SerDe means Serializer and Deserializer. Hive uses SerDe and FileFormat to read and write table rows. Main use of SerDe interface is for IO operations. A SerDe allows hive to read the data from the table and write it back to the HDFS in any custom format. If we have unstructured data, then we use RegEx SerDe which will instruct hive how to handle that record.We can also write our own Custom SerDe in any format. Let us see the definition of Serializer and Deserailizer.
Deserializer is conversion of string or binary data into java object when we any submit any query.
Serializer converts java object into string or binary object. It is used when writing the data such as insert- select statements.Hive currently uses these FileFormat classes to read and write HDFS files:
- TextInputFormat/HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text file format.
- SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in Hadoop SequenceFile format.
Hive currently uses these SerDe classes to serialize and deserialize data:
- MetadataTypedColumnsetSerDe: This SerDe is used to read/write delimited records like CSV, tab-separated control.
- LazySimpleSerDe: This SerDe can be used to read the same data format as MetadataTypedColumnsetSerDe and TCTLSeparatedProtocol, however, it creates Objects in a lazy way which provides better performance. Starting in Hive-0.14.0 it also supports read/write data with a specified encode charset, for example:
ALTER TABLE person SET SERDEPROPERTIES (‘serialization.encoding’=’GBK’);
- ThriftSerDe: This SerDe is used to read/write Thrift serialized objects. The class file for the Thrift object must be loaded first.
- DynamicSerDe: This SerDe also read/write Thrift serialized objects, but it understands Thrift DDL so the schema of the object can be provided at runtime. Also it supports a lot of different protocols, including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol (which writes data in delimited records).
Other Built-in SerDes are Avro, ORC, RegEx, Parquet, CSV, JsonSerDe, etc.
org.apache.hadoop.hive.serde is the deprecated old SerDe library. We have org.apache.hadoop.hive.serde2 for the latest version.