Hadoop File Formats
There are many Storage format
- Text:CSV,JSON record, not good for querying the data Also do not support the block compression, in most case they are also not splittable(although this depend of kind of task we are going to perform using Map/Reduce lets say if each line is dependent of other line in file they this format is not good)
- Sequence file:Row Based,used to transfer data between map/reduce phases.they are splittable which means they are good for map/reduce
- Avro:Mainly used for serialisation,fast binary format,support block compression and splittable.Most important they support schema evolution
- Parquet:Column oriented when specific columns needs to be retrieved they are excellent.
- ORC:mixed of row and column format,that means stores collections of rows and within the rows the data is stored in columnar format.Splittable that means parallel operations can be performed easily
There are two types of applications in Hadoop ecosystem i.e read and write.so lets compare all these format for these two types.
There are many factor we need to consider when choosing storage format for Write
1.data format that application having is compatible with the querying format
2. Do you have schema that change over time(clickstream/Event data format generally changes )
3. frequency of write and the size of the files,lets say if you dump each clickstream event then file size will be very small and you need to merge for better performances.
4. Speed concern while writing how fast you wanna write your data?
Read this block(http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/)
factors to consider when choosing storage format for Read
- Types of queries, if queries needs to retrieve few or group of columns user either Parquet or ORC they are very good for read with the penalty we are are paying for write.That means if applications are read heavy use Parquet/ORC.
- Snappy and LZO are commonly used compression technologies that enable efficient block storage and processing, so check which the combination of support lets say parquet with Snappy compression work best in Spark.
In terms of schema evolution Avro can do add,update,delete parquet can add at the end and ORC cant do any of those(in under development)
My common learning while choosing storage format and compression techniques in Hadoop.
- Tool selection:This is most Obvious thing to do, for example Cloudera(impala) does not support ORC so choosing the right platform for your hadoop platform is very important. another example let say you wanna use Avro for your application you must check wether your data processing engine has native support for avro reader and writer.
- Data change(clickstream/Event):Do you want to add and delete fields from a file and still be able to read old files with the same code? If yes know which file formats enable flexible and evolving schema
- File Format Splitability:Since Hadoop stores and processes data in blocks,you must check when choosing the file format lets say XML files are not splittable but CSV are splittable but CSV doesn't support block compression. These are just example to let you know what to compare.
- Choose either Snappy or LZO because they are balance in terms of split-ability and block compression.
- How big are you files?: small files are the exception in Hadoop and processing too many small files can cause performance issues,Hadoop wants large, splittable files so that its massively distributed engine can leverage data locality and parallel processing.
Comments
Post a Comment