Data Compression in Hadoop Framework

Data Compression in Hadoop Framework

Data Compression in Hadoop Framework

When working with Hadoop Framework ,there are many challenges and difficulties in dealing with large data sets .Whether you store data in HDFS or any other formats ,main challenge is that massive data volumes can cause I/O and network related bottlenecks.
In this blog i will explain the main problems that comes with data volumes and how data compression can be used to eliminate these problem in Hadoop jobs.

Why we need Data Compression

In data intensive hadoop workloads,I/O operation and network data transfer takes considerably long amount of time to complete.In addition to this internal MapReduce "Shuffle" process is also under huge I/O pressure as it has to often "spill out" intermediate data to local disks before advancing from Map phase to Reduce Phase

For any Hadoop cluster Disk I/O and network bandwidth are considered as a precious resource ,which should be allocated accordingly.So use of compressed files for storage not only saves disk space ,but also speed ups the data transfer across the network .Also when running a large volume MapReduce jobs ,combination of data compression and decreased network load brings a significant performance improvements as the I/O and network resource consumptions are reduced throughout the MapReduce process pipeline.

Data Compression Trade-off

Using data compression in hadoop framework is usually a tradeoff between I/O and speed of computation.When enabled compression ,it reduces I/O and network usage.Compression happens when the MapReduce reads the data or when it writes it out.

When MapReduce job is fired up against compressed data ,CPU utilization generally increases as data must be decompressed before the files can be processed by the Map and Reduce Tasks .Because of this ,decompression usually increases the time of the job .However ,it has been found that in many cases,overall job performance improves when compression is enabled in multiple phases of job configuration.

Data Compression Supported by Hadoop

Hadoop framework supports many compression format for both input and output data.A compression format or a codec(coder-decoder) is a set of compiled ,ready to use Java libraries that a developer can invoke programmatically to perform data compression and decompression in MapReduce job.Each of these codec implements an algorithm for compression and decompression and also has different characteristics.

Among the different data compression format ,some are splittable ,which can further enhance performance when reading and processing large compressed files .So when a single large file is stored in HDFS ,it is splitted into many data blocks and distributed across many nodes.If the file has been compressed by using the splittable algorithms ,data blocks can be decompressed in parallel by using several MapReduce tasks.However ,if the file has been compressed by a non-splittable algorithm ,Hadoop must pull at blocks together and use a single MapReduce task to decompress them .Some of the compression techniques are explained below:

Gzip:

Gzip(GNU zip) is a compression utility adopted by GNU project which generates compressed files.It is naturally supported by Hadoop. It is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman Coding.

Gzip provides very good compression performance (on average, about 2.5 times the compression that’d be offered by Snappy), but its write speed performance is not as good as Snappy’s (on average, it’s about half of Snappy’s). Gzip usually performs almost as well as Snappy in terms of read performance. Gzip is also not splittable, so it should be
used with a container format. Note that one reason Gzip is sometimes slower than Snappy for processing is that Gzip compressed files take up fewer blocks, so fewer tasks are required for processing the same data. For this reason, using smaller blocks with Gzip can lead to better performance.

BZIP2:

BZIP2 is a freely available, high-quality data compressor. It typically compresses files to within 10% to 15% of the best available techniques (the PPM family of statistical compressors), whilst being around twice as fast at compression and six times faster at decompression.Using this compression technique data is splittable in hadoop .It generates a better compression ratio than Gzip ,but is very slow.

Bzip2 provides excellent compression performance, but can be significantly slower than other compression codecs such as Snappy in terms of processing performance. Unlike Snappy and Gzip, bzip2 is inherently splittable. In the examples we have seen, bzip2 will normally compress around 9% better than GZip, in terms of storage space. However, this extra compression comes with a significant read/write performance cost. This performance difference will vary with different machines, but in general bzip2 is about 10 times slower than GZip. For this reason, it’s not an ideal codec for Hadoop storage, unless your primary need is reducing the storage footprint. One example of such a use case would be using Hadoop mainly for active archival purposes.

LZO:

The LZO(Lempel-Ziv-Oberhumer) compression format is composed of many smaller (~256K) blocks of compressed data, allowing jobs to be split along block boundaries.It supports splittable compression ,which enables parallel processing of compressed text file splits by MapReduce jobs

LZO is similar to Snappy in that it’s optimized for speed as opposed to size. Unlike Snappy, LZO compressed files are splittable, but this requires an additional indexing step. This makes LZO a good choice for things like plain-text files that are not being stored as part of a container format. It should also be noted that LZO’s license prevents it from
being distributed with Hadoop and requires a separate install, unlike Snappy, which can be distributed with Hadoop.

It supports splittable compression, which enables the parallel processing of compressed text file splits by your MapReduce jobs. It needs to create an index when it compresses a file as the compression blocks are of variable length.The index is used to tell the Mappper where it can safely split the compressed file.It is mainly used when one needs to compress text files.

LZ4:

It is an compression technique which creates a splittable compressed files in hadoop .There is no need of an external indexing ,when using this compression .It can be used at any level of speed/compression-ratio in Hadoop: from fast mode reaching 500 MB/s compression speed up to high/ultra modes providing increased compression ratio, almost comparable with GZIP one.

Snappy:

Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. It is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems.

Although Snappy doesn’t offer the best compression sizes, it does provide a good trade-off between speed and size. Processing performance with Snappy can be significantly better than other compression formats. It’s important to note that Snappy is intended to be used with a container format like Sequence Files or Avro, since it’s not inherently splittable.

A nice comparison between these compression ration is shown in below table.

References:
[1]Microsoft Paper on Compression in Hadoop.

[2]http://comphadoop.weebly.com/

[3]Hadoop Application Architecture by ORielly

Comments