advantages and disadvantages of decision tree
Various types of data compression in MapReduce
When Hadoop word comes to mind instantly, one more word also
comes side by side in mind which is big data. Big data means a very large
amount of data. When we need to play with a large amount of data there will
always be an issue of scarcity of space. So, how can we or Hadoop as
architecture can handle such a critical issue? Hadoop has provided very nice
and important to rescue us from this issue. The resolution is data compression.
We can do data compression using different Hadoop libraries on our huge dataset.
If you are still not clear about what are the benefits of data compression in
Hadoop let me show you. As we will compress the dataset size required to store
data will decrease drastically. On the other end as we all know we need to
transfer data among the Hadoop clusters from one machine to another. So, as a
result of data compression data size will decrease, and eventually, the speed
at which data will be transferred over the network will also increase. As we
understood now the purpose of data compression in Map Reduce programming let’s
see how the different types of compression options available in Hadoop
map-reduce.
Types of Compression in Map Reduce
There are different types of compression techniques available.
Each has different characteristics. Let’s see a list of available compression
techniques and understand each one of them in detail.
·
Deflate
·
Gzip
·
Bzip2
·
LZO
·
LZ4
·
Snappy
Here we can distinguish each one of them based on the trade-off
between space and time. It means if the speed of compression is more then
compression will be less and if the time required to compress will be more then
the compression will be of small space.
Deflate
Deflate is a lossless data compression file format that uses a
combination of LZ77 and Huffman coding. Its file extension is .deflate. Having
single file and not splittable. It stores data as a series of blocks.
Gzip
Gzip is a compression technique of type deflate. It is being
used to store compressed data. Gzip has a .gz extension and has gzip tool that
works as a single file and is not splittable. It provides a high compression
ratio so the time required to compress files will also be high and it will take
more time. It also uses high resource utilization for compression and
decompression.
Zip
Zip is having a tool zip to perform compression. It uses deflate
compression algorithm internally. It has a zip extension. It can have multiple
files and is splittable at the file boundaries level.
Bzip2
Bzip2 is fairly slow comparatively. But, it has more compression
ratio. It can’t have multiple files and but it is splittable. It uses the bzip2
algorithm and uses the bzip2 tool. It takes more time to compress and decompress
data.
LZO
LZO is another compression
technique. It uses lzop tool to perform compression. It uses the LZO algorithm
internally. It can not have multiple files and is not splittable. It is a
faster compression technique. It uses a low compression ratio.
LZ4
LZ4 is a compression technique that can be used at any
speed-to-compression ratio. We don’t need external indexing in this technique.
We can split the compressed files in this approach.
Snappy
Snappy is a technique that provides the best trade-off between
speed and compression. It focuses on more speed and less compression. It is
used widely in corporate organizations. This is generally used to compress
formats like Avro and sequence files.
When comes to an option to select a technique that is more
beneficial to our use case when there is a large size file we should not use
the technique in which the file is not splittable.
Conclusion
We are using Hadoop map-reduce algorithms and as we understood
different compression techniques and the benefits of each of them we will be
able to decide when to use these compression techniques and which compression
technique best matches which scenario. Hope this will help you to make the
right decision as and when such a scenario comes up front. Also, make a mental
note that it is not necessary to
0
Comments
Post a Comment