image classification using cnn
Various types of data compression in MapReduce
When Hadoop word comes to mind instantly, one more word also
comes side by side in mind which is big data. Big data means a very large
amount of data. When we need to play with a large amount of data there will
always be an issue of scarcity of space. So, how can we or Hadoop as
architecture can handle such a critical issue? Hadoop has provided very nice
and important to rescue us from this issue. The resolution is data compression.
We can do data compression using different Hadoop libraries on our huge dataset.
If you are still not clear about what are the benefits of data compression in
Hadoop let me show you. As we will compress the dataset size required to store
data will decrease drastically. On the other end as we all know we need to
transfer data among the Hadoop clusters from one machine to another. So, as a
result of data compression data size will decrease, and eventually, the speed
at which data will be transferred over the network will also increase. As we
understood now the purpose of data compression in Map Reduce programming let’s
see how the different types of compression options available in Hadoop
Types of Compression in Map Reduce
There are different types of
compression techniques available. Each has different characteristics. Let’s see
a list of available compression techniques and understand each one of them in
Here we can distinguish each
one of them based on the trade-off between space and time. It means if the
speed of compression is more then compression will be less and if the time
required to compress will be more then the compression will be of small space.
Deflate is a lossless data
compression file format that uses a combination of LZ77 and Huffman coding. Its
file extension is .deflate. Having single file and not splittable. It stores
data as a series of blocks.
Gzip is a compression
technique of type deflate. It is being used to store compressed data. Gzip has
a .gz extension and has gzip tool that works as a single file and is not
splittable. It provides a high compression ratio so the time required to
compress files will also be high and it will take more time. It also uses high
resource utilization for compression and decompression.
Zip is having a tool zip to
perform compression. It uses deflate compression algorithm internally. It has a
zip extension. It can have multiple files and is splittable at the file
boundaries level.
Bzip2 is fairly slow
comparatively. But, it has more compression ratio. It can’t have multiple files
and but it is splittable. It uses the bzip2 algorithm and uses the bzip2 tool.
It takes more time to compress and decompress data.
LZO is another compression technique. It uses lzop tool to
perform compression. It uses the LZO algorithm internally. It can not have
multiple files and is not splittable. It is a faster compression technique. It
uses a low compression ratio.
LZ4 is a compression
technique that can be used at any speed-to-compression ratio. We don’t need
external indexing in this technique. We can split the compressed files in this
Snappy is a technique that
provides the best trade-off between speed and compression. It focuses on more
speed and less compression. It is used widely in corporate organizations. This
is generally used to compress formats like Avro and sequence files.
When comes to an option to
select a technique that is more beneficial to our use case when there is a
large size file we should not use the technique in which the file is not
We are using Hadoop
map-reduce algorithms and as we understood different compression techniques and
the benefits of each of them we will be able to decide when to use these
compression techniques and which compression technique best matches which
scenario. Hope this will help you to make the right decision as and when such a
scenario comes up front. Also, make a mental note that it is not necessary to
Post a Comment