Median and Standard deviation using MapReduce
Various types of data compression in MapReduce
When Hadoop word comes to mind instantly, one more word also
comes side by side in mind which is big data. Big data means a very large
amount of data. When we need to play with a large amount of data there will
always be an issue of scarcity of space. So, how can we or Hadoop as
architecture can handle such a critical issue? Hadoop has provided very nice
and important to rescue us from this issue. The resolution is data compression.
We can do data compression using different Hadoop libraries on our huge
dataset. If you are still not clear about what are the benefits of data
compression in Hadoop let me show you. As we will compress the dataset size
required to store data will decrease drastically. On the other end as we all
know we need to transfer data among the Hadoop clusters from one machine to another.
So, as a result of data compression data size will decrease, and eventually,
the speed at which data will be transferred over the network will also
increase. As we understood now the purpose of data compression in Map Reduce
programming let’s see how the different types of compression options available
in Hadoop map-reduce.
Types of Compression in Map Reduce
There
are different types of compression techniques available. Each has different
characteristics. Let’s see a list of available compression techniques and
understand each one of them in detail.
·
Deflate
·
Gzip
·
Bzip2
·
LZO
·
LZ4
·
Snappy
Here
we can distinguish each one of them based on the trade-off between space and
time. It means if the speed of compression is more then compression will be
less and if the time required to compress will be more then the compression
will be of small space.
Deflate
Deflate
is a lossless data compression file format that uses a combination of LZ77 and
Huffman coding. Its file extension is .deflate. Having single file and not
splittable. It stores data as a series of blocks.
Gzip
Gzip
is a compression technique of type deflate. It is being used to store
compressed data. Gzip has a .gz extension and has gzip tool that works as a
single file and is not splittable. It provides a high compression ratio so the
time required to compress files will also be high and it will take more time.
It also uses high resource utilization for compression and decompression.
Zip
Zip
is having a tool zip to perform compression. It uses deflate compression
algorithm internally. It has a zip extension. It can have multiple files and is
splittable at the file boundaries level.
Bzip2
Bzip2
is fairly slow comparatively. But, it has more compression ratio. It can’t have
multiple files and but it is splittable. It uses the bzip2 algorithm and uses
the bzip2 tool. It takes more time to compress and decompress data.
LZO
LZO is another compression
technique. It uses lzop tool to perform compression. It uses the LZO algorithm
internally. It can not have multiple files and is not splittable. It is a
faster compression technique. It uses a low compression ratio.
LZ4
LZ4
is a compression technique that can be used at any speed-to-compression ratio.
We don’t need external indexing in this technique. We can split the compressed
files in this approach.
Snappy
Snappy
is a technique that provides the best trade-off between speed and compression.
It focuses on more speed and less compression. It is used widely in corporate
organizations. This is generally used to compress formats like Avro and
sequence files.
When
comes to an option to select a technique that is more beneficial to our use
case when there is a large size file we should not use the technique in which
the file is not splittable.
Conclusion
We
are using Hadoop map-reduce algorithms and as we understood different
compression techniques and the benefits of each of them we will be able to
decide when to use these compression techniques and which compression technique
best matches which scenario. Hope this will help you to make the right decision
as and when such a scenario comes up front. Also, make a mental note that it is
not necessary to
0
Comments
Post a Comment