What is the use of sequence file in Hadoop
I read about sequence file format in a few blogs. Since I am still new to hadoop I am not actually able to understand what is the application or purpose of sequence files. So, it would be really helpful if anyone can explain to me what a sequence file is and where it is used in hadoop?
Sequence files hadoop are binary files containing serialized key/value pairs. You can compress a sequence file at the record (key-value pair) or block levels. This is one of the advantages of using sequence files. Also, sequence files are binary files, they provide faster read/write than that of text file format.
Problem With Small Files and Hadoop
Now, one of the main problems that sequence file hadoop format solves is the problem of processing too many small files in Hadoop. As you know Hadoop is not good for processing large numbers of small files as referencing (memory) large amounts of small files generates a lot of overhead for the namenode. Besides this memory overhead, the second problem arises in terms of number of mappers as more number of mappers will be executed for each file (as the file size is smaller than that of block).
Solution: Sequence File
Sequence files allow you to solve this problem of small files. As discussed sequence files are the files containing key-value pairs. So, you can use it to hold multiple key-value pairs where the key can be unique file metadata, like filename+timestamp and value is the content of the ingested file. Now, this way you are able to club too many small files as a single file and then you can use this for processing as an input for MapReduce. This is the reason why sequence files often are used in custom-written map-reduce programs.
Let me know in case you have more confusion.