The Hadoop File System (HDFS) was designed to store files that have sizes in the magnitude of giga- or even terrabytes. Our requirements are quite different, because we have lots of small files (kilo- or at most megabytes). However, there is a way to store such data in a efficient way.
A hadoop cluster typically consists of one NameNode, that keeps an overview what lies where etc., and a couple of DataNodes, responsible for storing the data. Each file is stored as one block on a DataNode or, if it’s larger than the block size, distributed among several blocks. The default block size is 64 MB. Having mainly small files, we would have a lot of mostly empty blocks lying around. However, these don’t take more disk space than the original files (see “Hadoop: The Definitive Guide”).
It’s purely the number of blocks, that is causing the problem: The NameNode keeps a map in memory that holds the information on which DataNodes a block is stored. With lots of files, this map becomes quite huge. On top of that, during start up each DataNode scans its file system and provides the NameNode with the information which files it is storing. The more files there are, the longer this takes.
So, how to deal with this problem? Hadoop provides the possibility to pack a couple of files in one archive. An archive is simply a concatenation of files into one single data file together with an index structure to look up the position where a files starts. Accessing files in an archive is simply done by prepending the archive name to the filename – as if it where a folder.
More information on the small files problem and some more solutions can be found at cloudera.



