The Hadoop File System (HDFS) was designed to store files that have sizes in the magnitude of giga- or even terrabytes. Our requirements are quite different, because we have lots of small files (kilo- or at most megabytes). However, there is a way to store such data in a efficient way.
A hadoop cluster typically consists of one NameNode, that keeps an overview what lies where etc., and a couple of DataNodes, responsible for storing the data. Each file is stored as one block on a DataNode or, if it’s larger than the block size, distributed among several blocks. The default block size is 64 MB. Having mainly small files, we would have a lot of mostly empty blocks lying around. However, these don’t take more disk space than the original files (see “Hadoop: The Definitive Guide”).
It’s purely the number of blocks, that is causing the problem: The NameNode keeps a map in memory that holds the information on which DataNodes a block is stored. With lots of files, this map becomes quite huge. On top of that, during start up each DataNode scans its file system and provides the NameNode with the information which files it is storing. The more files there are, the longer this takes. >> more…