The Hadoop File System (HDFS) was designed to store files that have sizes in the magnitude of giga- or even terrabytes. Our requirements are quite different, because we have lots of small files (kilo- or at most megabytes). However, there is a way to store such data in a efficient way.

A hadoop cluster typically consists of one NameNode, that keeps an overview what lies where etc., and a couple of DataNodes, responsible for storing the data. Each file is stored as one block on a DataNode or, if it’s larger than the block size, distributed among several blocks. The default block size is 64 MB. Having mainly small files, we would have a lot of mostly empty blocks lying around. However, these don’t take more disk space than the original files (see “Hadoop: The Definitive Guide”).
It’s purely the number of blocks, that is causing the problem: The NameNode keeps a map in memory that holds the information on which DataNodes a block is stored. With lots of files, this map becomes quite huge. On top of that, during start up each DataNode scans its file system and provides the NameNode with the information which files it is storing. The more files there are, the longer this takes. >> more…

We are currently changing our infrastructure to use the distributed hadoop filesystem (HDFS, an open source filesystem similar to Google’s), instead of dedicated fileservers. Therefore we needed to change the ant task that deletes old files on the developer’s computers to delete those files in HDFS. After some extensive research – “ant hadoop” are really bad search terms – we found that the Hadoop distribution already comes with some predefined tasks. This is how they can be used:

  1. <path id="ant.classpath">
  2. <fileset dir="${libs.dir}">
  3.       <include name="hadoop-0.18.3-ant.jar" />
  4.       <include name="hadoop-0.18.3-core.jar" />
  5.       <include name="commons-cli-2.0-SNAPSHOT.jar" />
  6. </fileset>
  7. </path>
  8. <taskdef name="hdfs" classname="org.apache.hadoop.ant.DfsTask" classpathref="ant.classpath" />
  9. <target  name="createHDFSdirectory">
  10.       <hdfs cmd="mkdir" args="hdfs://localhost:54310/testDir" />
  11. </target>
  12. <target  name="deleteHDFSdirectory">
  13.       <hdfs cmd="rmr" args="hdfs://localhost:54310/testDir" />
  14. </target>

>> more…