Python Bindings for libhdfs

This is a module that provides python bindings to
libhdfs
a C library interfacing with Hadoop DFS. I haven't managed to make an installer but here some instructions(valid for Hadoop version 0.16).Also note, works only on Linux (ironic given that I work on a Mac) and does not support writing (will very soon)
  1. Download Hadoop DFS from the Hadoop web site.
  2. Unzip, go inside the Hadoop folder and into
    src/c++/libhdfs
  3. Download Java also(1.6.0_02) and SWIG (i used the latest version )
    1. Be sure that when you run SWIG the lates version runs, i.e the latest SWIG should be the first one to be found in search path - older SWIGS make problems.
    2. Set your JAVA Classpath just right! See mine, i placed every damn thing so that every JAVA class could be found:
      export CLASSPATH=/home/sguha/villa/jdk1.6/lib:/home/sguha/hadoop-0.1/hadoop-0.14.2-core.jar:
      	      /home/sguha/hadoop-0.1/hadoop-0.14.2-examples.jar:/home/sguha/hadoop-0.1/hadoop-0.14.2-test.jar:
      	      /home/sguha/hadoop-0.1/lib/commons-logging-1.0.4.jar:/home/sguha/hadoop-0.1/lib/commons-logging-api-1.0.4.jar:
      	      /home/sguha/hadoop-0.1/lib/log4j-1.2.13.jar:
      	      
  4. Backup the Makefile and download this one Makefile. This Makefile won't make the python library but will make the
    hdfs
    shared library for testing.
  5. A few things to note, read the Makefile carefully.
    1. Change LIB_HDFS_BUILD_DIR to what you want.
    2. CPPFLAGS has -m64, which you should change to -m32 if you're using 32 bit linux. Also use the right JDK, i.e 64 bit or 32 bit depending on your OS and CPPFLAGS -m option.
    3. The
      -L$(JAVA_HOME)/jre/lib/amd64/server
      corresponds to the location of
      libjvm.so
    4. This path should also be provided to the line
      $(LD) $(LDFLAGS)...
      just below
      $(SO_TARGET)
    5. There are some other files like
      jni.h
      and/or
      jni_md.h
      , this is specified in the
      CPPFLAGS
      line.
    6. Change the python include parameter according to your distributiuon(I used 2.3)
    7. Also, make sure your Java
      CLASSPATH
      is set for your java jar files.
    8. Pay attention to the comments in lines 38-42 regarding
      LD_LIBRARY_PATH
      and
      LD_RUN_PATH
    9. PYTHONPATH
      should be updated to contain
      LIB_HDFS_BUILD_DIR
    10. Yeah, it's a mess but it works, take the time to go through the comments if your build does not work.
  6. Your swig file,pyhdfs.swig
  7. Read comments at line 133 in the swig file before proceeding. Try compiling without performing the checks.
  8. run swig:
    swig -Wall  -DSWIGWORDLENGTH64  -python pyhdfs.swig 
    . Be sure the swig you run is the latest swig.
  9. Edit line 86(
     CSRC = hdfs.c  hdfsJniHelper.c
    ) in the Makefile to look like
    CSRC = hdfs.c  hdfsJniHelper.c  pyhdfs_wrap.c
  10. Run
    make clean;make
    There will be several error messages regarding the build of
    hdfs_test/read/write.c
    .Ignore these and check the build dir for
    _pyhdfs.so
    (see line 101 in Makefile).
  11. Now, download the python module hadoop.py , make sure your
    PYTHONPATH
    reflects your
    LIBHDFS_BUILD_DIR
    folder, so that python can load everything.
  12. In python, import hadoop

Some examples

For documentation, please see the hadoop libhdfs api(read
hdfs.h
), forgive me for not inserting python docstrings. Currently, python module hadoop only supports reads. Also, in the code below,
hadoop.HadoopDFS()
invokes
pyhdfs.hdfsConnect(hostname,port)
. The documentation (see
hdfs.h
) says
hdfsConnect('default',0)
should connect to the fs.default which should be your Hadoop DFS, however for me it connected to local FS. To work around this specify the namenode and port number to
hdfsConnect
Feel free to play around do inform me of any suggestions at s_g_u_h_a_@_p_u_r_d_u_e_._e_d_u (remove underscores). Also the source of
hadoop.py
is very lucid, please do read it.
Last modified: Sat Jun 14 11:56:21 EDT 2008