This is a module that provides python bindings to
libhdfs
a C library interfacing with
Hadoop DFS. I haven't managed to make an installer but here some instructions(valid for Hadoop version 0.16).Also note, works only on Linux (ironic given that I work on a Mac) and does not support writing (will very soon)
- Download Hadoop DFS from the Hadoop web site.
- Unzip, go inside the Hadoop folder and into
src/c++/libhdfs
- Download Java also(1.6.0_02) and SWIG (i used the latest version )
- Be sure that when you run SWIG the lates version runs, i.e the latest SWIG should be the first one to be found in search path - older SWIGS make problems.
- Set your JAVA Classpath just right! See mine, i placed every damn thing so that every JAVA class could be found:
export CLASSPATH=/home/sguha/villa/jdk1.6/lib:/home/sguha/hadoop-0.1/hadoop-0.14.2-core.jar:
/home/sguha/hadoop-0.1/hadoop-0.14.2-examples.jar:/home/sguha/hadoop-0.1/hadoop-0.14.2-test.jar:
/home/sguha/hadoop-0.1/lib/commons-logging-1.0.4.jar:/home/sguha/hadoop-0.1/lib/commons-logging-api-1.0.4.jar:
/home/sguha/hadoop-0.1/lib/log4j-1.2.13.jar:
- Backup the Makefile and download this one Makefile. This Makefile won't make the python library but will make the
hdfs
shared library for testing.
- A few things to note, read the Makefile carefully.
- Change LIB_HDFS_BUILD_DIR to what you want.
- CPPFLAGS has -m64, which you should change to -m32 if you're using 32 bit linux. Also use the right JDK, i.e 64 bit or 32 bit depending on your OS and CPPFLAGS -m option.
- The
-L$(JAVA_HOME)/jre/lib/amd64/server
corresponds to the location of libjvm.so
- This path should also be provided to the line
$(LD) $(LDFLAGS)...
just below $(SO_TARGET)
- There are some other files like
jni.h
and/or jni_md.h
, this is specified in the CPPFLAGS
line.
- Change the python include parameter according to your distributiuon(I used 2.3)
- Also, make sure your Java
CLASSPATH
is set for your java jar files.
- Pay attention to the comments in lines 38-42 regarding
LD_LIBRARY_PATH
and LD_RUN_PATH
PYTHONPATH
should be updated to contain LIB_HDFS_BUILD_DIR
- Yeah, it's a mess but it works, take the time to go through the comments if your build does not work.
- Your swig file,pyhdfs.swig
- Read comments at line 133 in the swig file before proceeding. Try compiling without performing the checks.
- run swig:
swig -Wall -DSWIGWORDLENGTH64 -python pyhdfs.swig
. Be sure the swig you run is the latest swig.
- Edit line 86(
CSRC = hdfs.c hdfsJniHelper.c
) in the Makefile to look like CSRC = hdfs.c hdfsJniHelper.c pyhdfs_wrap.c
- Run
make clean;make
There will be several error messages regarding the build of hdfs_test/read/write.c
.Ignore these and check the build dir for _pyhdfs.so
(see line 101 in Makefile).
- Now, download the python module hadoop.py , make sure your
PYTHONPATH
reflects your LIBHDFS_BUILD_DIR
folder, so that python can load everything.
- In python, import hadoop
Some examples
For documentation, please see the hadoop libhdfs api(read
hdfs.h
), forgive me for not inserting python docstrings. Currently, python module hadoop only supports reads. Also, in the code below,
hadoop.HadoopDFS()
invokes
pyhdfs.hdfsConnect(hostname,port)
. The documentation (see
hdfs.h
) says
hdfsConnect('default',0) should connect to the fs.default which should be your Hadoop DFS, however for me it connected to local FS. To work around this specify the namenode and port number to
hdfsConnect
- Open a connection to the datanote in the hadoop-site.xml
import hadoop
h=hadoop.HadoopDFS()
#open a connection to the local filesysetm
hl=hadoop.HadoopDFS(None)
- Open a file for reading
import hadoop
h=hadoop.HadoopDFS()
#Default open, with 8kb buffering, this is passed onto the Java interface, so the file is buffered.
f=hadoop.HadoopFile(h,filename)
#Not as slow as you might think,
f.readline()
#Read all lines
lines=f.readlines()
#xreadline Returns an iterator
for i in f.xreadlines():
print i
- Objects of
HadoopFile
also have seek,tell,flush,close
methods too.There is also a readchunks
which returns an iterator of a given size of bytes
- Objects of
HadoopDFS
have copy,delete,move
etc command but no ls
as of now, nor file size.
import hadoop
h=hadoop.HadoopDFS()
hl=hadoop.HadoopDFS(None)
#oldname must be a file in the DFS
h.rename(oldname,newname)
#oldname must be a file in the DFS
hl.rename(oldname,newname)
#Makes a copy of source on same DFS
h.copy(source, target)
#Copies source on DFS (in h) to local filesytem on hl
h.copy(source, target,hl)
Feel free to play around do inform me of any suggestions at s_g_u_h_a_@_p_u_r_d_u_e_._e_d_u (remove underscores). Also the source of
hadoop.py
is very lucid, please do read it.
Last modified: Sat Jun 14 11:56:21 EDT 2008