Table Of Contents

Previous topic

The rhmr Command

Next topic

Using RHIPE on EC2

This Page

Miscellaneous Commands

Introduction

This is a list of supporting functions for reading, writing sequence files and manipulating files on the Hadoop Distributed File System (HDFS).

Running Mapreduce

rhex

Once an object is created using rhmr and rhlapply, it must be sent to the Hadoop system. The function rhex does this

rhex <- function(o, async=FALSE, mapred)

Where o is the object that rhmr or rhlapply returns. mapred is a list of the same shape as in rhmr and rhlapply. Values in this over-ride those passed in rhmr``(and ``rhlapply). If async is FALSE, the function returns when the job has finished running. The value returned is a list of Hadoop Counters (e.g bytes sent, bytes written, time taken etc).

If async is TRUE, the function returns immediately. In this case, the value returned can be printed (i.e just type the returned value at the REPL) or passed to rhstatus to monitor the job.

rhjoin

rhjoin <- function(o,ignore.stderr=TRUE)

where o is returned from rhex with async=TRUE. The function returns when the job is complete and the return value is the same as rhex when async is FALSE (i.e counters and the result(failure/success) of the job). If ignore.stderr is FALSE, the progress is displayed on the screen(exactly like rhex).

rhstatus

rhstatus <- function(o)

where o is returned from rhex with async=TRUE (or a Hadoop job id (e.g “job_20091031_0001”). This will return list of counters and the progress status of the job(number of maps complete, % map complete etc).

print

This a generic function for printing objects returned from rhex when async=TRUE. The default returns start time, job name and job id, and job state, map/reduce progress. For more verbosity, type print(o,verbose=2) which returns a list of counters too (like rhstatus).

Serialization

rhsz

rhsz <- function(object)

Serializes a given R object. Currently the only objects that can be serialized are vectors of Raws,Numerics, Integers, Strings(including NA), Logical(including NA) and lists of these and lists of lists of these. Attributes are copied to(e.g names attributes). It appears objects like matrices, factors also get serialized and unserialized sucessfully.

rhuz

rhuz <- function(object)

Unserializes a raw object returned from rhsz

Map Files

rhS2M

rhS2M <- function (files, ofile, dolocal = T, ignore.stderr = F, verbose = F)

Converts the sequence files specified by files and places them in destination ofile. If dolocal is True the conversion is done on the local machine, otherwise over the cluster (which is much faster for anything greater than hundreds of megabytes). If ignore.stderr is True, the mapreduce output is displayed on the R console. e.g

rhS2m("/tmp/so/p*","/tmp/so.map",dolocal=F)

rhM2M

rhM2M <- function (files, ofile, dolocal = T, ignore.stderr = F, verbose = F)

Same as S2M, except it converts a group of Map files to Map files.Why? Consider a mapreduce job that outputs modified keys in the reduce part, i.e the reduce receives key K0 but emits f(K0), where f(K0) <> K0, the result of this the keys in the reduce output part files wont be sorted even though the K0 are sorted.

So, if the reducer emits K0, the output part files constitute a valid collection of sorted map files. If the reducer emits f(K0), this does not hold any more. Running rhM2M on this output produces another output in which the keys are now sorted (i.e we just run an identity mapreduce emitting f(K0), though now the input to the reducers are f(K0)).

To specify the input files, it is not enough to specify the directory containing the part files, because the part files are directories which contain a sequence file and a non sequence file. Specifying the list of directories to a mapreduce job will cause it to fail when it reads the non-map file.

Use rhmap.sqs .

rhgetkey

rhgetkey <- function (keys, paths, sequence=NULL,skip=0,ignore.stderr = T, verbose = F)

Given a list of keys and vector of map directories (e.g /tmp/ou/mapoutput/p*”), returns a list of key,values. If sequence is a string, the output key,values will be written to the sequence files on the DFS(the values will not be read into R). Set skip to larger(integr) values to prevent reading in all keys of the table - slower to find your key, but can search a much large database.