RHIPE has functions that access the HDFS from R, that are used inside MapReduce jobs and functions for managing MapReduce jobs.
Before calling any of the functions described below, call rhinit. If you call rhinit(TRUE,TRUE,buglevel=2000)` a slew of messages are displayed - useful if Rhipe does not load.
rhex - Submitting a MapReduce R Object to Hadoop¶
Submits a MapReduce job (created using rhmr) to the Hadoop MapReduce framework. The argument mapred serves the same purpose as the mapred argument to rhmr. This will override the settings in the object returned from rhmr. The function returns when the job ends (success/failure or because the user terminated (see rhkill)). When async is TRUE, the function returns immediately, leaving the job running in the background on Hadoop.
When async=TRUE, function returns an object of class jobtoken. The generic function print.jobtoken, displays the start time, duration (in seconds) and percent progress. This object can be used in calls to rhstatus,``rhjoin`` and rhkill. Otherwise is returns a list of counters and the job state.
rhstatus - Monitoring a MapReduce Job¶
This returns the status of an running MapReduce job. The parameter jobid can either be a string with the format job_datetime_id (e.g. job_201007281701_0274) or the value returned from rhex with the async option set to TRUE.
A list of 4 elements:
- the state of the job (one of START, RUNNING, FAIL,COMPLETE),
- the duration in seconds,
- a data frame with columns for the Map and Reduce phase. This data frame summarizes the number of tasks, the percent complete, and the number of tasks that are pending, running, complete or have failed.
- In addition the list has an element that consists of both user defined and Hadoop MapReduce built in counters (counters can be user defined with a call to rhcounter).
If mon.sec is greater than 0, a small data frame indicating the progress will be returned every mon.sec seconds. If autokill is TRUE, then any R errors caused by the map/reduce code will cause the job to be killed. If verbose is TRUE, the above list will be displayed too.
rhjoin - Waiting on Completion of a MapReduce Job¶
Calling this functions pauses the R console till the MapReduce job indicated by jobid is over (successfully or not). The parameter jobid can either be string with the format job_datetime_id or the value returned from rhex with the async option set to TRUE. This function returns the same object as rhex i.e a list of the results of the job (TRUE or FALSE indicating success or failure) and a counters returned by the job. If ignore is FALSE, the progress will be displayed on the R console (much like rhex)
rhkill - Stopping a MapReduce Job¶
This kills the MapReduce job with job identifier given by jobid. The parameter jobid can either be string with the format job_datetime_id or the value returned from rhex with the async option set to TRUE.