rhapply applies a user defined function to the elements of a given R list or the function can be run over the set of numbers from 1 to n. In the former case the list is written to a sequence file,whose length is the default setting of rhwrite.
Running a hundreds of thousadands of seperate trials can be terribly inefficient, instead consider grouping them, i.e set mapred.max.tasks to a value much smaller than the length of the list.
rhlapply returns a list, the names of which is equal to the names of the input list (if given).
1 2 3 4 5 6 7 8 9 | rhlapply <- function( ll=NULL,
fun,
ifolder="",
ofolder="",
readIn=T,
inout=c('lapply','sequence')
mapred=list()
setup=NULL,jobname="rhlapply",doLocal=F,...
)
|
Description follows
aggr=function(x) do.call("rbind",x)
and the result of rhlapply will be one big data frame.
An object that is passed onto rhex.
The object passed to rhex has variable called rhipe_command which is the command of the program that Hadoop sends information to. In case the client machine’s (machine from which commands are being sent ) R installation is different from the tasktrackers’ R installation the RHIPE command runner wont be found. For example suppose my cluster is linux and my client is OS X , then the rhipe_command variable will reflect the location of the rhipe command runner on OS X and not that of the taskttrackers(Linux) R distribution.
There are two ways to fix this a) after z <- rhlapply(...) change r[[1]][[1]]$rhipe_command to the value it should be on the tasktrackers.
or
b) set the environment variable RHIPECOMMAND on each of tasktrackers. RHIPE java client will read this first before reading the above variable.