Scaling Data Analytics

Speaker(s)

Bill Cleveland (Purdue University)
Karl Millar (Google)
Saptarshi Guha (Mozilla)
Luke Tierney (University of Iowa)
Lee Edlefsen (Revolution Analytics)
Jan Vitek (Purdue University)

Description

R is an open-source and open-development language for data analysis with a vast package system implementing the thousands of methods of statistics and machine learning. It is based on the S language which won the 1999 ACM Software System Award and "forever altered the way people analyze, visualize, and manipulate data".

Today, R is broadly used in fields ranging from biology to finance. But the challenge remains in scaling R to the analysis of large complex data, which are now ubiquitous. There is currently substantial interest in extending the language to achieve this without damaging its strengths.

This session will focus on the state-of-the-art of R. It will address two active areas of research: merging R with distributed computational systems such as Hadoop that distribute the data and computation across a cluster, and refactoring the design and implementation of the R engine to achieve very substantial increases in computational efficiency and reduction in footprint.

Schedule

Sun, June 24 - Location: STEW 202

Time	Speaker	Title

8:30-9:00	Bill Cleveland	Divide and Recombine: An approach to the analysis of complex big data
	Abstract: Divide and Recombine (D&R) is an approach to the analysis of complex big data. The data are parallelized: divided into subsets in one or more ways by the data analyst. Numeric and visualization methods are applied to each of the subsets separately. Then the results of each method are recombined across subsets. A D&R statistical method is a pair, a division method and a recombination method. Research in statistical theory and methods for D&R pairs is very broad because optimization of statistical properties depends on the data structure of the big data being analyzed. One seeks best D&R methods and a comparison with properties of results had it been feasible and practical to compute on all of the data without division. The goal of D&R is deep analysis, an ability to study the data in detail despite the size and complexity, and an ability to carry out analysis wholly from within an interactive language for data analysis such as R. D&R enables both by introducing an exploitable parallelization that allows a simple and fast embarrassingly parallel computation. This makes it possible to apply almost any existing analysis method from statistics, machine learning, and visualization to complex big data. RHIPE, the R and Hadoop Integrated Programming Environment, provides D&R analysis wholly from within R. Transparent to the user, Hadoop distributes the subsets across a cluster; schedules and carries out each subset computation with an algorithm that attempts to use a processor as close to each subset as possible; computes across the outputs of the subset computations if needed, provides fault tolerance, and enables simultaneous fair sharing of the cluster by multiple cluster users through fine-grained intermingling of all subset computations. The dividing into subsets, the subset computations, and the output computations are specified by R commands that are passed on to RHIPE R commands that manage the communication with Hadoop. A cluster architecture, Delta Rho, accommodates the computational tasking of D&R with R-RHIPE, which is very diverse, ranging from big distributed RHIPE R jobs (elephants), to nearly instantaneous interactive R commands (mice) when the analyst is studying small to moderately sized data reductions.
9:00-9:30AM	Karl Millar	Scaling R to Internet Scale Data
	Abstract: Analyzing internet-scale data sets requires statistical software that scales to data sizes several orders of magnitude larger than R is currently capable of handling. Tools such as MapReduce and Hadoop are capable of scaling to such large data sizes but are impractical for statisticians to use for data analysis. We will discuss the overall design and API of packages designed to work over Google's distributed computing architecture that help to address these issues. The foundation of this work is a package that provides an R version of the FlumeJava library[1] for distributed computation. This package provides a higher-level abstraction on top of Google's MapReduce framework, providing distributed generic collection classes and simple functions for manipulating them, which are automatically converted into an optimized sequence of MapReduces. Building on top of this functionality, Google is building additional packages that provide a convenient interface for working with large data sets, including distributed versions of some of R's data objects and common functions and parallelized statistical algorithms to analyze large, distributed data sets. References [1] Chambers, C., A. Raniwala, F. Perry, S. Adams, R. Henry, R. Bradshaw, and N. Weizenbaum (2010). Flumejava: Easy, efficient data-parallel pipelines. ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).
9:30-9:55AM	Saptarshi Guha	RHIPE and Massive Data Sets: It's Design, Usage, Lessons Learned and Things to Do
	Abstract: The R and Hadoop Integrated Processing Environment (RHIPE) is an R package that brings the Hadoop MapReduce and Distributed Storage framework to R. From within the R IDE, the data analyst can store massive data sets using the Hadoop Distributed Filesystem and compute with it entirely using the MapReduce paradigm. Much of the speed benefit comes from placing blocks of the data on many nodes and taking advantage of data local computation. This allows for redundancy and reduced network bandwidth. RHIPE allows the user to study massive data sets with the same level of detail as a small data set. The user never has to leave the R language. Despite this, RHIPE is language agnostic and the data created using RHIPE can be studied in other languages e.g. Python. In this talk, I will talk about the design of RHIPE, and provide code samples of its use in Mozilla. If time permits, I will discuss some ideas about RHIPE and Hadoop NexGen.
10:00-10:30AM	Break
10:30-11:00	Luke Tierney	Some New Developments for the R Engine
	Abstract: The R language for statistical computing and graphics has become a major framework for both statistical practice and research. This talk will describe some current efforts on improvements to the core computational engine, including work on compilation of R code, efforts to take advantage of multiple processor cores, and modifications to support working with larger data sets.
11:00-11:30AM	Lee Edlefsen	Parallel External Memory Algorithms applied to Generalized Linear Models
	Abstract: For the past several decades the rising tide of technology has allowed the same data analysis code to handle the increase in sizes of typical data sets. That era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of RAM, and of hard drives. To deal with this, statistical software must be able to use multiple cores and computers. Parallel external memory algorithms (PEMA's) provide the foundation for such software. External memory algorithms (EMA's) are those that do not require all data to be in RAM, and are widely available. Parallel implementations of EMA's allow them to run on multiple cores and computers, and to process unlimited rows of data. This paper describes a general approach to efficiently parallelizing EMA's, using an R and C++ implementation of GLM as a detailed example. It examines the requirements for efficient PEMA's; the arrangement of code for automatic parallelization; efficient threading; and efficient inter-process communication. It includes billion row benchmarks showing linear scaling with rows and nodes, and demonstrating that extremely.
11:30-11:55AM	Jan Vitek	Taming the Tiger: How to scale R to bigger data
	Abstract: The R language is a surprisingly successful mix of computer science ideas coming from the functional languages and object-oriented ones. This rather unlikely linguistic cocktail would probably never have been prepared by computer scientists, yet the language has become amazingly popular. With millions of lines of R code available in repositories, we have an opportunity to evaluate the fundamental choices underlying the R language design. Using a combination of static and dynamic program analysis we assess the success of different language features. Based on our observations we identify challenges and opportunities for scaling R to bigger problems.