Divide & Recombine with DeltaRho R & Hadoop for Big Data Analysis - Department of Statistics - Purdue University Skip to main content

Divide & Recombine with DeltaRho R & Hadoop for Big Data Analysis

Organizer and Chair: William S. Cleveland, Shanti S. Gupta Distinguished Professor of Statistics, Department of Statistics, Purdue University

Speakers

  • William S. Cleveland, Shanti S. Gupta Distinguished Professor of Statistics, Department of Statistics, Purdue University
  • Wen-wen Tung, Associate Professor, Department of Earth, Atmospheric, and Planetary Sciences, Purdue University
  • Aritra Chakravorty, Ph.D. Candidate, Department of Statistics, Purdue University
Schedule

Friday, June 8, 1:30-3:30 p.m. in STEW 202

Time Speaker Title
1:30-2:10 p.m. William S. Cleveland Divide Recombine (D&R) with DeltaRho for Big Data Analysis
Abstract: In D&R, the analyst divides the data into subsets by a D&R division method.  Each analytic method is applied to each subset, independently, without communication. Outputs of each analytic method are recombined by a D&R recombination method. Sometimes the goal is one result for all of the data, such as a logistic regression; D&R theory and methods seek division and recombination methods to maximize the statistical accuracy. In practice, division is commonly based on the subject matter. The data are divided by conditioning on variables important to the analysis; the outputs can be the final result, or further analysis of outputs is carried out. Much of D&R computation is the simplest: embarrassingly parallel. DeltaRho D&R software is open-source (www.deltarho.org). The front end is the DeltaRho R package datadr. The back end is a distributed database and parallel compute engine (DD-PCE) that spreads subsets and outputs across a database, and executes the analyst R and datadr code in parallel. The DeltaRho software component RHIPE provides integration of datadr and the widely used Hadoop DD-PCE. With D&R, we get deep analysis, which means analysis of the data at their finest granularity, including visualization. We get all of the tasking of data analysis, not just optimization. Through R we have access to the 1000s of methods of statistics, machine learning, and visualization. DeltaRho makes it easy to program D&R, protecting the analyst from the details of parallel computation and database management. DeltaRho can increase dramatically the data size and analytic computational complexity that are feasible in practice, whether the available hardware power is small, medium, or large. This performance does not require that the all of the data reside in memory at the same time, which for a large fraction of analyses in practice is a severe limitation. In fact, data can have a memory size that is larger than the physical memory.
2:10-2:50 p.m. Wen-wen Tung DeltaRho for Deep Analysis of Atmospheric Convection and Precipitation to Advance the Understanding of Earth's Water Cycle
Abstract: Precipitation and its associated physical processes, such as atmospheric convection, have both operational and fundamental scientific importance. They form the parts of the local and global water cycles that concentrate the heat used to evaporate the water and deliver water to the ocean or land surfaces. Satellite remote sensing potentially offers detailed records at spatial and temporal scales small enough to resolve the local features of precipitating cloud systems over decades and across the planet. Thus, scalable data analysis techniques become ever more critical in anticipation of the deluge of current and future satellite data.

In this talk, we present the analysis of satellite-based precipitation on a fixed 0.25 deg x 0.25 deg horizontal mesh at 3-hourly time intervals from the tropical rainfall measuring mission (TRMM, 1997--2015). In combination with the 6-hourly ECMWF global data assimilation (reanalysis) products of atmospheric motions, temperature, and moisture at 1.5 deg x 1.5 deg horizontal resolution from 1998 to 2015, we attempt to infer the heating in the atmosphere associated with cloud processes and the ensuing precipitation. We then examine their sub-seasonal temporal persistence patterns as well as changes over the recorded time. These analyses have been facilitated with the open source DeltaRho (\url{http://deltarho.org}) on clusters run on the Hadoop distributed file system.
2:50-3:30 p.m. Aritra Chakravorty

Introduction to Embarrassingly Parallel Statistics and its applications for computation of Quantiles and KD-trees for large data via Divide and Recombine method

Abstract: In Divide \& Recombine (D\&R), data are divided into subsets, analytic methods are applied to each subset independently, with no communication between processes; then the subset outputs for each method are recombined. For big data, this provides almost of the analytic tasking needed when data are analyzed. It also provides high computational performance because typically most of the computation is embarrassingly parallel, the simplest parallel computation. As with small data, quantiles of continuous variables for all of the data across subsets, are very useful summary statistics. For big data, recombination requires both numeric accuracy and high computational performance. To achieve this, the general and more widely applicable concepts of embarrassingly parallel statistics (EPS), both weak and strong, are introduced. This provides a framework for the fast and accurate D\&R method EPS Fourier Quantile (EPS-FQ). It uses a Fourier series to approximate an optimization criterion for quantiles. The series terms, which are strongly EPS, are summed across subsets, and the result is optimized. Speed and accuracy of EPS-FQ are compared with that of the much used binning method, which also can be formulated in terms of EPS. The second part of the talk is about an algorithm to construct a KD tree by Divide and Recombine. This algorithm is a more complex application of EPS that allows us to identify the vertices's of the KD tree. Theoretical and simulation results are provided to demonstrate accuracy and time complexity of the algorithms.

Purdue Department of Statistics, 150 N. University St, West Lafayette, IN 47907

Phone: (765) 494-6030, Fax: (765) 494-0558

© 2023 Purdue University | An equal access/equal opportunity university | Copyright Complaints

Trouble with this page? Disability-related accessibility issue? Please contact the College of Science.