Introduction to a Multithreaded and Distributed R for Big Data Analysis

RAWL 2082 | 8:30 - 4:00 p.m.

Description

The computer software R is one of the most popular computing tools for data analysis. In the past decade or so, tremendous efforts have been made to make R useful for big data analysis. These include Tessera, Revolution-R, and SparkR, to name a few. As we know, they are all making use of JAVA-based softwares such as Hadoop and Spark.

In this workshop, we introduce an entirely new alternative, a multithreaded and distributed R, called SupR. The prototype of SupR was made possible by modifying R (R-3.1.1) existing internal system implementation in C. The key features of the prototype include

a R-style front-end obtained by maintaining the existing R syntax and internal basic data structures,
a Java-like multithreading model,
a Spark-like cluster computing environment, and
a built-in simple distributed file system.

Students are expected to bring their own laptop computers to install Virtual Box, Ubuntu, and the current version SupR. More information on installation and course materials is available at SupR: Multithreaded and Distributed R release.