Introduction to Data Analysis for Big Data - Department of Statistics - Purdue University Skip to main content

Introduction to Data Analysis for Big Data

RAWL 2058 | 8:30 a.m. - 4:00 p.m.

 

Description

This one-day introductory workshop is geared toward participants who want to revitalize or improve their data analysis skills, especially with an emphasis on big data. Ward will present tools and techniques for these most fundamental, low-level aspects of data analysis. We are well-versed at teaching such techniques to students who have no background in data analysis or programming. This workshop will bring people up to speed with powerful techniques for data analysis. This one-day course has no prerequisites. This workshop will be hands-on and driven by examples, using large data sets. The intended participants for the course are people who work in a data-driven environment and have an increasing need to perform aspects of large data analysis. Before data is gathered and organized, a great deal of data manipulation is necessary, especially for working with big data sets. Sometimes the data need to be scraped from remote sources, and then parsed into more natural forms. This process often involves munging and cleaning the data. The need to be able to reproduce and reliably verify all of the methods used for the data wrangling is more important than ever.

R will be the main tool utilized in the workshop. The workshop is geared toward practitioners with (perhaps) only a limited knowledge of R, or even no knowledge of R at all. For instance, someone who has previously used (only) Excel, SAS, or Tableau for data analysis is a perfect candidate for this all-day immersive workshop. We endeavor to use R and its XML scraping and parsing libraries for pulling raw data from disparate sources on the internet, and wrangling them into forms that are amenable for data analysis.

The entire workshop will be example-driven. Participants should bring a laptop computer (Mac, Windows, and UNIX are all welcome). We will work in RStudio. Instructions for installing the necessary software can be sent to the participants before the workshop starts. We will use R Markdown for creating reproducible documents.

By the end of the one-day workshop, participants will have learned how to scrape data sets from the web, parse the desired portions of the data, wrangle it into a desired form for data analysis, and also perform some cleaning and verifying of the data. Reproducible paradigms and reliability will be emphasized throughout the workshop.

Purdue Department of Statistics, 150 N. University St, West Lafayette, IN 47907

Phone: (765) 494-6030, Fax: (765) 494-0558

© 2023 Purdue University | An equal access/equal opportunity university | Copyright Complaints

Trouble with this page? Disability-related accessibility issue? Please contact the College of Science.