Concepts in Computing with Data
STAT 59800-005, Summer 2009

Computer Laboratory: Wed, 9:50 AM -- 10:50 AM, in SC 183 (Banner CRN 88739)

Professor: Mark Daniel Ward
Email: mdw@purdue.edu
Office: MATH 540
Phone: 765-496-9563
Office hours: Wed, 8:30 AM -- 9:30 AM, in MATH 540


This course is based on the topics advocated at the recent Workshop on Integrating Computing into the Statistics Curricula, organized by Mark Hansen (UCLA), Deborah Nolan (UC Berkeley), and Duncan Temple Lang (UC Davis), and sponsored by the National Science Foundation and CAUSE. Dr. Ward acknowledges and thanks the organizers for sharing their materials from previous courses they have taught on similar topics.

We will cover all of the following technologies this summer: Some interesting websites for data are given here. Additional sources of data or data representations are very welcome: Course description: click here

Course policy: click here

Homework: (subject to small changes)
Outline of Topics
Week 1:
June 15-19
Reading material:
An Introduction to R,
pages 1-32
Lecture material:
day1.txt
day2.txt
day3.txt
day4.txt
day5.txt
We covered basics: help, examples, assignments, variables, vectors, recycling, arithmetic with vectors,
and the seq() command. We used seq() and rep(), logical vectors, NA, NaN, character vectors.
We discussed 4 types of index vectors. We also discussed attributes of objects. We discussed objects
and attributes. We also discussed factors, levels, tapply(), functions, cut(). We began arrays and matrices.
We learned more about arrays and matrices. We covered lists and data.frame's.
Day 5's code used two sample files: mydatafile.txt and dataFileWithoutRowHeaders.txt
Week 2:
June 22-26
Reading material:
An Introduction to R,
pages 32-49
Lecture material:
day6.txt
day7.txt
day8.txt
day9.txt
We covered the scan() command, edit(), and probability distributions. Today we had two tiny sample files:
trythis.dat and trythis2.dat We discussed loops and conditionals: if, else, ifelse, for, repeat, break, next, while.
We made one little comparison to C++, day7.cpp (to compare R and C++).
We defined an R t-test function. We discussed user-defined functions, including user-defined binary operators.
We discussed arguments to functions. We printed matrices without labels. We also defined recursive functions.
We discussed scope of variables, differences of R versus Splus, customizing the R environment, and generic functions.
Week 23
June 29-July 3
Reading material:
R Graphics
by Paul Murrell

Lecture material:
day10.txt
day11.txt
day12.txt
day13.txt
How to display data badly
day15part1.txt
day15part2.txt
day15part3.txt
All book examples from R Graphics are posted free on Paul Murrell's website. We discussed output of graphics to files,
graphics display devices, and several examples. We briefly discussed plots of multiple variables and interactive plots.
We began to investigate parameters of the Graphics system. We did an 8-part-saga on "Controlling the Appearance of Plots"
Also we studied multiple plots on a page. We discussed the paper: How to display data badly by Howard Wainer,
The American Statistician 38(2), 1984. We compared an example pie chart and discussed how the data would be better
displayed with dotchart(z). We briefly discussed some strategies from the "Principles of Graph Construction" section
of William S. Cleveland's book, The Elements of Graphing Data, Revised Edition (Hobart Press, 1994)
We also discussed banking to 45 degrees and Q-Q plots, an introduction to Lattice from
Lattice Multivariate Data Visualization with R by Deepayan Sarkar (Springer, 2008), and introduced the maps library.
Week 4:
July 6-10
Reading material:
Learning the Unix Operating System,
Chapters 3, 4, 5, 7, 8.
Also Learning the bash Shell,
Chapters 1, 3, 4, 5.
Lecture material:
day16.txt
day17.txt
day18.txt
day19.txt
day20.txt
day21.txt
We began to get familiar with UNIX, the X11 environment, and the bash shell. It is easy to read the two books mentioned by doing the examples. You can skip a lot of the discussion.

In the UNIX book, if you have ever used UNIX, you are probably already familiar with the topics covered. This is a very introductory book on UNIX.

In the bash shell book: Chapter 1 just has some bash basics. You do not need to know emacs or vi, so most of Chapter 2 can be skipped. For Chapter 4, I provided some scripts, in a file called day19scripts and we also used a songfile.txt
Week 5:
July 13-17
Reading material:
Gawk: Effective awk Programming,
Chapters
1, 2, 3, 4, 5, 6
Lecture material:
day22.txt
day23.txt
day24.txt
day25.txt
day26.txt
(I recommend reading the pdf version of the Gawk book, and using the ASCII text version to copy-and-paste code examples;
just remember to use nawk instead of awk when working on expert.ics.purdue.edu)
We did an overview of the many powerful and easy-to-use features of awk. You are welcome to use their example files:
BBS-list and inventory-shipped
We studied regular expressions, input and output, all kinds of expressions in awk, as well as patterns, actions, and variables.
Week 6:
July 20-24
Reading material:
Learning SQL,
Chapters
1, 2, 3, 4, 5, 6
Lecture material:
day28.txt
day34.txt
day41part1.txt
Dr. Ward was away during the SQL module in the Spring semester. So there are very limited notes. He picked SQL for this module because it is extremely easy to learn. So please read the 6 chapters of the book (they are very straightforward). The data from the Learning SQL book only needs to be downloaded one time; use the following instructions to download the data set from the book to your account:
First log on to expert.ics.purdue.edu by typing
ssh expert.ics.purdue.edu
and then load bash by typing
bash
To download the data, type
wget examples.oreilly.com/learningsql/LearningSQLExample.sql
Load the database using the following instructions, which you only need to do one time:
Log on to the MySQL server:
mysql -h mydb.ics.purdue.edu -p
type the password that you selected for your MySQL account (not your Career password). Then type the following:
use mdw; (but please use your Purdue ID, not my Purdue ID)
source LearningSQLExample.sql;
exit;
Now the database is loaded permanently into your MySQL account. You do NOT need to do any of the steps again.
To check that the database was loaded correctly, you can do the following:
Log back on to the MySQL server:
mysql -h mydb.ics.purdue.edu -p
type the password assigned to you in class. Then type the following:
use mdw; (again, please use your own Purdue ID, not mine)
SHOW TABLES;
You will see the names of the tables from the book.
Week 7:
July 27-31
Reading material:
Learning XML, 2nd Edition,
Lecture material:
day33.txt
day35.txt
day36.txt
day37.txt
day38part1.txt
day38part2.txt
day39part1.txt
day39part2.txt
day40.txt
day41part2.txt
day42part1.txt
day42part2.txt
day43.txt
The lectures from days 33, 35, 36, 37 all focus on XML and XPath.
We used some example files:
(If you download them using Firefox, please use the All Files option.)
examplegrades.xml
indiana.html
foodsample.xml
Library.xml
cdcatalog.xml
bookexample.xml

At the students' request, we also had a few lectures on Perl, during days 38, 39, 40, 41, 42, 43.
These lectures on Perl followed the Learning Perl book, and used the example datasets:
daysoftheweek.txt
pizza.txt
school.txt
2600.txt (which is the War and Peace book from Project Gutenberg)
some Carnegie Classification data
Week 8:
August 3-7
final, cumulative project