Concepts in Computing with Data ("STAT 598Z")
STAT 59800-012, Spring 2009

Lecture: MWF, 9:30 AM -- 10:20 AM, in REC 227 (Banner CRN 27908)
Computer Laboratory: Wed, 8:30 AM -- 9:20 AM, in SC 183 (Banner CRN 35346)

Professor: Mark Daniel Ward
Email: mdw@purdue.edu
Office: MATH 540
Phone: 765-496-9563
Office hours: Mon and Fri, 8:30 AM -- 9:20 AM, in MATH 540


This course is based on the topics advocated at the recent Workshop on Integrating Computing into the Statistics Curricula, organized by Mark Hansen (UCLA), Deborah Nolan (UC Berkeley), and Duncan Temple Lang (UC Davis), and sponsored by the National Science Foundation and CAUSE. Dr. Ward acknowledges and thanks the organizers for sharing their materials from previous courses they have taught on similar topics.

We will attempt to cover some or all of the following technologies this semester: Some interesting websites for data are given here. Additional sources of data or data representations are very welcome: Course description: click here

Course policy: click here

Homework: (subject to small changes)
Outline of Topics
Week 1: Mon, Jan 12 (day1.txt)
We covered basics: help, examples,
assignments, variables, vectors,
recycling, arithmetic with vectors,
and the seq() command.
Read pages 1-8 from
An Introduction to R
Wed, Jan 14 (day2.txt)
We used seq() and rep(), logical vectors,
NA, NaN, character vectors.
We discussed 4 types of index vectors.
We also discussed attributes of objects.
Read pages 9-13 from
An Introduction to R
Fri, Jan 16 (day3.txt)
We discussed objects and attributes.
We also discussed factors, levels,
tapply(), functions, cut().
We began arrays and matrices.
Read pages 14-19 from
An Introduction to R
Week 2: Mon, Jan 19 (no lecture)
Rev. Dr. Martin Luther King, Jr., Day
Wed, Jan 21 (day4.txt)
We learned more about arrays
and matrices. If you have data
that you want to use in future
projects, please let me know!
Please read pages 20-25 from
An Introduction to R
Fri, Jan 23 (day5.txt)
We covered lists and data.frame's.
Today's code used two sample files:
mydatafile.txt and
dataFileWithoutRowHeaders.txt
Please read pages 26-31 from
An Introduction to R
Week 3: Mon, Jan 26 (day6.txt)
We discussed Project 2.
We covered the scan() command,
edit(), and probability distributions.
Today we had two tiny sample files:
trythis.dat and trythis2.dat
Please read pages 32-35 from
An Introduction to R
Wed, Jan 28 (day7.txt)
We discussed loops and conditionals:
if, else, ifelse, for,
repeat, break, next, while
We made one little comparison to C++,
day7.cpp (to compare R and C++).
We defined an R t-test function.
Please read pages 40-42 from
An Introduction to R
Fri, Jan 30 (day8.txt)
We discussed user-defined functions,
including user-defined binary operators.
We discussed arguments to functions.
We printed matrices without labels.
We also defined recursive functions.
Please read pages 43-46 from
An Introduction to R
Week 4: Mon, Feb 2 (day9.txt)
We discussed scope of variables,
differences of R versus Splus,
customizing the R environment,
and generic functions.
Please read pages 47-49 from
An Introduction to R
Wed, Feb 4 (day10.txt)
We will follow Paul Murrell's book
R Graphics. All book examples
are posted free on his website.
We discussed these examples,
and output of graphics to files,
graphics display devices,
and several examples of functions
from the graphics library.
Fri, Feb 6 (day11.txt)
We continue to follow R Graphics.
Today's lecture shows plot regions
with illustrations from the book
(see the code in the day11.txt file).
We briefly discussed plots of
multiple variables and interactive plots.
We began to investigate parameters
of the Graphics system today.
Week 5: Mon, Feb 9 (day12.txt)
We continue to follow R Graphics.
We begin an 8-part-saga
on "Controlling the
Appearance of Plots"
I. Colors
II. Lines
III. Text
IV. Data Symbols
Wed, Feb 11 (day13.txt)
We continue to follow R Graphics.
We finish an 8-part-saga
on "Controlling the
Appearance of Plots"
V. Axes
VI. Plotting Regions
VII. Clipping
VIII. Moving to a new plot
Also: multiple plots on a page
1. traditional graphics and par()
2. layout()
3. split.screen()
Fri, Feb 13
We discussed the paper:
How to display data badly
by Howard Wainer,
The American Statistician 38(2), 1984.
We also looked at
an example pie chart and
we discussed how the data would be
more easily decoded when viewed as
a dotchart, i.e., with dotchart(z)
Week 6: Mon, Feb 16
We briefly discussed some
strategies from the "Principles
of Graph Construction" section
of
William S. Cleveland's book,
The Elements of Graphing Data,
Revised Edition
(Hobart Press, 1994)
We also discussed
banking to 45 degrees
and Q-Q plots (see day15part1.txt),
an introduction to Lattice from
Lattice Multivariate Data Visualization with R
by Deepayan Sarkar (Springer, 2008)
(see day15part2.txt),
and an introduction to the
maps library (see day15part3.txt).
Wed, Feb 18 (day16.txt)
We began to get familiar with UNIX
and the X11 environment. Please read
Learning the Unix Operating System,
Fifth Edition
by Jerry Peek,
Grace Todino-Gonguet, and
John Strang (O'Reilly, 2001).
You don't need every topic in the book.
It is easy to read by doing the examples.
You can skip a lot of the discussion.
Consider reading Chapters 3, 4, 5, 7, 8.
Fri, Feb 20 (day17.txt)
We continue with UNIX and we begin
to work with the bash shell. See
Chapter 1 of Learning the bash Shell,
by Cameron Newham and Bill Rosenblatt
(O'Reilly, 2005).
You don't need every topic in the book.
It is easy to read by doing the examples.
You can skip a lot of the discussion.
Chapter 1 just has some bash basics.
Week 7: Mon, Feb 23 (day18.txt)
We continue to work with the bash shell.
See Chapter 3 of Learning the bash Shell.
(We also finished Chapter 1, and only
discussed a few aspects of Chapter 2.
You do not need to know emacs or vi,
so most of Chapter 2 can be skipped.)
Wed, Feb 25 (day19.txt)
We continue to work with the bash shell.
See Chapter 4 of Learning the bash Shell.
The scripts from today are in
a file called day19scripts
and we also used a songfile.txt
Fri, Feb 27 (day20.txt)
We continue to work
with the bash shell.
See Chapters 4 and 5 of
Learning the bash Shell.
Week 8: Mon, Mar 2 (day21.txt)
We finish our work
with the bash shell.
See Chapter 5 of
Learning the bash Shell
Wed, Mar 4 (day22.txt)
We did an overview of the many
powerful and easy-to-use features of awk.
See Chapter 1 of
Gawk: Effective awk Programming
by Arnold Robbins
(I recommend reading the pdf version,
and using the ASCII text version to
copy-and-paste code examples;
just remember to use nawk instead of awk
when working on expert.ics.purdue.edu)
You are welcome to use their example files:
BBS-list and inventory-shipped
Fri, Mar 6 (day23.txt)
We studied regular expressions in awk.
See Chapter 2 of
Gawk: Effective awk Programming
Week 9: Mon, Mar 9 (day24.txt)
We studied input and output in awk.
See Chapters 3 and 4 of
Gawk: Effective awk Programming
Wed, Mar 11 (day25.txt)
We studied all kinds of expressions in awk.
See Chapters 4 and 5 of
Gawk: Effective awk Programming
Fri, Mar 13 (day26.txt)
We studied patterns,
actions, and variables in awk.
See Chapter 6 of
Gawk: Effective awk Programming
Spring Break: Mon, Mar 16 (no lecture) Wed, Mar 18 (no lecture) Fri, Mar 20 (no lecture)
Week 10: Mon, Mar 23
We did an overview of SQL.
See Chapter 1, and part of Chapter 2, from
Learning SQL by Alan Beaulieu
Wed, Mar 25 (day28.txt)
Here are notes about MySQL at Purdue.
Finish Chapter 2 and begin Chapter 3 from
the Learning SQL book.
Fri, Mar 27 (no lecture)
Finish Chapter 3, and
solve the Chapter 3 exercises from
the Learning SQL book.
Week 11: Mon, Mar 30 (no lecture)
Read Chapter 4, and
solve the Chapter 4 exercises from
the Learning SQL book.
Wed, Apr 1 (no lecture)
Read Chapter 5, and
solve the Chapter 5 exercises from
the Learning SQL book.
Fri, Apr 3 (no lecture)
Read Chapter 6, and
solve the Chapter 6 exercises from
the Learning SQL book.
Week 12: Mon, Apr 6 (day33.txt)
We began to discuss XML,
which is discussed in the book:
Learning XML, 2nd Edition
by Erik T. Ray. We discussed how
to parse XML code using R,
with the XML library in
Duncan Temple Lang's
Omega project.

Some files we used:
examplegrades.xml
indiana.html
foodsample.xml
If you download them in Firefox,
please use the All Files option.
Wed, Apr 8 (day34.txt)
We discussed a few remaining
nawk questions from Project 6.
We spent more of the lecture,
however, discussing the
MySQL baseball database
that is used in Project 7.
Fri, Apr 10 (day35.txt)
We continued our discussion
from Monday about XML,
which is discussed in the book:
Learning XML, 2nd Edition
We gave more examples, including
one introductory example using XPath.
Some files we used:
Library.xml
cdcatalog.xml
If you download them in Firefox,
please use the All Files option.
Week 13: Mon, Apr 13 (day36.txt)
We continued our discussion
from last week about XML.
We turn our attention to XPath,
which is discussed in Chapter 6
of the book
Learning XML, 2nd Edition
We revised our example from Friday
(see Friday's notes)
and studied another example using XPath.
The new file that we used is:
bookexample.xml
If you download it in Firefox,
please use the All Files option.
Wed, Apr 15 (day37.txt)
Today we spent about 20 minutes
discussing the course, including
the positive aspects, as well as
suggestions for improvements.
We also did one comprehensive
example using R, XML, XPath,
and pattern matching. The
example is given in the
transcript of today's class notes.
Using only this short code,
we can parse the election data
directly from the New York Times.
Fri, Apr 17 (day38part1.txt and day38part2.txt)
At the request of the class
we will do an introduction
to Perl, starting today.
We will follow Learning Perl,
Fourth Edition

by Randal L. Schwartz, Tom Phoenix,
and brian d foy (O'Reilly, 2005).
We discussed integers, strings,
basic input and output,
arrays, lists, quoting,
and manipulating arrays.
Week 14: Mon, Apr 20 (day39part1.txt and day39part2.txt)
We continued to quickly
cover the topics in
Learning Perl
We discussed foreach loops,
input/output, and gave an
introduction to hashes,
including a comprehensive
example of how to use hashes
to index the words in War and Peace.
Here are some example files we used today:
daysoftheweek.txt
pizza.txt
school.txt
and also 2600.txt
which is the text of War and Peace,
available from Project Gutenberg
Wed, Apr 22 (day40.txt)
We did an example of Perl
that combines several of the
concepts discussed so far.
We extracted information about
graduate students directly from
the web, and built a phonebook
using a hash table.
Fri, Apr 24 (day41part1.txt and day41part2.txt)
We discussed how to make
MySQL calls from inside R.
Our example is contained in
the file day41part1.txt.

Note: Calling our MySQL server from R
will only work on campus. It can also
work from your home/apartment if you
establish a VPN connection, which is easy
to do. Please ask for help if you
want to do this and get stuck.

We also continued some topics from
Learning Perl
We discussed subroutines,
a debugging tip,
a tip for using die in case
a file cannot be opened,
and how to create strings on the fly.
Week 15: Mon, Apr 27 (day42part1.txt and day42part2.txt)
We discussed some of the
more advanced features of
pattern matching in Perl.
We also discussed using
locatime and how to make
backup copies of files
when performing modifications.
Wed, Apr 29 (day43.txt)
We took the class picture today.

We did a comprehensive example
of how to use data from R
to build some MySQL tables.
We used some of the data from the
Carnegie Classification
We were able to easily determine
all of the universities in Indiana.

Note: Calling our MySQL server from R
will only work on campus. It can also
work from your home/apartment if you
establish a VPN connection, which is easy
to do. Please ask for help if you
want to do this and get stuck.
Fri, May 1
No lecture today.
Students are working on final projects.
Final project due date: Thursday, May 7, at 10 AM
(we do not have a final exam, but the registrar reserved us a final exam time from 8 AM to 10 AM that day)