This page contains related R codes for STAT 51100 Section 010. Please get data sets from http://www.stat.purdue.edu/~lingsong/teaching/2017fall/data/. We will use the class data to illustrate materials used in Chapter 1.

The following directly read the data from web. You can also download all the files to your local drive and then run R there.

classdata<-read.table('http://www.stat.purdue.edu/~lingsong/teaching/2017fall/data/class.txt', header=T, as.is=T)

You can see class data set by directly type it

classdata
##       name sex age height weight
## 1    Alice   F  13   56.5   84.0
## 2    Becka   F  13   65.3   98.0
## 3     Gail   F  14   64.3   90.0
## 4    Karen   F  12   56.3   77.0
## 5    Kathy   F  12   59.8   84.5
## 6     Mary   F  15   66.5  112.0
## 7    Sandy   F  11   51.3   50.5
## 8   Sharon   F  15   62.5  112.5
## 9    Tammy   F  14   62.8  102.5
## 10  Alfred   M  14   69.0  112.5
## 11    Duke   M  14   63.5  102.5
## 12   Guido   M  15   67.0  133.0
## 13   James   M  12   57.3   83.0
## 14 Jeffrey   M  13   62.5   84.0
## 15    John   M  12   59.0   99.5
## 16  Philip   M  16   72.0  150.0
## 17  Robert   M  12   64.8  128.0
## 18  Thomas   M  11   57.5   85.0
## 19 William   M  15   66.5  112.0

When the data is huge, we may only want to take a look of the first several rows. The following function head will be very useful

head(classdata)
##    name sex age height weight
## 1 Alice   F  13   56.5   84.0
## 2 Becka   F  13   65.3   98.0
## 3  Gail   F  14   64.3   90.0
## 4 Karen   F  12   56.3   77.0
## 5 Kathy   F  12   59.8   84.5
## 6  Mary   F  15   66.5  112.0

In R, another important function will be help, which can be used to find how to use functions. For example

help(head)

will return how to use head function.

Now we will go to descriptive statistics and data visualization part. Note that the class data contains names, sex, age, height and weight 5 variables, where name is the id and sex is a categorical data. The other three variables are continuous. We will draw the barchart for sex, and histogram for height. We can also generate stem-and-leaf display in R.

The following generate stem-and-leaf display for the variables weight and height

stem(classdata$weight)
## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##    4 | 1
##    6 | 7
##    8 | 3445508
##   10 | 0332233
##   12 | 83
##   14 | 0
stem(classdata$height)
## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   5 | 1
##   5 | 67789
##   6 | 033344
##   6 | 557779
##   7 | 2

The following generates histogram for height.

hist(classdata$height, freq=FALSE, xlab='Height', main='Histogram of Height')

You can specifiy the intervals by using the option breaks. One easy way is to put the number of intervals in. The R code will directly use a nearby number to draw histogram. Or you can specifically set the intervals.

Another histogram function in the lattice package can draw histogram as well

library(lattice)
histogram(classdata$height, xlab='Height', main='Histogram of Height')

We can also draw barchart for gender. Before it, we will use table function to return a frequency table.

table(classdata$sex)
## 
##  F  M 
##  9 10
barplot(table(classdata$sex), xlab='Sex', ylab='Frequency')

Another barchart function in the lattice package can directly draw a bar chart as well

barchart(classdata$sex, horizontal=FALSE, xlab='Sex', ylab='Frequency')

Several functions return descriptive statistics:

Mean

mean(classdata$height)
## [1] 62.33684

Median

median(classdata$height)
## [1] 62.8

Trimmed Mean

mean(classdata$height, trim=0.1)
## [1] 62.41765

first quartile, median, and the third quartile

quantile(classdata$height, c(.25, .5, .75))
##   25%   50%   75% 
## 58.25 62.80 65.90

Variance

var(classdata$height)
## [1] 26.2869

Standard deviation

sd(classdata$height)
## [1] 5.127075

We then can draw boxplot directly

boxplot(classdata$height, main='Boxplot of Height')