We will use daily highest temperature for the demostration of Analysis of Variance. The high temperature data is available in the course website. Another data set we will use for the purpose is the iris data, which is directly avaialble in R system. Use will load the data. Use will describe the data. In this document, we will use as our measure of interest.

hightemp<-read.table("http://www.stat.purdue.edu/~lingsong/teaching/2017fall/data/hightemp.txt", header=TRUE)
head(hightemp)
##   High Month
## 1   42   Jan
## 2   41   Jan
## 3   27   Jan
## 4   19   Jan
## 5   17   Jan
## 6   18   Jan
data("iris")

We can see whether the data is balanced or not by checking the frequencies of the class labels.

table(iris$Species) ## ## setosa versicolor virginica ## 50 50 50 table(hightemp$Month)
##
## Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
##  30  31  31  29  31  31  30  31  31  30  31  30

It is straightfoward to see that iris is balanced, while hightemp is not.

We will use comparative boxplot to get some visual impression of the comparisons.

boxplot(hightemp$High~hightemp$Month, xlab='Month', main='Comparative Boxplot of High Temperatures')

boxplot(iris$Sepal.Length~iris$Species, xlab='Species', main='Comparative Boxplot of Sepal Length by different Species')

For the high temperature data, we want to check normality of the deviations ($$X_{ij}-\overline{X}_{i\dots}$$). The following code is to calculate the deviations for each data.

uniqueclass<-unique(hightemp$Month) hightemp$groupmean<-NA
hightemp$grandmean<-mean(hightemp$High)
for (imonth in uniqueclass){
tempmean<-mean(hightemp$High[hightemp$Month == imonth])
hightemp$groupmean[hightemp$Month == imonth]<-tempmean
}
hightemp$Deviation<-hightemp$High-hightemp$groupmean Now we can check the normality of the Deviation qqnorm(hightemp$Deviation, main='Normal Q-Q Plot for Deviations of High Temperature Data')
qqline(hightemp\$Deviation, probs=c(.25, .75), lty=2, col=2, lwd=2)