Discrimination and Classification

First we need to install a library which contains some functions we will use later. The library called MASS is associated with the book, Modern Applied Statistics with S-PLUS, by venables and Ripley. First, you need to open Splus3.4, then go to

http://franz.stat.wisc.edu/pub/MASS3/Windows.shtml

Download VRlib62.exe to the directory c:\splus4\library;second, open MS-DOS prompt, go to the above directory, and execute VRlib62.exe by simply typing in

vrlib62

So, the temporary library MASS has been set up. Now, in the splus command window, you can use the following commands to make various functions available

library(MASS, first=T)
library(nnet)
library(spatial)
library(class)

For discriminant analysis, we need the first command. Because the splus4 directory is temporary, after you log out, the library will be lost, so you need re-install it next time when you log in.

To check if you have really set up the library properly. Go to the splus command window, and try out the following commands.

help(lda)//you will get an error message
library(MASS, first=T)
help(lda)

We will use iris data, a famous one used by Fisher to illustrate his ideas in discriminant analysis. We used iris data in principal component analysis. (If you have iris as the name for an object, you need go to _Data to remove it.)

help(iris)
iris
irisdat_rbind(iris[,,1],iris[,,2],iris[,,3])
irisdat
spe_(c(rep("s",50),rep("c",50), rep("v",50)))
spe

We have combined the original data to be a two-dimensional matrix, the first 50 rows are from Setosa, denoted by s, the next 50 observations are from Veriscolor, denoted by c, and the last 50 observations are from Virginica denoted by v.

x1_irisdat[,1]
x2_irisdat[,2]
x3_irisdat[,3]
x4_irisdat[,4]

We can use x2 and x4 to take a look at the data.

win.graph()
plot(x2, x4, type="n", xlab="x2", ylab="x4")
text(x2, x4, spe)

Any comments about the plot? Do they appears to be from bivariate normal distributions? Seems that the specie s is different from the other two, c and v. Now let's focus on c and v.

meanc_apply(irisdat[51:100,1:4], 2, mean)
meanv_apply(irisdat[101:150,1:4],2, mean)
meanc
meanv
varc_var(irisdat[51:100,1:4])
varv_var(irisdat[101:150,1:4])
varc
varv

Now the pooled variance:

varpool_(49*varc+49*varv)/(100-2)
varpool

We assume the prior probabilities are equal, so are the costs. Let's calculate the coefficients of linear discriminant (classification) function.

a_solve(varpool)%*% (meanc-meanv)
a

And the second term in the linear discriminant function:

crit_sum(a*(meanc+meanv)/2)
crit

Check if the derived discriminant function classifies the observations in c and v correctly. T indicates that an observation is classified as from c and F indicates classified as from v.

clas_irisdat[51:150,] %*% a >= crit 
t(clas)

We know that the observations from 51-100 are from c, and the observation from 101-150 are from v. Now let's calculate the number of misclassifications. nctov represent the number of misclassification from c to v, and nvtoc represents the opposite.

nctov_0
nvtoc_0
for(i in 1:50) {
if(clas[i]==F) nctov_nctov+1
}
for(i in 51:100){
if(clas[i]==T) nvtoc_nvtoc+1
}
nctov
nvtoc

Hence the apparent error rate is (nctov+nvtoc)/100=3%. The true error rate is likely to be higher.

In the following, we will use another example to illustrate linear discriminant analysis. We need to use the ellipse function in the following, so you need source the ellipse.S to set up the functions once more.

Margolese (1970) studied hormone ratios in 15 male heterosexuals and 11 male homosexuals.

Read in the data and examine.

hormone_read.table("hormone.data",header=T)
hormone

Compute the means and variances.

means_apply(hormone[1:11,-3],2,mean)
meang_apply(hormone[12:26,-3],2,mean)
vars_var(hormone[1:11,-3])
varg_var(hormone[12:26,-3])

Now plot the data with 95% confidence ellipses:

plot(hormone$a,hormone$e,xlab="Androgen",ylab="Estrogen",type="n",xlim=c(0,6),ylim=c(0,6))
text(hormone$a,hormone$e,as.character(hormone$o))
lines(ellipse(vars,cent=means),lty=2)
lines(ellipse(varg,cent=meang),lty=2)

Suppose we assume a common variance and redo the plot.

varpool_(10*vars+14*varg)/24
plot(hormone$a,hormone$e,xlab="Androgen",ylab="Estrogen",type="n",xlim=c(0,6),ylim=c(0,6))
text(hormone$a,hormone$e,as.character(hormone$o))
lines(ellipse(varpool,cent=means),lty=2)
lines(ellipse(varpool,cent=meang),lty=2)

Put in the points of the means.

points(means[1],means[2])
points(meang[1],meang[2])

Now compute the LDA rule.

a_solve(varpool) %*% (means-meang)
crit_sum(a*(means+meang)/2)
a

The decision line - can you see where the numbers come from?

abline(-4.063956/3.588248,4.534452/3.588248)

What's the apparent error rate?

(To be continued)