Multivariate Normal Distribution

In this lab, we will explore the multivariate normal distribution and samples thereof. First of all, we need install some tools for later use. Open MS-DOS window, change directory and make a new subdirectory called _Help as follows (I assume that you have already set up your h:\stat524\_Data directory in the previous lab)

cd h:\stat524\_Data
mkdir _Help

Open Netscape, and go to http://lib.stat.cmu.edu/S/ellipse, and save it to h:\stat524\ as a text file, you can call it ellipse.txt, which is an packed file. To unpack ellipse.txt, go to http://lib.stat.cmu.edu/DOS/S/SWin, find unshar.ext, click on it, and save it as h:\stat524\unshar.exe. Then, go back to MS-DOS widow, make sure that you are in h:\stat524\, and type

unshar ellipse.txt

The above procedure generates several files including the source code ellipse.s. Now, open Splus from standard software\statistical package. It again takes a couple of minutes to bring out Splus. In command window, enter

>attach("h:\\stat524\\_Data", pos=1)
>source('h:\\stat524\\ellipse.s')

Now you are ready to start today's lab. Note: ellipse could be set up in your local library or in the system Splus library, I still didn't quite sort it out yet. At this stage, we will just use some functions from this package. There are also quite a lot packages available in http://lib.stat.cmu.edu/S/, if you are interested, you can explore them. Now we need a function to calculate the bivariate normal density, please check the text or the lecture notes if you have forgot the form. You can copy and paste, or directly type, the function into your Splus Command Window:

>dbinorm_function(x, mu, sigma) {
det_sigma[1,1]*sigma[2,2]-sigma[1,2]*sigma[2,1]
a_t(x-mu)%*%solve(sigma)%*%(x-mu)
exp(-a/2)/(sqrt(det)*2*pi)
}

where t is the transposition operator, %*% is the multiplication between two matrices, solve() gives the inverse of a nonsingular square matrix. Now create an identity variance matrix and a zero mean vector:

>sigma_matrix(c(1,0,0,1),ncol=2)
>sigma
>mu_c(0,0)
>mu

Create the range of values for which we want to plot the density:

>x_seq(-3,3,len=20)
>y_seq(-3,3,len=20)

Create a matrix to store the density at points (x[i],y[j]):

>z_matrix(0,ncol=20,nrow=20)

Compute the density over for the points (x[i],y[j]):

for(i in 1:20) {
   for(j in 1:20){
    z[i,j]_dbinorm(c(x[i],y[j]),mu,sigma)
}
}

Plot the result in two ways:

>win.graph()
>persp(x,y,z)
>contour(x,y,z)

Let's construct a distribution with some correlation. The determinant of the variance matrix is chosen to be one so that the scale will be similar to the identity case.

>sigmac_matrix(c(sqrt(2),1,1,sqrt(2)),ncol=2)
>sigmac

Recompute the density:

>for(i in 1:20){
   for(j in 1:20){
    z[i,j]_dbinorm(c(x[i],y[j]),mu,sigmac)
}
}

Replot:

>persp(x,y,z)
>contour(x,y,z)


Now we generate a sample with identity variance matrix and plot:

>d_matrix(rnorm(200),ncol=2)
>plot(d[,1],d[,2])


Where rnorm(200) generates 200 independent observations from the standard
normal distribution. Now compute and draw the 50% and 90% ellipsoids of the
true density:

>lines(ellipse(sigma, centre=mu, level=0.50))
>lines(ellipse(sigma, centre=mu, level=0.90))


We can generate a sample in the correlated case in this manner:

>A_t(chol(sigmac))
>nd_d %*% A


and plot as before:

>plot(nd)
>lines(ellipse(sigmac,centre=mu,level=0.50))
>lines(ellipse(sigmac,centre=mu,level=0.90))


See what will happen when we use the estimates of mean and  variance rather
than the true values:

>mus_apply(nd,2,mean)
>mus
>sigmas_var(nd)
>sigmas
>lines(ellipse(sigmas,centre=mus,level=0.50),lty=2)
>lines(ellipse(sigmas,centre=mus,level=0.90),lty=2)


The second part of this lab is about checking multivariate normality. In a 
quality control test of microwave ovens, radiation readings were taken with
door open and with the door closed. Data for 42 ovens are available. First
save the data to your own directory for plain data sets, such as 
h:\stat524\Datasets\radiation.data, and then read in and examine the data:

>rad_read.table("h:\\stat524\\Datasets\\radiation.data",header=T)
>rad


Now plot the data:

>plot(rad[,1],rad[,2],xlab="Closed",ylab="Open")


Does it look normal? Now make the QQ plots:

>qqnorm(rad[,1])
>qqline(rad[,1])
>qqnorm(rad[,2])
>qqline(rad[,2])


Do these look normal? Now compute the mean and variance:

>radvar_var(rad)
>radvar
>radmean_apply(rad,2,mean)


Now make the chi-square plot, recall that the generalized distances 
are also called Mahalanobis distances:

>md_mahalanobis(rad,radmean,radvar)
>plot(sort(md),qchisq((1:42)/43,2))


where sort() sorts md from smallest to largest, and qchisq(,2) gives 
the quantiles of chi-square distribution with 2 degrees of freedom. Notice
that the quantiles given here are sightly different from those given in the
text. Now, does it look normal? Let's apply some transformation and redo
the plots

>srad_sqrt(rad)
>plot(srad)
>qqnorm(srad[,1])
>qqline(srad[,1])
>qqnorm(srad[,2])
>qqline(srad[,2])


And redo the chi-square plots:

>sradvar_var(srad)
>sradmean_apply(srad,2,mean)
>smd_mahalanobis(srad,sradmean,sradvar)
>plot(sort(smd),qchisq((1:42)/43,2))
>abline(0,1)


This looks better. We can plot the estimated 95th percentile contour for the
mean:

>plot(srad)
>lines(ellipse(sradvar,centre=sradmean, level=0.95),lty=2)


Finally, it is interesting to see how the region look when we transform back
to the original scale of measurement:

>plot(rad[,1],rad[,2], xlab="Closed", ylab="Open")
>square_function(x) x^2
>lines(square(ellipse(sradvar,centre=sradmean,level=0.95)),lty=2)
>lines(square(ellipse(sradvar,centre=sradmean,t=squrt(qchisq(0.95,2)/42))))


Now, let us determine what is the best Box-Cox transformation for data
that does not show normality. We will use the radiation data with door
close as an example.

>x_rad[,1]
>lambda_seq(-2,2,by=0.1)
>m_length(lambda)
>g_rep(0,m)


We calculate the values of l(lambda) at various lambda between -2 and
2, and save them in g. 

>for(i in 1:m) {
if(lambda[i]!=0) a_(n-1)*var((x^{lambda[i]}-1)/lambda[i])/n
if(lambda[i]==0) a_(n-1)*var(log(x))/n
b_sum(log(x))
g[i]_-n*log(a)/2+(lambda[i]-1)*b
}


Nest, plot lambda against g, choose the maximum as the transformation
for x.

>plot(lambda,g, type="l")


Approximately what lambda you should use in the Box-Cox
transformatoin? Use QQ plot to check the normality of the transformed
data.