>testscores_read.table("testscores.data")
>testscores
The data are the results of qualifying examinations for 25 graduate
students in mathematics. diffgeom refers to differential geometry and
reals is for real analysis. Note that the differential geometry and
complex analysis examinations were closed book, while the remaining
three exams were open book.
Let's get the covariance and correlation matrices of testscores first.
> mcov_var(testscores) > mcov > rcov_cor(testscores) > rcovNow, compare the eigenvalu-eigenvector pairs of mcov and rcov, where mcov is the covariance matrix and rcov is the correlation matrix.
> emcov_eigen(mcov) > emcov > ercov_eigen(rcov) > ercovFirst, we do principle component analysis based on the covariance matrix mcov. Compute the cumulative proportions
>cumsum(emcov$values) >sum(emcov$values) >cumsum(emcov$values)/sum(emcov$values)Can you conclude that the first two component get 90% of the variation?
The factor loadings refer to the coefficients in each principle component or the correlation between the component and the variables (check page 461 for details). Let's determine the first type of factor loadings, and keep only 3 digits after the decimal point.
>fl1_round(emcov$vectors,3)
>rownames_c("diffgeon", "complex", "algebra", "reals", "statistics")
>row.names(fl1)_rowname
>fl1
V1, V2, ..., V5 are the five principle components. Any interpretation
for the first two components?
The second type of factor loadings is more complicated, according to (8-8) in the text, fl1 need be multiplied by the square root of the corresponding eigenvalue and divided by the square root of its sample variance:
>temp1_diag(mcov) //the variance of each variable
>temp2_emcov$values // the eigenvalues
>temp_sqrt(temp2/temp1)
>fl2_matrix(0,nrow=5,ncol=5)
>for(i in 1:5) {fl2[,i]_fl1[,i]*temp[i]}
>row.names(fl2)_rownames
Compare fl2 and fl1 to see how the components are different?
Now, repeat the above procedure for the correlation matrix rcov.
>cumsum(ercov$values)/sum(ercov$values)
>rfl1_round(ercov$vectors,3)
>row.names(rfl1)_rownames
>rfl2_matrix(0,nrow=5,ncol=5)
>temp_sqrt(ercov$values) //for the second type, only need consider eigenvalues
>for(i in 1:5) {rfl2[,i]_rfl1[,i]*temp[i]}
>row.names(rfl2)_rownames
please compare the results with using the covariance matrix mcov.
Now, we use scree plot to check the relative importance of the principle components.
>win.graph() >par(mfrow=c(2,2)) >plot(emcov$values, type="b",xlab="ev no.",ylab="eigenvalue", main="scree plot based on mcov") >plot(ercov$values, type="b",xlab="ev no.",ylab="eigenvalue",main="scree plot based on rcov")So far, we use the function eigen to do the principle component analysis. As a matter of fact, splus has an independent function called princomp() to perform principle component analysis.
>testscores.prc_princomp(testscores) >testscoresR.prc_princomp(testscores, cor=T) >summary(testscores.prc) >summary(testscoresR.prc) >summary(testscores.prc, loadings=T) >summary(testscoresR.prc, loadings=T)what is the difference between testscores.prc and testscoresR.prc? Compare the output with the results we derived from the eigen functions. Which type of loadings does princomp use?
Create screeplots as follows
>plot(testscores.prc) >plot(testscoresR.prc)By default, the scree plot in splus is histogram, but you change it to be a line graph.
>plot(testscores.prc, style="lines") >plot(testscoresR.prc, style="lines")The images of the original data under the principal components transformation are referred to as principle component scores. If the first two components are the most important, we can use the scores of these two components to evaluate the students performance in the qualifying examination instead of using the original five scores. princomp provides the principle component scores automatically.
>scores1_testscores.prc$scores >scores1 >scores2_testscoresR.prc$scores >scores2Now, let's do a little experimentation
>y_rep(0,25)
>for(i in 1:25) {y[i]_sum(testscores[i,])}
>y
>x_scores1[,1]+scores1[,2]
>plot(x,y)
what x and y are? what does the plot tell you in this data.