Factor Analysis

We have used the testscores data in principal component analysis. And in this lab, it will be used to illustrate factor analysis. First let's take a look of the data:

testscores

Now do principal factor analysis with 2 factors, the principal factor method is the default:

test.fa_factanal(testscores, factors=2)
test.fa
summary(test.fa)

How much of the total variance of the data can be explained by the two factors? What are the two factors? Now, let's calculate the communalities.

comm_1-test.fa$uniquenesses

Which variable is best explained by the factors in terms of variance? Now let's have a look of the estimated model LL'+Psi:

lmat_test.fa$loadings
psi_diag(test.fa$uniquenesses)
emodel_lmat %*% t(lmat) + diag(test.fa$uniquenesses)
emodel

Did factanal use the covariance matrix or correlation matrix? Now, let's calculate the residual matrix:

testcor_cor(testscores)
resmat_testcor-emodel
resmat

Now use maximum likelihood method. Since in the maximum likelihood approach, normal distributions are assumed. So, there is a formal test for the number of factors (page 537-538). The test statistic follows a chi-square distribution with ((p-m)^2-(p+m))/2 degrees of freedom. Since the degrees of freedom are usually positive, so the number of factors, m, is restricted. In the testscores data, p=5, thus, m should be less than or equal to 2.

test.mle_factanal(testscores, factors=1, method="mle")
test.mle
test.mle2_factanal(testscores,factors=2, method="mle")
test.mle2

In the above two tests, large p-values indicate that we can actually use just one factor model. But considering that there are only 25 observation in the data, so the formal test could be misleading. Also for the purpose of comparison, we use the two factor model.

summary(test.mle2)
comm2_1-test.mle2$uniquenesses
comm2
lmat2_test.mle2$loadings
lmat2
psi2_diag(test.mle2$uniquenesses)
psi2
emodel2_lmat2 %*% t(lmat2) + psi2
emodel2
resmat2_testcor-emodel2
resmat2

Compare the solutions from the principal factor method and the mle method.

Let's see how the factors might be rotated for the testscores data. The default implementation of factanal() does some rotation, so let's see what happens when no rotation is done.

test.mle_factanal(testscores, factors=2, method="mle",rotation="none")
lmat_loadings(test.mle)

The loadings are saved for future reference.

testnames_c("diffgeom", "complex", "algebra", "reals", "statistics")
win.graph()
par(mfrow=c(2,2))
plot(lmat[,1],lmat[,2], type="n", xlab="Factor 1", ylab="Factor 2", xlim=c(0,1),ylim=c(0,1))
text(lmat[,1],lmat[,2], testnames)
abline(v=0,lty=2)
abline(h=0,lty=2)

From the graph, reals, statistics and algebra form a cluster, and diffgeom and complex are distant from the cluster. Do you think that a graphical rotation can separate the clusters further? why? Now let's try the analytical method, called varimax, covered in the lecture.

test.mler_factanal(testscores, factors=2, method="mle",rotation="varimax")
lmatr_loadings(test.mler)
plot(lmatr[,1],lmatr[,2], type="n", xlab="Factor 1", ylab="Factor 2", xlim=c(0,1),ylim=c(0,1))
text(lmatr[,1],lmatr[,2], testnames)
abline(v=0,lty=2)
abline(h=0,lty=2)

Compare the two plots, it appears that the rotation does not improve the interpretability a lot. How about trying some nonorthogonal rotation:

test.mleo_factanal(testscores, factors=2, method="mle",rotation="oblimin")
lmato_loadings(test.mleo)
plot(lmato[,1],lmato[,2], type="n", xlab="Factor 1", ylab="Factor 2", xlim=c(0,1),ylim=c(-0.5,0.5))
text(lmato[,1],lmato[,2], testnames)
abline(v=0,lty=2)
abline(h=0,lty=2)

Now, it appears that the separation between the two groups improved a little bit. The improvement is from the sacrifice of the orthogonality of the factors. Let's check how the factors are correlated.

summary(test.mleo)

Notice that the correlatin between factor 1 and factor 2 in test.mleo is not very big.

Now, let's calculate the scores for the original data and predict scores for some new cases.

test.mle_factanal(testscores, factors=2)
test.mle$scores

or,

predict(test.mle)

Suppose we have three new cases 22 50 70 54 30 22 46 38 52 62 42 49 70 42 50 and we want to calculate the scores for them.

newmat_matrix(0, ncol=5, nrow=3)
newmat[1,]_c(22,50, 70, 54, 30)
newmat[2,]_c(22,46, 38, 52, 62)
newmat[3,]_c(42,49, 70, 42, 50)
predict(test.mle, newdata=newmat)

Notice that the scores were derived from regression which is a splus default. Suppose we want to get the scores based the weighted least square.

test.mle_factanal(testscores, factors=2, type="weighted.ls")
predict(test.mle)

and

predict(test.mle, newdata=newmat, type="weighted.ls")