Canonical Correlation Analysis

In this lab, we explore the canonical correlation analysis. First, we will use a splus data set called evap.x. To get the description of evap.x, just type in

help(evap.x)
evap.x

With the concern of computational stability, we transform some original variables to form a new data set denoted by evap.y as follows.

evap.y_evap.x
evap.y[,3]_evap.x[,3]/10
evap.y[,6]_evap.x[,6]/10
evap.y[,9]_evap.x[,9]/10
evap.y[,10]_evap.x[,10]/10

we partition the evap.y into two parts,

soil_evap.y[,1:3]
air_evap.y[,4:10]

where soil represents the soil conditions and air represents the air conditions. Thus, p=3, q=7. The sample covariance and correlation matrices are

evapcov_var(evap.y)
evapcor_cor(evap.y)
S11_var(soil)//S11_evapcov[1:3,1:3]
S12_var(soil, air)//S12_evapcov[1:3,4:10]
S21_t(S12)
S22_var(air)//S22_evapcov[4:10,4:10]
R11_cor(soil)//R11_evapcor[1:3,1:3]
R12_cor(soil,air)//R12_evapcor[1:3,4:10]
R21_t(R12)
R22_cor(air)// R22_evapcor[4:10,4:10]

Now, let's use the sample covariance matrix to perform canonical correlation analysis. We need to define a function to calculate the square root of a positive definite matrix.

msqrt_function(sigma){
sigmae_eigen(sigma, symmetric=TRUE)
temp_sigmae$vectors %*% sqrt(diag(sigmae$values)) %*% t(sigmae$vectors)
temp
}

Let's check some properties of msqrt

temp_msqrt(S11)
temp
temp %*% temp 
temp1_solve(msqrt(S11))
T11_solve(S11)
temp1 %*% temp1

Now let's calculate the canonical variables and the canonical correlations.

A_solve(msqrt(S11)) %*% S12 %*% solve(S22) %*% S21 %*% solve(msqrt(S11))
Aeigen_eigen(A, symmetric=TRUE)
B_solve(msqrt(S22)) %*% S21 %*% solve(S11) %*% S12 %*% solve(msqrt(S22))
Beigen_eigen(B,symmetric=TRUE)
Aeigen
Beigen

Since p=3, we have three pairs of canonical variables.

a1_solve(msqrt(S11)) %*% (-Aeigen$vectors[,1])
u1_soil %*% a1
b1_solve(msqrt(S22)) %*% (Beigen$vectors[,1])
v1_air  %*% b1
a2_solve(msqrt(S11))%*% (-Aeigen$vectors[,2])
u2_soil %*%  a2
b2_solve(msqrt(S22)) %*% (Beigen$vectors[,2])
v2_air %*% b2
a3_solve(msqrt(S11)) %*% (-Aeigen$vectors[,3])
u3_soil %*% a3
b3_solve(msqrt(S22)) %*% (Beigen$vectors[,3])
v3_air %*%  b3
u_cbind(u1,u2,u3)
v_cbind(v1,v2,v3)
uvcor_cor(u,v)

Verify that the canonical correlations are the square root of the eigenvalues of A (or the largest three eigenvalues of B). The negative signs for the eigenvectors of A is to make the canonical correlation to be positive.

Now let's derive canonical correlation and canonical variables based on the correlation matrix R.

Ar_solve(msqrt(R11)) %*% R12 %*% solve(R22) %*% R21 %*% solve(msqrt(R11))
Areigen_eigen(Ar, symmetric=TRUE)
Br_solve(msqrt(R22)) %*% R21 %*% solve(R11) %*% R12 %*% solve(msqrt(R22))
Breigen_eigen(Br,symmetric=TRUE)
Areigen
Breigen

Notice that Ar and A have the same eigenvalues, and Br and B have the same eigenvalues. Let's compare the canonical coefficients based on R with those based on S.

ar1_solve(msqrt(R11)) %*% (-Areigen$vectors[,1])
br1_solve(msqrt(R22)) %*% (Breigen$vectors[,1])
ar2_solve(msqrt(R11))%*% (-Areigen$vectors[,2])
br2_solve(msqrt(R22)) %*% (Breigen$vectors[,2])
ar3_solve(msqrt(R11)) %*% (-Areigen$vectors[,3])
br3_solve(msqrt(R22)) %*% (Breigen$vectors[,3])
ar1/a1
sqrt(diag(S11))
br1/b1
sqrt(diag(S22))

Next, let's carry out the likelihood ratio test of H0: sigma12=0 using Bartlett's correction version ((10.44) in the text).

p_3
q_7
n_46
tt_-(n-1-(p+q+1)/2)*log(prod(1-Aeigen$values))
chicrit_qchisq(0.95,p*q)

What is your conclusion?

Let's further understand what the canonical correlations are

u1.lm_lm(u1~air)
summary(u1.lm)
v1.lm_lm(v1~soil)
summary(v1.lm)

Note that lm(u1~air) is the regression of the first canonical variable against the original variables in the air group. lm(v1~soil) is similar. What are the multiple correlation coefficients? How are they related to the first canonical correlations?

We now consider two remaining issues. The first one is about the residual matrices if we want to use canonical variables to represent the original data. We will focus on the canonical variables based on the correlation matrix R. Since S22 have another 4 eigenvectors, they can form another 4 canonical variables for air.

br4_solve(msqrt(R22)) %*% (Breigen$vectors[,4])
br5_solve(msqrt(R22)) %*% (Breigen$vectors[,5])
br6_solve(msqrt(R22)) %*% (Breigen$vectors[,6])
br7_solve(msqrt(R22)) %*% (Breigen$vectors[,7])
AR_cbind(ar1,ar2,ar3)
BR_cbind(br1,br2,br3,br4,br5,br6,br7)
ARinv_solve(t(AR))
BRinv_solve(t(BR))

Let's verify the equations in (10-37) in the text

temp1_matrix(0,nrow=3,ncol=7)
for(i in 1:3){
temp1_temp1+sqrt(Areigen$values[i])*(ARinv[,i] %*% t(BRinv[,i]))
}
temp1
R12
temp2_matrix(0,nrow=3,ncol=3)
for(i in 1:3){
temp2_temp2+ARinv[,i]%*% t(ARinv[,i])
}
temp2
R11
temp3_matrix(0,nrow=7,ncol=7)
for(i in 1:7){
temp3_temp3+BRinv[,i] %*% t(BRinv[,i])
}
temp3
R22

Notice that temp2==R11, temp3==R22, but tmep1 and R12 are little off. I expected them to be identical. Rounding error is an easy explanation but I doubt it. I will really appreciate if you can provide help.

If we want to use the first two pairs of canonical variables to represent the original data, we can only approximate R11, R22, and R12. Let's calculate the residual matrices ( the equations in (10-39) in the text).

resR11_ARinv[,3] %*% t(ARinv[,3])
resR11
resR22_matrix(0,nrow=7,ncol=7)
for(i in 3:7){
resR22_resR22+BRinv[,i] %*% t(BRinv[,i])
}
resR22
resR12_sqrt(Areigen$values[3])*ARinv[,3] %*% t(ARinv[,3])
resR12

Your comments about the approximation.

The last issue is about the proportions of total variances explained by the first 2 canonical variables( the equations in (10-42) in the text).

epR11_(sum((ARinv[,1])^2)+sum((ARinv[,2])^2))/p
epR11
epR22_(sum((BRinv[,1])^2)+sum((BRinv[,2])^2))/q
epR22

Your comments on the approximation.

Remark: Splus does have a function called cancor. Since you are already experts of Splus, and cancor is not very complicated. So, please explore cancor, apply it to evap.y, and compare your results.