help(author.count) author.count(I am only familiar with Hemingway and Faulkner, hope you have much better knowledge so that you can interpret the clustering result). Since the works are of different pages, it is more proper to consider relative frequencies.
sums_apply(author.count,1,sum) AutAdj_sweep(author.count,1,sums,"/") round(AutAdj,3)Now, we need extract the author names
anames_dimnames(author.count)[[1]] anamesNow, let's apply the hierarchical clustering method. hclust supports the single linkage method (connected), complete linkage method (compact) and the average linkage method (average).
Look at the single linkage method first
win.graph() AutAdjCluS_hclust(dist(AutAdj), method="connected") plclust(AutAdjCluS,label=anames,main="Single Linkage")The complete linkage method:
AutAdjCluC_hclust(dist(AutAdj), method="compact") plclust(AutAdjCluC,label=anames,main="Complete Linkage")The average linkage method:
AutAdjCluA_hclust(dist(AutAdj), method="average") plclust(AutAdjCluC,label=anames,main="Average Linkage")Compare the results from the above methods. What can you conclude? What are reasonable clusters?
Next, we use the k-means clustering methods. Suppose we assume that there are two clusters. We need start with an initial grouping scheme. Let Hemingway and Faulkner form group 1, and the other authors form group 2. Do you think it is a good starting point? Now we also need to calculate the means of the initial clustering.
mean1_apply(AutAdj[c(5,6,7,9),],2,mean) mean2_apply(AutAdj[-c(5,6,7,9),],2,mean)Now we apply the k-means method:
g_kmeans(AutAdj, rbind(mean1,mean2)) gTo find out what are the clusters?
anames[g$cluster==1] anames[g$cluster==2]Are the derived clusters reasonable? Now let's try a different starting point,
mean12_apply(AutAdj[c(1,2,3,4,5,9),],2,mean) mean22_apply(AutAdj[-c(1,2,3,4,5,9),],2,mean) g2_kmeans(AutAdj, rbind(mean12, mean22)) g2To check who ends up in which group this time:
anames[g2$cluster==1] anames[g2$cluster==2]Although, we purposely separated the works of Hemingway and Faulkner, one of Faulkner's works ends up in the same cluster with Hemingways. This is different from the result we have from hierarchical method. Could you explore further to give a good interpretation after class?
Next, we will analyze the divorce data for the 50 states in 1970. 1=acceptable, 0=not acceptible. First, you need to save the data in the proper directory and read the data in.
divorce_read.table("divorce.data")
divorce
Now we calculate the similarity coefficient use the ratio between the number
of mismatches and the number of matches with presence.
divsim_dist(as.matrix(divorce),metric="binary")divcluH_hclust(divsim) plclust(divcluH, label=state.abb) Can see which states have the same grounds and those that only differ by one. Does not seem to be any clear clustering - maybe just one big cluster? Which state is the most unlike the rest?
Now try k-means clustering - not clear how reasonable this is because of the binary data.
g_kmeans(as.matrix(divorce),as.matrix(divorce[1:2,])) g state.abb[g$cl == 1] state.abb[g$cl == 2]