Tuesday, March 30, 2010

Principal Component Analysis of a U.S. city distance matrix in Excel

Principal Component Analysis (PCA) transforms a number of correlated variables into a smaller number of uncorrelated variables – the principal components. This way, the high-dimensional data set can once again be depicted on a low(two)-dimensional graph.

PCA is performed by calculating the covariance matrix for the initial data (each dimension of initial data must have zero means – if needed the average of the respective dimension is first subtracted from each data item). Next, eigenvectors of covariance matrix are calculated. The proportion of eigenvalues shows the importance of each new principal component dimension. A suitable number of eigenvectors is chosen, usually two. The initial data matrix is multiplied with the eigenvectors, thus obtaining a two- dimensional representation of data. If one was to calculate the covariance matrix for the new representation of data, this would be diagonalized – which means that each new axis would have maximal covariance with itself (in other words variance) and no covariance with the other dimensions.

A document from Salk Institute presents a very thorough description of PCA, together with the full derivation of the algorithm.

In our file, a symmetric 11x11 matrix of distances between the cities of U.S. is taken. The first two principal components are calculated from this 11 dimensional data set – depicting 80% of the variance in the initial data. The picture above shows the two-dimensional map that was constructed. Eigenvalues and eigenvectors were calculated with add-in MATRIX 2.3 - Matrix and Linear Algebra functions for EXCEL that is freely downloadable.

1 comment:

Anonymous said...

Hi, after reading this remarkable article i am also glad to share my knowledge here with colleagues.

my web blog :: online airplane games