Principal Component Analysis of a U.S. city distance matrix in Excel
Principal Component
Analysis (PCA) transforms a number
of correlated variables into a smaller number of uncorrelated variables – the
principal components. This way, the high-dimensional data set can once again be
depicted on a low(two)-dimensional graph.
PCA is performed by
calculating the covariance matrix for the initial data (each dimension of
initial data must have zero means – if needed the average of the respective
dimension is first subtracted from each data item). Next, eigenvectors of
covariance matrix are calculated. The proportion of eigenvalues shows the
importance of each new principal component dimension. A suitable number of
eigenvectors is chosen, usually two. The initial data matrix is multiplied with
the eigenvectors, thus obtaining a two- dimensional representation of data. If
one was to calculate the covariance matrix for the new representation of data,
this would be diagonalized – which means that each new axis would have maximal
covariance with itself (in other words variance) and no covariance with the
other dimensions.
A document from Salk Institute presents a very thorough
description of PCA, together with the full derivation of the algorithm.
In our file, a symmetric 11x11 matrix of distances between the cities
of U.S. is taken. The first two principal components are calculated from this
11 dimensional data set – depicting 80% of the variance in the initial data.
The picture above shows the two-dimensional map that was constructed.
Eigenvalues and eigenvectors were calculated with add-in MATRIX 2.3 - Matrix and Linear
Algebra functions for EXCEL that is freely downloadable.
1 comment:
Hi, after reading this remarkable article i am also glad to share my knowledge here with colleagues.
my web blog :: online airplane games
Post a Comment