Tuesday, March 30, 2010

Correspondence Analysis in Excel: the semantic differential of words according to Enron email dataset.

(See the post on Singular Value Decomposition for keywords.) After the bankruptcy of Enron, a $100 billion sales, 22 000 employee company, the Federal Energy Regulatory Commission made public and posted on the web, the data set containing 500 000 messages between the 150 top executives of the company – a real treat for the people doing data mining and visualization.

Here, I have created a VBA program that traverses the 150 folders containing the emails and constructs a contingency table. This two mode table shows how frequently the 150 executives used the 1000 most common words in English language.

Next the process of correspondence analysis is undertaken in Excel. Correspondence analysis displays both the rows and the columns of a two-way contingency table in one low-dimensional space. It calculates the Chi-square statistic divided by n (called Total Inertia) for the table and performs its Singular Value Decomposition. In order to the calculate coordinates for rows and columns, Q1 and Q2 as well as row and column sums are used. The theory is documented in the book Correspondence Analysis in Practice by Michael Greenacre. In our case the first two dimensions will contain about 50% of the total variance.

More interestingly, a similar process is undertaken in the second file. Here, from amongst the 1000 most common words in English, twelve are chosen – which according to author’s perception have either a strong positive or negative connotation. The results of correspondence analysis display the semantic differential that these words really do have – as words are ranging from those that have a strong negative meaning (on the left side of the primary axis) to those with a strong positive one (on the right side).

No comments: