Correspondence Analysis in Excel: the semantic differential of words according to Enron email dataset.
(See the post on Singular
Value Decomposition for keywords.) After the bankruptcy of Enron, a $100 billion
sales, 22 000 employee company, the Federal Energy Regulatory Commission made
public and posted on the web, the data set containing
500 000 messages between the 150 top executives of the company – a real
treat for the people doing data mining and visualization.
Here, I have created a VBA
program that traverses the 150 folders containing the emails and constructs a contingency table. This two mode table shows how
frequently the 150 executives used the 1000 most common words in English language.
Next the process of
correspondence analysis is undertaken in Excel. Correspondence analysis
displays both the rows and the columns
of a two-way contingency table in one low-dimensional space. It calculates the Chi-square statistic divided by n (called
Total Inertia) for the table and performs its Singular Value Decomposition. In
order to the calculate coordinates for rows and columns, Q1 and Q2
as well as row and column sums are used. The theory is documented in the book Correspondence Analysis in Practice by Michael Greenacre.
In our case the first two dimensions will contain about 50% of the total
variance.
More interestingly, a
similar process is undertaken in the second file. Here, from amongst
the 1000 most common words in English, twelve are chosen – which according to
author’s perception have either a strong positive or negative connotation. The
results of correspondence analysis display the semantic differential that these words
really do have – as words are ranging from those that have a strong negative
meaning (on the left side of the primary axis) to those with a strong positive
one (on the right side).
No comments:
Post a Comment