This file creates the
Solver model for multidimensional scaling (see a previous post on MDS in
Excel/VBA). The application calculates MDS for the matrix from McCormick’s BEA
article (see the previous post on BEA in Excel/VBA).
Sunday, April 04, 2010
PROXIMUS with Solver
This file creates an approximation of a matrix (using PROXIMUS – see
a previous post on Koyutürk’s PROXIMUS) using Solver.
McCormick’s Bond Energy Algorithm (BEA) in Excel/VBA
This file contains an implementation of a classic seriation algorithm
by McCormick, Schweitzer and White (Problem Decomposition and Data
Reorganization by a Clustering Technique) – bond energy algorithm – in Excel/VBA.
This implementation is very similar to the fortran code on the S-Plus/R page of Fionn
Murtagh’s. Additionally, an article by Arabie,
Schleutermann, Daws and Hubert (Marketing
Applications of Sequencing and Partitioning of Nonsymmetric and/or Two-Mode
Matrices), that uses multiple iterations, is
examined. Some modifications to this algorithm are considered.
Approximating matrices with PROXIMUS in Java
This file is an application of Mehmet Koyutürk’s algorithm PROXIMUS – an efficient
framework for error-bounded compression of high-dimensional discrete attributed
datasets. As sources for information on PROXIMUS, on the internet, seem to be
limited, the best is perhaps the C source code of the R PROXIMUS package.
Sometimes, it would just be better to switch to VBA :)
This file contains vehicle
routing paths (a solution to the two truck problem a few posts back), as a
Trendline’s Microsoft Excel Solver optimization result. The problem (in the two
left red columns) is that the two paths have become mixed. One path would have
to take us: Tallinn-Märjamaa-Riidaku-Tallinn and the other:
Tallinn-Pärnu-Valgerand-Tallinn, as determined by the ones and zeros in the
solution. But the six ones determining the paths reside in the same column and
have to be separated.
The set of Excel
relationships is visualized by the formula auditing tool. The final result is
in the right side of the spreadsheet, where the two red columns have three ones
each.
A Solver model for a transportation problem that finds a location for an additional factory – using 625 changing cells :(
In this file a location for an additional factory has to be determined.
It cannot be built on an existing location, goods can only be moved in
vertical/horizontal directions, the distance between adjacent cells being 1.
Total distance has to be minimized.
A transportation model with two trucks in Solver
This model uses flow
balance constraints as well as Boolean variables.
Tuesday, March 30, 2010
Estonian peasants forming Bucket Brigades
Java applet demo
This Scratch file
demonstrates the use of bucket brigades – a way of organizing workers (e.g. on
an assembly line) so that the line balances itself. This algorithm from Bartholdi’s
materials
shows that Scratch can be used for algorithmic purposes. We start our Informatics
II – Visual Basic for Applications (VBA) in Excel – class by introducing
programming concepts in Scratch, this way mitigating the hardships of learning
IDE, syntax and programming concepts simultaneously.
Spacefilling curves approximating TSP in Java
This file contains an
algorithm that calculates a spacefilling curve – a means for mapping a
high-dimensional space in a low-dimensional space. Bartholdi’s material uses Sierpinski’s curve and the mapping is a
fast approximation that can be used in solving traveling salesman problem, in
seriation, etc. Next to the Java code, a test set from Traveling
Salesman Problem website is imported to Excel, and a spacefilling curve solution
is found. As expected this fast approximation is ca. 40% inferior to the
optimal solution.
The file, thus,
contains a presentation, Java code and Excel/VBA program. The way of creating
spacefilling curve itself is suboptimal, as I decided to read only the first
paragraph of the algorithm and to try to devise the solution (on the picture
above) myself.
A heuristic solution for a knapsack problem: using Excel’s VBA to optimize a sequence of Solver problems
Suppose we have to
bill our business partner in the total amount of $500 000, but it’s our
policy, that one invoice contains no more than $50 000 worth of records. This could be stated to be
a kind of a curious knapsack problem. At first, the model is run and an initial
invoice is created – worth of up to $50 000 –, followed by the removal of
those records from the dataset. Then the model is run again, the chosen records
are once again removed and so on – until we have created a minimal number of
invoices, all filled to the maximum possible capacity. This is, though, stated
with notion, that we are actually only using a heuristic method: each time the
$50 000 solution is found to the remaining
problem, and this is all that is taken into account. Ideally, we should look
at the entire picture at once and find the global optimum of record allocation.
This model was created for my optimization modeling class.
Multinational Corporations practice transfer pricing: lets try to fit that into a Solver/Excel network optimization model
The foundation of this Solver model is a
classical operations research problem on network optimization. There is a
certain capacity in the three “producing” countries, which can’t be exceeded;
and the demand sets the upper boundary for all product flows into the four
“consuming” countries. We know the production costs as well as the selling
prices across the countries. All that in a classical model would help us
establish the maximal profit. Regarding the MNC’s, though, we also take into
account the corporate income taxes and the fact that an MNC has, either sales
or production, subsidiaries in all countries.
If an MNC plans to
produce in country 1 and sell in country 4, it would check which one has a
lower corporate tax. If the consuming country is more competitive, the
subsidiary of country 1 will sell the product to the subsidiary of country 4 at
zero profit margin. The subsidiary of country 4 will make the sale to the
customer at market prices, thus obtaining all profits. The profits are thus
paid in the country with lower taxes.
If, on the other hand,
the producing country has lower corporate income taxes, the producing
subsidiary sells the product to the sales subsidiary – at full market value.
The sales subsidiary sells the product to the final customer at market price as
well, making no profit. All the profit is accrued in the producing country –
with the MNC once again being able to
choose the lower tax rate. This transfer pricing model was created for my optimization
modeling class.
For business students: Monte Carlo method demystified with Excel/VBA
As I did my bachelor’s
degree in the faculty of Economics and Business Administration of the University
of Tartu,
the topic of Monte Carlo simulations surfaced
a number of times. Looking back, the simplicity of this method, applied in many
classes, seems to have been rather unrevealed. As in our university VBA is
taught in all faculties, we have taken a look at this business problem, the
solution of which at its most involved part comprises of an If embedded inside
two For-s. The key to the Excel part of the application logic is calculating
the cumulative distribution function from
probability density function.
Another tentative candidate for the Excel Gurus Gone Wild: Do the Impossible with Microsoft Excel book: Convert a two-dimensional table/matrix into a column or row vector and vice versa :)
In spreadsheet
modeling one often has to convert a two-dimensional table into a row or column
vector and vice versa. This Excel file does just that by using the following formulas: =INDEX($C$17:$N$28;IF(MOD(P3;12)=0;12;MOD(P3;12));IF(P3/12=INT(P3/12);P3/12;INT(P3/12)+1))
and =INDEX($R$3:$R$146;MATCH($B31&"-"&C$30;$Q$3:$Q$146;0)).
The former formula is
actually an involved version of =INDEX($C$17:$N$28;MOD(P3;12);INT(P3/12)+1) that
fixes the last row problem.
This is, for instance,
a problem that is tackled in Solver network optimization models, when a two
dimensional cost/distance matrix is converted into a flow balance constraint
form.
Demos of Excel/VBA graphical capabilities based on NetLogo – a programmable modeling environment for simulating natural and social phenomena –: the world of rock-paper-scissors and the dynamics of bird flocks
This program is a demo of graphical capabilities that Excel has. A
famous game of rock-paper-scissors forms the basis of
our simulated world. Each species consumes one neighbor and is in turn consumed
by another. A new entity of one’s own type is spawned, when food is consumed.
There is a need for
maintaining a dynamic balance – if one was to eat all of its food, only two
species would be present, and the focal species itself would be driven out as
it’s the weaker of the two.
There are a few
specific procedures created for this program: touching(object 1, object 2)
checks whether the two objects touch (this sub program can also be used to
check whether one block of cells is within another block of cells). Pause (s)
pauses the program for a period of time, also updating the graphical display. Arrays
are used for storing graphical objects.
Another example from Netlogo models shows the
dynamics of bird flocks. Bird flocks is the one program here that has been created by my TTU students, not me. These
students come from the same high school I once attended –
Tallinna Reaalkool.
Kohonen’s Self Organizing Maps in Excel/VBA, applied for reducing dimensions of colors and of financial ratios from Google Finance
Kohonen’s Self
Organizing Maps (SOM) is a type of artificial neural network that is trained using
unsupervised learning. The number of dimensions is effectively reduced as a
two-dimensional, discrete representation of high-dimensional data is produced.
A neighborhood function is used as the topological properties of the input
space are preserved.
In this file, the output space is visualized in a 40x40 block of Excel
cells that are colored appropriately. If you would like to take a look at a
good description of the algorithm, go to AIJunkie’s website.
The second file, chooses three
financial indicators – ROE, Debt-to-Equity and Market Capitalization – and once
again depicts this three-dimensional data on a two-dimensional SOM. The ratios
are from telecom sector of Google Finance and are (0,1) normalized beforehand.
As the initial plot is “dominated” by the three biggest companies – for which a
substantial portion of the entire space is reserved (the market capitalization
ratio is extremely right skewed) – a logarithm transformation is later undertaken.
The colors of the graph are unappealing, because the ratio vectors are used as
RGB components of Excel cell colors :).
Principal Component Analysis of a U.S. city distance matrix in Excel
Principal Component
Analysis (PCA) transforms a number
of correlated variables into a smaller number of uncorrelated variables – the
principal components. This way, the high-dimensional data set can once again be
depicted on a low(two)-dimensional graph.
PCA is performed by
calculating the covariance matrix for the initial data (each dimension of
initial data must have zero means – if needed the average of the respective
dimension is first subtracted from each data item). Next, eigenvectors of
covariance matrix are calculated. The proportion of eigenvalues shows the
importance of each new principal component dimension. A suitable number of
eigenvectors is chosen, usually two. The initial data matrix is multiplied with
the eigenvectors, thus obtaining a two- dimensional representation of data. If
one was to calculate the covariance matrix for the new representation of data,
this would be diagonalized – which means that each new axis would have maximal
covariance with itself (in other words variance) and no covariance with the
other dimensions.
A document from Salk Institute presents a very thorough
description of PCA, together with the full derivation of the algorithm.
In our file, a symmetric 11x11 matrix of distances between the cities
of U.S. is taken. The first two principal components are calculated from this
11 dimensional data set – depicting 80% of the variance in the initial data.
The picture above shows the two-dimensional map that was constructed.
Eigenvalues and eigenvectors were calculated with add-in MATRIX 2.3 - Matrix and Linear
Algebra functions for EXCEL that is freely downloadable.
A primitive try at testing graph isomorphism in Excel/VBA
When we consider an adjacency matrix of a graph, each entry shows the
presence of an arc between the nodes. Taking the matrix to the power of two, an
entry shows the number of length two connections between the nodes considered.
Similarly, the adjacency matrix to the power of three shows the number of
length three connections, etc.
In this program a three-dimensional array is constructed, with the
initial matrix at the first position of the third dimension. Next the matrix is
taken to subsequent powers, each being stored at respective positions of the
third dimension. Thus, if we take position 2-3 of the adjacency matrix and look
at its powers on the third dimension, we will see how this node is connected to
the rest.
The program takes the
graph up to the power of the number of its elements. For each adjacency matrix
element the sequence of its successive powers is then written out as a text
string and the strings are sorted in an ascending order. The idea is that for
two isomorphic graphs, the results would be the same.
This, of course, is a
very primitive try at testing the isomorphism of graphs, as in the entire computational complexity theory this is one of only
two problems in existence, the complexity of which is neither polynomial time
nor NP-complete.
A set of pictures that I use for explaining Singular Value Decomposition (SVD)
According to finite-dimensional spectral theorem, a symmetric matrix
has orthonormal eigenvectors, thus Q*QT=I and QT=Q-1.
Let us take Q1 as eigenvectors of a symmetric matrix formed by A*AT;
and Q2 as eigenvectors of a symmetric matrix formed by AT*A.
According to the (not too involved) proof, SVD decomposes
matrix A in a following way: A = Q1*S*Q2T, with
eigenvalues being stored in S. As eigenvalues are stored in descending order, the matrix is
now effectively multilayered (see pictures above), with a possibility of
“peeling” off the more important vectors at first and discarding the less
influential information from the latter part of decomposition.
Using Singular Value Decomposition (SVD): Excel/VBA application that reads a BMP file and compresses it in a lossy way
(See the post on SVD
for keywords.) This application begins by
reading a BMP file – BMP format is chosen as pictures therein are stored in a
raw format, pixel by pixel, with no transformation undertaken –, and then displaying
the picture in Excel with each cell being shaded respectively.
Next, SVD is
undertaken (once again using Excel add-in MATRIX 2.3 - Matrix and Linear
Algebra functions for EXCEL that is freely downloadable). Analysis of resultant eigenvalues
shows that over 80% of the information is contained in the first 41
eigenvectors. Whereas the initial picture had information worth of
154*181=27874 bytes, the first 41 vectors of Q1 and Q2
and the first 41 eigenvalues are contained in 154*41+181*41+41*41=15416 bytes,
resulting in 44,6% compression. The quality you can judge for yourselves :). The
picture has thus been compressed in a lossy way; additionally the method could be
used for transferring the picture over a network, peel by peel, until the final
fine-grained view is obtained.
Correspondence Analysis in Excel: the semantic differential of words according to Enron email dataset.
(See the post on Singular
Value Decomposition for keywords.) After the bankruptcy of Enron, a $100 billion
sales, 22 000 employee company, the Federal Energy Regulatory Commission made
public and posted on the web, the data set containing
500 000 messages between the 150 top executives of the company – a real
treat for the people doing data mining and visualization.
Here, I have created a VBA
program that traverses the 150 folders containing the emails and constructs a contingency table. This two mode table shows how
frequently the 150 executives used the 1000 most common words in English language.
Next the process of
correspondence analysis is undertaken in Excel. Correspondence analysis
displays both the rows and the columns
of a two-way contingency table in one low-dimensional space. It calculates the Chi-square statistic divided by n (called
Total Inertia) for the table and performs its Singular Value Decomposition. In
order to the calculate coordinates for rows and columns, Q1 and Q2
as well as row and column sums are used. The theory is documented in the book Correspondence Analysis in Practice by Michael Greenacre.
In our case the first two dimensions will contain about 50% of the total
variance.
More interestingly, a
similar process is undertaken in the second file. Here, from amongst
the 1000 most common words in English, twelve are chosen – which according to
author’s perception have either a strong positive or negative connotation. The
results of correspondence analysis display the semantic differential that these words
really do have – as words are ranging from those that have a strong negative
meaning (on the left side of the primary axis) to those with a strong positive
one (on the right side).
MultiDimensional Scaling (MDS) in Excel/VBA
Information
visualization technique multidimensional scaling (MDS) has its roots in the
field of psychometrics and is used for
exploring similarity/dissimilarity data. The case in point is once again the
city distance matrix for the U.S. (see the post on PCA).
As a result a
two-dimensional representation of data is constructed, where each Euclidean distance
approximates the corresponding dissimilarity/distance
in the
original matrix. There is an objective function (called stress/Kruskal’s
stress/energy) that is minimized.
This file is based on methods
put forth in the dissertation of Wojciech Basalaj and uses Newton-Raphson method for optimization. There is a problem with convergence,
though, as solution seems to get stuck in the local minima.
Using Singular Value Decomposition (SVD) to factorize data on car properties
This file has data on five
cars: Chevrolet Matiz, Chevrolet Evanda, Chevrolet Tacuma, Chevrolet Nubira, Chevrolet
Kalos. The properties considered are engine displacement, length, width, cargo
volume, number of cylinders, power, fuel consumption and CO2 amount.
SVD reveals that 93,4%
and 5,4% of information is contained in the first two singular vectors,
respectively – leaving only 1,2% for the rest of the decomposition. The first
singular vector can be called general assessment – it shows a basic relationship
between all the cars across the entire set of variables. Our main focus is to
try to understand the contribution that the second singular vector has. As
shown above, the most influential contributions of the second singular vector
are in engine displacement, cargo volume and power (kW) – this is the
fine-tuning, which shows that the cars can’t simply be described with a rank
one matrix.
This is the second
file on this blog, that was not originally created by me (the first was the
model of bird flocks), but rather replicated and expanded.
Inverse of a matrix in Excel/VBA
This file finds the
inverse matrix in VBA, using Gauss-Jordan elimination.
Stephen Wolfram’s A New Kind of Science: a set of cellular automatas in Excel/VBA
This file(and this) uses VBA or Excel’s formulas to operate Stephen
Wolfram’s
cellular automatas from his book A New Kind of Science, including Reflector
and Alice in the Wonderland’s Cat.