Jaanbi: 2010

Sunday, April 04, 2010

MultiDimensional Scaling (MDS) with Solver

This file creates the Solver model for multidimensional scaling (see a previous post on MDS in Excel/VBA). The application calculates MDS for the matrix from McCormick’s BEA article (see the previous post on BEA in Excel/VBA).

PROXIMUS with Solver

This file creates an approximation of a matrix (using PROXIMUS – see a previous post on Koyutürk’s PROXIMUS) using Solver.

McCormick’s Bond Energy Algorithm (BEA) in Excel/VBA

This file contains an implementation of a classic seriation algorithm by McCormick, Schweitzer and White (Problem Decomposition and Data Reorganization by a Clustering Technique) – bond energy algorithm – in Excel/VBA. This implementation is very similar to the fortran code on the S-Plus/R page of Fionn Murtagh’s. Additionally, an article by Arabie, Schleutermann, Daws and Hubert (Marketing Applications of Sequencing and Partitioning of Nonsymmetric and/or Two-Mode Matrices), that uses multiple iterations, is examined. Some modifications to this algorithm are considered.

Approximating matrices with PROXIMUS in Java

This file is an application of Mehmet Koyutürk’s algorithm PROXIMUS – an efficient framework for error-bounded compression of high-dimensional discrete attributed datasets. As sources for information on PROXIMUS, on the internet, seem to be limited, the best is perhaps the C source code of the R PROXIMUS package.

Sometimes, it would just be better to switch to VBA :)

This file contains vehicle routing paths (a solution to the two truck problem a few posts back), as a Trendline’s Microsoft Excel Solver optimization result. The problem (in the two left red columns) is that the two paths have become mixed. One path would have to take us: Tallinn-Märjamaa-Riidaku-Tallinn and the other: Tallinn-Pärnu-Valgerand-Tallinn, as determined by the ones and zeros in the solution. But the six ones determining the paths reside in the same column and have to be separated.

The set of Excel relationships is visualized by the formula auditing tool. The final result is in the right side of the spreadsheet, where the two red columns have three ones each.

A Solver model for a transportation problem that finds a location for an additional factory – using 625 changing cells :(

In this file a location for an additional factory has to be determined. It cannot be built on an existing location, goods can only be moved in vertical/horizontal directions, the distance between adjacent cells being 1. Total distance has to be minimized.

A transportation model with two trucks in Solver

This model uses flow balance constraints as well as Boolean variables.

Tuesday, March 30, 2010

Estonian peasants forming Bucket Brigades

Java applet demo

This Scratch file demonstrates the use of bucket brigades – a way of organizing workers (e.g. on an assembly line) so that the line balances itself. This algorithm from Bartholdi’s materials shows that Scratch can be used for algorithmic purposes. We start our Informatics II – Visual Basic for Applications (VBA) in Excel – class by introducing programming concepts in Scratch, this way mitigating the hardships of learning IDE, syntax and programming concepts simultaneously.

Spacefilling curves approximating TSP in Java

This file contains an algorithm that calculates a spacefilling curve – a means for mapping a high-dimensional space in a low-dimensional space. Bartholdi’s material uses Sierpinski’s curve and the mapping is a fast approximation that can be used in solving traveling salesman problem, in seriation, etc. Next to the Java code, a test set from Traveling Salesman Problem website is imported to Excel, and a spacefilling curve solution is found. As expected this fast approximation is ca. 40% inferior to the optimal solution.

The file, thus, contains a presentation, Java code and Excel/VBA program. The way of creating spacefilling curve itself is suboptimal, as I decided to read only the first paragraph of the algorithm and to try to devise the solution (on the picture above) myself.

A heuristic solution for a knapsack problem: using Excel’s VBA to optimize a sequence of Solver problems

Suppose we have to bill our business partner in the total amount of $500 000, but it’s our policy, that one invoice contains no more than $50 000 worth of records. This could be stated to be a kind of a curious knapsack problem. At first, the model is run and an initial invoice is created – worth of up to $50 000 –, followed by the removal of those records from the dataset. Then the model is run again, the chosen records are once again removed and so on – until we have created a minimal number of invoices, all filled to the maximum possible capacity. This is, though, stated with notion, that we are actually only using a heuristic method: each time the $50 000 solution is found to the remaining problem, and this is all that is taken into account. Ideally, we should look at the entire picture at once and find the global optimum of record allocation. This model was created for my optimization modeling class.

Multinational Corporations practice transfer pricing: lets try to fit that into a Solver/Excel network optimization model

The foundation of this Solver model is a classical operations research problem on network optimization. There is a certain capacity in the three “producing” countries, which can’t be exceeded; and the demand sets the upper boundary for all product flows into the four “consuming” countries. We know the production costs as well as the selling prices across the countries. All that in a classical model would help us establish the maximal profit. Regarding the MNC’s, though, we also take into account the corporate income taxes and the fact that an MNC has, either sales or production, subsidiaries in all countries.

If an MNC plans to produce in country 1 and sell in country 4, it would check which one has a lower corporate tax. If the consuming country is more competitive, the subsidiary of country 1 will sell the product to the subsidiary of country 4 at zero profit margin. The subsidiary of country 4 will make the sale to the customer at market prices, thus obtaining all profits. The profits are thus paid in the country with lower taxes.

If, on the other hand, the producing country has lower corporate income taxes, the producing subsidiary sells the product to the sales subsidiary – at full market value. The sales subsidiary sells the product to the final customer at market price as well, making no profit. All the profit is accrued in the producing country – with the MNC once again being able to choose the lower tax rate. This transfer pricing model was created for my optimization modeling class.

For business students: Monte Carlo method demystified with Excel/VBA

As I did my bachelor’s degree in the faculty of Economics and Business Administration of the University of Tartu, the topic of Monte Carlo simulations surfaced a number of times. Looking back, the simplicity of this method, applied in many classes, seems to have been rather unrevealed. As in our university VBA is taught in all faculties, we have taken a look at this business problem, the solution of which at its most involved part comprises of an If embedded inside two For-s. The key to the Excel part of the application logic is calculating the cumulative distribution function from probability density function.

Another tentative candidate for the Excel Gurus Gone Wild: Do the Impossible with Microsoft Excel book: Convert a two-dimensional table/matrix into a column or row vector and vice versa :)

In spreadsheet modeling one often has to convert a two-dimensional table into a row or column vector and vice versa. This Excel file does just that by using the following formulas: =INDEX($C$17:$N$28;IF(MOD(P3;12)=0;12;MOD(P3;12));IF(P3/12=INT(P3/12);P3/12;INT(P3/12)+1)) and =INDEX($R$3:$R$146;MATCH($B31&"-"&C$30;$Q$3:$Q$146;0)).

The former formula is actually an involved version of =INDEX($C$17:$N$28;MOD(P3;12);INT(P3/12)+1) that fixes the last row problem.

This is, for instance, a problem that is tackled in Solver network optimization models, when a two dimensional cost/distance matrix is converted into a flow balance constraint form.

Demos of Excel/VBA graphical capabilities based on NetLogo – a programmable modeling environment for simulating natural and social phenomena –: the world of rock-paper-scissors and the dynamics of bird flocks

This program is a demo of graphical capabilities that Excel has. A famous game of rock-paper-scissors forms the basis of our simulated world. Each species consumes one neighbor and is in turn consumed by another. A new entity of one’s own type is spawned, when food is consumed.

There is a need for maintaining a dynamic balance – if one was to eat all of its food, only two species would be present, and the focal species itself would be driven out as it’s the weaker of the two.

There are a few specific procedures created for this program: touching(object 1, object 2) checks whether the two objects touch (this sub program can also be used to check whether one block of cells is within another block of cells). Pause (s) pauses the program for a period of time, also updating the graphical display. Arrays are used for storing graphical objects.

Another example from Netlogo models shows the dynamics of bird flocks. Bird flocks is the one program here that has been created by my TTU students, not me. These students come from the same high school I once attended – Tallinna Reaalkool.

Kohonen’s Self Organizing Maps in Excel/VBA, applied for reducing dimensions of colors and of financial ratios from Google Finance

Kohonen’s Self Organizing Maps (SOM) is a type of artificial neural network that is trained using unsupervised learning. The number of dimensions is effectively reduced as a two-dimensional, discrete representation of high-dimensional data is produced. A neighborhood function is used as the topological properties of the input space are preserved.

In this file, the output space is visualized in a 40x40 block of Excel cells that are colored appropriately. If you would like to take a look at a good description of the algorithm, go to AIJunkie’s website.

The second file, chooses three financial indicators – ROE, Debt-to-Equity and Market Capitalization – and once again depicts this three-dimensional data on a two-dimensional SOM. The ratios are from telecom sector of Google Finance and are (0,1) normalized beforehand. As the initial plot is “dominated” by the three biggest companies – for which a substantial portion of the entire space is reserved (the market capitalization ratio is extremely right skewed) – a logarithm transformation is later undertaken. The colors of the graph are unappealing, because the ratio vectors are used as RGB components of Excel cell colors :).

Principal Component Analysis of a U.S. city distance matrix in Excel

Principal Component Analysis (PCA) transforms a number of correlated variables into a smaller number of uncorrelated variables – the principal components. This way, the high-dimensional data set can once again be depicted on a low(two)-dimensional graph.

PCA is performed by calculating the covariance matrix for the initial data (each dimension of initial data must have zero means – if needed the average of the respective dimension is first subtracted from each data item). Next, eigenvectors of covariance matrix are calculated. The proportion of eigenvalues shows the importance of each new principal component dimension. A suitable number of eigenvectors is chosen, usually two. The initial data matrix is multiplied with the eigenvectors, thus obtaining a two- dimensional representation of data. If one was to calculate the covariance matrix for the new representation of data, this would be diagonalized – which means that each new axis would have maximal covariance with itself (in other words variance) and no covariance with the other dimensions.

A document from Salk Institute presents a very thorough description of PCA, together with the full derivation of the algorithm.

In our file, a symmetric 11x11 matrix of distances between the cities of U.S. is taken. The first two principal components are calculated from this 11 dimensional data set – depicting 80% of the variance in the initial data. The picture above shows the two-dimensional map that was constructed. Eigenvalues and eigenvectors were calculated with add-in MATRIX 2.3 - Matrix and Linear Algebra functions for EXCEL that is freely downloadable.

A primitive try at testing graph isomorphism in Excel/VBA

When we consider an adjacency matrix of a graph, each entry shows the presence of an arc between the nodes. Taking the matrix to the power of two, an entry shows the number of length two connections between the nodes considered. Similarly, the adjacency matrix to the power of three shows the number of length three connections, etc.

In this program a three-dimensional array is constructed, with the initial matrix at the first position of the third dimension. Next the matrix is taken to subsequent powers, each being stored at respective positions of the third dimension. Thus, if we take position 2-3 of the adjacency matrix and look at its powers on the third dimension, we will see how this node is connected to the rest.

The program takes the graph up to the power of the number of its elements. For each adjacency matrix element the sequence of its successive powers is then written out as a text string and the strings are sorted in an ascending order. The idea is that for two isomorphic graphs, the results would be the same.

This, of course, is a very primitive try at testing the isomorphism of graphs, as in the entire computational complexity theory this is one of only two problems in existence, the complexity of which is neither polynomial time nor NP-complete.

A set of pictures that I use for explaining Singular Value Decomposition (SVD)

According to finite-dimensional spectral theorem, a symmetric matrix has orthonormal eigenvectors, thus Q*Q^T=I and Q^T=Q^-1. Let us take Q₁ as eigenvectors of a symmetric matrix formed by A*A^T; and Q₂ as eigenvectors of a symmetric matrix formed by A^T*A. According to the (not too involved) proof, SVD decomposes matrix A in a following way: A = Q₁*S*Q₂^T, with eigenvalues being stored in S. As eigenvalues are stored in descending order, the matrix is now effectively multilayered (see pictures above), with a possibility of “peeling” off the more important vectors at first and discarding the less influential information from the latter part of decomposition.

Using Singular Value Decomposition (SVD): Excel/VBA application that reads a BMP file and compresses it in a lossy way

(See the post on SVD for keywords.) This application begins by reading a BMP file – BMP format is chosen as pictures therein are stored in a raw format, pixel by pixel, with no transformation undertaken –, and then displaying the picture in Excel with each cell being shaded respectively.

Next, SVD is undertaken (once again using Excel add-in MATRIX 2.3 - Matrix and Linear Algebra functions for EXCEL that is freely downloadable). Analysis of resultant eigenvalues shows that over 80% of the information is contained in the first 41 eigenvectors. Whereas the initial picture had information worth of 154*181=27874 bytes, the first 41 vectors of Q₁ and Q₂ and the first 41 eigenvalues are contained in 154*41+181*41+41*41=15416 bytes, resulting in 44,6% compression. The quality you can judge for yourselves :). The picture has thus been compressed in a lossy way; additionally the method could be used for transferring the picture over a network, peel by peel, until the final fine-grained view is obtained.

Correspondence Analysis in Excel: the semantic differential of words according to Enron email dataset.

(See the post on Singular Value Decomposition for keywords.) After the bankruptcy of Enron, a $100 billion sales, 22 000 employee company, the Federal Energy Regulatory Commission made public and posted on the web, the data set containing 500 000 messages between the 150 top executives of the company – a real treat for the people doing data mining and visualization.

Here, I have created a VBA program that traverses the 150 folders containing the emails and constructs a contingency table. This two mode table shows how frequently the 150 executives used the 1000 most common words in English language.

Next the process of correspondence analysis is undertaken in Excel. Correspondence analysis displays both the rows and the columns of a two-way contingency table in one low-dimensional space. It calculates the Chi-square statistic divided by n (called Total Inertia) for the table and performs its Singular Value Decomposition. In order to the calculate coordinates for rows and columns, Q₁ and Q₂ as well as row and column sums are used. The theory is documented in the book Correspondence Analysis in Practice by Michael Greenacre. In our case the first two dimensions will contain about 50% of the total variance.

More interestingly, a similar process is undertaken in the second file. Here, from amongst the 1000 most common words in English, twelve are chosen – which according to author’s perception have either a strong positive or negative connotation. The results of correspondence analysis display the semantic differential that these words really do have – as words are ranging from those that have a strong negative meaning (on the left side of the primary axis) to those with a strong positive one (on the right side).

MultiDimensional Scaling (MDS) in Excel/VBA

Information visualization technique multidimensional scaling (MDS) has its roots in the field of psychometrics and is used for exploring similarity/dissimilarity data. The case in point is once again the city distance matrix for the U.S. (see the post on PCA).

As a result a two-dimensional representation of data is constructed, where each Euclidean distance approximates the corresponding dissimilarity/distance in the original matrix. There is an objective function (called stress/Kruskal’s stress/energy) that is minimized.

This file is based on methods put forth in the dissertation of Wojciech Basalaj and uses Newton-Raphson method for optimization. There is a problem with convergence, though, as solution seems to get stuck in the local minima.

Using Singular Value Decomposition (SVD) to factorize data on car properties

This file has data on five cars: Chevrolet Matiz, Chevrolet Evanda, Chevrolet Tacuma, Chevrolet Nubira, Chevrolet Kalos. The properties considered are engine displacement, length, width, cargo volume, number of cylinders, power, fuel consumption and CO2 amount.

SVD reveals that 93,4% and 5,4% of information is contained in the first two singular vectors, respectively – leaving only 1,2% for the rest of the decomposition. The first singular vector can be called general assessment – it shows a basic relationship between all the cars across the entire set of variables. Our main focus is to try to understand the contribution that the second singular vector has. As shown above, the most influential contributions of the second singular vector are in engine displacement, cargo volume and power (kW) – this is the fine-tuning, which shows that the cars can’t simply be described with a rank one matrix.

This is the second file on this blog, that was not originally created by me (the first was the model of bird flocks), but rather replicated and expanded.

Jaanbi