Wednesday, September 22, 2010

Experiences from my first published project

Yesterday I reached a point where I felt comfortable with my implementation of a Double Array Trie, so I uploaded it to github. It was an instructive experience, one I plan to repeat ASAP with some
other ideas I have about useful libraries. Here is a superficial report of the things I learned during the last four days.

The project
The code started its life as a component of a larger project, while I was working  at the department of Informatics at the Athens University of Economics and Business. We implemented an algorithm known as Context Tree Weighting, a very effective solution to the problem of finding the most probable model that describes a data stream. The details are not important, however one of the problems of the implementation was that we needed to store information for every prefix of a string for a large number of very large strings (the performance of the algorithm is more or less proportional to the length of the strings stored). Since the initial implementation was in Matlab (which has no concept of pointers) we needed to implement a trie structure using arrays. I was directed towards the solution of the double array trie, I found an explanation of the algorithm at Theppitak Karoonboonyanan's site and I got down to coding. However, memory utilization and performance suffered (whether it was Matlab's fault or mine still is unclear to me) so, with C being out of the question due to portability reasons, I chose Java to re-implement the solution. The end result was functional and relatively stable, but, this being an academic project and all, had no relation to proper engineering. It was blazingly fast though and, what is more, extremely fun to code.

After school
Ever since this project ended, I wanted to release this code to the public, mostly as brag rights, since it was a difficult piece of coding. However, working in a software house does not leave much time off so it was postponed indefinitely. The opportunity came disguised as the bankruptcy of my employer, leaving me all the time in the world to work on everything I want. So, after some time on doing nothing at all, this code was unburied from a heap of bits known as the "old code archive" and I got down to some coding.

Experience applied
I have looked at and worked on a lot of public code, studying, measuring and solving (or even introducing) bugs. I have some expectations when it comes to quality, both of the code and of the documentation, but I also understand what over-engineering is (and the tendency I have to indulge into it). I reached the conclusion that the mess of a code I had at the time did not qualify for coding quality, since although it was not wrong, it was un-unravelable. Also, it was so tightly bound to the CTW algorithm that was useless.

Lessons learned
The fact that I knew the code proved the most difficult thing to overcome. I knew it was a mess but I could not see it. Finally I managed to prevail by trying to make myself use the code for something that it was not built for but users will expect to be available. This lead to implementing interfaces, providing assertions and test cases, separating concerns and adding a degree of modularity that allows for simple extensions as well as core rewrites. I learned to implement what I expect to see in other people's code, a task easier said than done. And loads of comments.
Another point was finding a hosting solution. I ended up using github, which meant I needed to learn a new VCS complete with integration with eclipse, my workflow and mindset. Nothing difficult but it was a new experience.
Finally, now I have a constant nagging feeling that comes from the things I haven't done correctly. I have an empty wiki, I know I have to provide usage examples, I know I must do some performance comparison with another implementation, I know there is not complete test coverage. In other words, I know I will never reach perfection. I knew however that if I expected to reach what I perceive as satisfactory I would never do what I have now done. I guess this is the biggest lesson I have learned from this small endeavour. To be realistic.

On to something bigger now...

Wednesday, September 15, 2010

...then it isn't

I continue my previous post with some more experiments and a roundup.

What was left out

In my last set of measurements, I left out the most important test. I wanted to see whether there was a threshold of the ratio of serial scans over random accesses where there would be a flip of the performance characteristics of my two implementations. Let me remind you here that the mixed test I performed previously gave a 1/7 probability of a serial scan (of random length and starting point) and the rest 6/7 to a random operation at a random position.

The test

The code this time was this:
for(int i = 0; i < RAND_ITERS; i++) {
size = theArray.size();
double op = rng.nextDouble();

if (op <= SERIAL_SCAN_PROB) {
int start = nextRandomIndex(dist, size);
int length = (int)(rng.nextInt(size - start));
for(int j = 0; j < length; j++) {
theArray.getFirst(j+start);
theArray.getSecond(j+start);
}
} else {
op = rng.nextDouble();
if (op < 0.2) {
theArray.getFirst(nextRandomIndex(dist, size));
} else if (op < 0.3) {
theArray.getSecond(nextRandomIndex(dist, size));
} else if (op < 0.4) {
theArray.setFirst(nextRandomIndex(dist, size), rng.nextInt());
} else if (op < 0.5) {
theArray.setSecond(nextRandomIndex(dist, size), rng.nextInt());
} else if (op < 0.7) {
theArray.add(rng.nextInt(), rng.nextInt());
} else {
theArray.remove(rng.nextInt(size));
}
}
}

The value of SERIAL_SCAN_PROB is what interests us here. Since I wanted to see the effects of caching, I settled to a specific size for the arrays that, based on my previous findings, was chosen as 5M elements. As before, all experiments where done with three access distributions, a ~Uni(0,size), a ~N(0.5, 0.1) and a ~N(0.5, 0.0001). RAND_ITERS was 1K and the whole test was run 20 times for each of the two structures. My hardware setup was the one I used last time.


The numbers

As before, no listing of boring data. If for any reason you want them, ask and you shall receive. The same goes for the code. Here are the graphs:






First off, the serial scans are the most time consuming of the operations, so it is normal the more are performed the more time is needed for each iteration of the test. Next, let's note that overall, the more localized the scans (in the order of the graphs), the more there is a difference in performance between the implementations. This is not because SingleArray does worse (note that its performance is more or less steady) but because DoubleArray is performing better. The more localized the operations, the better DoubleArray performs. However, there is no more random way to access an array other than a Uniform distribution, so only in this scenario can the single array implementation compete. Finally, more importantly but less interestingly, DoubleArray performs marginally better in all cases.

Conclusion

I think there are no more realistic tests to do here. There are marginal cases that I would like to test and find if the single array implementation dominates, but they are not important right now. What comes next is my library and, after all this, I will implement it with a double array backing store.

EDIT

I did a similar experiment with a backing store of two parallel linked integer lists (not Integer objects). Even @1M elements, the random start serial reads test hadn't finished after 1.5h of 100% CPU. I killed it. 'nuff said.

Tuesday, September 14, 2010

When data locality isn't

I am working on releasing as an open source project a data structure that uses as a store two parallel arrays. I tried to do some performance optimization by merging the two arrays into one, with the (naive, as it proved) logic that since the arrays are processed mostly in parallel (that means that accesses are performed at the same indexes) it would be beneficial to interleave them in one, using even indexes to represent the first and odd ones for the second. Obviously I was "thinking" that data accesses would be faster because of caching and prefetching chunks of the now common store. Boy was I wrong! First, take a look at the API so that we have a common vocabulary:

package org.digitalstain.structures;
/**
* Operations that are supposed to be present on a store
* supported by two arrays that are parallel. That means
* they are of the same size and in general the elements
* that correspond to the same index are accessed together.
*
* @param E The type of element to store.
* @author Chris Gioran
*
*/

public interface JointArray<E> {
/**
* Returns the size of the arrays.
* @return The size of the arrays.
*/
public int size();

/**
* Returns the element that is at position <code>index</code>
* at the first array.
*
* @param index The index requested
* @return The element present at the requested index at the first
* array
*/
public E getFirst(int index);

/**
* Returns the element that is at position <code>index</code>
* at the second array.
*
* @param index The index requested
* @return The element present at the requested index at the second
* array
*/
public E getSecond(int index);

/**
* Sets the value at <code>index</code> of the first array
* to the value <code>element</code>
* @param index The position to set.
* @param element The value to set.
*/

public void setFirst(int index, E element);

/**
* Sets the value at <code>index</code> of the second array
* to the value <code>element</code>
* @param index The position to set.
* @param element The value to set.
*/
public void setSecond(int index, E element);

/**
* Appends to the list, increasing its size if necessary.
* @param first The element to append to the first array
* @param second The element to append to the second array
*/
public void add(E first, E second);

/**
* Removes the element at <code>index</code> from both arrays
* @param index The index to remove
*/
public void remove(int index);
}


The setup

Since the details of the actual structure are irrelevant, I implemented both storage solutions bare-bone, that is two classes, one with two arrays with the corresponding indexes acting as a pair and the other with one backing array, two consecutive indexes acting as a pair. Before you go any further, which one is faster?

The trials

Obviously it depends on the access mode. So I performed a series of experiments with different access scenarios.

I did serial writes:
for(int i = 0; i < theArray.size(); i++) {
theArray.setFirst(i, i);
theArray.setSecond(i, i);
}


serial reads:
for(int i = 0; i < INIT_SIZE*10; i++) {
sum += theArray.getFirst(i%INIT_SIZE);
sum += theArray.getSecond(i%INIT_SIZE);
}

random reads:
Random rng = new Random(startTime);
for(int i = 0; i < RAND_ITERS; i++) {
theArray.getFirst(rng.nextInt(theArray.size()));
theArray.getSecond(rng.nextInt(theArray.size()));
}


and the crown jewel, random mixed access modes:
Random rng = new Random(startTime);
Normal normal = new Normal(0.5, 0.0001, new MersenneTwister64((int) System.currentTimeMillis()));
for(int i = 0; i < RAND_ITERS; i++) {
size = theArray.size();
switch (rng.nextInt(7)) {
case 0 : theArray.getFirst(nextRandomIndex(normal, size)); break;
case 1 : theArray.getSecond(nextRandomIndex(normal, size)); break;
case 2 : {
int start = nextRandomIndex(normal, size);
int length = (int)(rng.nextInt(size - start));
for(int j = 0; j < length; j++) {
theArray.getFirst(j+start);
theArray.getSecond(j+start);
}
}; break;
case 3 : theArray.setFirst(nextRandomIndex(normal, size), rng.nextInt()); break;
case 4 : theArray.setSecond(nextRandomIndex(normal, size), rng.nextInt()); break;
case 5 : theArray.add(rng.nextInt(), rng.nextInt()); break;
case 6 : theArray.remove(rng.nextInt(size)); break;
default : throw new IllegalArgumentException("Error from RNG in mixed case");
}
}


the last one was performed with the help of the colt library, with three distributions: ~Uni(0, theArray.size()), ~N(0.5, 0.1), ~N(0.5, 0.0001), the last two normalized to return a valid index. The host PC was a laptop with a P9500 Intel Core2Duo processor (2 cores, 6MB L2 cache) and 4 GB of RAM, 1 GB dedicated to the heap of the Sun JDK 1.6.0_21 JVM.

The numbers

I did 20 iterations of every test and took the mean and standard deviation from these. The results are tedious to reproduce here (if you need them, ping me) but the resulting charts are interesting.
Behold:








They display is average time per element vs the number of elements, for each type of operation.

The first two are simply clues that my setup is correct. For the serial read, things are as expected. The SingleArray implementation dominates over the other as a result of prefetching, especially around the time the cache evictions start to become noticeable (around 4M elements). For the random reads test, there is no noticeable difference between the two implementations, since essentially the test trashes the memory.

The three randomized, mixed operations tests however are not all that predictable. Apart from a small non-linear increase around the 4M mark (where cache evictions become more often and the memory access costs start to accumulate) and a more or less constant time after that, the order is the reverse than the one I originally expected. However, upon closer inspection, it is obvious that this is the way things should be (as if reality would get it wrong this time!). The serial access case in the mixed ops test is a very small part of the overall operations and even then, the random accesses destroy any chance that due to prefetching the correct portion of the array would be in the cache. The effects cumulatively make the disjoint arrays solution to perform measurably better.

Conclusion

The assumption that closely accessed data should be located near in memory is correct but far more sensitive to access patterns than I originally thought. In my case, I often but not always accessed the elements of two parallel arrays together, so I thought it was prudent to interleave them, taking advantage of prefetching and caching. However, this not only did not make any improvement, but, because of the less common random accesses within the arrays, the performance suffered to a noticeable degree. It seems that to take advantage of the large caches, it is
necessary to couple data that indeed is strongly coupled in the vast majority of the code paths. Even then I would run an experiment or two just to be on the safe side. I think that the last test takes a bit more refinement, to see how far along deviations of the coupled access pattern can go before they stop messing with performance.

Wednesday, September 8, 2010

A genuine question about the curicullum of the Department of Informatics at the Athens University of Economics And Business

My qualifications

I studied Computer Science at the aforementioned university. I was the second best student of my class, with a GPA of 9.3/10. I finished 3/4 of the required courses by my second year and was involved in research by the third semester. I published before I got my degree and was accepted for a PhD at the University of California, San Diego. I went downhill (academically speaking) from there, but the point I am trying to make is that I was a model student. I knew the material of the courses by hard, not because I studied, but because I really enjoyed it. In several courses, I was the only student to achieve perfect scores. However...

My rant

... I had to get out of the academic world to see what the real world demands and takes for granted. Obviously this is a fact of life, I was not expecting anything else. What really baffles me however is that whole chunks of Computer Science were left not only untouched but totally unmentioned during the 3 1/2 years of my study. I took all the core CS courses, mathematics, advanced mathematics, computability, complexity, the works, but I never ONCE heard the phrase "functional programming". Not once, no joke. I am constantly wondering how is it that a department that is proud to offer one of the best educations in CS in the country (Greece, for those that do not keep score) has no course, no lab, hell, no mention to one of the most important academic CS topics. I am not saying that we should talk about map/reduce or actors in scala (a language also unheard of in the corridors of AUEB). No, I am simply asking that during the "Introduction to CS" or "Introduction to programming" courses, somewhere among all the Java lectures there should be an interlude about, oh, I don't know, programming languages and a quick mention of the possible programming paradigms. Instead we were taught for the whole 4 years a course in Java (I was lucky to have already written a buttload of C and AVR assembly before I got there, so I was in the clear), some C++ and a lousy, barely deserving the mention, course in FORTRAN. That's it. Pathetic, don't you agree?

The result

I had to get out of there, work at the industry for 3 years (in Greece, the pinnacle of technology is managing to setup a JavaEE stack, with properly documented - the implementation is optional- Web Services for the really daring) to find out about Hadoop, FP, Scala, Graph Databases, lock free algorithms. Note that apart from the technologies, this material is core CS. Not graduate level even, FP should be taught along with C at the first year students (although others have talked about it far more eloquently that me). Hell, I have talked with professionals from my school who have no idea what is the difference between an ArrayList and a LinkedList. Honestly.

I'm done

I am really disappointed by the quality of the education I received from my studies. I learned a lot to be honest, but looking from the outside in I realize that it was not enough. I have worked really hard to make myself competitive and a lot of my hardship came from the inadequate preparation I received during my time there. Of course, this is Greece and no one is interested in creating quality software or implementing a solution in a way that displays imagination or in any way indicates that the programmer had fun. No sir. Maybe that is why I am unemployed so long and there is no light in sight. I am not qualified enough to work where i want to work and too proud to work someplace where the highlight of softeng is writing an ant script. Screw that, maybe I will be better off functionally serving burgers.