Friday, June 29, 2007

An interesting bug in Java binarySearch

Josh Bloch, author of Effective Java Programming and one of the best in the biz, blogged last summer about an error in his own implementation of binary search within Sun's JDK. It turns out that the most straightforward implementation admits a flaw so subtle that the version in java.util.Arrays, in particular, lay dormant for almost a decade. It is one of the most interesting bugs I have seen in a long time, and in fact I didn't notice it myself. Simplified slightly from the original:

// locate an index of key k in sorted array a[],
// or if k is not present, return -1
int binarySearch( int[] a, int k ){
int i = 0;
int j = a.length - 1;

while( i <= j ){
int m = (i + j) / 2;

if( k < a[m] ){ // left half
j = m - 1;

} else if( a[m] < k ){ // right half
i = m + 1;

} else { // found
return m;
}

}
return -1; // not present
}

The culprit is sly: Operating on a large enough array, computing m = (i+j)/2 is subject to integer overflow, becoming negative due to wraparound, e.g.

int M = 0x7fffffff; // 2^31 - 1
int[] a = new int[ M ];
a[ M - 1 ] = 1;

// the call...
int r = binarySearch( a, 1 );
// ...throws an ArrayIndexOutOfBoundsException !

Fortunately the midpoint can be computed safely via m = i + (j-i)/2. Already there are many blog posts on this blemish, but none concerning quality. What might a software engineer learn about testing from this long-hidden oversight?

1. A lesson from numerical analysis, that time-honored formulas aren't necessarily robust when copied to finite arithmetic.

2. Vary your test data. Surely the Sun test battery for binarySearch passed, again and again, on the same sets of arrays and keys. Fine for 9 years -- but static lists of tests become stale. Which arrays should we select, or search for? Read on...

3. Specialized conditions are required to pinpoint rare defects. This flaw cannot be exposed for sizes below 2^30, about 1 billion; no wonder nobody noticed. The truth is that the endpoints of anything almost always provide interesting insights. Here, searching far to the left always passes, and only searching far enough to the right can throw the Exception. Aha. This leads to a favorite maxim...

4. Design tests specific to the implementation. The greatest of the Java int primitives is M -- nothing to do with binary search -- instead, a boundary in the language itself! Therefore, an array of that length makes a worthwhile test case in Java (cf. Mathematica, Python). Something to remember when writing your own test suites.

Professor Kahan at Berkeley once summed up his lessons about testing (from the infamous Pentium divide glitch) into a fitting quote:
Seek singularities!  There lie all the errors.

Wisely put, as are Josh's remarks about programming defensively, vigilance, code reviews, static analysis, solid proofs, and formal verification. (I would disagree with his view that there will always be bugs, but opposing formidable Dr. Bloch isn't really something one ought to do).

The paradox is that although testing can't exterminate our errors, it still needs to be done to provide corroborative evidence of correctness, even for well-proved algorithms! Later, I'll augment this post with test strategies which might have exposed the flaw. With test methods in mind, we might be able to expose ever more errors, and add to our QA repertoire.

An exercise: Using 32-bit two's compliment integer arithmetic, compute the average of i and j, correctly rounding to even. Test it with i and j at the extremes. My solution is not short.

Tuesday, June 19, 2007

Ubuntu: A Fresh Perspective

I decided to replace the Fedora core 4 machine I had at home with the bizarrely named "Feisty Fawn" version (7.04) of Ubuntu, which has gotten oh so much attention as of late. Having never installed Ubuntu, I wanted to know what the big deal was.

There was only one disk to burn. That was a nice surprise. I started booting, and found that the CD I burned was actually a Live CD. It booted me right into Ubuntu. Naturally, I didn't read much about installing the distro, so I surmised for a minute that perhaps Ubuntu was just a Live CD. Not so -- after I actually looked at the items on the desktop, I found a single "Install" icon. Clicking this proceeded with the install of Ubuntu on my local hard drive.

The distro has so far proved to be very usable. There is an "Add/Remove Applications" feature which is remarkably like the "Add/Remove Programs" feature in Windows. That is very useful for users who grew up in a Windows world. My one complaint thus far is that I can't login as the root user. I'm sure that if I look around the net enough I can find a way around that :).

Saturday, June 9, 2007

Concurrent Programming: The Next Best Thing

I was one of the lucky few to attend the Google Developer Day in San Jose last week. It was quite interesting. One of the best talks I went to concerned the Google backend infrastructure. One of the topics they've worked on extensively is storing and organizing massive amounts of information. This probably isn't a surprise to anyone, but it highlighted to me the coming importance of concurrent programming.

For a short one-line introduction to the subject: concurrent programming concerns the simultaneous execution of multiple interacting computational tasks. O'Reilly recently pointed out that as the processor speeds remain stable, but the number of processors inside a computer continue to increase. In 5-10 years, having a computer with 50-100 processors will probably not be unusual. The real question is: how do you take advantage of all those processors?

I think many consumers assume that all these extra processors in their computers are being leveraged. I would say they are most likely assuming incorrectly, or at least, they are being leveraged inefficiently. Concurrent programming languages (such as Haskell) allow multiple processors to execute tasks simultaneously, and coordinate tasks between them. Now that's some sexy programming.

It can also make good financial sense to move into this area of programming. In O'Reilly's article, "The Faint Signals of Concurrency", utilizing an 8 CPU machine for processing data is much cheaper than buying a cluster of 8 machines for processing data.

Haskell is already a "hip" programming language (stop laughing). Among nerds, it is the cool language that wears its sunglasses at night, listens to jazz, and can recite quotes from famous philosophers. I would say that this is a bleeding edge language, and really needs more investigation by me.