|Once I had a working B+Tree I sought to intergrate this into Scribe to see how it worked with "real data". Well that was a dismal failure. You see being a disk based data structure rather than RAM based it sucked speed wise compared to the memory only system used by the v1.88 build of Scribe. So I looked at the code and there was a cache of sorts built in to keep B+Tree blocks in RAM for a while before flushing them to disk. I did a quick experiment to see how much the cached blocks were getting used and whoa, no caching was taking place. Well there yah go son, thats your problem.
So I started tinkering with the cache, first by fixing the bugs to get it working and then progressively changing it's parameters and implementation to improve speed. Initially by just fixing the bugs in the initial implementation I got a massive speed up. So instead of 50 keys/sec I got around 3600 keys/sec. Nice, but still slow compared to the old way. Then I realised that it wasn't a smart cache in that it cached the first n blocks and then that was it, so I implemented a system to remove older blocks when the cache was full. I bumped the cache size up to 256 blocks (from 128) and the speed went up to 3900 keys/sec. Ok, time to whip out the profiler and see where the time was being spent. Turns out that most of the time was going in linear searches through the cache which drastically limited the size of the cache and these get/put functions that converted data for serialization. So I reimplemented the get/put functions as macros, getting rid of the expensive function call and that made an immediate difference, speed was up to 4950 keys/sec.
The existing data structures were now holding cache size back, so I ripped out the array data structure and reimplemented with a binary tree for offset lookups and a doubly linked list for the least recently used que. That allowed the size of the cache to climb without hurting performance and speed increased again, with the cache size up to 1024 speed jumped up to 6400 keys/sec. The least recently used que was working a treat, with O(1) operation for all operations.
But still I wasn't convinced. From cs101 any programmer knows that binary trees have O(logn) complexity for insertion, deletion and search. Now I know of another data structure with essentially O(1) complexity for all those operations. A hash table.
So in goes the hash table and then I ran the tests again. 1500 keys/sec. Huh? Ok, 2 things could be happening here, a) I could have chosen a sucky hashing algorithm and b) my code is probably buggy. To fix the bugs I wrote a data structure verify function that scanned the entire thing and check for conformance to the rules. This ran everytime I changed something in the structure, and it saved heaps of time. Then with the bugs out of the way I got down to checking the quality of the hash algorithm. I tested the number of hash table collisions for each hashing algorithm and eventually settled on a shift and a modulous.
Now I felt I was wringing out the last of the performance from the data structure. Still when I plugged it into Scribe I was getting about 100-130 messages / sec when rebuilding the spam word database. This is compared to the 600 messages / sec with a straight memory only hash table. So it took about 5 times longer to build the word DB.
So currently I've given up with using the B+Tree directly during word DB rebuilds and I've gone back to the memory sucking hash tables. Then instead of writing the hash table to disk I dump it all to the B+Tree instead so that I can access the data without loading the entire thing into memory. This keeps the footprint of Scribe nice and low for normal day to day running, and there isn't that expensive word DB load that kills performance during the first mail receive of the session.
The only problem now is that during the word DB rebuild my debug build sucked down a cool 1gb of RAM, seriously paging out to disk (I've got 512mb installed). This is obviously unacceptable so I'll have to look at that before a release happens. But I'll probably remove the "rebuild word DB" function entirely and rely on incremental edits and mail is received and moved between folders. I'm still toying with the idea of adding mail bit by bit to the word DB's as the user moves around the folders, thus acheiving the indexing incrementally and thus the speed of the B+Tree's is not so much of a problem. Also the actual word counts are not accurate yet, there is an issue in the mail -> word list function that I've been working on for a couple of days now. Damn charsets and dodgy email clients make things so complicated.
Anyway, enough said, I'm still working hard on the software and a release will happen in the next few weeks.