|Recently I've ported some B+Tree code to windows to use in Scribe for storing the spam database word indexes. The current system takes a lot of time to load into memory and uses too much memory. I spent some time optimizing the existing code, but fundamentally it's design is flawed. Really the data needs to stay on disk and be arranged for random access (by key) without loading much of the file into memory. These is an existing data structure for that called the B+Tree. It's now finding it's way into commonly file systems (like ReiserFS), and was the basis of the BeOS File System.
However my point is that Scribe is probably going to use B+Trees for it's spam word DB. To test the code I've been reading in my current spamwords.wdb file and exporting it to the new B+Tree format. The .wdb file is fairly optimal in it's size, mines 762kb, so thats our baseline. The B+Tree files are less optimal in storing raw data, because of the overhead of their structure, but they have other very desirable qualities.
Once I got the B+Tree code actually working, I began finding the best parameters for the B+Tree, my initial attempt created a 4+mb file, eek! B+Trees are parameterised by block size and order. The order (n) is an indicator of the desired number of keys per block. The B+Tree tries to keep the number of keys between n and 2*n. The block size and order are inter-related, if the order is too small for the block size the blocks have lots of wasted space, if the order is too large, the blocks get full and have to be split up into multiple blocks too soon. So I worked where the sweet spot was for my typical data (using 256 byte blocks):
However on top of that, the order effects the speed of the data structure, because the larger the order, the more keys in the block and thus more linear searching of keys at the leaf level.
I found the sweet spot for 512 byte blocks was quiet a bit slower than the sweet spot for 256 byte blocks for this reason.
I'll be putting this into the bayesian implementation in Scribe v1.89 test1... it bring down the memory footprint and speed up the first mail receive a fair bit... esp if you have large word databases.