Reverse Engineering Logicworks' Accounting Software - Part III

This article originally appeared on Logicworks' old blog.
Logicworks' new blog can be found at www.gatheringclouds.com.

In my previous two posts (Part 1, Part 2) I discussed a reverse engineering project undertaken here at Logicworks to synchronize our accounting system's invoices and purchase orders with our ticketing, inventory, and CMS system, LogicOps. I discussed a specific puzzle regarding a couple database columns that I had determined I could use to solve a problem of a lack of primary keys in the invoice line items table. The puzzle was determining the purpose for those columns and understanding the values that those columns could take. The first part of the puzzle, figuring out the purpose of the columns, seemed to be simple enough; the two columns in question, line_item_sequence_id and component_sequence_id (see below), controlled grouping and ordering of the line items in an invoice. But why did they have such off-the-wall values? And more importantly, was this an indicator that I didn't actually understand this well enough to use it in a production system?

Let's take one more look at a sample of rows coming from the database:

account_idinvoice_idline_item_sequence_idcomponent_sequence_idquantityproduct_name
012015R272462129920225.0Backup Committment
012015R2724622118401.0BW committment for server (1mb/s)
012015R2724624576001.0Dedicated Xeon SCSI server
012015R27246245760491521.0Server Chassis: dual socket Xeon (5000/5100/5300) dual/quad-core, 4-bay SCSI
012015R27246245760655361.0Intel Xeon 5410 2.33Ghz 1333FSB 12MB qual-core
012015R27246245760819202.02GB DDRII 667 FBDIMM
012015R27246245760983042.073GB SCSI 10K RPM hard drive
012015R272462457601146882.0300GB SCSI 10K RPM hard drive
012015R272462457601310721.0Microsoft Windows 2008 Enterprise
012015R2724626214401.0Shared loadbalancing service
012015R2724627852801.0Shared firewall service
012015R2724631129601.0Dedicated Xeon SCSI server
012015R27246311296491521.0Server Chassis: dual socket Xeon (5000/5100/5300) dual/quad-core, 4-bay SCSI
012015R27246311296655361.0Intel Xeon 5410 2.33Ghz 1333FSB 12MB qual-core
012015R27246311296819202.02GB DDRII 667 FBDIMM
012015R27246311296983042.073GB SCSI 10K RPM hard drive
012015R272463112961146882.0300GB SCSI 10K RPM hard drive
012015R272463112961310721.0Microsoft Windows 2008 Enterprise
012015R2724632768001.0Shared loadbalancing service
012015R2724634406401.0Shared firewall service


Some of the values here jump out for a software developer. 65536 is a power of 2, as is 131072. The other numbers aren't, but they are multiples of 16384, another power of 2. So instead of counting 0, 1, 2, 3... why did they instead count in increments of 214? Why would someone do that? And why wasn't it simply skipping by 214? There were multiples of 214 missing. To try to piece together the solution I had John in our accounting department create a new test invoice for me in a dummy account. The result was similar to the data shown above, but the values were straight increments of 214: 0, 16384, 32768, 49152, 65536, and so on:

account_idinvoice_idline_item_sequence_idcomponent_sequence_idquantityproduct_name
TESTACCT01234001.0Test Line-Item 1
TESTACCT012341638401.0Test Line-Item 2
TESTACCT012343276801.0Test Line-Item 3
TESTACCT012344915201.0Test Line-Item 4
TESTACCT012346553601.0Test Line-Item 5
TESTACCT0123465536163841.0Test Line-Item 5-1
TESTACCT0123465536327681.0Test Line-Item 5-2


Why?!? I asked John (in our Accounting Dept., who at this point is losing his patience with my obsessiveness) to randomly alter the invoice without telling me how he had changed it. Add some line items, remove some, etc. The database now reflected the changes as so:

account_idinvoice_idline_item_sequence_idcomponent_sequence_idquantityproduct_name
TESTACCT01234001.0Test Line-Item 1
TESTACCT01234819201.0New Line-Item A
TESTACCT012343276801.0Test Line-Item 3
TESTACCT012344915201.0Test Line-Item 4
TESTACCT012345734401.0Test Line-Item B
TESTACCT012346553601.0Test Line-Item 5
TESTACCT0123465536163841.0Test Line-Item 5-1
TESTACCT0123465536245761.0New Line-Item 5-C
TESTACCT0123465536327681.0Test Line-Item 5-2


I've highlighted the new rows in bold lettering above. Notice the ordering values assigned to the new rows in orange. I immediately understood what was happening, but just to make sure I asked him if he added "New Line-Item A" before deleting "Test Line-Item 2". He confirmed that this was the case.

The algorithm is simple. Start by incrementing new line items at the end of the list by 214 (0, 16384, 32768, etc.). When adding a new line item between existing items you take the average of the two adjacent line items. Since we don't care about the values here, only their relative ordering, this is a simple and very fast way of maintaining the ordering of the list without requiring renumbering of the values in the succeeding rows. Not including complexity that the database may introduce this algorithm's complexity is O(1) in computer science lingo, also known as "constant time". The algorithm may not be obvious to a non-programmer, so let me elaborate.

Consider if the ordering of line items had proceeded with 0, 1, 2, 3... and so on. To insert a new line item between the third and fourth items you'd need to give the new row the ordering value "3", and move the rest up by 1. I'll just show the sequence values to keep things clear:

01234...


We wish to insert a new item between items 2 and 3. This new item becomes item number 3...

012334...


...and the previous items that were ordered 3, 4, and so on have to be incremented.

012345...


But if you counted by some large number such as 1000, you wouldn't need to alter the other rows in order to insert a new item:

01000200030004000...

010002000250030004000...


Bear in mind here that the actual values are not of interest to us. The ordering of the values is. After some consideration it should be clear that the optimal number to use for this purpose is a power of 2, as it gives the most divisions before you find yourself with a non-integer value. It would be possible to implement so that the average always rounds up or down, but using a power of two simplifies things. For example, using 1000 as our increment we run into problems as soon as we insert 3 times, at which point we'd have to take half of 125 as the next index:

01000

05001000

0250500100

01252505001000


So, why 2 to the 14th power? The answer lies in the maximum allowed value in the database for this type and balancing the need to do this kind of insert versus the maximum number of line items you could put in an invoice. The maximum integer that can be stored in a SQL standard "int" field is 231-1, which means that you can have 217-1 line items in an invoice before you reach the maximum value for this field. And since it's 2 to the 14th power, you can subdivide a given ordering value 16 times before you run out of space to continue the algorithm. To verify my suspicions I had John create an invoice and continually add new line items immediately below the first item. It gave an error when he tried to insert the 17th line item and prevented him from continuing. The resulting values looked something like this:

0First Line Item Entered
1Sixteenth Line Item Entered
2Fifteenth Line Item Entered
4Fourteenth Line Item Entered
8Thirteenth Line Item Entered
16Twelfth Line Item Entered
32Eleventh Line Item Entered
64Tenth Line Item Entered
128Ninth Line Item Entered
256Eighth Line Item Entered
512Seventh Line Item Entered
1024Sixth Line Item Entered
2048Fifth Line Item Entered
4096Fourth Line Item Entered
8192Third Line Item Entered
16384Second Line Item Entered


Mystery solved! I now understood what the system did with these columns and how it did it. This was enough for me to be sure that those two columns weren't hiding some mysterious purpose that was going to haunt me months later when my code blew up in a firey display of false assumptions. Why this worked this way... I wasn't quite sure. Was it bad design? A novice programmer going overboard with some clever numerical phenomenon? I decided to let that loose end bounce around in the back of my head while I continued the project and maybe it would eventually make sense. It took me until a couple days later when the purpose for this strange behavior hit me completely out of the blue. The reason why this works the way it does is apparent if you consider the perspective of the programmer who built the system.

Imagine building an accounting/invoicing system with a database back end. You store the invoices and their associated line items as rows in the database with an ordering value in a column in each row. Because user interface is critical for such an app, you allow users to configure the invoices by moving line items around, deleting and adding at will. Also consider that some invoices may be very large, thousands of line items long in fact. Now, if one went with the obvious solution of ordering line items with 0, 1, 2, 3... then adding a row between the first and second line items would require a database write on all the remaining rows in the invoice (this algorithm would be O(n) or "linear time"). On a very large invoice with thousands of rows this would pause the application for at least several seconds and could put undue strain on the database server. On a system servicing a large organization generating hundreds of invoices a day this would put a very heavy strain on the database as well as dramatically effect the productivity of the accounting staff as they would spend most of their time waiting for the system to do trivial updates to the invoices. But this trick allows users to add a new line item without altering any of the existing rows in the database. It seems ugly when you look at the actual numbers, but the users never see the values in that column, only the ordering that results from it. [Note that a more complete implementation of this technique would involve "rebalancing" the sequence values when the 16-insert limit has been reached]

So it's a database scalability hack! One that is not apparent in an organization of our size, but one that might be useful when designing a system for a much larger organization. And one that I doubt I would have learned, much less invented on my own, if I hadn't taken the time to reverse engineer someone else's solution. There were other lessons to be had in that project, many in the "what not to do" category, but this puzzle was the one I'll remember the most. I gained a new respect for the anonymous developers of our accounting system and the whole experience reinforced my belief that reverse engineering is a wonderful and surprising way to learn new ways of building software.

Reverse Engineering is also the best way to work miracles in back-office software integration: the sync script was finished shortly afterwards and has been in production for 4+ years now with almost no changes, integrating our accounting system with LogicOps, eliminating the mistakes associated with manual duplication of the data, and saving hundreds of hours of data entry.