Reverse Engineering Logicworks' Accounting Software - Part I

This article originally appeared on Logicworks' old blog.
Logicworks' new blog can be found at www.gatheringclouds.com.

Reverse engineering a commercial production system can be a very enlightening and rewarding experience. Many engineers have an intuitive sense of how a system works by assuming that the system is built the way they would build it. However, there are many ways to build computing systems and by reverse engineering a system an engineer can learn entirely new design strategies and techniques, not to mention a lesson in humility when it comes to assumptions. I've had the opportunity (or burden) to reverse engineer a few large closed source back-office systems in my career and I'd like to share one of the puzzles that I've come across in the process.

The system in question is a large enterprise-grade accounting package, sold by one of the largest software companies in the world. The task was to write a synchronization module for LogicOps, our internal ticketing, inventory and CMS system, so that it would have an up-to-date record of our clients' invoices as well as providing some tracking and correlation of purchase orders to inventory. The task seemed simple: query the database for client invoices and purchase orders and add/update/remove records from LogicOps as needed. My first challenge in the process was determining which database tables and columns contained the requisite data. Luckily our accounting staff is intimately familiar with this software and were able to guide me through identifying most of the tables and columns I needed. The next challenge I couldn't have anticipated: there are no primary keys for these tables!

Here is one of the situations where an assumption of the way a system is designed can lead you to misjudge the scope and complexity of finding a solution through reverse engineering. I had been taught a very textbook view of databases. Normalization is critical for data integrity and should only be sacrificed if necessary for performance. I never expected to find that our _accounting_ database was without referential integrity. However, with some perspective on the issue (read: years), I can now appreciate that the downsides of a lack of referential integrity are complimented by the upside of a much easier business logic development environment. In fact, by relaxing database integrity a little (there were still apparently application code-level controls on data consistency) the developers of the system were in a much better position to build a one-size-fits-all accounting package. If the database schema had been perfectly rigid, the utility of the package would have suffered more than it did by the occasional loose database row or unexpected null field in a join.

Since uniquely identifying entities in the database was critical to my being able synchronize our accounting data with LogicOps, I decided that I would search for a set of columns that when taken together formed a unique key; in essence determining a composite key that I could use as a unique index. After much analysis (hackery) with various one-off Perl scripts, I determined that the columns CUSTNMBR, SOPNUMBE, LNITMSEQ, and CMPNTSEQ, were unique as a set. Yes, these are the actual column names for this system! Here's an example account from the system, with better names for the columns. Each row represents a line item on an invoice for one of our clients. I've included the quantity and product names for clarification:

account_idinvoice_idline_item_sequence_idcomponent_sequence_idquantityproduct_name
012015R272462129920225.0Backup Committment
012015R2724622118401.0BW committment for server (1mb/s)
012015R2724624576001.0Dedicated Xeon SCSI server
012015R27246245760491521.0Server Chassis: dual socket Xeon (5000/5100/5300) dual/quad-core, 4-bay SCSI
012015R27246245760655361.0Intel Xeon 5410 2.33Ghz 1333FSB 12MB qual-core
012015R27246245760819202.02GB DDRII 667 FBDIMM
012015R27246245760983042.073GB SCSI 10K RPM hard drive
012015R272462457601146882.0300GB SCSI 10K RPM hard drive
012015R272462457601310721.0Microsoft Windows 2008 Enterprise
012015R2724626214401.0Shared loadbalancing service
012015R2724627852801.0Shared firewall service
012015R2724631129601.0Dedicated Xeon SCSI server
012015R27246311296491521.0Server Chassis: dual socket Xeon (5000/5100/5300) dual/quad-core, 4-bay SCSI
012015R27246311296655361.0Intel Xeon 5410 2.33Ghz 1333FSB 12MB qual-core
012015R27246311296819202.02GB DDRII 667 FBDIMM
012015R27246311296983042.073GB SCSI 10K RPM hard drive
012015R272463112961146882.0300GB SCSI 10K RPM hard drive
012015R272463112961310721.0Microsoft Windows 2008 Enterprise
012015R2724632768001.0Shared loadbalancing service
012015R2724634406401.0Shared firewall service


I had established empirically that the first four columns above uniquely identified a given line-item on a customer's invoice, but why is that the case? Specifically, why do line_item_sequence_id and component_sequence_id have the values that they do? If they really are sequences, why not start at 0 or 1 and count upwards? In the next post I'll give more clues that I discovered as to how these columns are used in the system and what purpose they serve.