A lesson in data file formats

I have written a number of accounting packages in the languages C, Go, Haskell Python, Tcl (non-working), and C++. My C, Go, Python and Tcl efforts have been lost in the mists of time.

I write my input data in a bash-like format; that is to say, as a “command” with arguments separated by whitespace, using quotes if necessary, and the shebang for comments. The command is actually the description of a data item.

In my experience, this is a good engineering decision, as input is easy to parse.

Two things happened recently:

  • I wanted to combine several years worth of data together. This is a tricky process, because I close off the accounts at the end of every year. So there are opening balances that are redundant if I aggregate years together. There are also transactions duplicated from the previous year. These, too, would need to be eliminated from the aggregation process
  • I am interested in exploring John Wiegley’s ledger program. The data in my format is incompatible with the program. I had experimented with ledger in the past, but I generally abandoned my efforts for a number of reasons. Repos tended to supply only an old version, closing off did not work as I needed, the output was difficult to process downstream, and compilation required boost. In the past, I was extremely resistant to installing boost, as I am keen to avoid unnecessary bloat and dependencies. I have relented recently, though, as boost provides many tempting features.

Consequently, I am faced with the dilemma of how to aggregate all the data together, and how to transform the data.

My solution comes in two forms:

  • shlex, a program I wrote in C++ that takes input in bash-like format, and prints each field as a line. It also builds a library that can be linked with other programs. I have added a feature that reprints the input in an m4-like syntax. The program was inspired by the Python module of the same name.
  • m4, a general-purpose macro processor. m4 takes input text and transforms it according to a set of macros. The macros can defined anywhere. m4 is language-agnostic, it is just a text-transformer. m4 is a very old language, and should be ubiqitous on all UNIX and UNIX-like systems. m4 powers GNU autotools.

So, I convert my raw input to m4-compatible macros via shlex. All I need to do now is write m4 macros to transform the input into a suitable form. This should be relatively straightforward.

m4 macros can be a bit kludgey. If more power is needed, then there is a non-standard package called pyexpander. This is a macro processing language in the style of m4, but you can embed python code. This should, theoretically, make it a more powerful, less kludgey, processor. I have never tried it, though.

It is also worthwhile considering using Tcl, which seems tailor-made for transformation work. Tcl seems quite underrated.

If I had not chosen to write my inputs in a format that was easy to parse, but had used some kind of custom format, than I would be facing difficulties.

In fact, I am inclined to say that program output should be in a bash-like syntax, too. This should make it much easier for down-stream processes to parse.



About mcturra2000

Computer programmer living in Scotland.
This entry was posted in Computers and tagged , . Bookmark the permalink.

1 Response to A lesson in data file formats

  1. Duncan says:

    M4 macros are great for small things but difficult to manage once they start to grow in size and number. I tried to write a preprocessor in M4 for an OO notation to sit inside C (like Objective-C but different). I made good progress at first but it soon turned into a world of quote-unquote pain.

    For parsing your bash-like input files, have you looked at awk? I usually use that or ‘cut’ to extract data from files of this kind. Given the presence of comments, awk is probably the better choice.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s