Toying with a data query #grammar in #perl6

I have written an accounts package in C++. I have a minuscule knowledge of COBOL which ought to be well-suited the task, but I have found it almost cumbersome to wrangle. COBOL is still interesting in the sense that it tries to be a declarative language for exactly this kind of domain.

The thought occurred to me that it would be great if we had a much better DSL for handling data. COBOL, but better, plus SQL, but better, and as a first-class construct of the language, and untethered to a database.

Perl6 seemed an obvious choice to use for this purpose.

My experiment is very preliminary at the moment, and the code consists of ~100 lines. You can view the complete source here.

So, you create a record structure like so:

record person
        name string;
        age  int;
end-record

That’s not spectacular so far, of course. Pascal, COBOL, C, etc. already give you syntax for creating structures. It’s where we’re going that is perhaps more important. The grammar for this is straight-forward:

grammar rec {
	token TOP { * }	
	token record-spec { * 'record' \s+  +  'end-record'  }	
	token rec-name { \S+ }
	token field-descriptor { *  +  * ';' }
	token field-name { \S+ }
	token field-type { <[a..z]>+ }
	token ws { <[\r\n\t\ ]> }
}

Due to my newness with P6 (perl 6), construction took a long time. I should also probably be working in terms of rules and protos instead of just using tokens all the time. The feature set of P6 is vast, so just getting something working is good enough for me at this stage.

Two classes naturally suggest themselves: Fields, which store information about the name and type of a field, and Recs, which aggregate the fields into a record. The definition of a Field is straightforward:

class Field {
        has Str $.namo is rw;
        has Str $.type is rw;
}

A Rec is a little more complicated:

class Rec {
        has Str $.namo;
        has Field @.fields;
        has %.flup; # look up from field name to an indexed number
        has @.fnames; # field names

        method add_field(Field $f) {
                @.fields.push: $f;
                @.fnames.push: $f.namo;
                %.flup{$f.namo} = %.flup.elems;
        }
}

‘namo’ is the name we want to assign to the record, and ‘fields’ is a list of the component Fields. We also want a couple of convenience variables: ‘flup’, to obtain an index number for the position of field, and ‘fnames’, just the names of the fields as an array.

We create a class of actions, qryActs, which creates some data structures for us:

class qryActs {
	has Rec %.recs is rw;

	method record-spec ($/) { 
		my $r = Rec.new(namo => $.Str);
		for $ -> $fd {
			$r.add_field($fd.made);
		}
		%.recs{$.Str} = $r;
	}

	method field-descriptor ($/) { make Field.new(namo =>$.Str, type => $.Str); }
}

We parse our textual description ($desc) of the record(s) using:

my $qa = qryActs.new;
my $r1 = rec.parse($desc, :actions($qa));

and we extract the column names for the record like so:

my @cols = $qa.recs{"person"}.fnames;

Let’s define some inputs to play with:

my $inp = q:to"FIN";
adam    26
joe     23
mark    51
FIN

and extract them to an array of arrays, splitting up the input by newlines and whitespace:

my @m = (split /\n/, (trim-trailing $inp)).map( -> $x { split /\s+/, $x ; } );

‘@cols ‘ is useful, because we can use the P6 module ‘Text::Table::Simple’ to print a table:

sub print_table(@data) {
        lol2table(@cols, @data).join("\n").say;
}

and print out what we have so far:

print_table @m;
O------O-----O
| name | age |
O======O=====O
| adam | 26  |
| joe  | 23  |
| mark | 51  |
--------------

Ideally we want to incorporate this into our grammar; perhaps something like this:

import tabbed file "mydata.txt" of person into people
show people

An inlining facility would be useful. There is a world of possibilities for importation. We may want to import CSV, QIF files, all sorts. Perhaps some way is required to extend the language given the general nature of importation. Or perhaps an external utility would be a better and more flexible approach.

What we would like next is a way to filter data. Suppose we wanted a table of all people who were less than 50 years old. We want to write ‘age < 50’ for such a query. I created a new grammar to handle that:

grammar predi {
        token TOP { <ws>* <arg> <ws>* <rel> <ws>* <arg> <ws>* }
        token arg { <field-name> | <value> }
        token field-name { <[a..z]> \S+ }
        token value { <[0..9]>+ }
        token ws { <[\r\n\t\ ]> }
        token rel { '<' }
}

and a function that calls the grammar to create a subset:

sub filter-sub($pred-str) {
        my $pr = predi.parse($pred-str);

        sub get-val($idx, $row) {
                my $v = $pr<arg>[$idx];
                my $ret;
                if $v<field-name>:exists {
                        my $fnum = $qa.recs{"person"}.flup{$v};
                        $ret = $row[$fnum];
                } else {
                        $ret = $v<value> ;
                }
                $ret;
        }

        my @filtered;
        for @m -> $row { 
                my $v1 = get-val(0, $row);
                my $v2 = get-val(1, $row);
                if $v1 < $v2 { @filtered.append: $row; }
        }
                        
        @filtered;
}

The grammar is only primitive at the moment. It does not allow for logical operations, and has ‘<‘ hard-wired as a comparison operator. Still, it is at least possible to do:

my @some = filter-sub("age < 50");
print_table @some;
O------O-----O
| name | age |
O======O=====O
| adam | 26  |
| joe  | 23  |
--------------

How cool is that?

There are many many directions that this idea can expand in. The predicate logic really needs to be completed and merged in with the main grammar as a first step.

One possibility for extension: maybe we do not know the names or the types of the records initially. So there would need to be a way of creating data structures on the fly, as you might need for a generic dataframe library.

Conversely, maybe you do know the record layout ab initio, and you could like to generate static C++ or COBOL code as a back-end. You would then be able to create a processor that is very fast.

Other obvious extensions: support for derived fields, report-writing, nested records, user-level record editing, table joining, heuristics for guessing datatypes (think dataframes), statistics, effortless serialisation, and grouping.

One last thing. Although the idea of using natural languages for programming is discredited, I’m wondering: what about constructed languages? Here, I am thinking along the lines of Esperanto or Lojban. Experanto, because you can always deduce the object, subject, adjectives, etc. in a sentence merely by looking at the word. Lobjan because it is apparently an unambiguous language aimed at the precise expression of ideas. An idea so insane that it might just work?

 

Advertisements

About mcturra2000

Computer programmer living in Scotland.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s