Exploring byte-code compiler using #cplusplus

bcode

Let’s build a byte-code compiler.

Motivation

There are several factors that have led me to writing “bcode”, my attempt at making a byte-code compiler:

  • my general interest in compiling technology
  • my involvement with neoleo, a spreadsheet that I forked from oleo, which is GNU’s spreadsheet program that has not been updated for many years. Part of the program parses user formulae, compiles and decompiles them into byte-codes as required. Lex and yacc are used. Oleo was originally written in C, and I was interested in converting it to C++. The workings of the compiler and parser were, and to some extent still are, a bit of a mystery to me. I figured that there must be a better way to do it.
  • my involvement with ultimc. Ultimc is a project that I created to work in conjunction with Ultibo, a unikernel for the RPi (Raspberry Pi). I figured that it would be nice to have a scripting environment for the unikernel, and set about creating Glute (“gluing utility”, although I like the name because it was vaguely rude). Although it currently “works”, it can only deal with commands. You cannot perform branching, or define functions.
  • I was inspired by Jack Crenshaw PhD’s series of articles, Let’s build a compiler to try my own take on the subject. Jack creates a compiler that spits out Motorola assembly. The assembly can then be compiled into an executable, assuming that you have access to the Motorola chipset
  • J.P. Bennett’s book “Introduction to compiling techniques – a first course using ANSI C, LEX and YACC” was of interest to me. I bought the book in 1996. I never made my way through all of it, although I occasionally dipped in, latterly more than formerly. I am glad I bought the book, as it is a pleasant introduction to compiling techniques.

Bennett’s work adopts the typical approach to writing a compiler:

  1. define a lexer using lex. He discusses ad hoc lexing techniques, where you basically roll-your-own lexer
  2. analyse the syntax using yacc. He also discusses top-down parsing.
  3. produce an intermediate representation of the compiled code. He chooses an idealised assembly language
  4. the idealised assembly language or intermediate code can then be compiled to native machine code. Bennett actually creates an interpreter for the assembly language, which is a perfectly reasonable choice.

Method of attack

I have this to say on compiler construction tools:

  • I would rather avoid code generation. Ideally, the code should just work, without any need for tool intervention. I think a lot of the drawbacks come from the limitations of C itself
  • they’re too complicated. This is a contentious point, I know. Everybody’s mileage varies on this. You have to learn them, and it’s difficult to know if they create more problems than they solve
  • the tools don’t especially fit harmoniously together. I always feel that it’s like trying to push two magnets together with the same polarity head-to-head. You can do it with sufficient force, but I would rather not.
  • they seem more designed to the C era of programming. I want to write in C++, where encapsulation is better, and I feel I have more certainty about computer memory hygiene (no leaks). I noticed that with neooleo, for example, it is difficult to reason about memory. I have ideas about how it could be better, although it is difficult to realise them given the nature of the tools I have

I think lex and yacc can be considered obsolete for the following reasons:

  • Rob Pike, of Go, Unix and Plan 9 fame, seems to prefer hand-rolled lexers
  • hand-rolled lexers are OK, but I think they require finnicky bookkeeping as to token state and the state of the input stream
  • if you don’t want to go down the lexing route, and hand-rolling a very low-level lexer is too much fiddle for you, and you are willing to sacrifice some efficiency, then you can use C++ regexs. By defining a list of them, and checking against them, you can effectively achieve what lex does, but without the hassle of using a separate tool. My blang project uses such an approach, and I think it completely obviates the need for lex
  • nearly all compiler writers today seem to use recursive descent compilers, having abandoned yacc.
  • recursive-descent compilers also fit in well with humans intuitvely

I have need of an interpreter, as opposed to a compiler. I am willing to sacrifice speed, and I am not looking to perform any optimsations. I had originally thought that a good idea would be to avoid writing any kind of byte-code compiler, and just walk through a parse tree as needed. I think it could be reasonably efficient. One of the real problems with compilers is that they rely on a lot of different “types”. Types include things like integers versus strings, or function blocks versus arithmetic. The compiler needs to work with variants of data types. This is commonly called ADTs (Abstract Data Types), or Sum Types. This is in contradistinction with Product Types, or more commonly called “the struct”. Product Types are widely supported in many languages. ADTs rarely feature in language designs. The notable expections are programming languages like Haskell, which excels at using ADTs. Lisp-like languages are also well-suited to ADTs, as they can deal with data representations dynamically. For a language like C++, it always felt like trying to put a square peg in a round hole. The latest version of C++ 17 does actually provide the variant type. Finally! As of writing (Dec 2017), full C++ compilers have not yet made their way into popular Linux distributions, though. I also wonder if C++’s solution is entirely satisfactory. It has chosen an object-oriented approach to dealing with variants, whereas I wonder if they would best be dealt with by syntactic extension. Time will tell.

Given the above factors, I am now think that byte-compiling is an idea that I want to explore. So my idea is that the parser, instead of constructing a parse tree, just spits out byte code. In Tcl, as the saying goes, “everything is a string”. It is a data-hiding mechanisem. I think it is a good approach for compilers, except that for “string”, read “series of bytes”.

So this is the approach I will adopt here:

  • instead of trying to compile a full-blown language, I will try to write a compiler for a simple assembly language. And I’m not kidding here, the parsing should be as simple as possible
  • that assembly will compile to byte-code for an “idealised” stack machine
  • the byte-codes will be interpreted

I think this is a good approach. You can compile a fully-fledged language by emitting those assembly instructions. The nature of writing a top-down parser is that you can generate assembly instructions in postfix form, and you do not have to construct a syntax tree. The tree is implicit in the top-down recursion. You don’t have to construct a tree explicitly. This is great, because it means that you don’t have to store variant-type intermediate structures, you just emit assembly code as you go, in the manner adopted by Crenshaw.

The code

I am trying to write my assembler piecemeal, creating features as we go. I do hope I make it to the end. So here are the parts:

Part 1 – halting, pushing and extensible execution, no branching

 

References:

  • The above document above is available here.
  • For more details, see part 1, where I get into the instruction set and implement it in code. Branches are not yet implemented. I intend to work on them next.

Enjoy!

Advertisements
Posted in Uncategorized | Leave a comment

Magic Hat – DTG stays in

The MHP (Magic Hat Portfolio) on Stockopedia (http://www.stockopedia.com/fantasy-funds/magic-hat-463/) is an experiment by me to see if a human can improve on a mechanical Greeblatt Magic Formula screen. I am trying to weed out “mistakes” that I feel the screening commits: unseasoned companies, scams, foreign companies (particularly Chinese), fishy accounting, and statistical quirks. Apart from that, I am agnostic as to the sector the company operates in, although I will try to avoid heavy concentration in any one sector. I will mostly apply “Strategic Ignorance”, by which I mean that I wont try to be clever in my stockpicking. My picking will be mostly mechanical. A summary of most of my Magic Hat articles can be found on the web page http://www.markcarter.me.uk/money/greenblatt.htm This will allow you to see, at a glance, what shares have been bought and sold in the past, as well as what shares have been rejected from consideration and why.

Airline operator DTG (Dart Group) is due for ejection this month. It is no longer in the Greenblatt screen. However, its Stockopedia StockRank is 94, it is classified as a SuperStock, and it has at Greenblatt rating of A-, rather than A+. I’m going to be lazy, and keep it in.

Taking a look at the chart of the fund’s performance, I see that it has performed exceptionally well against the FTSE 350. I should note a couple of caveats, though:

  1. although the fund has outperformed the index fairly consistently, the really superior gain has been over the course of 2017. This somewhat lumpy gain could be just luck.
  2. the FTSE 350 is heavily weighted to the Footsie. The FTSE 250 has performed exceptionally well against the FTSE 100 over the course of the existence of the fund, so the fund’s performance would look far more modest if it was compared against the mid-caps.

So, that’s it for this month. Have a merry Christmas, and let’s hope we get lots of mild weather.

Stay safe out there.

Posted in Uncategorized | Leave a comment

#awk the underappreciated

At Softpanorama:

Most people are surprised when I tell them what language we use in our undergraduate AI programming class. That’s understandable. We use GAWK. GAWK, Gnu’s version of Aho, Weinberger, and Kernighan’s old pattern scanning language isn’t even viewed as a programming language by most people. Like PERL and TCL, most prefer to view it as a “scripting language.” It has no objects; it is not functional; it does no built-in logic programming. Their surprise turns to puzzlement when I confide that (a) while the students are allowed to use any language they want; (b) with a single exception, the best work consistently results from those working in GAWK. (footnote: The exception was a PASCAL programmer who is now an NSF graduate fellow getting a Ph.D. in mathematics at Harvard.) Programmers in C, C++, and LISP haven’t even been close (we have not seen work in PROLOG or JAVA)

Full version on Ronald Loui’s webpage.

Posted in Uncategorized | Leave a comment

#Ultibo – a “unikernel” for the #raspberrypi

I just came across Ultibo , a “unikernel” for the RPi (Raspberry Pi), any version. What a fascinating idea it is! If you are like me, you have no idea what a unikernel is. Wikipedia describes it as:

a specialised, single address space machine image constructed by using library operating systems.[1][2] A developer selects, from a modular stack, the minimal set of libraries which correspond to the OS constructs required for their application to run. These libraries are then compiled with the application and configuration code to build sealed, fixed-purpose images (unikernels) which run directly on a hypervisor or hardware without an intervening OS such as Linux or Windows.

From what I gather, a unikernel is designed to run a single application on virtual hardware; to be contrasted with an exokernel, which is designed to run multiple applications that needn’t know of each other’s existence, on real hardware. And to be contrasted yet again with a kernel, like Linux or Windows, that provide a whole host of services.

You write your application in Lazurus, which is an IDE for the FreePascal compiler. The version of Lazurus is a custom version. I’m not Pascal wizard, but I was able to get up a “program” that created a console, asked for the user’s name, and printed it back to the console. I say “program”, but it is actually the complete guts of the machine. There’s no separation between program and kernel.

Writing in Pascal makes it much more pleasurable to program in than C. For example, Pascal has proper strings.

When you build the project, Lazarus creates a kernel.img file. You then copy that to your SD card, insert it into your Pi (I chose a Pi 2), and boot your machine. It boots straight into your program.

Here’s a simple program that I wrote:

program Joy;

{$mode objfpc}{$H+}

{ Raspberry Pi Application }
{ Add your program code below, add additional units to the "uses" section if }
{ required and create new units by selecting File, New Unit from the menu. }
{ }
{ To compile your program select Run, Compile (or Run, Build) from the menu. }

uses
RaspberryPi,
GlobalConfig,
GlobalConst,
GlobalTypes,
Platform,
Threads,
SysUtils,
Classes,
Ultibo,
Console,
Framebuffer
{ Add additional units here };

var
WindowHandle:TWindowHandle;
name:String;
begin
{ Add your program code here }
WindowHandle:=ConsoleWindowCreate(ConsoleDeviceGetDefault,CONSOLE_POSITION_FULL,True);
ConsoleWindowWriteLn(WindowHandle,'Hello Ultibo! What is your name?');
ConsoleReadLn(name);
ConsoleWindowWrite(WindowHandle, 'Your name is ');
ConsoleWindowWriteLn(WindowHandle, name);
ThreadHalt(0);
end.

It creates a console, asks you for your name, and prints it back to the console. Then stops. Not, perhaps, the most useful of programs, but I was impressed by the ease of it.

I have no idea how the authors of Ultibo manage to get all this to work, but it works quite well.

The work is reminiscent of Charles Moore’s colorForth, and Niklaus Wirth’s Oberon OS.

Ultibo is well suited to those who like tinkering; so those interested in an RPi should find it very appealing. The RPi is an excellent platform to target, too.

I look forward to delving deeper into the system. I would like to try to build its version for Linux. Ultibo currently runs from Windows. Lazarus runs on all platforms, so I think that it should be possible to compile a version for Linux. Theoretically.

Anyway, worth checking out.

EMUZ80-RPI.png

Image taken from the web

Posted in Uncategorized | Leave a comment

Review of #elegoo basic start kit

Recently I decided to buy the “Elegoo UNO Project Basic Starter Kit with Tutorial and UNO R3 for Arduino” from Amazon. It cost £13.99.

For those who may be unaware, the small board electronics niche has two undisputed leaders: the Arduino, and the RPi (Raspberry Pi). The Arduino is a microcontroller board, whilst the RPi is a microcomputer board. The Arduino can be used for electronic projects, i.e. controlling electrical circuitry, whilst the RPi is basically a complete computer with which you can also create electronic projects. The Arduino costs £28, and the RPi3 costs a fraction more, at £32. Given the sheer scope of what you can do with the RPi, the advantage generally lies with the RPi, at least in my opinion.

Although the Arduino is quite expensive considering, it is an open hardware platform, and there are plenty of clones out there. One very popular one is produced by Elegoo. Elegoo is designed and produced in China, which is often a bit of a warning sign.

However, the Arduino comes as just a basic microcontroller on a board, whilst the kit I purchased is half the price of that, and comes with leads, a breadboard, resisters, LEDs, pushbuttons, a tilt detector and photometer and a CD explaining how to use the kit.

Being Chinese, the quality is not up to the standards set by Western companies. I inserted a lead in the breadboard, for example, and when I came to remove it, the lead’s pin broke free and remained stuck in the breadboard. So I had to remove the pin manually.

Despite these glitches, the kit worked well otherwise.

The instruction manuals that come with the Elegoos are often derided. However, I found the CD to be well-written and informative. They appear to have been written by someone for whom English was a native language. I felt that the instructions were close to the standard that you would find in, say a CamJam EduKit. So don’t dismiss the manuals, as I found them educational. Use them!

Would I recommend this kit? Despite my seemingly somewhat negative previous comments, I would say “definitely yes“. The price is unbeatable, you get bits an bobs to play with, a set of instructions. This makes it an ideal introduction to electronics.

Remember, an Elegoo is a clone of the  Arduino UNO, so it can do eveything an Arduino can do, in exactly the same way. Throw in a selection of components and an instruction manual at half the price of a bare board, and what’s not to like?

elegoo.jpg

Lots of blinkenlights going with the basic starter kit

Posted in Uncategorized | Leave a comment

gists for #plaintextaccounting

Earlier this month, I wrote a blog post  on creating a small and simple plain text accounting system. I was asked if I could make my shell scripts available. I have done so as gists. You can download all the files, and use them.

Note that they were really only intended for my personal use, so you may find them not-so-easy to use yourself. I have approached it from a UNIX toolset viewpoint, rather than as a monolithic package.

All inputs should have their fields delimited by tabs. I recommend my basic approach even if you want to feed your inputs into something like John Wiegley’s ledger program. You can write your own script to generate a suitable format; I had done so in the past, but abandoned the idea when I realised that the program was not quite what I was looking for. You retain maximum flexibiltiy

The two principal inputs are etran (equity transaction) and ntran (“normal” transaction), whose fields are as follows:

etran-2 dstamp folio ticker qty amount way desc
ntran dstamp dr cr amount desc

The meaning of the fields are as follows:

  • dstamp is a date stamp of the form YYYY-MM-DD
  •  folio is a symbolic folio name that you want to associate an equity transaction with
  • ticker is the ticker for an equity (e.g. VOD.L (UK) or VOD (US) for Vodafone)
  • qty is the number of shares bought or sold
  • amount is the transaction value in your currency (e.g. 1234.56)
  • way indicates whether the equity is a B (buy) or S (sell)
  • desc is a description of the transaction
  • dr is the symbolic account name to debit (e.g. “cash”)
  • cr is the symbolic account name to credit

There’s also a few minor record types ones like “pending”; which is basically an uncleared ntran. See, that’s the advantage of using a tab-delimited approach: you can easily invent new structures on an ad-hoc basis, and then generate canonical transactions from that.

A commentary on each of the files in the gist is as follows:

  • derive concatenates all of your input files together, and translates pending transactions into normal ones. Adjust line 5 accordingly
  • etb creates a set of accounts. You can create subtotals, income and expense reports, balance sheets, whatever you want. You will need to set up gaap.txt, which is described in my original post
  • macc is the main driver program. Invoke it to kick off the whole set of computations
  • postify converts ntrans and etrans into simple “posts”
  • qtys.sh sums up the number of shares you have, both by portfolio and in aggregate
  • sort-inputs sorts the inputs by date and entry order
  • stocko.awk is a support program used for creating CSV files in a form usable by Stockopedia
  • stocko.sh is a driver to stocko.awk
  • sumby.cc subtotals posts by accounts. You will need to compile it. I had problems with numerics in awk, but I think that I might have been able to replace the C++ code with an awk program
  • teet a debugging script
  • vacc this runs separately from macc. It shows you all the transactions for an account that you specify on the command line.

I hope that the above proves useful. If there is serious interest in this post, I might be tempted to elaborate further. Remember, it is really something for you to mould yourself, rather than provide an out-of-the-box solution.

 

 

Posted in Uncategorized | Leave a comment

gums: my (currently useless) module system for #guilelang

I know next to nothing about Scheme, but I have decided to have another go at it. I have chosen to focus on Guile.

One thing I’m trying to put together is “gums”, a “guile module system”. I am trying to model it on the idea of Quicklisp (Lisp is a language that I also know next to nothing about). Maybe it will come out as a cross between Quicklisp and Arch pacman. I wouldn’t want it to be centralised, though.

Well, anyway, the story so far is that you can download the file (here). Upon execution, it will add an entry to .guile file, and create a gums directory in your homedir. It is currently 68 lines long, which is a little more than I was hoping for. It would be nice to have a facility where you can update gums from the web.

The next thing for me to try is either a dummy package or “wisp”, which seems like a nice module to try to get going.

Wish me luck.

guile.gif

Posted in Uncategorized | Leave a comment