Roman numerals using parser combinators

Tuesday, June 25, 2013

I recently submitted code for a job interview. Evaluation was on object orientation, test driven design, elegance, and production-like aspects. Solution to the core problem statement was in my opinion not the main evaluation criteria. However, I spent more time designing a nice solution. I submitted the code in Python, but my solution was influenced more by Haskell.

Problem statement was to convert between roman numerals, arabic numerals and another custom counting scheme.

The word algorithm has its origin in the technique for finding the value of a arabic numeral (more appropriately hindu arabic numerals). It is simple and elegant. "Start with zero. Consume individual numerals from right to left. Keep adding increasing powers of 10 times the individual number to the current count." Effectively:
"4221" = 1 + 2*10^1 + 2*10^2 + 4*10^3 = 1 + 20 + 200 + 4000 = 4221
In my opinion, it is NOT surprising that one of the earliest acknowledged algorithm is actually a fold in disguise (or reduce in python).

The algorithm for computing the value of a roman numeral on the other hand is more involved. Best explained in wikipedia.

There are many ways to convert roman to arabic numerals. My implementation was a parser combinator, designed ground up in python. Not industrial grade, never the less elegant. Here is a sketch.

At the lowest level are basic parser combinators. A parser by the way is a function which takes a string and returns a pair (value, unconsumed_string). Basic combinators which return such parsers are:
char c f - consume a character if equal to c, returning f(c) and rest of the string. otherwise return fail and the whole string unconsumed
empty f - always return f() and the whole string unconsumed 
and combiner p1 p2 - do p1, on success do p2 on rest of string, combine outputs with combiner. failure at any step returns fail and the whole string unconsumed
optional_upto combiner count p - apply p, count number of times. combine outputs with combiner
or - similar to "and" using a short circuited implementation
zero p f - is an or of p and empty(f())
ands p1 p2 p3 p4 ... -  similar to "and"
ors - equivalent to "ands" for "or"
Next level above are symbol combinators. Pretty simple to setup using the char combinator.
I = char( 'I' , id(1))
V = char( 'V', id(5))
Combinators for units follow.
IX = and(minus, I, X)
IV = and(minus, I, V)
Is = and(plus, I, optional_upto(plus, 2, I)    # an I followed by 2 optional I's
VIs = and(plus, V, optional_upto(plus, 3, I)  # a V followed by 3 options I's
units is an "ors" of the above in order

Implementing this using parser combinators created a mini DSL. Once setup, id, minus, and plus functions can either calculate the value, or produce the abstract syntax tree. With a bit more effort in meta-programming,  and, or can be implemented as overloaded operators using &, |, to make the code look even more DSLish.

Combinators for tens, hundreds, and thousands follow a similar pattern. Finally all these are threaded to produce the final result.
roman = ands( plus, zero(thousands), zero(hundreds), zero(tens), zero(units))
This is again a fold, albeit with more complications. NOT a surprise.

Obvious inefficiencies exist in the implementation. On seeing I, the parser looks for an X. Not finding X it discards the whole parse and again looks for I followed by a V.  Another aspect is that the basic combinators are not minimal. I did not optimize as verbose dode was preferred, given the context. I also wanted the code to resemble a recursive descent parser, to help me explain the code during the interview.



**Actual code not posted, as this was for submission @ thoughtworks. Did I make it? Yes, but rejected as the offer was not one I couldn't refuse.