Julia vs Haskell: Implementing percentile ranks

I am a newbie to both Julia and Haskell, and I am interested in obtaining an array of percentile ranks. How did I fair? Read on to find out …

Although both languages offer percentiles, none of them seem to produce an array of percentiles. I also wanted to produce an array of percentiles in the presence of missing values. Julia’s DataFrame has this concept, for example. I haven’t investigated Haskell’s Frame package as yet, so I have assumed that an equivalent would be a list of Maybes.

I did not produce first one version, then another. I had a lot of false starts, and I batted back and forwards between trying to write a Julia version, and a Haskell version. I discovered that writing in one tended to inform, and sometimes hinder, writing in the other.

Starting with the easy case, where all the values of an array are proper, I tried to produce the function ‘prank’, which takes in an array of floats, and outputs an array of percentiles. Here is what I achieved in Julia:

function prank(arr::Array)
   n = length(arr)
   result = Array(Float64, n)
 
   for i in 1:n 
      score = 0
      for j in 1:n
         score += arr[j] < arr[i] ? 1 : 0 
         score += arr[j] == arr[i] ? 0.5 : 0
      end
      result[i] = score/n
    end
   result
end

prank([23.0, 20.0, 20.0, 57.0, 46.0])

which produced the (correct) output :

5-element Array{Float64,1}:
 0.5
 0.2
 0.2
 0.9
 0.7

The code looks neat to me, and, importantly, it is intelligible. It implements the ranking algorithm on Wikipedia. Here is the equivalent code I wrote in Haskell:

prank :: [Float] -> [Float]
prank arr =
  map score  arr
  where
    len p v = fromIntegral $ length $ filter (p v) arr
    n = fromIntegral $ length arr
    score v = ((len (>) v)  + 0.5 * (len (==) v))/n

It weights in at under half the lines. The very first line is not even necessary. I’d say it’s more of a challenge to assimilate, and it took me much longer to write, too. Could a seasoned Haskeller offer a more comprehensible algorithm?

Some of what it is doing is fairly obvious. ‘map’, ‘length’, ‘filter’ are standard ideas. ‘fromIntegral’ turns an integer into a float (actually, it does something a little froodier than that, but let’s just go with that explanation for the sake of argument).

To clarify things a little, ‘score’ tries to score a value against the entire matrix. So doing ‘map score arr’ takes the score from each element of the list, against the entire list. It does that by calling ‘len’. ‘len’ takes two arguments: a mathematical comparator operator, and a value to be scored. So firstly, it constructs a filter like:

filter (> v) arr, or filter (== v) arr

The magical thing here – or perhaps the confusing thing – is that when I write something like (> 3), that’s actually a function. It’s a function that takes an argument (of type (Num a, Ord a) if you must know) and returns a boolean. Being a function, it can be evaluated, like so:

(> 3) 2

to yield the result False. It seems to be an odd way of writing it at first, but remember, when I define the function ‘score’, I don’t know what the “other” value is that I’m comparing to v. The $ operator just helps prevent nested parentheses. So instead of writing

(a (b (c x)))

I can just write

a $ b $ c x

Which implementation is more obvious between Julia and Haskell? Well, I think Julia was easier for me to write.

Next, we consider the more ambitious case where the inputs may be undefined. Julia’s DataFrames consist of DataArrays, which can take the value NA. In this case, what I do is construct a subset of the DataArray containing only well-defined floats, take the “pranks” of that, and then rebuild an array, re-filling in the NAs in the right places. Here’s the function I wrote:

 function prankM(darr::DataArray)
   """
   TODO docstring available in julia >= 0.4 (which I don't yet have
   Algorithm is at: http://en.wikipedia.org/wiki/Percentile_rank
   It agrees with scipy stats 'mean' kind
   """
   len = length(darr)
 
   function isreg(x) ! isa(x, NAtype) end
 
   # reduce the array to just valid numbers
   pranks = Array(Float64, len)
   n = 0
   for i in 1:len
     if isreg(darr[i])
       n += 1
       pranks[n] = darr[i]
     end
   end
 
   ranks = prank(pranks[1:n]) # calculate the ranks on the good values
 
   # rebuild the data array
   n = 0
   for i in 1:len
     if isreg(darr[i])
       n += 1
       darr[i] = ranks[n]
     end
   end
 
   darr
end

That’s still pretty readable. Let me pass in some data like last time, only incorporating an NA:

inp = @data([23, 20, 20, NA,  57, 46.0])
prankM(inp)

This time the output is

6-element DataArray{Float64,1}:
 0.5
 0.2
 0.2
  NA
 0.9
 0.7

One dilemma I did face, and this is common to a lot of dynamic languages, is this: is my input data mutated? That is often something of an non-obvious issue. Sometimes a language will mutate the data, sometimes not, depending on the type of data you pass in (quickly now: in Python, if you pass in a dictionary to a function, can it, or is it, mutated? How about a list? Or a list of objects? And if you can get inconsistent results, then does that really make sense?). It turns out that the above function actually does mutate the data, so I have actually probably created a bug. One of the joys of Haskell is that you’re never in any doubt about whether data is mutated, or not. It never is!

Here’s my equivalent effort in Haskell:

prankM :: [Maybe Float] -> [Maybe Float]
prankM arrM =
  rebuild [] pranks arrM
  where
    pranks = prank $ catMaybes arrM
    rebuild acc _ [] = acc
    rebuild acc (p:ps) (m:ms) =
      if isJust m
      then rebuild (acc ++ [Just p]) ps ms
      else rebuild (acc ++ [Nothing]) (p:ps) ms

Again, you see it is less than half the number of lines. this implementation is actually correct. There’s more here than I’m likely to have the patience to explain.

But, firstly, instead of taking in a list of floats, it takes in a list of “maybe” floats. A “maybe” float is either Nothing (i.e. no value, corresponding to our NA), or “just” a float (i.e. an actual float). So we would actually call it like so:

prankM [Just 23.0, Just 20.0, Just 20.0,  Nothing, Just 57.0, Just 46.0]

and obtain the result

[Just 0.5,Just 0.2,Just 0.2,Nothing,Just 0.9,Just 0.7]

It is almost the same as before, but it will return “just” the percentile ranks, or Nothing where there is no value.

The value ‘pranks’ is a list of floats. ‘catMaybes’ throws out the Nothings in the array pass in, returning the “actual” (or “just”) values for floats that are actually specified.

Then we have “rebuild”, which rebuilds the percentiles in the correct length, with the correct sequence of Justs and Nothings. I seem to have defined the function twice. What’s going on there?

Well, that’s due to Haskell’s pattern-matching. ‘rebuild’ takes 3 arguments:

  1. argument 1 (acc) – is an accumulation of the result
  2. argument 2 (the pranks) – are a list of floats
  3. argument 3 is the input array, containing a mixture of actual floats and Nothing.

The first expression for ‘rebuild is:

rebuild acc _ [] = acc

The _ means “anything”, I don’t care. The [] matches the empty list. In other words, when my list of inputs is exhausted, I just return the result accumulated so far. I don’t care what the list of pranks is; in fact, it should be empty.

The second form for rebuild is

rebuild acc (p:ps) (m:ms)

Here, (p:ps) is a list of unaccumulated pranks whose head is p, and tail is ps. Likewise for (m:ms), but for the input arguments.

So I have a list of unprocessed pranks, and unprocessed inputs. I look at the first unprocessed input. Is it a valid number? If so, then I know that the first item in my unprocessed pranks is the output to that input. So I append that prank to my accumulated value. I then recurse, reducing the number of pranks and inputs that I pass into the next round.

If there isn’t a valid number in the input, then I know I have to append Nothing to the accumulated values. I then recurse. This time, I pass in the all of the pranks that I had before, with one less input.

Easy? Confusing? A similar result can be obtained using foldl, but I find the representation above easier to comprehend. Maybe there is a neater way to do it with monads. I even think that Haskell can simulate a loop using monads, although I haven’t figured out how that all works.

Conclusions?

Well, it’s difficult to say. Haskell code is definitely shorter, but possibly takes longer to write. It could be that I’m using bad idioms, or just haven’t learned to think in the right way. I like Haskell’s type signatures, which makes it much more obvious what a function “does”. It’s much easier to reason about code correctness in Haskell, and in fact I avoided a bug that I had inadvertently created in Julia. Julia is no worse than any other typical dynamic language in that respect, though. Ultimately you should choose the language that works best for you. I’m still giving both the benefit of the doubt when it comes to playing with scientific computing.

One thing I would say would that trying to write Haskell in Julia isn’t a good idea. I defined functions within functions in Julia, but eventually decided that a more “vanilla” approach was actually neater and more comprehensible.

Advertisements

About mcturra2000

Computer programmer living in Scotland.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s