Skip to main content

Part 2: Applying Optimisations

Table of Contents

In Part 1, we implemented a brute-force Countdown solver in Haskell.

Unfortunately, it’s incredibly slow and inefficient.

In this post, we’ll apply some optimisations to speed things up.

Reducing The Search Space
#

As I mentioned towards the end of Part 1, the main problem is the astronomical size of the search space. For a given set of numbers and a target, we would have to generate and examine ~33 million expressions.

This is the main cause of the slow speed and high memory usage, so we’ll start with that.

In order to reduce the search space, we need a way of throwing away expressions which we don’t necessarily want to examine. In the business, this is called tree pruning.

Consider the following two expressions:

$$ 25 + (((6 * 75) / 50) * (3 + 100)) $$

$$ (((75 * 6) / 50) * (3 + 100)) + 25 $$

These are two equivalent, but different, expressions. Equivalent in that they both evaluate to the same value (952), but different in that the operands for the arithmetic operations are in a different order: \(25 + ((6 * 75) \cdots)\) in the top expression and \(((75 * 6) \cdots) + 25\) in the bottom.

Fortunately, there are some arithmetic tricks we can use to eliminate this duplication.

Algebraic Pruning
#

The main reason for our large search space is the many possible representations per value-equivalent subtree. So we need to reduce the number of value-equivalent representations generated.

The arithmetic operations \(+, -, \div, \times\) have some interesting properties to help us.

Commutativity
#

For starters, addition and multiplication are commutative operations, meaning it doesn’t matter which order the operands are in as they produce the same answer. For example, \(5 + 3 \equiv 3 + 5\) and \(5 \times 3 \equiv 3 \times 5\).

We can use this to our advantage: since both of the expressions above are equivalent, we can create a rule where we only consider one of them. For example, only consider \(x \times y\) and \(x + y\) if \(x \leq y\).

This means we’d only consider \(3 + 5\) and \(3 \times 5\).

Identity
#

Multiplication has another interesting property called identity, in other words, getting the same output as the input. \(x \times 1 = x\) and \(1 \times x = x\) - multiplying \(x\) by \(1\) you always get \(x\).

Similarly for division, if you divide \(x\) by 1, you always get \(x\) as an answer. But \(1 \div x \neq x\), so division doesn’t have true identity.

By adding another rule to throw away identity expressions for multiplication and division, we can further prune the search space.

Repeated Work
#

Cutting the per-call search space gets us a long way, but our solver still does a lot of work it doesn’t need to.

Consider what happens when we call allExprs on a four-number list [a, b, c, d]. One split gives us ([a, b], [c, d]), which triggers a recursive call to allExprs [c, d]. Another split gives us ([a], [b, c, d]), and inside that branch we’ll eventually recurse to allExprs [c, d] again. Same input, same expression list, rebuilt from scratch.

It gets worse one level up. To search the full Countdown space, we don’t call allExprs on the six numbers as a unit; we call it once per ordering of every non-empty subsequence. Two orderings of the same five-number subset don’t share any work in our brute-force implementation - the entire expression tree is rebuilt for each one.

The fix is to compute the expression list for each subset of input numbers exactly once, store it somewhere, and look up the answer. This is called memoisation.

Now we’ve got a plan, let’s build!

Expression Tree Pruning
#

We’ll start with pruning the expression tree, since this gives us a notable improvement in performance.

Let’s look at our current code for generating and evaluating expressions:

-- evaluate an expression
eval :: Expr -> Maybe Int
eval (Val n) = Just n
eval (Exp op leftExpr rightExpr) = do
  left <- eval leftExpr
  right <- eval rightExpr
  apply op left right

-- given an Op and 2 Ints, check if the Op is valid 
-- and apply it if so
apply :: Op -> Int -> Int -> Maybe Int
apply Add x y = Just (x + y)
apply Mul x y = Just (x * y)
apply Sub x y = if y > x then Nothing else Just (x - y)
apply Div x y = if y == 0 || x `mod` y /= 0 then Nothing else Just (x `div` y)

-- Split a list into all possible (Left, Right) pairs
split :: [a] -> [([a], [a])]
split xs = init . drop 1 $ zip (inits xs) (tails xs)

A few things to note:

  • We are combining expression validity and evaluation in apply.
  • We have a separate eval function for expression evaluation.
  • split has some added overhead because of the zip call with inits and tails.

Referring back to our algebraic pruning section, let’s add a new function valid, which will check if an Op can be applied to two Int arguments.

It will have the following properties:

  1. Add x y is only valid if x <= y. This way, we would only consider one expression representation, for example, 3 + 5, and not 5 + 3.
  2. Sub x y will only be allowed if x > y. This is equivalent to the rule in our old apply function, as it prevents negative values.
  3. Mul x y is only valid if neither x nor y are equal to 1, and if x <= y. This avoids identity (multiplying by 1) and equivalent expressions (3 * 5 and 5 * 3).
  4. Div x y is only valid if y isn’t equal to 1, and x / y doesn’t produce a fractional number.

Here’s the implementation:

valid :: Op -> Int -> Int -> Bool
valid Add x y = x <= y
valid Sub x y = x > y
valid Mul x y = x /= 1 && y /= 1 && x <= y
valid Div x y = y /= 1 && x `mod` y == 0

Now that we’ve moved the validity check out of apply, we can simplify it:

apply :: Op -> Int -> Int -> Int
apply Add x y = x + y
apply Sub x y = x - y
apply Mul x y = x * y
apply Div x y = x `div` y

Now we’ll think about the split function.

The issue with our original split was the added overhead of inits and tails, which generates two sublists. If we re-write our split function recursively, we can avoid the extra allocations.

Let’s think about our recursive cases:

  • When we have a list with 0 or 1 elements, we can’t split it, so we’ll return the empty list. These will be our base cases.
  • Otherwise, we’ll recurse on the remaining elements, and prepend the new list-list tuple to the result.
split :: [a] -> [([a], [a])]
split [] = []
split [_] = []
split (x : xs) = ([x], xs) : [(x : l, r) | (l, r) <- split xs]

Finally, we’ll update allExprs to use our new valid, split and apply functions:

allExprs :: [Int] -> Expressions
allExprs [] = []
allExprs [n] = [(Val n, n)]
allExprs ns =
  [ (Exp op l r, apply op lv rv)
    -- generate left and right splits
    | (ls, rs) <- split ns,
      -- generate all expressions from the left side
      (l, lv) <- allExprs ls,
      -- generate all expressions from the right side
      (r, rv) <- allExprs rs,
      op <- ops,
      -- filter out invalid expressions
      valid op lv rv
  ]

Let’s compare our new allExprs with the old one, to see how many expressions are generated given a list of numbers.

We’ll save our changes into a file called FasterCountdown.hs, and load it and our previous BruteCountdown.hs into ghci to compare performance.

ghci> :l FasterCountdown.hs 
ghci> length $ allExprs [25, 50, 75, 100, 3, 6]
418
ghci> :l BruteCountdown.hs 
ghci> length $ allExprs [25, 50, 75, 100, 3, 6]
4178

By eliminating equivalent expressions, we’ve cut down the search space by a factor of 10!

Memoisation
#

So we’ve cut down the search space, but we’re still doing the same work over and over again. Let’s use memoisation to fix that.

To implement memoisation, we need two things:

  1. A way to name a subset of the input numbers.
  2. A place to store the expression list for each name.

Naming A Subset
#

The natural choice for a subset of an n-element list is a bitmask - an Int where bit i is set if position i is included in the subset. For an input of six numbers, we need a 6-bit mask (0b000001 through 0b111111) to cover every non-empty subset.

For example, given the input number list [25, 50, 75, 100, 3, 6], the bitmask 0b100101 (decimal 37) means the subset [25, 75, 6]:

  • bit 0 = 1 → include position 0 → 25
  • bit 1 = 0 → skip position 1
  • bit 2 = 1 → include position 2 → 75
  • bit 3 = 0 → skip position 3
  • bit 4 = 0 → skip position 4
  • bit 5 = 1 → include position 5 → 6

One subtlety - Countdown solutions can contain duplicates. Two 5s in the input need to stay distinct, otherwise we’d collapse them into one and lose half of the search space. So our masks index positions in the input list, not the values themselves.

The Table
#

Here’s a tweaked allExprs function which uses a memo table, renamed to allExprsMemo:

import Data.Array (Array, listArray, (!))
import Data.Bits (bit, popCount, countTrailingZeros, (.&.), xor)

allExprsMemo :: [Int] -> Array Int Expressions
allExprsMemo ns = table
  where
    n     = length ns
    nums  = listArray (0, n - 1) ns  -- position -> value
    table = listArray (0, bit n - 1)
              [ compute m | m <- [0 .. bit n - 1] ]

    compute m = ... -- defined below

table is a \(2^n\) element array. Each cell table ! m holds the list of all expressions that can be built from the subset described by the bitmask m. The contents are produced by compute m, which we’ll define in a moment.

The interesting part is that compute itself will read from table. The two definitions (allExprsMemo and compute) are mutually recursive - the array is defined in terms of a function that indexes back into it. This is a Haskell idiom known as tying the knot.

It works because Haskell is lazy. Each cell table ! m is just a thunk until something forces it, and when it does get forced, the cells it depends on are already accessible in the array (even if their contents are also still thunks). The result is a fully memoised table that builds itself on demand.

The Recurrence
#

Now for compute, the function that actually builds each cell:

compute m
  -- popCount counts the '1's in a binary representation of an Int
  | popCount m == 0 = []
  | popCount m == 1 =
      -- nums taken from earlier definition in allExprsMemo
      let v = nums ! countTrailingZeros m
       in [(Val v, v)]
  | otherwise =
      [ (Exp op l r, apply op lv rv)
      | lm        <- properSubmasks m
      , let rm     = m `xor` lm  -- complement within m
      , (l, lv)   <- table ! lm
      , (r, rv)   <- table ! rm
      , op        <- ops
      , valid op lv rv
      ]

For a single-bit mask (a bitmask with only one 1 and the rest 0), we read its value out of nums and return a Val leaf. The empty mask returns an empty list so it can’t accidentally propagate upward.

For two or more bits, we walk every proper, non-empty submask lm of m, take its complement within m as rm = m XOR lm, look up the expressions for both halves in the table, and combine them with every valid op. This is the same nested-comprehension structure as our old allExprs, except the recursive call has been replaced with an array lookup.

To clarify, a proper submask is a submask of m that isn’t m itself. For example, given the mask 0b101, the submasks are 0b000, 0b001, 0b100, and 0b101. However, the proper submasks are the first three. Further, the non-empty constraint would filter out the first submask 0b000.

The recurrence is well-founded: lm and rm are both proper submasks of m, and any proper submask is numerically smaller than m. The dependency graph runs strictly from large masks to small ones, so no cell ever depends on itself, directly or transitively.

Enumerating Proper Submasks
#

Finally, we need a way to list every non-empty proper submask of a given mask. There’s a classic bit-twiddling trick for this:

properSubmasks :: Int -> [Int]
properSubmasks m =
  takeWhile (/= 0) . tail $ iterate (\s -> (s - 1) .&. m) m

The transition s' = (s - 1) .&. m walks downward through every submask of m in decreasing order. We start at m itself, drop it with tail (since we want proper submasks only), and stop when we reach zero.

Because we enumerate every proper submask, each partition of m shows up twice: once as (lm, rm) and once as (rm, lm). That’s deliberate, because it lets valid enforce it’s specific operand order for the commutative ops without us having to manually pick which side comes first.

Putting It All Together
#

With the memo table in place, the rest of the solver gets simpler. We no longer need permutations or subsequences, as every non-empty subset is already a mask in [1 .. 2^n - 1], and iterating those gives back the “any non-empty subset” behaviour of the old subsetPermutations with no recomputation.

solve :: [Int] -> Int -> [String]
solve ns target =
  [ show e
  | m       <- [1 .. bit (length ns) - 1]
  , (e, v)  <- table ! m
  , v == target
  ]
  where table = allExprsMemo ns

Each subset’s expression list is built exactly once and shared by every caller that lands on it.

Comparison With BruteCountdown.hs
#

Let’s remind ourselves of the running time of the old solve function from BruteCountdown.hs:

ghci> :l BruteCountdown.hs
[1 of 1] Compiling BruteCountdown   ( BruteCountdown.hs, interpreted )
Ok, one module loaded.
ghci> :set +s
ghci> solve [25, 50, 75, 100, 3, 6] 952
...
ghci> (23.53 secs, 6,405,928,856 bytes)

And now our new solution:

ghci> :l FasterCountdown.hs
[1 of 1] Compiling FasterCountdown   ( FasterCountdown.hs, interpreted )
Ok, one module loaded.
ghci> :set +s
ghci> solve [25, 50, 75, 100, 3, 6] 952
...
ghci> (1.18 secs, 444,807,904 bytes)

Wow - from ~ 24 seconds to ~ 1.2!

But we can do even better.

ghci is an interpreter, so there’s added overhead when running programs. To find out how fast our new FasterCountdown.hs compares to our old BruteCountdown.hs, we’ll compile them to native code.

Compiling
#

We need to add a Main.hs to act as our entry point. All it will do is read the list of numbers and the target number from the command line, and then call solve.

-- Main.hs
module Main where

-- when testing FasterCountdown.hs, we'll change this to FasterCountdown
import BruteCountdown (solve)
import System.Environment (getArgs)
import System.Exit (die)

main :: IO ()
main = do
  args <- getArgs
  case map reads args :: [[(Int, String)]] of
    parsed | length parsed >= 2, all (\p -> length p == 1 && null (snd (head p))) parsed -> do
      let nums = map (fst . head) (init parsed)
          target = fst (head (last parsed))
      case solve nums target of
        [] -> putStrLn "No solutions found."
        ss -> mapM_ putStrLn ss
    _ -> die "Usage: countdown N1 N2 ... Nk TARGET"

We’ll compile in run-time information, so we can see running time and memory usage.

$ ghc -O2 -with-rtsopts=-s -main-is Main Main.hs BruteCountdown.hs -o countdownbrute
$ ghc -O2 -with-rtsopts=-s -main-is Main Main.hs FasterCountdown.hs -o countdownmemo

This will produce two binaries: countdownbrute and countdownmemo.

Let’s start by running countdownbrute:

$ ./countdownbrute 25 50 75 100 3 6 952
...
INIT    time    0.001s  (  0.000s elapsed)
MUT     time    0.459s  (  0.458s elapsed)
GC      time    0.020s  (  0.021s elapsed)
EXIT    time    0.000s  (  0.000s elapsed)
Total   time    0.480s  (  0.479s elapsed)

%GC     time       0.0%  (0.0% elapsed)

Alloc rate    4,951,097,985 bytes per MUT second

Productivity  95.6% of total user, 95.5% of total elapsed

And then running countdownmemo:

$ ./countdownmemo 25 50 75 100 3 6 952
...
INIT    time    0.000s  (  0.000s elapsed)
MUT     time    0.046s  (  0.045s elapsed)
GC      time    0.121s  (  0.121s elapsed)
EXIT    time    0.000s  (  0.000s elapsed)
Total   time    0.167s  (  0.167s elapsed)

%GC     time       0.0%  (0.0% elapsed)

Alloc rate    4,662,416,340 bytes per MUT second

Productivity  27.3% of total user, 27.1% of total elapsed

Ok, what are we actually looking at?

Crunching The Numbers
#

The runtime breaks the wall time (real-world time between a program starting and finishing) into a few useful pieces. MUT is mutator time, the time spent computing things. GC is time spent in the garbage collector. Total is the wall time we care about, and Productivity is the fraction of total time spent in MUT rather than GC.

Putting the two runs side by side:

countdownbrute countdownmemo
Total 0.480s 0.167s
Mutator 0.459s 0.046s
GC 0.020s 0.121s
Productivity 95.6% 27.3%

The headline is the 2.9× speedup on total time - real, but a long way from the 20× gap we saw in ghci. What happened?

The key is the MUT times. The memoised solver does its actual computation in 0.046s against the brute solver’s 0.459s, the 10× speedup thanks to algebraic pruning and the table stopping a mountain of redundant work from happening.

The compiled brute-force solver isn’t running smarter code, it’s just running more of it very fast, because GHC’s bump-pointer allocator and generational GC are well-suited to “allocate a million little things, throw them away” workloads. ghci is much less forgiving, which is why the gap looked enormous there.

The other side of the story is the GC column. The brute solver spends almost no time collecting (0.020s) because every expression it builds is consumed and dropped immediately - there’s nothing to keep alive across the search. The memoised solver has to keep the whole \(2^n\) cell table resident until the query finishes, so the GC scans a much larger heap on every collection. We’ve traded cheaper CPU for more live data, and on this scale the GC is most of what we’re paying for.

Wrapping Up
#

Two optimisations took our brute-force Countdown solver from roughly 24 seconds in the interpreter to 1.2, and from 0.48 seconds compiled to 0.17.

The first was algebraic pruning. By recognising that commutativity and identity create equivalent expressions, we cut a single subset’s expression count by a factor of ten and collapsed every “solution” down to one canonical form.

The second was memoisation, built around bitmask-keyed lookups into a self-referential lazy array. Each subset’s expression list gets computed exactly once, no matter how many parent splits land on it. Tying the knot is one of the more striking idioms Haskell offers - the table is defined in terms of a function that indexes back into it, and laziness alone makes the whole construction well-defined.

Memoisation isn’t free, though. The table has to stay live in memory until the query finishes, and that pushes the cost from the mutator over to the garbage collector. The optimisation that gives us a 10× speedup on actual computation only buys a 3× speedup overall once GHC’s runtime gets its hands on the brute version. “Fewer cycles” doesn’t always mean “less wall time”.

You could push the solver further still - parallelism over the subsets, a more compact representation of Expr, or batching multiple targets against a single shared memo table - but the algorithmic wins are largely behind us. For now, we have something that finds every Countdown solution in under a fifth of a second, and that feels like a good place to stop.