Problem RepresentationThe first task is working up the problem representation. We need to parse:
A useful data structure in this context is a list of 19 sublists, each corresponding to a column of 8 tokens. The reason for this is that we need to easily shuffle around the columns, which are the atomic parts of the problem (i.e. those which cannot change):
s = """|de| | f|Cl|nf|ed|au| i|ti| |ma|ha|or|nn|ou| S|on|nd|on| |ry| |is|th|is| b|eo|as| | |f |wh| o|ic| t|, | |he|h | |ab| |la|pr|od|ge|ob| m|an| |s |is|el|ti|ng|il|d |ua|c | |he| |ea|of|ho| m| t|et|ha| | t|od|ds|e |ki| c|t |ng|br| |wo|m,|to|yo|hi|ve|u | t|ob| |pr|d |s |us| s|ul|le|ol|e | | t|ca| t|wi| M|d |th|"A|ma|l |he| p|at|ap|it|he|ti|le|er| |ry|d |un|Th|" |io|eo|n,|is| |bl|f |pu|Co|ic| o|he|at|mm| |hi| | |in| | | t| | | | |ye| |ar| |s | | |. |"""
There are certainly more elegant ways to handle this, but this is enough for the discussion. To revert back from this representation to a readable form:
grid = [[''] * 8 for i in range(19)] for j, line in enumerate(s.split('\n')): cols = line.strip().split('|')[1:-1] for i, col in enumerate(cols): grid[i][j] = col
We need to consider two abstract components for solving this problem: the state space exploration, and the evaluation function we must use to guide it, since there's clearly no way we can brute-force the 19! possible configurations.
def repr(grid): return '\n'.join(['|%s|' % '|'.join([grid[i][j] for i in range(19)]) for j in range(8)])
State Space ExplorationFor the exploration function, here is a simple approach: from a given grid configuration, consider all the possible grid reorderings, resulting from having an arbitrary column reinserted to an arbitrary position. Then expand them in the order of their score ranking. Here is an example showing two possible column reorderings from the initial state:
Again it's quite certain that nicer solutions exist for this: in particular, I use
def nextGrids(grid): global visited # I know.. next_grids = set() for i in range(19): for j in range(19): if i == j: continue next_grid = list(list(col) for col in grid) next_grid.insert(i, next_grid.pop(j)) next_grid = tuple(tuple(col) for col in next_grid) if next_grid not in visited: next_grids.add(next_grid) return next_grids
setsto avoid repetitions (both for the already explored states overall and the reinsertion configs for a given grid), which thus requires those ugly "tuplify/listify" operations (a
listis editable but not hashable, while a
setis the opposite). We can then use this function in a greedy and heuristic way, always expanding first the best reordering from any given state (using a heap to make our life easier):
This is obviously greedy, because it might easily overlook a better, but more convoluted solution, while blindly optimizing the current step. Our only hope is that the evaluation function might be more clever in compensation. Note also that we don't stop, simply because there is no obvious criterion for doing so. We could define a tentative one, but since that's not the goal of the exercise, let's just proceed by inspection (once I knew the answer, I cheated by hardcoding a stopping criterion, just for the purpose of counting the number of steps to reach it). But then what about the scores? But just before..
visited = set() frontier =  # heap of (score, grid)'s while True: for next_grid in nextGrids(grid): # using a min-heap so score has to be negated heappush(frontier, (-gridScore(next_grid), next_grid)) score, grid = heappop(frontier) visited.add(grid) score, grid = max(results) print repr(grid) print score
Some Doubts..At first, my solution wasn't based on a heap: it was simply considering the best configuration at any point, and then would forget about the others (in other words: it used a single-element search frontier). But I had doubts: was this strategy guaranteed to eventually exhaust the space of the possible permutations of a given set? If not, even in theory, it would seem to be a potential problem as for the generalness of the solution.. I'm not sure about the math required to describe the relationship between the n! permutations of a set and the n2-2n+1 element reorderings (there must be a technical term for this?) from any configuration, but after having performed some practical tests, I reached the conclusion that using a heap-based method (i.e. a frontier with many elements) was more sound, because although I cannot prove it, I'm pretty certain that it is guaranteed to eventually exhaust the permutation space, whereas the first method is not. In the context of this problem, it doesn't make a difference though, because the search space is so large that we would have to wait a very long time before we see these two closely related strategies behave differently.
Language ModelingEnters language modeling.. this is the component we need for the evalution function, to tell us if we are heading in the right direction or not, in terms of unscrambling the hidden message. My first intuition was that character-based n-grams would work best. Why? Because while exploring the state space, due to the problem definition, most of the time we are dealing with parts or scrambled words, not complete (i.e. real) ones. Thus a character-based n-gram should be able to help, because it works at the sub-lexical level (but happily not exclusively at this level, as it retains its power once words begin to get fully formed, which is what we want). To do this, I used the venerable Brown Corpus, a fairly small (by today's standards) body of text containing about 1M words, which should be enough for this problem (note that I could have also used NLTK):
Another debatable aspect: I use a very small value as the probability of combinations that have never been seen, instead of using a proper discounting method (e.g. Laplace) to smooth the MLE counts: I thought it wouldn't make a big difference, but I might be wrong. The outcome however is that I cannot talk about probabilities, strictly speaking, so let's continue with scores instead (which means something very close in this context).
CHAR_NGRAM_ORDER = 3 EPSILON = 1e-10 char_counts = defaultdict(int) for line in open('brown.txt'): words = re.sub(r'\W+', ' ', line).split() # tokenize with punctuation and whitespaces chars = ' '.join(words) char_counts[''] += len(chars) # this little hack is useful for 1-grams for n in range(CHAR_NGRAM_ORDER): for i in range(len(chars)): if i >= n: char_counts[chars[i-n:i+1]] += 1 # charProb('the') should be > than charProb('tha') def charProb(c): global char_counts if c in char_counts: return char_counts[c] / char_counts[c[:-1]] return EPSILON
The final required piece is the already mentioned function to compute the score for a given grid:
A couple of things to note: I use the log-likelihood because it is more numerically stable, and I also use a simple interpolation method (with uniform weights) to combine models of different orders.
def gridScore(grid): # add spaces between rows s = ' '.join([''.join([grid[i][j] for i in range(19)]) for j in range(8)]) # tokenize with punctuation and whitespaces s = ' '.join(re.sub(r'\W+', ' ', s).split()) LL = 0 # log-likelihood for i in range(len(s)): probs =  for n in range(CHAR_NGRAM_ORDER): if i >= n: probs.append(charProb(s[i-n:i+1])) # interpolated LMs with uniform weights pi = sum([p/len(probs) for p in probs]) LL += math.log(pi) return LL
So.. does this work? Not terribly well unfortunately.. Although with some tuning (the most important aspect being the order N of the model) it's possible to reach somewhat interesting states like this:
from which it's rather easy to guess the answer (we are so good at this actually that there's even an internet meme celebrating it), for some reason my experiments seemed to always find themselves stuck in some local minima from which they could not escape.
| i|nf|or|ma|ti|on|ed|Cl|au|de| S|ha|nn|on| f|ou|nd| | | |as|is| o|f | | | b|th|eo|ry|, |wh|ic|h |is| t|he| | | | m|od|el|s |an|d |ge|pr|ob|ab|il|is|ti|c |la|ng|ua| | | |et|ho|ds| t|ha|t | m|of| t|he| c|od|e |br|ea|ki|ng| | | | t|hi|s |pr|ob|le|ve|yo|u |wo|ul|d |us|e |to| s|ol|m,| | |"A| M|at|he|ma|ti|d |wi|th| t|he| p|ap|er| t|it|le|ca|l | |n,|" |pu|bl|is|he|io|Th|eo|ry| o|f |Co|mm|un|ic|at|d | | | | | | | | | |in| t|hi|s |ye|ar|. | | | | | |
The SolutionConsidering the jittery nature of my simplistic optimization function (although you prefer it to go up, there is no guarantee that it will always do), I pondered for a while about a backtracking mechanism, to no avail. The real next step is rather obvious: characters are probably not enough, the model needs to be augmented at the level of words. The character-based model should be doing most of the work in the first steps of the exploration (when the words are not fully formed), and the word-based model should take over progressively, as we zero in on the solution. It's easy to modify the previous code to introduce this new level:
After some tinkering, I found that character-based 6-grams augmented with word unigrams was the most efficient model, as it solves the problem in 15 steps only. Of course this is highly dependent on the ~890K training parameters obtained using the Brown corpus, as with some other text it would probably look quite different. I'm not sure if this is the optimal solution (it's rather hard to verify), but it should be pretty close.
CHAR_NGRAM_ORDER = 6 WORD_NGRAM_ORDER = 1 char_counts = defaultdict(int) word_counts = defaultdict(int) for line in open('brown.txt'): # words words = re.sub(r'\W+', ' ', line).split() # tokenize with punctuation and whitespaces word_counts[''] += len(words) # this little hack is useful for 1-grams for n in range(WORD_NGRAM_ORDER): for i in range(len(words)): if i >= n: word_counts[' '.join(words[i-n:i+1])] += 1 # chars chars = ' '.join(words) char_counts[''] += len(chars) for n in range(CHAR_NGRAM_ORDER): for i in range(len(chars)): if i >= n: char_counts[chars[i-n:i+1]] += 1 # wordProb('table') should be > than wordProb('tabel') def wordProb(w): global word_counts words = w.split() h = ' '.join(words[:-1]) if w in word_counts: return word_counts[w] / word_counts[h] return EPSILON def gridScore(grid): # add spaces between rows s = ' '.join([''.join([grid[i][j] for i in range(19)]) for j in range(8)]) # tokenize with punctuation and whitespaces s = ' '.join(re.sub(r'\W+', ' ', s).split()) LL = 0 # log-likelihood for i in range(len(s)): probs =  for n in range(CHAR_NGRAM_ORDER): if i >= n: probs.append(charProb(s[i-n:i+1])) if not probs: continue pi = sum([p/len(probs) for p in probs]) LL += math.log(pi) words = s.split() for i in range(len(words)): probs =  for n in range(WORD_NGRAM_ORDER): if i >= n: probs.append(wordProb(' '.join(words[i-n:i+1]))) if not probs: continue pi = sum([p/len(probs) for p in probs]) LL += math.log(pi) return LL
And finally.. isn't it loopy in a strange way that the solution to a problem refers to what is probably the most efficient way of solving it?
That was really fun, as the rest of the course was, and if this could be of interest to anyone, the code is available on GitHub.