Once the
AI Class was over, I wanted to take some time to study the
second optional NLP
problem with Python. Finding a solution is not difficult (because even an imperfect one will yield the answer anyway), but devising a satisfying
process to do so (i.e. a general way of solving similar problems), is much harder.
Problem Representation
The first task is working up the problem representation. We need to parse:
s = """|de| | f|Cl|nf|ed|au| i|ti| |ma|ha|or|nn|ou| S|on|nd|on|
|ry| |is|th|is| b|eo|as| | |f |wh| o|ic| t|, | |he|h |
|ab| |la|pr|od|ge|ob| m|an| |s |is|el|ti|ng|il|d |ua|c |
|he| |ea|of|ho| m| t|et|ha| | t|od|ds|e |ki| c|t |ng|br|
|wo|m,|to|yo|hi|ve|u | t|ob| |pr|d |s |us| s|ul|le|ol|e |
| t|ca| t|wi| M|d |th|"A|ma|l |he| p|at|ap|it|he|ti|le|er|
|ry|d |un|Th|" |io|eo|n,|is| |bl|f |pu|Co|ic| o|he|at|mm|
|hi| | |in| | | t| | | | |ye| |ar| |s | | |. |"""
A useful data structure in this context is a list of 19 sublists, each corresponding to a column of 8 tokens. The reason for this is that we need to easily shuffle around the columns, which are the atomic parts of the problem (i.e. those which cannot change):
grid = [[''] * 8 for i in range(19)]
for j, line in enumerate(s.split('\n')):
cols = line.strip().split('|')[1:-1]
for i, col in enumerate(cols):
grid[i][j] = col
There are certainly more elegant ways to handle this, but this is enough for the discussion. To revert back from this representation to a readable form:
def repr(grid):
return '\n'.join(['|%s|' % '|'.join([grid[i][j] for i in range(19)]) for j in range(8)])
We need to consider two abstract components for solving this problem: the state space exploration, and the evaluation function we must use to guide it, since there's clearly no way we can brute-force the 19! possible configurations.
State Space Exploration
For the exploration function, here is a simple approach: from a given grid configuration, consider all the possible grid reorderings, resulting from having an arbitrary column reinserted to an arbitrary position. Then expand them in the order of their score ranking. Here is an example showing two possible column reorderings from the initial state:
Note that this is not the same as simply swapping columns: from "a,b,c", for instance, we'd like to derive "b,a,c", "b,c,a", "c,a,b" and "a,c,b", which wouldn't possible with only one swapping step. One way to do this is:
def nextGrids(grid):
global visited
next_grids = set()
for i in range(19):
for j in range(19):
if i == j: continue
next_grid = list(list(col) for col in grid)
next_grid.insert(i, next_grid.pop(j))
next_grid = tuple(tuple(col) for col in next_grid)
if next_grid not in visited:
next_grids.add(next_grid)
return next_grids
Again it's quite certain that nicer solutions exist for this: in particular, I use
sets
to avoid repetitions (both for the already explored states overall and the reinsertion configs for a given grid), which thus requires those ugly "tuplify/listify" operations (a
list
is editable but not hashable, while a
set
is the opposite). We can then use this function in a greedy and heuristic way, always
expanding first the best reordering from any given state (using a
heap to make our life easier):
visited = set()
frontier = [] while True:
for next_grid in nextGrids(grid):
heappush(frontier, (-gridScore(next_grid), next_grid))
score, grid = heappop(frontier)
visited.add(grid)
score, grid = max(results)
print repr(grid)
print score
This is obviously greedy, because it might easily overlook a better, but more convoluted solution, while blindly optimizing the current step. Our only hope is that the evaluation function might be more clever in compensation. Note also that we don't stop, simply because there is no obvious criterion for doing so. We could define a tentative one, but since that's not the goal of the exercise, let's just proceed by inspection (once I knew the answer, I cheated by hardcoding a stopping criterion, just for the purpose of counting the number of steps to reach it). But then what about the scores? But just before..
Some Doubts..
At first, my solution wasn't based on a heap: it was simply considering the best configuration at any point, and then would forget about the others (in other words: it used a single-element search frontier). But I had doubts: was this strategy guaranteed to eventually exhaust the space of the possible permutations of a given set? If not, even in theory, it would seem to be a potential problem as for the generalness of the solution.. I'm not sure about the math required to describe the relationship between the
n! permutations of a set and the
n2-2n+1 element reorderings (there must be a technical term for this?) from any configuration, but after having performed some practical tests, I reached the conclusion that using a heap-based method (i.e. a frontier with many elements) was more sound, because although I cannot prove it, I'm pretty certain that it is guaranteed to eventually exhaust the permutation space, whereas the first method is not. In the context of this problem, it doesn't make a difference though, because the search space is so large that we would have to wait a very long time before we see these two closely related strategies behave differently.
Language Modeling
Enters language modeling.. this is the component we need for the evalution function, to tell us if we are heading in the right direction or not, in terms of unscrambling the hidden message. My first intuition was that character-based
n-grams would work best. Why? Because while exploring the state space, due to the problem definition, most of the time we are dealing with parts or scrambled words, not complete (i.e. real) ones. Thus a character-based n-gram should be able to help, because it works at the sub-lexical level (but happily not exclusively at this level, as it retains its power once words begin to get fully formed, which is what we want). To do this, I used the venerable
Brown Corpus, a fairly small (by today's standards) body of text containing about 1M words, which should be enough for this problem (note that I could have also used
NLTK):
CHAR_NGRAM_ORDER = 3
EPSILON = 1e-10
char_counts = defaultdict(int)
for line in open('brown.txt'):
words = re.sub(r'\W+', ' ', line).split()
chars = ' '.join(words)
char_counts[''] += len(chars)
for n in range(CHAR_NGRAM_ORDER):
for i in range(len(chars)):
if i >= n:
char_counts[chars[i-n:i+1]] += 1
def charProb(c):
global char_counts
if c in char_counts:
return char_counts[c] / char_counts[c[:-1]]
return EPSILON
Another debatable aspect: I use a very small value as the probability of combinations that have never been seen, instead of using a proper discounting method (e.g. Laplace) to smooth the MLE counts: I thought it wouldn't make a big difference, but I might be wrong. The outcome however is that I cannot talk about
probabilities, strictly speaking, so let's continue with
scores instead (which means something very close in this context).
The final required piece is the already mentioned function to compute the score for a given grid:
def gridScore(grid):
s = ' '.join([''.join([grid[i][j] for i in range(19)]) for j in range(8)])
s = ' '.join(re.sub(r'\W+', ' ', s).split())
LL = 0
for i in range(len(s)):
probs = []
for n in range(CHAR_NGRAM_ORDER):
if i >= n:
probs.append(charProb(s[i-n:i+1]))
pi = sum([p/len(probs) for p in probs])
LL += math.log(pi)
return LL
A couple of things to note: I use the
log-likelihood because it is more numerically stable, and I also use a simple interpolation method (with uniform weights) to combine models of different orders.
So.. does this work? Not terribly well unfortunately.. Although with some tuning (the most important aspect being the order N of the model) it's possible to reach somewhat interesting states like this:
| i|nf|or|ma|ti|on|ed|Cl|au|de| S|ha|nn|on| f|ou|nd| | |
|as|is| o|f | | | b|th|eo|ry|, |wh|ic|h |is| t|he| | |
| m|od|el|s |an|d |ge|pr|ob|ab|il|is|ti|c |la|ng|ua| | |
|et|ho|ds| t|ha|t | m|of| t|he| c|od|e |br|ea|ki|ng| | |
| t|hi|s |pr|ob|le|ve|yo|u |wo|ul|d |us|e |to| s|ol|m,| |
|"A| M|at|he|ma|ti|d |wi|th| t|he| p|ap|er| t|it|le|ca|l |
|n,|" |pu|bl|is|he|io|Th|eo|ry| o|f |Co|mm|un|ic|at|d | |
| | | | | | | |in| t|hi|s |ye|ar|. | | | | | |
from which it's rather easy to guess the answer (we are so good at this actually that there's even an internet
meme celebrating it), for some reason my experiments seemed to always find themselves stuck in some
local minima from which they could not escape.
The Solution
Considering the jittery nature of my simplistic optimization function (although you prefer it to go up, there is no guarantee that it will always do), I pondered for a while about a backtracking mechanism, to no avail. The real next step is rather obvious: characters are probably not enough, the model needs to be augmented at the level of words. The character-based model should be doing most of the work in the first steps of the exploration (when the words are not fully formed), and the word-based model should take over progressively, as we zero in on the solution. It's easy to modify the previous code to introduce this new level:
CHAR_NGRAM_ORDER = 6
WORD_NGRAM_ORDER = 1
char_counts = defaultdict(int)
word_counts = defaultdict(int)
for line in open('brown.txt'):
words = re.sub(r'\W+', ' ', line).split()
word_counts[''] += len(words)
for n in range(WORD_NGRAM_ORDER):
for i in range(len(words)):
if i >= n:
word_counts[' '.join(words[i-n:i+1])] += 1
chars = ' '.join(words)
char_counts[''] += len(chars)
for n in range(CHAR_NGRAM_ORDER):
for i in range(len(chars)):
if i >= n:
char_counts[chars[i-n:i+1]] += 1
def wordProb(w):
global word_counts
words = w.split()
h = ' '.join(words[:-1])
if w in word_counts:
return word_counts[w] / word_counts[h]
return EPSILON
def gridScore(grid):
s = ' '.join([''.join([grid[i][j] for i in range(19)]) for j in range(8)])
s = ' '.join(re.sub(r'\W+', ' ', s).split())
LL = 0
for i in range(len(s)):
probs = []
for n in range(CHAR_NGRAM_ORDER):
if i >= n:
probs.append(charProb(s[i-n:i+1]))
if not probs: continue
pi = sum([p/len(probs) for p in probs])
LL += math.log(pi)
words = s.split()
for i in range(len(words)):
probs = []
for n in range(WORD_NGRAM_ORDER):
if i >= n:
probs.append(wordProb(' '.join(words[i-n:i+1])))
if not probs: continue
pi = sum([p/len(probs) for p in probs])
LL += math.log(pi)
return LL
After some tinkering, I found that character-based 6-grams augmented with word unigrams was the most efficient model, as it solves the problem in
15 steps only. Of course this is highly dependent on the ~890K training parameters obtained using the Brown corpus, as with some other text it would probably look quite different. I'm not sure if this is the optimal solution (it's rather hard to verify), but it should be pretty close.
And finally.. isn't it
loopy in a strange way that the solution to a problem refers to what is probably the
most efficient way of solving it?
That was really fun, as the rest of the course was, and if this could be of interest to anyone, the code is available on
GitHub.