Sunday, June 17, 2012

Two Tales of Data Clerking

I - A couple of years ago..

I was tasked with the creation of a web-based medical database application to manage TB case data. As the data had spatial components, the choice of a database tech (PostgreSQL, with its PostGIS extension) was easy. For the client-side tech (mostly data entry forms coming from the paper world), the choice was harder. After having explored and reviewed many frameworks, I finally settled for the powerful Ext JS library. Lastly, as I wanted to code the server-side logic in Python, I decided to go with mod_python, which wasn't then in the almost defunct state it is in today. I played for a while with the idea of a more integrated environment like Django, but I decided to go without after having realized that the mildly complicated database logic I had in mind was not well supported by an ORM. In its place, I devised little_pger, a very simple, psycopg2-based "proto-ORM", wrapping common SQL queries in a convenient and pythonic interface (at least to my taste!).

Having made all these choices, I was ready to build the application. I did, and in retrospect, it's very clear that the most time-consuming aspect, by far, was the delicate user interface widgets I had to create to maximize user happiness and data consistency (with all the power Ext JS yields, it's hard to resist):
  • Different kinds of autocompletion text fields
  • Panels with variable number of sub-forms
  • Address validation widget
  • Etc..


The initial versions of the app were so heavy that I had to revise the whole UI architecture.. but then a nice thing happened: the JavaScript performance explosion brought by the browser war.. suddenly this wasn't a concern anymore!

This system has been working very reliably (from an Ubuntu VM) since then. It has required slightly more user training and adjustments than I'd assumed (sophisticated UI semantics is always clearer in the designer's mind) but it is still being used, and its users are still (seemingly) happy, and so this could have been the end of an happy, although rather uneventful story, if it wasn't for the fact that I actually spent some time meditating upon what I'd do differently, being offered the opportunity.

For instance, learning about WSGI, but above all, the very elegant Flask microframework, made me want to be part of the fun too. So I refactored my application in terms of it (which was mostly painless). So long mod_python..

By then however, a kind of "UI fatigue" had begun to set in, and when I was asked to quickly code an addon module, I admit it: I did it with Django (and no one really noticed).

Time passed, I worked on many other projects and then it came again.

II - A couple of months ago..

I was asked to create a medical database to hold vaccination survey data. By that time, I had developed another condition: "database fatigue" (AKA obsessive-compulsive normalization), which is probably better explained by these schema diagrams from my previous app:


And so, even though I still strongly believed in the power of relational databases, as the NoSQL paradigm was gaining traction, MongoDB seemed like a fun alternative to try.

This time however, as I lacked the courage (or ignorance) I had when facing the prospect of "webifying" 47 pages of paper forms (though it was only a mere 13 pages for this new task), I decided to explore other options. The solution I came up with is based on a kind of "end user outsourcing": since the paper form was already created as a PDF document, all I did was superimpose dynamic fields on it, and added a big "Submit" button on top (attached to a bit of JS that sends the form payload over HTTP). Of course this is not web-based anymore (rather HTTP-based), and it makes for a less sophisticated user experience (e.g. no autocompletion fields), but that's the price to pay for almost pain-free UI development. So long, grid and accordion layouts..


The really nice thing about it is hidden from the user's view: the entire application server reduces to this Python function (resting of course on the immense shoulders of Flask and PyMongo), which receives the form data (as a POST dict) and stores it directly in the MongoDB database, as a schemaless document:
from flask import *
from pymongo import *

application = Flask('form_app')

@application.route('/submit', methods=['POST'])
def submit():
    request.parameter_storage_class = dict
    db = Connection().form_app
    db.forms.update({'case_id': request.form['case_id']},
                    request.form, upsert=True, safe=True)
    return Response("""%FDF-1.2 1 0 obj <<
                       /FDF << /Status (The form was successfully submitted, thank you!) >> 
                       >> endobj trailer << /Root 1 0 R >>
                       %%EOF
                    """, mimetype='application/vnd.fdf')

Although for a shorter period of time, this system has also been working in quite a satisfying way since it went online. Its users seem happy, and even though it's less sexy than a web UI, there are merits to this distributed, closer-to-paper approach. The MongoDB piece also makes it a joy (and a breeze) to shuffle data around (though I'm not sure what is due to sheer novelty however).

Of course these database app designs are so far apart that it could seem almost dishonest to compare them. But the goal was not to compare, just to show that it's interesting and sometimes fruitful to explore different design options, when facing similar problems.

Thursday, May 31, 2012

Reverse Engineering the Birthday Paradox

Recently, I've been thinking about the Birthday Paradox (not to be confused with this similarly named problem). The first time was while reviewing the concept of birthday attacks for the final exam of the online cryptography class I'm currently taking. The second time was while reading this intriguing exploration of the processes at play when struggling to understand a math problem.

The usual way to think about the problem is to consider the probability $\overline{p}$ of the opposite event we're interested in: that there is no birthday"collision" at all within a group of $k$ people. This, we're said (without really explaining why), is easier to calculate, and it follows from it that we can retrieve the probability $p$ of our goal event (i.e. that there is at least one "collision") as $1 - \overline{p}$.

The above-mentioned essay discusses at length this idea and the underlying principles, but it also casually challenges the reader with an interesting side problem: find a way to calculate $p$ directly, that is, without relying on the probability of the opposite event. Being a programmer, I decided to tackle this problem not from the principles, but from the opposite direction, by trying to derive understanding from "procedural tinkering".

In the first version of this article, I had derived a way of screening the $k$-permutations out of the cartesian power set $N^k$. I thought this was the answer, but someone on the Cryptography Class discussion forum helped me understand that this was actually only a rearrangement of the indirect computation (i.e. $p = 1 - \overline{p}$). The correct way to compute $p$ directly should rather involve the sum of:
  • the chance of a collision occurring only on the second choice
  • the chance of a collision occurring only on the third choice
  • ...
  • the chance of a collision occurring only on the $k$-th choice
This seemed to make sense, but as I wanted to study this proposition in more detail on a simplified problem instance, I wrote this Python program:

from __future__ import division
from itertools import product

n = 10
k = 5

prev_colls = set()
p_coll = 0

def findPrevColl(p, j):
    for i in range(2, j):
        if p[:i] in prev_colls:
            return True
    return False

for j in range(2, k+1):
    n_colls = 0
    count = 0
    for p in product(range(n), repeat=j):
        if len(set(p)) < j and not findPrevColl(p, j):
            n_colls += 1
            prev_colls.add(p)
        count += 1
    print 'at k = %d, P_k = %f' % (j, n_colls / count)
    p_coll += n_colls / count
print 'sum(P_k) = %f' % p_coll

# verify result with the "indirect" formula
import operator
from numpy.testing import assert_approx_equal
assert_approx_equal(p_coll, 1 - reduce(operator.mul, range(n-k+1, n+1)) / n ** k)

# at k = 2, P_k = 0.100000
# at k = 3, P_k = 0.180000
# at k = 4, P_k = 0.216000
# at k = 5, P_k = 0.201600
# sum(P_k) = 0.697600

which works by enumerating, for every $k$, all the possible trials (of length $k$) and looking for a "new" collision with each one, "new" meaning that no subset (of length $< k$) of this particular trial has ever been causing a collision. The probabilities for each $k$ are then summed to yield the overall, direct probability of at least one collision. Note that this code relies on the itertools module's product function to generate the cartesian powers of $k$ (i.e. all possible trials of length $k$) and the set data structure for easy past and current collision detection. Once fully convinced that this was working, the obvious next step was to derive an analytic formulation for it. By studying some actual values from my program, I figured out that the probability for a collision occurring only on the $k$-th choice should be: \[ P_{only}(n, k) = \frac{\left(\frac{n! \cdot (k-1)}{(n-k+1)!}\right)}{n^k}\] meaning that the total, direct probability of at least one collision is the sum: \[ P_{any}(n, k) = \sum_k{\frac{\left(\frac{n! \cdot (k-1)}{(n-k+1)!}\right)}{n^k}}\] As this wasn't still fully satisfying, because it doesn't yield an intuitive understanding of what's happening, the same helpful person on the Cryptography forum offered an equivalent, but much better rewrite: \[ P_{only}(n, k) = \left(\frac{n}{n} \cdot \frac{n-1}{n} \cdot \cdot \cdot \frac{n-k+2}{n}\right) \cdot \left(\frac{k-1}{n}\right)\] which can be easily understood by imagining a bucket of $n$ elements: the chance of a collision happening exactly at choice $k$ is the probability that there was no collision with the first $k-1$ choices ($\frac{n}{n} \cdot \frac{n-1}{n} \cdot \cdot \cdot$, increasing as the bucket fills up) multiplied by the probability of a collision at $k$, which will happen $k-1$ times out of $n$, if we assume that the bucket is already filled with $k-1$ different values (by virtue of the no-previous-collision assumption).

Although the conclusion of my previous attempt was that it's often possible to derive mathematical understanding from "procedural tinkering" (i.e. from programming to math), I'm not so sure anymore, as this second version is definitely a counterexample.