Saturday, October 27, 2012

A Tribute to an Unsung Pattern

I've always been a big fan of the autocomplete UI pattern. Although it was invented way before, I guess it would be fair to say that its web incarnation was made ubiquitous in the wake of the Web 2.0 revolution. It certainly wasn't its most important innovation, but I think this simple idea deserves more praise than it usually receives, because it has a deep impact on the way users interact with an underlying, often unknown data model (the Google search field is certainly the most powerful example). I'd like to pay a small tribute to this modest pattern by showing a possible implementation, using some of my favorite tools: Python, ExtJS and PostgreSQL. Many tutorials already exist for this of course, but I'd like to focus on the particular problem of being maximally tolerant with the user's input (i.e. not impose any structure or order on it), which I solve in a compact way with little_pger, a small Python "pseudo-ORM" module.

The Data

Say we'd like to create an autocomplete field for the list of Ubuntu releases, stored on our Postgres server:

The Widget

Our JavaScript client widget will be an instance of the incredibly versatile ExtJS combobox:

Note that the Model data structure (required by the combobox, and which mirrors our database table) conflates the three fields (adjective, animal and version) into a single release field, with a convert function to specify the layout of the data that the user will see.

The Backend

The widget is fed by our Python Flask WSGI backend, with which it communicates using JSON:

Making It a Little Smarter

To access the database, the backend code uses little_pger, a small module I wrote, which acts as a very thin (but Pythonic!) layer above Psycopg2 and SQL. I use it as a replacement for an ORM (as I don't really like them) and thus call it a "pseudo-ORM". In this context, it does us a nice favor by solving the problem of searching through all the fields of the table, by AND-matching, in any order, the user-supplied tokens (e.g. "lynx 10.04 lucid" should match). For this, the where parameter of the function offers a nice syntactic flexibility, as those examples show:

So in our context, the trick is to use the last pattern (i.e. set-based), with the concatenated (||) fields we are interested in searching, which could successfully match a search for "lynx 04 lucid" with this SQL query, for example:

A more complete version of this tribute's code is available on GitHub.

Friday, October 19, 2012

Stack Overflow Tag Similarity Visualization

For the recent Kaggle Stack Overflow machine learning contest, I have created this visualization submission, where the words found in questions with the most frequent tags have been used to compute their semantic similarity. The result is a matrix where nearby columns (representing their word use patterns) should correspond to similar, or related tags.

To do this, the $t$ most frequent tags have been extracted from a subset of the training examples, along with the $w$ most frequent words found in the "title" and "body" parts of questions tagged with at least one of these. This results in a $t \times w$ matrix where each column corresponds to a "tag vector" of w words. The tag vector components are computed using the tf-idf statistic, which tries to quantify the importance of a word by weighting its occurrence frequency for a particular tag ($tf$ term) against its frequency across the overall set of tags ($idf$ term). The problem with this representation is that neither the order of the columns nor the rows convey any information. Is there a way we could somehow reorder at least the tag vectors (i.e. columns)?

From the previous matrix, we can compute a new $t \times t$ matrix where each cell corresponds to the pairwise cosine similarity between two tags, which is an estimate of their degree of semantic relatedness:

Using this similarity matrix, we can reorder the columns of the first matrix: column/tag 1 should be similar to column/tag 2, which in turn should be similar to column/tag 3, and so on. The ideal reordering would maximize the sum of similarities across the whole chain of tags, but as I'm pretty sure that finding it is a NP-complete problem (thus intractable even for such a small matrix), I had to settle on a suboptimal greedy solution. Still, some interesting patterns emerge from the result, as one can see a gradient of tag relatedness, ranging from operating systems, Microsoft and web technologies, C-family languages, Apple technologies, web again, databases, and some general purpose programming languages.

Finally, as it seems that my submission is not too popular for some reason, I really wouldn't mind a quick thumbs up on the Kaggle page, if you think it's a reasonable idea!