Friday, October 19, 2012

Stack Overflow Tag Similarity Visualization

For the recent Kaggle Stack Overflow machine learning contest, I have created this visualization submission, where the words found in questions with the most frequent tags have been used to compute their semantic similarity. The result is a matrix where nearby columns (representing their word use patterns) should correspond to similar, or related tags.

To do this, the $t$ most frequent tags have been extracted from a subset of the training examples, along with the $w$ most frequent words found in the "title" and "body" parts of questions tagged with at least one of these. This results in a $t \times w$ matrix where each column corresponds to a "tag vector" of w words. The tag vector components are computed using the tf-idf statistic, which tries to quantify the importance of a word by weighting its occurrence frequency for a particular tag ($tf$ term) against its frequency across the overall set of tags ($idf$ term). The problem with this representation is that neither the order of the columns nor the rows convey any information. Is there a way we could somehow reorder at least the tag vectors (i.e. columns)?

From the previous matrix, we can compute a new $t \times t$ matrix where each cell corresponds to the pairwise cosine similarity between two tags, which is an estimate of their degree of semantic relatedness:

Using this similarity matrix, we can reorder the columns of the first matrix: column/tag 1 should be similar to column/tag 2, which in turn should be similar to column/tag 3, and so on. The ideal reordering would maximize the sum of similarities across the whole chain of tags, but as I'm pretty sure that finding it is a NP-complete problem (thus intractable even for such a small matrix), I had to settle on a suboptimal greedy solution. Still, some interesting patterns emerge from the result, as one can see a gradient of tag relatedness, ranging from operating systems, Microsoft and web technologies, C-family languages, Apple technologies, web again, databases, and some general purpose programming languages.

Finally, as it seems that my submission is not too popular for some reason, I really wouldn't mind a quick thumbs up on the Kaggle page, if you think it's a reasonable idea!