Processing math: 100%

Friday, October 19, 2012

Stack Overflow Tag Similarity Visualization

For the recent Kaggle Stack Overflow machine learning contest, I have created this visualization submission, where the words found in questions with the most frequent tags have been used to compute their semantic similarity. The result is a matrix where nearby columns (representing their word use patterns) should correspond to similar, or related tags.

To do this, the t most frequent tags have been extracted from a subset of the training examples, along with the w most frequent words found in the "title" and "body" parts of questions tagged with at least one of these. This results in a t×w matrix where each column corresponds to a "tag vector" of w words. The tag vector components are computed using the tf-idf statistic, which tries to quantify the importance of a word by weighting its occurrence frequency for a particular tag (tf term) against its frequency across the overall set of tags (idf term). The problem with this representation is that neither the order of the columns nor the rows convey any information. Is there a way we could somehow reorder at least the tag vectors (i.e. columns)?

From the previous matrix, we can compute a new t×t matrix where each cell corresponds to the pairwise cosine similarity between two tags, which is an estimate of their degree of semantic relatedness:



Using this similarity matrix, we can reorder the columns of the first matrix: column/tag 1 should be similar to column/tag 2, which in turn should be similar to column/tag 3, and so on. The ideal reordering would maximize the sum of similarities across the whole chain of tags, but as I'm pretty sure that finding it is a NP-complete problem (thus intractable even for such a small matrix), I had to settle on a suboptimal greedy solution. Still, some interesting patterns emerge from the result, as one can see a gradient of tag relatedness, ranging from operating systems, Microsoft and web technologies, C-family languages, Apple technologies, web again, databases, and some general purpose programming languages.



Finally, as it seems that my submission is not too popular for some reason, I really wouldn't mind a quick thumbs up on the Kaggle page, if you think it's a reasonable idea!

2 comments:

Note: Only a member of this blog may post a comment.