To do this, the t most frequent tags have been extracted from a subset of the training examples, along with the w most frequent words found in the "title" and "body" parts of questions tagged with at least one of these. This results in a t×w matrix where each column corresponds to a "tag vector" of w words. The tag vector components are computed using the tf-idf statistic, which tries to quantify the importance of a word by weighting its occurrence frequency for a particular tag (tf term) against its frequency across the overall set of tags (idf term). The problem with this representation is that neither the order of the columns nor the rows convey any information. Is there a way we could somehow reorder at least the tag vectors (i.e. columns)?
From the previous matrix, we can compute a new t×t matrix where each cell corresponds to the pairwise cosine similarity between two tags, which is an estimate of their degree of semantic relatedness:

Using this similarity matrix, we can reorder the columns of the first matrix: column/tag 1 should be similar to column/tag 2, which in turn should be similar to column/tag 3, and so on. The ideal reordering would maximize the sum of similarities across the whole chain of tags, but as I'm pretty sure that finding it is a NP-complete problem (thus intractable even for such a small matrix), I had to settle on a suboptimal greedy solution. Still, some interesting patterns emerge from the result, as one can see a gradient of tag relatedness, ranging from operating systems, Microsoft and web technologies, C-family languages, Apple technologies, web again, databases, and some general purpose programming languages.

Finally, as it seems that my submission is not too popular for some reason, I really wouldn't mind a quick thumbs up on the Kaggle page, if you think it's a reasonable idea!
Voted for your submission after carefully checking out the others :)
ReplyDeleteThanks!
Delete