Session abstract:
On ApachCon 2012 we presented a system for content tagging at ZEIT Online. This talk presents recent work on improving the ranking of the tags by building exclusively on the data (complete ZEIT Online News Archive). Tags are either named entities, statistically relevant terms/phrases occurring in the news articles or topics delivered from text classification. The system uses open-source technology such as Lucene and Gate to produce these tags. We will present the requirements and expectations of an online editorial office on such a system and ideas on how to meet those expectations. All tags are ranked based on trend analysis, typical contexts and overall TF-IDF scores, all computed on the whole archive. No manual maintenance of thesauri or ontology resources is necessary.