The Traps of Categorization
I’ve already written many times about automatic categorization, but it’s such a complex topic with plenty of different aspects, and although it may seem simple to the general public, I think that it’s worth discussing once again (and again in the future.)
This time I would like to focus on the categorization of contents dealing with generic and horizontal subjects, i.e. journalistic categories such as news, sports, economics, politics and so on. For those, like us, who have developed categorization software for years, the abundance of non-institutional content on the Web (mainly blogs and similar) offers more opportunities than in the past to apply our applications successfully. In theory, it’s a winning solution not only for those who develop the technology, but also for those who provide the content because categorization rules need only minor customization and as a result, improving content with quality information becomes quick and effective.
Nevertheless, two relevant aspects must be considered with close attention, in order to avoid problems during implementation.
The first aspect is somehow implicit in the personal and subjective nature of such information sources. In fact, when writing blogs, authors quite often (if not always) mix posts on their favorite subject or field of expertise (cinema, sports, technology…) with other more intimate and personal posts, which do not necessarily have a specific subject. When we try to categorize this kind of content using standard systems developed for the well-focused articles of periodicals and newspapers, we tend to obtain background noise. In order to minimize such noise, we need to be aware of the problem, and use semantic technology in an expert way: this way, the level of the final result is usually quite good, and can provide an added value to users.
The second aspect is the average length of these contents. In fact, quite often the post is short and does not exceed 500-600 characters, making it quite difficult to obtain enough reliable information to select the right category. The readers of a blog already know the subject because they have read previous posts and therefore do not need further context to find the main subject. Yet, for a program the task is definitely more complicated because very often a program does not analyse the posts of the blog one after the other, or from the same information source, but receives them in a random order, or one by one. In order to manage this aspect correctly, we need to accept some compromises and modify the system progressively, in order to reach a good balance.
For these kinds of projects, technology is very important but equally important is the expertise of those who have worked for years in this field: like they say in Naples, no one is born learned.