Myths and Realities: Categorization
Categorization, a subject I have covered before, is a central activity in the effective management of the knowledge contained in texts (or, in technical terms, the so-called “unstructured information”) but it is shrouded in the most stubborn myths of the field of document processing.
But what is categorization?
The question is not trivial, because there are different ways to indicate this activity, which seems to have inherited the confused eclecticism typical of Knowledge Management, and includes the large variety of labels such as “classification” and “clustering” and even going as far as some who use such linguistic monstrosities as “taxonomization”.
Personally I prefer “categorization”, because I believe it’s the term that best reflects the process behind the different names: distinguishing available information according to different categories to make searching easy and immediate.
Categorization is in most cases performed manually, and therefore tied to subjectivity, to individual choices depending on the way of thinking, on necessity etc., and also on the type of content (documents, emails, web sites, etc.)
There is no need to emphasize that, being a manual activity, categorizing presents two main problems: it requires a great amount of time to be performed, and normally produces subjective definitions of categories that different users may find incoherent. In order to solve these problems, in the development of technologies for information management, automatic applications were introduced.
The first categorization systems were born immediately after the first attempts to implement research applications, but only with the recent explosion of information, has the potential usefulness of automatic categorization become a major interest. We just need to consider the quantity of data available today on the web in comparison to a few years ago, our direct experience in the management of documents on our pc, or the phenomenon of email: less than 10 years have passed, and average users are no longer managing a few emails per week, but about 30 emails per day…
Typically, in the field of technologies for information processing (at least from the point of view of an insider), nearly all the researchers have approached the problem with the fixed idea of finding an algorithm that, with no or little manual work, can categorize any content automatically, and with a very high quality.
This is how a pragmatic approach to the problem was replaced by the silver bullet race of automatic categorization: an imprudence that has caused excessive expectations and unsatisfactory results. In the next posts we will see how, when and why.