Blog, Myths and realities

Myths and Realities: Automatic Categorization

… or programs that “learn” how to categorize  and programs that just categorize

From the Seventies onward, many researchers have been investing time and resources to develop algorithms able to analyze texts already categorized by hand, in order to extract, automatically (or better… magically), the knowledge required to categorize  other texts of the same kind.

Basically, the idea was (or rather is, because no solution has been found yet) the following:

           Take a list of the desired categories (or tree, often hierarchical) directly from the people who need a system for automatic categorization.

           Receive from the same people a set of documents (tagged automatically) for each category, selected from the larger set of available texts.

           Use the categorization tree and the set of documents to teach the program how to recognize the stylistic features of each category. This is pure magic 😉 and it is normally referred to as training.

This approach has produced one of the oldest and most persistent myths about Knowledge Management.

Although the solution soon proved to be inadequate, the will to accomplish this magic has been so persistent that even today the market insists on the possibility to obtain a program, suitable for any field that, starting from a few examples can perform automatically a task that often is not even within the capacity of people.

The idea of such a system is understandable and desirable (maybe it’s the dream of everyone in the field of information management), but has created exaggerated expectations, absolutely unrealistic and even detrimental, because they interfere with the advance of the state of the art.

Systems of this kind DO NOT exist and, what’s more as I often underline, there are no easy shortcuts for the solution of complex problems related to the management of information.

Still, in the case of specific categorization the myth can come true and reality is often better than expected.

In fact, although the categorization of contents for personal use is still quite far from being economically realizable (it remains pricey as the subjects are countless and tied to subjectivity),  we can nevertheless observe that, for few years, at the enterprise level it is possible to implement systems for the automatic categorization that are economical and effective, provided that all the parts (firm and supplier of technology, client and vendor, etc.) share clear goals and work together to avoid traps.

We will see how in the next post on this subject.

Share On