Blog, Myths and realities

Myths and Realities: to Categorize or Facet?

Working on categorization projects, we often face the fact that a perfect automatic categorization cannot exist:  a certain degree of subjectivity (which can also vary in time) is always involved when we assign a category or a subject to a text.

The most common situation involves taxonomies including heterogeneous categories: for example, when categorizing newspaper articles customers tend to include in the taxonomy subjects such as sport and politics together with domains such as people or events.

But while categories like sport or politics are fairly objective and strictly related to the content of the text, people and events are cross-category elements, therefore it is very difficult to manage them with an automatic system.  In fact there are no common topics, no recurring or typical concepts, no specific domains, while the only shared feature is that of being focused on someone or something (a person or event).


However, it is  comparatively easy for the reader to agree that articles about Leonardo da Vinci, Gorbachev, Robin Hood or Joe Dimaggio should belong to a “people category”.

In general we should always keep in mind that some choices are quite easy for us, but can be extremely complicated for a program.

For example, we may need to categorize the review of a Second World War movie. For most readers, without even having to read the whole article, the first category will be “cinema”, as the subject is a movie. The program, instead, may think* about history or war or military instead, and would not consider “cinema” as relevant topic.

Luckily, most categorization issues can actually be solved by an automatic system which, once configured properly, will be far more objective and reliable (because it will never get tired nor influenced by external factors) than a person, who remains nevertheless the only one of the two who is really intelligent.

* think… it’s only a manner of speaking 🙂



Share On