Never Stop Expecting More from Your Unstructured Data
This blog is a cross post from Smart Data Collective.
Where there is information, there is also software trying to make sense of it and traditionally, this software is based on keyword technologies. These technologies are so widely used that it is common to think that using keywords is an easy and effective way to access and analyze information. But in reality, this is not completely true.
Keyword technologies use probabilistic algorithms that focus on matching and are not able to make sense of the exact meaning of each word in search. Today’s businesses are dealing not only with traditional structured data, but more and more with the unstructured data (text, email and documents and more) that fill our databanks, file sharing systems and CRMs. Analysts rightly wonder how they can effectively relate the structured with the ever growing volume of unstructured data for something meaningful.
Semantic technology is able to understand a text in a way that emulates human comprehension of information. For example, it can identify that a text is about “education” and “sport” even if it doesn’t explicitly contain the two words, but concepts that are correlated to them (“education”: school, tutoring, teacher, math, etc.; “sport”: game, team, score, football, quarterback, etc.).
More importantly, it also comprehends conversational language and all its ambiguities (slang, abbreviations, multi-language text) to arrive at an understanding of not just words, but the user’s intention. A good example of this at work can be seen in the recent analysis that the social research firm Sociometra conducted using over 30,000 comments made on social media of tourist destinations (museums, monuments, etc.) and general comments about the city of Rome, Italy.
The analysis showcases the technology’s power for analyzing unstructured text and its strength in establishing connections between not just words, but more importantly, concepts. A majority of comments focused on the topics of flights, air travel, taxis, the subway, buses, which the system recognized as within the category of “transportation.” In this way, semantic technology was able to categorize comments based on their stated or implied subject matter and expose a hierarchy of the top concepts mentioned by commenters.
In the same way, it was able to differentiate among even ambiguous information to determine the proper context of the value judgment “cheap” which could be intended as good, frugal or indicating poor quality, disappointing. Without an ability to capture the overall context thanks to a correct linguistic analysis (morphology and grammar, syntax, lexicon), it is difficult to make a distinction between two or more meanings.
Here, it’s not about the guesswork of keywords, but the ability to distinguish one word with many meanings and many words that have or are correlated to the same meaning.
In the business domain, there is a constant need for analysts to understand more and more information. And while there are plenty of good systems to analyze structured data, it requires the constant development (and anticipation of) new lists of keywords and information to train the system so that it is able to discover and share strategic links and patterns between unstructured information and data points within their large databases.
The problem is that most organizations do not have the time or resources to do the regular document training needed to fulfill their needs for deeper knowledge. It’s next to impossible to manually think about all the terms that could have the same meanings or the multiple ways to say something in English or any other language, not to mention all of the other insights or low-lying trends that have been underestimated or overlooked.
Is semantics the panacea for every analyst then? Of course not. Keywords can be useful but we are aware of their limitations. Integrating semantics with keyword technology through faceted search is a hybrid solution that helps further refine search along different paths according to a certain order or category and it could be a solution to start the migration to a full semantic search.
Companies expect more from their unstructured data. It’s not enough to be able to access it, but it can only add value if it is accurately processed.