Many of our commercial and government customers are building extremely powerful and efficient search engines for their own internal or customer’s data. Whether they are using open source Elasticsearch or building their own, these applications are often tasked with natively searching across many languages with very high accuracy.
Stemming vs. Lemmatization
Of course, in order to do this, especially when dealing with European languages, a search engine must be able to handle all those particular morphological complexities.
In these languages, it is common for the forms of words to change based on how they are used. This presents a challenge for search engines because they must match the correct form of the word in order to serve accurate results — this is called normalizing. Typically, most search engines and search solutions normalize by “stemming.” Stemming is a crude method of chopping off characters at the end of a word in the attempt to find the root word. You can imagine the problems this technique produces.
Think about the following example: The user searches with the word “celebrities”, as in the plural of celebrity, but the search engine ends up with a stem of “celebr.” That search could end up with false positives from other words with the same stem as “celebrations.” Not good!
This problem is compounded when you are a dealing with searches across many languages. The only way to truly get accurate results is to perform a more advanced morphological analysis to find the “lemma” or dictionary form of the word — this is called “lemmatization.”
So let’s review the previous example: “celebrities” is searched, but with lemmatization utilized by the search engine, the query is correctly interpreted as “celebrity,” not “celebration,” enabling the search engine to deliver the right results. In fact, studies have shown that lemmatization is significantly more accurate than stemming in many European languages.
Our linguists and engineers have worked really hard to bring lemmatization to our customers and their search applications. This is a standard feature in the Rosette Analyze Language component, enabling high-quality search across multiple languages.