By Eugene Reyes
Consider the case of Tamerlan Tsarnaev. Tsarnaev easily passed through the U.S. Customs and Border Protection checkpoint after landing at JFK. The name “Tsarnaev” did not appear on the watchlists. The alternate transliteration of Tsarnaev’s name, “Tsarnayev,” did. But Tsarnaev wasn’t flying under the name “Tsarnayev,” so he entered New York, traveled north, and, with his brother, bombed the 2013 Boston Marathon. The attack killed three people and injured hundreds more.[1]
Name matching is vitally important to national security, finance, healthcare, education, and an array of other industries. And you can’t accurately match names without precise transliteration.
Let’s take a closer look at what is transliteration, along with the challenges it presents, and how cutting-edge technologies can ease and improve the process.
Transliteration vs. translation
Transliteration is the systematic process of converting text from one script to another in predictable ways.
Today, world populations use at least 18 major alphabets.[2] If you’re reading this post, you probably feel comfortable with the Latin/Roman alphabet. Conversely, you may know nothing about Cyrillic. Therefore, the phrase “President Владимир Путин’s actions have caused global instability” leaves you scratching your head. However, you can easily understand the transliterated version of this sentence: “President Vladimir Putin’s actions have caused global instability.”
The rewriting of “Владимир Путин” as “Vladimir Putin” is an act of transliteration, not translation.
Translation seeks to convey the meaning of words from one language to another. Transliteration concerns itself with scripts and alphabets — not meaning. Think of it this way. Roughly translated from the original Old East Slavic, the name “Vladimir” means “of great power.” But when discussing world news, we don’t use the translated version of Putin’s first name. We never refer to the Russian president as “Of Great Power Putin.” Rather, we say “Vladimir” — a transliteration that uses and arranges Latin/Roman letters to roughly mimic how Putin’s first name sounds when pronounced in Russian.
Systematic, consistent transliteration means a name is always spelled the same way and enables any search engine to find matching names. Without that, you need cutting edge, cross-lingual name-matching technology that looks at the name as-written for precise name matching between English and complex, non-Latin languages such as Russian, Japanese, Hebrew, and Arabic.
Transliteration of these languages is inherently difficult. While governments and NGOs have developed their own standards for transliteration, there are no universal standards. This makes reproducibility — or the ability to transliterate the same foreign characters into the same Latin/Roman characters in the same order, every time — arduous, if not impossible.
Lack of universal standards, language variations and misguided machine translation efforts plague organizations trying to transliterate massive lists of names.
Why?
Challenges arise from language to language. Consider the following:
- Transliterations of Japanese names typically follow one of two methodologies. One prioritizes ease of pronunciation for speakers of United States English, the other prioritizes logical mapping of Japanese characters to English letters. Both can introduce ambiguity in name matching.
- Existing transliteration standards for Hebrew were created for scholars whose top priorities were unambiguous scientific transliteration. These types of transliterations are often incompatible with transliterations by laypeople, the type of name matching information that appears in databases and that is sought via search functions.
- Transliterating Arabic names into English is particularly difficult. Arabic has various linguistic pronunciation rules: characters may be either silent or voiced, depending on where they appear. Transliteration programs may “miss” these characters and cause spelling differences. In addition, English lacks the full range of Arabic sounds, further hampering transliteration. These and other linguistic issues merge with a lack of universal standards to create imprecise and varied Arabic transliterations. The common first name “Mohammad,” for example, has more than 30 transliterations currently in use.
To resolve these problems, many people resort to machine translation programs. This is a bad idea. These programs translate word meaning rather than transliterate letters. The name of a city in northeast Syria is natively spelled “ٱلرَّقَّة.” Machine translation typically renders the English version of this name as “Tenderness.” You may know the city as Raqqa.
Lack of universal standards, language variations and misguided machine translation efforts plague organizations trying to transliterate massive lists of names. For example, without solid transliteration rules, computational libraries often incorrectly process and store documents from different languages — leaving entries hard or impossible to find. The consequences range from funny (incorrect transliteration left Ikea debuting a workbench in the United Kingdom called “The Fartful”) to dire. Homeland Security mistakenly lets a terrorist enter the country. Financial institutions run afoul of Know Your Customer laws. International news reports become incomprehensible.
A better way
The Rosette text analytics and discovery platform quickly extracts entities from massive lists of names, then transliterates native names rather than running them through machine translation programs.
Rosette transliteration capabilities consistently follow the user’s transliteration standards of choice. These can include the Buckwalter Transliteration standard for Arabic; standards set by the Intelligence Community of the United States Department of Defense for transliteration of personal, organization, and location names; and the Geospatial Intelligence Agency’s standards for transliterating names of foreign places.
Rosette can transliterate across more than 20 languages. These include complex, non-Latin languages such as Arabic, Hebrew, Japanese, Korean, Vietnamese, simplified Chinese, and traditional Chinese. It automatically and efficiently resolves name spelling ambiguities in these languages — notably, partial vocalization in Arabic and word segmentation in Chinese. It also supports transliteration of non-native names appearing in a specific language: non-Arabic names written in Arabic, for example, or non-Japanese names written in Japanese.
Turn to Rosette Name Translator for producing consistent spellings of names from non-Latin scripts. But, for searching documents or database fields which contain inconsistently transliterated names, rely on Rosette Name Indexer to improve your organization’s name-matching efforts today. Learn more about Rosette.
End Note
[1] https://www.history.com/topics/21st-century/boston-marathon-bombings “Boston Marathon Bombing.” History.com, 2019 ↩
[2] https://www.howmanyarethere.net/how-many-alphabets-are-there-in-today/ “How Many Alphabets Are There in the World Today?” Howmanyarethere.com, 2022 ↩