Babel Street’s Chief Innovation Officer, Gil Irizarry spoke with our partner, IMTF, about the challenges of matching names in Chinese. What follows is an edited version of the conversation. You can view the full interview here.
Provide an overview of the complexity of Chinese scripts, how they are shared across languages, and the influence of culture, history, and politics on their development.
The Chinese writing system was developed over 4,000 years ago based on pictures, known as pictographs and ideographs that represented objects or things, and eventually expressed concepts and ideas.
Ideographs are incredibly complex, with characters comprising numerous strokes. A stroke is the movement of a writing instrument on the page that is required to make an individual mark before lifting the instrument. The record is 172 strokes for the ancient word “Huang,” which has an unknown source and meaning.
Characters more typically have 10 to 12 strokes, and the characters represent a rich set of words and tones. Chinese is a very tonal language and there are different inflections for the same syllable. When rendering Chinese ideographs into Latin characters, tonal marks are added to represent how the syllable should be pronounced.
Given the history of the region, it’s not surprising that the Chinese writing system has been adopted by other countries. China refers to the characters as hanzi. Japan adapted hanzi into a system called kanji, which shares common meanings of the ideographs, but with Japanese pronunciation. Korean hanja is similar, and traditional Vietnamese also uses Chinese characters.
When looking at a string of Chinese ideographs it’s important to know if it refers to Chinese or to Japanese or Korean. This is crucial for understanding how the word represented by the character should be rendered.
What are the challenges of transliterating between languages that use the Han script and the Latin script?
Transliteration is a unique challenge because it involves representing the sounds of a language in a different writing script, such as representing the sounds and intonations of Chinese in a language based on the Latin script, like English.
As an example, the country known as “Italy” in English is known as “Italia” in the Italian language. The capital of that country is Rome in English, but in Italian it's Roma. If we represented the sounds of Italia and Roma in English, that would be transliteration.
What we do in this case, however, is translation, where we call the country Italy and its capital Rome.
For the capital of China, English speakers transliterate it as “Beijing” since we're trying to represent the sound as a Chinese speaker would say it. We don’t use “Northern Capital” — the literal translation of “Beijing” — since we want to preserve the Chinese pronunciation.
We also must consider the tonal mark. If we want to transliterate the Chinese ideographs very well, we could use a transliteration system like Pinyin, where the sounds of Chinese ideographs are written as Latin characters with tonal marks. In this way we can imitate the sound and tone of the syllable.
Again, it's important to know which language is using Chinese characters. In Japanese or Korean the tonal marks would be very different, and in Japanese the tone is not as important, so transliterations generally don’t even need tonal marks. Knowing the language represented by the ideographs is important and preserving the tonal marks is very important too.
What is fuzzy name matching, how does it work, and how effective is it for non-Latin scripts?
Fuzzy name matching is a fault tolerant process for comparing two names and assessing if the two names match.
As an example, consider the name, Sophia commonly spelled S-O-P-H-I-A. But another common spelling is S-O-F-I-A. Neither one is right or wrong. Sophia has a Greek origin, and these are just different ways of representing the original Greek name.

A person may use a particular spelling of the name. But what if Sophia with a “ph” goes to an office where someone hears the name and writes it with an “f?” Did they get the name wrong? It’s wrong in the sense that it’s not the person’s name as recorded on their birth certificate, but it is a valid rendering of the name. But if we want to look up that person’s medical record, we want to be able to match the Sophia with an “f” against a list or database containing Sophia with a “ph” and be sure we’ve found the right record. Fuzzy name matching software does essentially that.
How is that important for languages written in non-Latin scripts?
There are many ways of writing names that occur in these languages. Imagine a common Chinese name like Wu. Is it W-U or W-O-O? There are different standards for transliterating the same sounds and so it could be either. It also could depend on what the original script was, and that information would tell us more about how to do the correct transliteration.
This becomes much more difficult in languages that don't have vowels.
For example, in Arabic the name Muhammad commonly gets rendered into English as Mohammad with an M-O or Muhammad with an M-U. Both are correct because Arabic does not have vowels, so the English vowels are just filling in for the way an Arabic speaker would say it.

Fuzzy name matching software tries to address these problems by being fault tolerant, being aware of transliteration standards, assessing the validity of transliterations, and ultimately trying to determine if two names match each other or not.
What does “pairwise matching” refer to and how does it solve the problem of transliterating Chinese names?
Pairwise name matching is a narrower application of fuzzy name matching. Fuzzy name matching is a set of algorithms that are applied to names to determine whether names match. It's a very general set of algorithms to see, for example, if a name appears on a list in a one-to-many comparison.
Pairwise matching applies when there are two names for a one-to-one comparison. Consider the example of Sophia spelled with “ph” and Sofia spelled with an “f.” Based on the fuzzy matching algorithms, a pairwise comparison will show the likelihood that these two names are a match.
Imagine now that an organization has a customer list and is searching the list for someone named “Sophia Jones” with Sophia spelled with a “ph.” The search result returns “Sofia Jones” with Sofia spelled with an “f.” Is that the same Sophia Jones? The existing system may say that it is.
Pairwise name matching can act as a sanity check on that system and provide a score that indicates the likelihood that those names are the same. Or, in cases that are less obvious, the pairwise comparison may disagree with the system if it doesn’t believe there’s a high likelihood that the names match.
How does pairwise matching help with Chinese names?
Based on all the knowledge in our system, Babel Street Match transliterates Chinese names in a smart and fault-tolerant way. Then, we work in conjunction with an existing system as a sanity check to see whether the fuzzy pairwise match agrees with its match results.
How does the two-pass hybrid method work? How much does it reduce false positives, and how long does it take to screen names with this method?
Babel Street Match employs a two-pass system that addresses the issues of recall and precision. There are two challenges in matching names which come down to how wide is the search and then how accurate is the search? In other words, how many names come back as likely match candidates (recall) and how precise are we when considering those candidates?
Different name matching systems often try to optimize one over the other. With our two-pass approach, Match optimizes both. When Match receives a name, it encodes that name in a variety of different ways. One of those ways is a hash function that creates a set of hash keys to represent the name.
Then, when Match looks up the name, it tries to find all the names where the keys match. Those names become the candidates because their hashes match the name being queried. That’s the first pass, which has more breadth for recall. It yields likely candidates for a matching name.
With that set of likely candidates, Match applies AI machine learning algorithms to see which name more accurately matches given the challenges of transliteration and the fault tolerance for things like misspellings and different transliteration standards. That’s the second pass for precision.
The two passes together generate an answer that has the widest breadth possible and the most accurate analysis possible. And this can be done in a matter of milliseconds, although it does depend a bit on the size of the data set being queried.
Tell us about the deep learning neural network that you had to train for name matching.
The algorithms described in the previous questions are statistical models based on the hidden Markov model. To put that in computer science terms, it’s a finite state machine that looks at a particular character and tries to decide the probability that a given input character will yield a particular output character.
Think about the Greek alphabet, where an alpha is typically an A, beta typically maps to a B, and gamma typically maps to a G. A finite state machine, when given an alpha as an input will almost always say that an alpha will be output as an A. One would imagine that as this simple machine went down a string of characters and saw an alpha, it would emit an A, and when it saw a beta, it would emit a B.
But what happens when it encounters the Phi character? It’s typically a “ph” sound and there are different ways of rendering it in English. In Latin characters it could be rendered as a “ph” but it could also be rendered as an “f.”
How does the Markov model know that? This gets to machine learning and training the model with large banks of input names. The model learns that when encountering a Phi, for a certain percentage of time, a “ph” is emitted, and for another percentage of time an “f” is emitted. With that, the machine can rapidly run through input strings and emit output strings.
But Match doesn’t only have a simple finite state machine — it also has a neural network. A neural network is a kind of finite state machine with a far deeper and richer set of states. It's still based on probabilities and learns by machine learning in the sense that it takes banks of input to build the probability between the states.
But as the name implies, it's a neural network that is trying to represent the interconnections that say, neurons have. There isn’t a single state or single input/output result.
Using the neural network, we want to look at sequences of input and output characters. Given an input sequence, what's the probability that a particular output sequence might be a match? With a large enough data set training the model, we can gain a more accurate determination of how two strings match. As an example, we trained a deep learning neural model for matching Japanese katakana to English as represented by the Latin character set.
What are some of the future developments for Match and how will they improve regulatory name screening?
An upcoming set of improvements to Babel Street Match involves speed. No matter how much we improve the matching process, people always want it to be faster, and with good reason. Speed matters, especially in high-volume environments like payment processing where many names need to be searched simultaneously. Responses can only take a few milliseconds so we’re always optimizing our name matching solution to have it run as fast as possible.
Specifically, we're looking at running Match on graphics processing units (GPUs). Our AI machine learning algorithms, deep l earning models, and Markov models lend themselves quite well to run on GPUs, so we're looking at hardware acceleration to have Match run as fast as possible.
Another area is language support. We support about 25 languages and we're always looking to add more. That involves a variety of efforts.
One example is name frequency models. Say Match is looking for a name like John Smith. The words “John” and “Smith” in a namespace are both relatively common. Now imagine a name like John Rutherford. “Rutherford” would be far less common than “Smith,” so we would want to give “Rutherford” more weight when calculating the overall name match score than “John.” So, in addition to the matching algorithms, we also have these name frequency models where we can evaluate and weight the name components.
These are things that we are constantly looking to improve for Babel Street Match. Adding languages means gathering data to inform these kinds of scoring algorithms. With existing names, we also want to reevaluate them and make sure we're doing this as accurately as possible, in part because names change.
Previously, Muhammad as a first name might have been rare. Given immigration trends, Muhammad is now more common in Western countries so the name frequency model must adapt. That’s why we're always updating our models.
We’re always happy to hear from customers on other ways the system can improve.
Disclaimer:
All names, companies, and incidents portrayed in this document are fictitious. No identification with actual persons (living or deceased), places, companies, and products are intended or should be inferred.