Shakespeare asked “What’s in a name?” It turns out there’s a lot in the name of a typical Afghani including common nouns and personal titles—not used as titles! One of our linguistic experts, Bushra Zawaydeh, wrote today’s post about the challenges to natural language processing software of automatically extracting names of people from text written in the two most widely used languages of Afghanistan: Pashto and Dari (a Persian dialect). Read on to learn how these languages pose headaches to any entity extractor.
Afghani names are a challenge to intelligence agencies as name spellings frequently vary when written in English. Consequently, automatic correlation of data points—using search engines and natural language processing systems—is difficult. I will discuss the cultural aspects, linguistic properties, and composition of Afghani names and how they challenge these software tools.
Languages of Afghanistan
Understanding the names begins with understanding Pashto and Dari, the official languages of Afghanistan (CIA Factbook). They are the most widely used of the 30 some languages that are used in the country. Both Pashto and Dari are Indo-European languages which adopted the Arabic script in the 7th century.
Dari is a dialect of Persian spoken in Afghanistan. When the Persian language adopted the Arabic script, four letters were added to the Arabic script to fit its phonology (Peh پ, Tcheh چ, Jeh ژ, Gaf گ). The appearances of the Arabic Kaf ك and Arabic Yeh ي were also modified. The Persian counterparts are ک and ی. Subsequently, Pashto adopted the Persio-Arabic script, adding eight more letters to the Persian letters (four retroflex consonants ډ ړ ڼ ټ, velar fricatives “ghe” ږ and ښ “xin”, and dental affricates /dz/ ځ and /ts/ څ).
Name origins and composition
Since the official religion in Afghanistan is Islam, the majority of Afghani names are Arabic (the language of the Quran and Islam), but pronounced in the local dialect. Most Afghanis don’t have family names. Some Afghani given names are composed of one word, such as “Khalil” or “Farid.” Others have two parts (Miran, 1975) with the actual given name prefixed with a subordinating common name such as -Ullah, Jan, Ali, Gholam, Abdul, Mohammad, Din, Khan, and Shah.
For example, in the names “Mohammad Nasim” and “Abdul Ghafoor,” “Mohammad” and “Abdul” are less useful for identification than “Nasim” and “Ghafoor.” Unlike Arabic names, the subordinating name and proper name are not related to the individual’s parents or grandparents.
In conversations, the titles “Khan” (Mr.), and “Jan” (Mrs., Miss) may follow the given name of men and women, respectively. Although “Khan” only follows male given names, “Jan” can be used either as a male given name or a female title. Other religious, royal, occupational, and military honorific titles preceding a name can be confused as part of a given name. Examples include “Agha” (meaning “Mr.”), “Mullah” (a Muslim cleric title), “Khwaja” (meaning “lord”), and “Akhund” (meaning “Muslim cleric”). Examples:
Akhund Khel اخند خېل
Akhund Mullah Obaidullah اخوند ملا عبيدالله
Khwaja Muhammad Bangesh خواجه محمد بنګش
Pashtun tribal names are used as surnames. Examples are: Kasi, Tareen, Yusafzai, Tarkalani, and Mohmand. Examples:
Abasin Yousufzai اباسين يوسفزی
Sher Muhammad Mohmand شېر محمد مهمند
Challenges to natural language processing for names
Entity extraction in Afghani Languages—especially Pashto—must confront non-standard segmentation and spelling. Pashto does not yet have the level of literacy or literature, especially in digital forms, to encourage such standardization—one Pashto reader guessed the number of printed books in Pashto is less than the number of hairs on his head! Spaces are often unintentionally inserted or omitted causing words to be split apart or conjoined with punctuation or other words. Even if some convention in the use of spaces is applied, the spelling of the words varies considerably.
Name translation between English and Afghani languages: Translation also requires knowledge about the pronunciation of the name, which in turn requires knowledge of short vowels which are not conventionally written by native speakers. Specifying those short vowels is itself a challenge since there is no convention for doing so in Pashto and attempting to apply the convention of Modern Standard Arabic is hampered by the additional vowels in Pashto.
A general obstacle that applies to all Arabic script languages is the many ways Arabic names can be written in English (Latin script). Consequently, a name such as خلود can be written in many ways in English, including Kholoud, Kulud, Khulud, and many more variations. Translating into Pashto or Dari, is also fraught with a certain amount of ambiguity. Consider the name “Hamid,” which can translated to حامد or حميد. In these cases, software that applies a consistent translation standard to Pashto and Dari names can reduce errors by human translators and speed up translations.
Finally, when typing, Pashto and Dari writers may substitute the Arabic kaf and yeh in place of the the Pashto or Dari counterparts, meaning that any natural language processing of Pashto and Dari must be able to normalize such letters.
Given the scarcity of qualified native speakers of Afghani languages, natural language processing tools are needed to automatically triage incoming data to send the highest priority messages to be translated. For these tools to work well, being able to extract named entities and translate names—often the key points in data—is an absolute necessity.