What does it mean to lemmatize and normalize emoji for text analytics?
Emoticons 🙂 and emoji 😀😆 add a bit of the nonverbal communication that humans inherently crave in our electronic communications. The addition of a winking face 😉 softens a potentially harsh statement or expresses shared camaraderie far more succinctly and immediately than words. So it’s not a surprise that these modern hieroglyphs pop up a lot in social media and text messages [1].
A history of emoji
Just as language evolves, emoji have, too. Emoji first appeared in 1999 on Japanese mobile phones and in 2007 emoji were approved for addition to the Unicode Standard (to sort out the mess of incompatible emoji sets between the various Japanese mobile phone carriers trying to use the Unicode private use area [2]).
In the beginning, Japanese carriers used a light colored skin tone for all the emoji depicting people or body parts. However, responding to the desire of users to reflect human diversity in their communication, Unicode introduced skin tones for human emoji in Unicode 8.0 (released in mid-2015).
The set of standard emojis have also expanded to reflect diversity in other ways. The emoji “kiss” 💏 (U+1F48F) usually depicts a man and a woman, and was approved as part of Unicode 6.0 in 2010. From there it wasn’t a far leap to users wanting to depict “a man and a man kissing” or “a woman and a woman kissing” and Unicode lets one do that with a zero-width joiner (ZWJ). The ZWJ creates a glyph that looks and is treated as a single character, but is actually multiple characters (👨 Man, ZWJ, ❤ Heavy Black Heart, ZWJ, 💋 Kiss Mark, ZWJ and 👨 Man).
Similarly, the single character “kiss” 💏 (U+1F48F) could be represented as (👩 Woman, ZWJ, ❤ Heavy Black Heart, ZWJ, 💋 Kiss Mark, ZWJ, 👨 Man). This flexibility created situations where a character can be represented more than one way.
Emoji and text analytics
For those in the text analytics world, the addition of skin tones as emoji modifiers, and the ability to depict a character in more than one way, mean that we need a way to canonicalize (or normalize) emoji for efficient processing. Suppose that a feedback analysis application is using these tokens downstream. Does the boy emoji 👦 combined with one of five different skin tones 👦🏻👦🏼👦🏽👦🏾👦🏿 really change that it’s representing a boy? In most cases no, but of course if those modifiers are important to the meaning, the surface form can be used as is.
Is there a meaningful difference between “kiss” depicted as one character vs. several? That’s about the same as the Japanese katakana “ga” being represented as single character (ガ) versus two characters (カ plus ゛)? Probably a meaningless difference in most cases that should be removed.
Rosette tackles emoji
So as part of Rosette 1.7, the tokenization and morphological analysis endpoints now support tokenizing and part-of-speech tagging for emoticons and emoji (as well as hashtags, @mentions, emails and URLs). This same functionality is also supported in our on-premises API and SDK. Furthermore, Rosette will lemmatize emoji (removing skin tone and gender modifiers, used with “people” emoji like “surfer”) and normalize multi-character emoji in the text stream to single emoji characters where they exist.
We enjoy diversity in our lives, but our language analyzers, not so much 😉
For additional reading, check out Unicode Technical Report #51 and Emojipedia.org .
End notes
- See emoji popularity on Twitter tracked here http://www.emojitracker.com/.
- The Unicode private use area is a series of codepoints in the Unicode standard that are not officially assigned characters. Thus users can assign whatever characters they want to these codepoints, but if a document uses the same PUA codepoints to which a different program has assigned other characters, you get an incompatibility clash.