Writing annotation guidelines for entity extraction for adverse media screening in financial compliance

As much as we love to think our machine learning algorithms are the most important part of training up a model, they need human-annotated data to get started, which also — very importantly — are the benchmark for evaluating a model’s accuracy.

An inescapable, essential requirement to producing good training data is solid annotation guidelines. These guidelines distinguish correct from incorrect results. They also define the task and ensure that the annotation process is reliable and repeatable for independent human annotators. If the manual annotation process is not reliable, the supervised machine learning models may be confused by inconsistencies in the training data.

Getting Started

Before starting, make sure your use case is clear and thoroughly understood. For what task and goal are you extracting entities? Without careful consideration, you may end up training a model that doesn’t provide useful results for your downstream processes.

Here are the steps that will take you through the process of writing annotation guidelines using the example of entity extraction:

Decide what purpose or use case the NLP results will serve
Decide what entities you need, and which ones are suited to machine learning
Write a first draft of the guidelines
Run a pilot annotation project with a quantity of data* (minimum of three annotators)
1. Check for inter-annotator agreement to see where the annotators tagged text differently
2. Have a meeting with the annotators to figure out what was unclear or erroneous
3. Update the guidelines to address points of confusion
Try the “results” (human-tagged entities) in your downstream process, and ask, am I getting the results I expect?
Edit the guidelines.

*The quantity of data varies, but a rule of thumb is to do a pilot annotation project that both tries out the guidelines and trains the annotators before launching into the bulk of the work. Aim for annotators to independently tag for two days and produce just enough data so that you and they can review their disagreements together in a 30-minute meeting. At the beginning of an annotation project, you will need to meet with annotators frequently to review annotation disagreements and edit the guidelines. For more details on the annotation process, see our blog post “A Day in the Life of… Building a New Entity Extraction Model.”

Revise, Revise, Revise

It’s important to remember that writing guidelines is an iterative process. The pilot annotation project is as much a training session for the annotators as it is a proof-of-concept to confirm which entities you should be annotating. You may find yourself asking whether a different extraction method is better suited to a particular entity type. When you work through the inter-annotator agreement with your annotators, the question of which tagging was correct might require you to refer back to your use case, and refine the purpose your NLP will be serving. You will definitely have to update your guidelines to address ambiguous and borderline cases that come up during discussions with annotators.

What Will You Do with the NLP Results?

For the purpose of illustration, let’s suppose we are writing annotation guidelines for entity extraction, and that these entities will feed into a system that is detecting negative news about certain individuals or organizations — aka, adverse media screening for financial compliance. (Banks are heavily penalized by regulators if they transact business with people and organizations who are involved in crime and misconduct.)

Financial regulations require enhanced screening of politically exposed persons (PEPs) who tend to be well-known and well-connected. The financial compliance system needs to pull up enough information to know that:

It found the right person to screen — and not someone else with the same name
It is negative news about the person — and not just generic news, as well-known people frequently appear in stories.

The corpus will consist of news and financial news articles.

Based on this task, we want to extract people, locations, organizations, titles of people, dates, currency, and types of crime and misconduct. For example, entities like locations, organizations, dates, and titles can help determine which “Michael Jordan” you have found (the basketball player or the former CEO of PepsiCo). Suppose that in our corpus, dates, currency, and types of crime are too sparse (don’t appear often enough) for machine learning. Follow this rule of thumb: You need a minimum of 1,000 instances of each entity type, but the more the better.

For sparse entities, use a method other than machine learning. Dates are suited to pattern matching, as they are written in set formats. Types of crime are possibly suited to entity lists, but we might also consider leveraging the entity extraction to build an event extraction model around specific types of criminal activity.

What to Include in Your Annotation Guidelines

Now let’s go through the steps for writing guidelines in greater detail. Suppose after evaluating your downstream needs, you decide to tag names of people, locations, organizations, nationalities, and titles of people.

Define these entity types and give examples of entities and non-entities. The essential components to named entity recognition (NER) guidelines are:

Motivation and description of the entity types of interest
Several positive examples of each type (“Do annotate X in the following context: ____”)
Several negative examples of each type (“Do NOT annotate X in the following context:____”), including non-entities and instances of ambiguity between different types.
Instructing annotators to tag obvious misspellings of entities, as the misspelled words are still a true signal of where a real entity would appear
Instances where annotators had significant disagreements during the iterative development process usually are good examples.

Define how will you handle:

Punctuation — Do you tag embedded punctuation? Example: “AT&T,” “Hitachi, Ltd.”
Fictional entities — Do you tag fictional entities? Example: “Harry Potter,” “Mickey Mouse”
Metonyms — Do you tag a nominal expression that represents an entity associated with a different entity? Example: “The White House condemned the attacks on the refugees.” Sometimes White House is a location, but here it is used to reference the executive branch of the U.S. government.
Proper nouns as verbs or compound nouns — Do you tag “Twitterverse” or “googling”? The answer depends on your downstream needs. For financial compliance, this is likely unnecessary.
Titles — What titles do you tag? There is an enormous range of titles from Mr. and Ms. to simple occupations like “dancer” or “lawyer” to long descriptive titles like “VP, Big Data Lab, and Senior Fellow, Information Research and Development.” You also need to decide if it is useful to tag a title that doesn’t appear with a person’s name. Example: “The HR manager sent out a memo.”

Last Words

The above is just a starting point for writing annotation guidelines. Ultimately, seeing what is confusing for annotators will shape what your guidelines cover. Always keep in mind the purpose of the entities being extracted. If your downstream analysis is going to be doing event extraction where time is really important, maybe you should extract more than dates. You may want to include time phrases, such as “from 6 p.m. to 9 a.m.” or “two nights ago.”

Ultimately, an easy-to-understand annotation guide with plentiful examples of each rule (with examples of what to tag and what not to tag) will help your human annotators produce consistently tagged data that will enable your model to become accurate and yield useful results.

Learn what it takes to create a machine-learned model in our blog post “A Day in the Life of… Building a New Entity Extraction Model.” For information on how to compare NLP models, see our blog post series “Evaluating NLP.” Try out Rosette Entity Extractor here.

Ecosystem

Analytics

Data

Insights

マネーロンダリング防止対策

出入国管理

民間企業向け

政府関連機関向け

内部脅威

法執行機関向け

OSINT&脅威インテリジェンス

資料センター

Babel Streetブログ

顧客事例

サクセスストーリー

今後のイベント

連絡先

プレスリリース

採用情報

How to Write Annotation Guidelines for Entity Extraction

Getting Started

Revise, Revise, Revise

What Will You Do with the NLP Results?

What to Include in Your Annotation Guidelines

Last Words

フォルドウ核施設から得られるグローバルな軍事的教訓：超音速ミサイル技術、高性能コンクリート、そしてレアアースの戦略的競争

フォルドウ地域上空の爆撃 – 絶対に見逃せないポイント

データによる防御：注目度の高いスポーツイベントのセキュリティをBabel Streetが支える仕組み

イスラエル・イラン関係と「旧秩序の崩壊」