Named Entity Extraction In Legal Documents

12 min readJan 18, 2022

Named Entity Recognition (NER) is a sequence labeling task that aims to classify words in a sentence into several categories of interest such as person, organization, location, etc. It has been an important subject tackled by people in the NLP field and resulted in the submission of an even more important number of papers over the years. The reason being the multitude of possible applications that flow from it such as information extraction and pseudonymisation -identification and replacement- in sensitive documents…

Legal documents

In legal documents in general and contracts in particular we can find various key informations, representing the counter-parties, term, scope and contractual information. For example the term clause of a contract contains alone the Term, Effective Date, Expiration Date, Renewal Duration and Termination Notice. The dates and durations are very important to keep track of active contracts and those expiring soon to avoid unnecessary renewals.

The entities we find in a contract depend on the type of the document, for example in a lease we will find the rent, an entity that we won’t find in an NDA or a purchase agreement. The variety of these entities and the type of contracts make it challenging to build a significant entity extraction dataset and this is precisely why we will be looking to group by type our labels.

Since a salary, a rent or a transaction amount are after all regular amounts, we will be considering the extraction of basic labels/entities such as dates, duration, amounts etc to make sure we have the necessary amount of data to train a state of the art extraction approach and deliver good results.

Other than the variety of entities, contracts have the particularity of having value only after signature. This means that in most cases the documents we process are scanned(after manual signature) and need to undergo optical character recognition(OCR) to extract textual content, otherwise a natural language processing approach is impossible.

The OCR step makes it even more challenging to analyse contracts and extract key information since it adds an important number of spelling errors and incoherences into the mix, especially when the provided scan is of very low quality.

The particularity of these documents extends further to include a specific vocabulary and redaction style, structure and reasoning. This means that regular pre-trained word embeddings and language models might not work as well as on a general domain dataset.

All the above specificities make it challenging and even more interesting for us to tackle entity extraction in legal documents.

Named entity recognition in literature

Research papers related to NER have considerably increased in the last few years. Entity extraction models can be rule-based systems using expert knowledge, hybrid models using linguistics and domain specific cues such as input features for a machine learning algorithm or end-to-end deep learning models using distributed representations of words and characters.

Each method has its own advantages and drawbacks. Rule-based approaches allow high precision but are costly to design, not robust to noisy data and require regular maintenance to adapt to new cases (very poor generalisation). Hybrid methods, on the other hand, combine the robustness and the high accuracy of Machine Learning algorithms with the fine grained information of external dictionaries or linguistic rules.

Deep learning approaches remain nowadays the most efficient mainly benefiting from their ability to learn hidden features automatically but they require large amounts of training data.

Nevertheless, when it comes to real-world applications even the most efficient systems struggle due to noisy dataset. Typos, misspelling and missing words are very common in many fields of application due to data origins and data acquisition pipelines (Optical Character Recognition for example) module affecting up to 15% of our samples, and thus lowering the performances of our NER systems.

The benchmark of these NER models is usually done on the Conll03 dataset where entities present in the texts are categorised into four classes : persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. The evaluation metric is the micro F1-score.

Input features

Textual data can’t be directly processed by machines and thus need to be encoded into vectors also called embeddings serving as input features to neural architectures.

Many research papers enfaces on the importance of these input embeddings for named entity extraction in particular and propose different methods to encode textual data while losing as little valuable information as possible in the process.

Distributed Word vectors

Many proposed approaches have shown the efficacy of using word representations as input. These word embeddings are typically pre-trained over large collections of text through unsupervised algorithms such as continuous bag-of-words (CBOW) and continuous skip-gram models (word2vec). Used as input, these embeddings can be either fixed or further fine-tuned during NER model training.

Commonly used word embeddings include Google Word2Vec, Stanford GloVe, Facebook fastText and SENNA.

Character based approaches

Instead of only considering word-level representations as the basic input, several studies incorporated character-based word representations learned from an end-to-end neural model.

Character-level representation has been found useful for exploiting explicit sub-word-level information such as prefixes and suffixes. Another advantage of character-level representation is that it naturally handles out-of-vocabulary. Thus character-based models are able to infer representations for unseen words and share information of morpheme-level regularities.

There are two widely-used architectures for extracting character-level representation: CNN-based and RNN-based models.

These CNNs can be simple, e.g. as single convolution with one or many kernels, or complex such as IntNet that consists of a ResNet-like architecture to capture better internal structure of words by composing their characters from limited supervised training corpora.

Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) are two typical choices when it comes to RNN based approaches. Kuru et al. proposed CharNER, a character-level tagger for language independent NER that considers a sentence as a sequence of characters and utilizes LSTMs to extract character level representations. It outputs a tag distribution for each character instead of each word. Then word-level tags are obtained from the character-level tags. Their results show that taking characters as the primary representation is superior to words as the basic input unit.

The character level features are often concatenated word embedding feeding into an RNN context encoder. Rei et al. combined character-level representations with word embeddings using a gating mechanism. In this way, the model dynamically decides how much information to use from a character or word-level component.

Leveraging context

More recent approaches tried to leverage contextual information to improve the quality of the input representation.

Contextualized word vectors such as ELMO or BERT have been used to provide high quality informative word vectors. Other advances in language modeling using recurrent neural networks made it viable to model language as distributions over characters.

The contextual string embeddings by Akbik et al. and CharacterBERT developed by our colleague Hicham El Boukkouri et al. , use character-level neural language model to generate a contextualized embedding for a string of characters in a sentential context. An important property is that the embeddings are contextualized by their surrounding text, meaning that the same word has different embeddings depending on its contextual use, in addition the Language Model embedding can improve the performance of a sequence tagger even when the data comes from a different domain.

Including additional information

Besides word-level and character-level representations, some studies also incorporate additional information such as : gazetteers -Entities dictionary-, lexical similarity, linguistic dependency and visual features) into the final representations of words, before feeding into context encoding layers. In other words, the DL-based representation is combined with a feature-based approach in a hybrid manner. This may lead to improvements in NER performance, with the price of hurting generality of these systems.

The BiLSTM-CNN model by Chiu et al. incorporates a bidirectional LSTM and a character-level CNN. Besides word embeddings, the model uses additional word-level features (capitalization, lexicons) and character-level features (4-dimensional vector representing the type of a character: upper case, lower case, punctuation, etc.).

Interesting Approaches

Contextual String Embeddings for Sequence Labeling(Paper, Code)

The authors of this paper propose a novel type of contextualised word embeddings based on character level inputs. These embeddings are pre-trained using a neural character-level language modeling setup on a large unlabel corpora.

Figure1: High level overview of Flair CSE-BiLSTM-CRF Approach

On top of these embeddings Akbik et al. used a classic Bi-LSTM-CRF architecture for sequence labeling and obtained state of the art(at the time of its publication) results on the CONLL03 english and german datasets.

Working on the character level comes with both benefits and drawbacks, it boosts the LM robustness as it’s a lot less sensitive to spelling errors, missing spaces between words… etc. Another advantage is the ability to capture both syntactic and semantic properties making better internal representations. These advantages come at the cost of the size of the context window and the length of input text samples. For example in french a word contains on average 8 characters (study), this means that a 50 words context needs 448 LSTM units to cover which is quite challenging for such an architecture and by going further we rish having training gradients problems not to mention long training time.

FLERT: Document-Level Features for Named Entity Recognition (Paper, Code)

While most state-of-the-art approaches for named entity recognition only consider the immediate context of the entity for it’s extraction (typically the sentence), this approach extends the context window to surrounding phrases. This modelisation is very useful in real life NER applications since we have more specific and concise labels that require an understanding of short term and long term contexts. This is especially the case with medical and legal dataset.

Figure 2: High level overview of FLERT Approach

Contrary to the previous method, FLERT uses a transformer based architecture which leverages self attention to provide better document level features. Another interesting feature of this approach is the fact that only tokens belonging to the central sentence are labeled and used for sequence tagging. This makes it a lot more flexible to use on already annotated dataset that may have fully labeled sentences but partially/not labeled surrounding contexts.

Several approaches were tested in the paper mainly:

A Fine-tuning approach consists of taking a pre-trained transformer architecture, adding a new linear layer adapted to the task at hand then fine tuning the entire architecture on the NER task.
A Feature based approach consists of using the contextually enhanced word embeddings and combining it with pre-trained word embeddings (Glove for english and Fasttext for other languages) and using it as input to a more classic architecture (BiLSTM-CRF).

Figure 3: Overview of feature-based approach

The best performing approach is the fine-tuning of achieving state of the art results on several CoNLL-03 benchmark datasets.

In this benchmark we will be comparing FLERT performance using a global domain language model and our in house legal language models.

Good old Spacy (Official Blog / Github repo)

SpaCy is an open source library for natural language processing in python offering tools ranging from tokenization and post tagging to deep learning custom training pipelines.
This library comes with pre-trained named entity recognition models for several languages that have the particularity of being both competitive when compared to state of the art models and extremely fast on training and inference. This made Spacy one of the most successful and used NPL libraries in industry.

I couldn’t find a paper regarding the Spacy’s NER model, all we know comes from a video presentation the author’s made that explains the basic components of the architecture and explain the intuition behind the different layerslayer and modules used. Basically spacy’s ner model can be described as a

“System that features a sophisticated word embedding strategy using subword features and “Bloom” embeddings, a deep convolutional neural network with residual connections, and a novel transition-based approach to named entity parsing The system is designed to give a good balance of efficiency, accuracy and adaptability.”

Benchmark on legal dataset

To perform this benchmark we are going to use a subset of our french and english datasets comprising each of more than 15.000 text samples and containing 25.000 entities unevenly distributed over 9 classes(organisation, address, date, amount, duration, person, percent, area and registration_number).

Considering the approaches we are looking to benchmark (except maybe Spacy), the maximum length of our text samples is a very important hyper-parameter that has to be adapted to our dataset. For this reason we will be running a quick statistical analysis of the number of words per text sample in our dataset.

Figure 4: Distribution of word counts per text sample (French and English)

Based on the previous analysis it seems that for French we have an important number of samples having more than 200 words and it doesn’t seem to be the case for English with a much tighter distribution between [0, 100]. This can be explained by the writing style for French legal documents in general and contracts in particular, lawyers tend to use a lot longer sentences than english contracts (US and British origins).

Figure 5: Distribution of character counts per text sample (French and English)

This analysis makes it clear that using “Contextual String Embeddings” is going to be a difficult task for French and english.

One way to overcome this we will be truncating text-samples beyond a certain threshold while making sure to respect word and entity boundaries. Using sentence tokenization and dependency tree is a big help to ensure we don’t end up with incoherent and incomplete contexts.

Figure 6: Distribution of token counts per text sample (French and English)

The transformers based approach (FLERT) is less sensitive to this parameter since we can take up to 510 tokens as input which fits a lot better (perfectly for english). Even samples having more than 510 tokens are going to be a lot easier to split into meaningful chunks than in the case of “Contextual String Embeddings”.

Spacy remains unaffected by the length of the input sequence since it relies on a transition-based approach that enables it to process sequence without length constraints unlike the other two approaches. Something that makes spacy a favourite in terms of flexibility and simplicity of preprocessing.

Results

(This is just an illustration/ Another one should be added for french but with real scores)

As expected, the “Contextual String Embeddings” based approach is exhibiting the worst results in terms of micro f1-score among the three tested approaches. As mentioned previously, this is due to the length of the char sequences that we can give to our model. Spacy’s model came in third with a very good result compared to FLERT models especially since it takes more than 8 times less time to train. On inference, FLERT model requires a GPU infrastructure to be able to be served with reasonable prediction times while spacy is extremely fast on inference and does not require GPU for that.

Finally both the FLERT models come on top, with the legal language model based model being the best of the two confirming the utility/need to have a domain specific language model(Similar results were observed in a clause classification benchmark illustrated in a previous blog post).

Using state of the art approaches based on language models can definitely lead to the best results compared to more simple approaches (Spacy) but it comes at the cost of training and inference time as well as the cost/infrastructure complexity of deploying these models.

In my perspective, the only way to make sure these approaches are worth the cost is to stop having a custom/fine tuned specifically for one task LM and start building a different pipeline on top of the same model.

This will enable us to limit the number of models we need to deploy and the number of inferences. For example one LM inference at the level of the clause can be used for clause classification and entity extraction at the same time, making it more interesting cost/time wise.

We actually tried this approach on FLERT by freezing the LM while training the last layer for NER and the results were not good. This means that we have to develop our own NER architecture that is adapted to our data and our needs.

Named Entity Extraction In Legal Documents

Written by DiliTrust