1. Introduction
Large quantities of printed documents are scanned and archived as images. Text extraction using optical character recognition (OCR) systems is then necessary for indexing documents, which is an essential feature for the accessibility to these documents. Unfortunately, the quality of OCR output is imperfect and sometimes far from the actual expected text, known as the ground truth. Compared to the costly efforts that can be spent fixing OCR errors, it is considered that the quality of OCR outputs is sufficient to read and explore documents. However, several studies show that the effectiveness of systems processing OCR output texts might be considerably harmed by OCR errors (Ittner, Lewis, and Ahn Reference Ittner, Lewis and Ahn1995; Lopresti Reference Lopresti2009).
The quality of the generated texts using OCR engines depends on their algorithms and on the parameter settings of the scanner used to digitize documents as well as on the quality of the original image and the nature of the document. For example, generated text from recent/historical newspapers and normal/damaged manuscripts do not usually have the same quality. Reasonable levels of OCR errors have relatively little impact on the human ability to read the documents. However, the text resulting from OCR is the one used for indexing. Subsequently, if some words have been wrongly recognized by the OCR, they will be indexed with their errors. This represents a serious problem for document indexing and retrieval.
Named entities (NEs) are useful in many applications in Web search (Guo et al. Reference Guo, Xu, Cheng and Li2009). A study has shown that NEs are the first point of entry for users in a search system (Gefen Reference Gefen2014). It is estimated that four out of five user queries on the GallicaFootnote a contain at least one NE (Chiron et al. Reference Chiron, Doucet, Coustaty, Visani and Moreux2017). Thus, properly recognising NEs can be considered more important than properly recognising other words. In order to improve the quality of user searches in a system, it is thus necessary to ensure the quality of these particular terms.
Named entity recognition (NER) is a traditional natural language processing (NLP) task used for many information retrieval applications (Petkova and Croft Reference Petkova and Croft2007; Guo et al. Reference Guo, Xu, Cheng and Li2009) such as indexing and text mining. NER emerged in the middle of the 90’s (Grishman and Sundheim Reference Grishman and Sundheim1996). It aims to locate specific terms in a given text and to categorize them into a set of predefined classes. Three main classes are usually used for named entity labeling: person, location, and organization (Nadeau and Sekine Reference Nadeau and Sekine2007).
Combined or subsequent to NER, named entity linking (NEL) connects NEs to external knowledge bases (KBs) such as Wikipedia,Footnote b Wikidata,Footnote c DBpedia (Lehmann et al. Reference Lehmann, Isele, Jakob, Jentzsch, Kontokostas, Mendes, Hellmann, Morsey, van Kleef, Auer and Bizer2015), GeoNames,Footnote d YAGO (Suchanek, Kasneci, and Weikum Reference Suchanek, Kasneci and Weikum2007), and Google Knowledge Graph.Footnote e This allows differentiating ambiguous geographical locations or names (e.g. the mention Paris can be linked to several cities or people), and implies that the descriptions from the KBs can be used for semantic enrichment.
However, NER and NEL are especially challenging for large quantities of documents as the diversity of NEs is increasing with the size of the collections. In the case of digitized documents, represented by their OCRed version which may contain numerous OCR errors, NEs are particularly affected, as stated by (Chiron et al. Reference Chiron, Doucet, Coustaty, Visani and Moreux2017). To perform NER and NEL, many techniques were developed in the literature over the last 25 years. These techniques can be classified into rule-based and machine learning-based approaches. For rule-based methods, rules are extracted manually. They are related to linguistic descriptions, trigger words, and lexicons of proper names (also known as gazetteers). These rules use patterns and regular expressions in order to locate NEs, classify them, and link them to KBs. The machine-learning approaches, on the other hand, aim to extract rules autonomously using large corpora. In the presence of OCR errors, rule-based methods are clearly hampered and unable to override the degradation generated by the OCR. However, machine-learning methods introduce sufficient flexibility to be adapted to processing noisy text.
Recent works have analysed the impact of OCR errors on NER (Hamdi et al. Reference Hamdi, Jean-Caurant, Sidère, Coustaty and Doucet2020) and NEL (Linhares Pontes et al. Reference Linhares Pontes, Hamdi, Sidere and Doucet2019). More precisely, they analyzed different levels and types of OCR degradation and their impact on the performance of NER and NEL systems. They concluded that OCR is strongly related to the drop in performance of these tasks. For instance, the performance of NER systems drops from 90% to 60% when the character error rate (CER) exceeds 20% while the results on NEL systems decrease around 10 percentage point when the OCR error rates are, respectively, 4% and 15% at the character and word levels.
The present work proposes to extend the analysis of these previous works (Linhares Pontes et al. Reference Linhares Pontes, Hamdi, Sidere and Doucet2019; Hamdi et al. Reference Hamdi, Jean-Caurant, Sidère, Coustaty and Doucet2020) with a deep analysis of OCR errors over the noisy collections. We define and study types of character/word errors and in which way they impact the performance of NER and NEL systems. In order to do that, we study five aspects related to general OCR errors and compare them with human-generated misspellings, including length effects, erroneous character positions, segmentation errors (NE boundaries), Levenshtein distance, and edit operations. These observations allowed us to give several suggestions on how to implement effective OCR postprocessing approaches when intending to perform NER and NEL.
The rest of this article is organized as follows: in Section 2, we describe typical NER and NEL approaches. Section 2.3 studies the impact of OCR on many NER and NEL systems processing its outputs. Section 3 consists of two parts. The first one concerns the datasets, with an overview of the NER and NEL datasets used. The second part outlines the impact of OCR errors on NER and NEL systems from a global point of view, using results on clean and noisy OCRed texts. Based on the resulting observations, we propose an in-depth analysis of the types of OCR errors and their impact on NER and NEL in Section 4. Finally, Section 5 concludes the article.
2. Related work
This article studies NER and NEL applied to OCRed documents. Consequently, we first introduce the main underlying NER and NEL approaches, and then review works related to the impact of OCR quality on the performance on NE processing.
2.1 Named entity recognition
NER systems aim to locate NEs in a given sequence of words, and to assign them a label (e.g. PER for persons, LOC for locations and ORG for organizations). Many NER approaches annotate texts using the IOB tagging scheme, where each token is marked as being inside (I), outside (O), or at the beginning (B) of an entity of a certain class. The sentence “Paris Hilton visited Paris” is for instance to be labeled as follows: B_PER I_PER O B_LOC.
NER approaches appeared in the 1990’s (Grishman and Sundheim Reference Grishman and Sundheim1996), and the early systems relied on rule-based approaches. Rules used in those systems are defined by humans and based on dictionaries, trigger words, and linguistic descriptors. Such techniques require a lot of time and effort to be extracted and handled. Thus, they cannot easily be updated to new types of texts or entities. To overcome this problem, the efforts on NER are largely dominated by machine-learning techniques such as fully supervised learning, semi-supervised learning, unsupervised learning, and more recently deep learning.
Fully supervised approaches to NER include support vector machines (Asahara and Matsumoto Reference Asahara and Matsumoto2003) and maximum entropy models (Borthwick et al. Reference Borthwick, Sterling, Agichtein and Grishman1998), as well as sequential tagging methods such as hidden Markov models (Bikel et al. Reference Bikel, Miller, Schwartz and Weischedel1998), and conditional random fields (CRFs) (Filannino, Brown, and Nenadic Filannino et al. Reference Filannino, Brown and Nenadic2013). These approaches, similarly to rule-based methods, rely on hand-crafted features, which are challenging and time-consuming to develop, and may be costly to update and generalize to new data.
More recently, neural networks have been shown to outperform other supervised algorithms for NER. The first neural network-based system has been developed in 2011 (Collobert et al. Reference Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa2011). It reached very competitive results for NER in comparison to previous machine-learning works. Therefore, many NER systems using neural networks architectures have been proposed and have shown their abilities to outperform all previous systems. Most deep-learning NER models are based on BiLSTM (Dernoncourt, Lee, and Szolovits Reference Dernoncourt, Lee and Szolovits2017; Peters et al. Reference Peters, Ammar, Bhagavatula and Power2017) or Transformer architectures (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017; Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019; Boros et al. Reference Boros, Hamdi, Linhares Pontes, Cabrera-Diego, Moreno, Sidere and Doucet2020a). However, BiLSTM models, with a CRF top-layer as tag decoder, dominate existing models (Lample et al. Reference Lample, Ballesteros, Subramanian, Kawakami and Dyer2016; Ma and Hovy Reference Ma and Hovy2016). Deep learning-based approaches rely on word and character distributed representations. Common algorithms for such context-independent word embeddings include Google word2vec (Goldberg and Levy Reference Goldberg and Levy2014) and Stanford Glove (Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014). Many other works have been proposed to enrich word representations with subword and contextual information, such as ELMo (Peters et al. Reference Peters, Neumann, Iyyer, Gardner, Clark, Lee and Zettlemoyer2018), Flair (Akbik et al. Reference Akbik, Bergmann, Blythe, Rasul, Schweter and Vollgraf2019), and BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019). The effectiveness of NER systems using neural networks is due to their ability to be updated and generalized. These systems can jointly learn effective features with model parameters directly from the training dataset, instead of relying on hand-crafted features developed for a specific dataset.
2.2 NEL
As we mentioned in Introduction, the aim of the NEL task is to map NEs to their corresponding entities in a KB (Shen, Wang, and Han Reference Shen, Wang and Han2015), where KBs contain a set of NEs and a set of documents.
As with NER, NEL methods based on neural networks (Ganea and Hofmann Reference Ganea and Hofmann2017; Le and Titov Reference Le and Titov2018) have shown an ability to outperform models based on hand-crafted features that learn from data on the base of manually selected features. Neural network-based methods include all variants of deep learning techniques, such as transfer learning (Linhares Pontes, Moreno, and Doucet Reference Linhares Pontes, Moreno and Doucet2020b), reinforcement learning (Fang et al. Reference Fang, Cao, Li, Zhang, Zhang and Liu2019), or multitask learning (Martins, Marinho, and Martins Reference Martins, Marinho and Martins2019b) to analyze features, relationships, and complex interactions among features, which allow a better analysis of documents and improve their performance. These methods combine context-aware words, spans, and entity embeddings with neural similarity functions to analyze the context of mentions and correctly disambiguate them to a KB. To disambiguate NEs, NEL systems are based on three steps:
-
(1) generate all potential entities to each linkable NE;
-
(2) rank the candidate entities;
-
(3) predict unlinkable mentions of extracted NEs.
More precisely, the generation of candidate entities consists in retrieving the related entity mentions from a KB that refer to mentions in a document. To do so, NEL systems have used dictionaries (Guo, Chang, and Kiciman Reference Guo, Chang and Kiciman2013) and search engines (Han and Zhao Reference Han and Zhao1999) or expanded surface forms from the local document (Zhang et al. Reference Zhang, Sim, Su and Tan2011). They ranked then selected candidates to the most likely mention from the KB. Many state-of-the-art methods have been proposed to analyse and rank these candidates such as name string comparison (Zheng et al. Reference Zheng, Li, Huang and Zhu2010), entity popularity (Guo et al. Reference Guo, Chang and Kiciman2013), entity type (Dredze et al. Reference Dredze, McNamee, Rao, Gerber and Finin2010), textual context (Li et al. Reference Li, Wang, Han, Han, Roth and Yan2013), and coherence between mapping entities (Cucerzan Reference Cucerzan2007). Once candidates from the KB are ranked, the last module checks for each mention whether the input NE matches with the target top-ranked NE.
Many other studies developed end-to-end systems, which jointly handle NER and NEL. Early works were based on engineered features (Sil and Yates Reference Sil and Yates2013; Luo et al. Reference Luo, Huang, Lin and Nie2015). Neural network-based systems were then proposed to capture the mutual dependency between NER and NEL. The first neural systemFootnote f has been proposed by Kolitsas, Ganea, and Hofmann (Reference Kolitsas, Ganea and Hofmann2018). Their method first generates all potential NEs and then learns similarity features based on contextual embeddings in order to disambiguate these mentions. Then, many other neural architectures have been proposed using either BiLSTM (Martins, Marinho, and Martins Reference Martins, Marinho and Martins2019a) or BERT (Broscheit Reference Broscheit2019; Chen et al. Reference Chen, Zukov-Gregoric, Li and Wadhwa2019). More recently, Cao et al. (Reference Cao, Wu, Popat, Artetxe, Goyal, Plekhanov, Zettlemoyer, Cancedda, Riedel and Petroni2021) implemented the mGENRE system, which takes advantage of language connections to predict and link NEs of multiple languages.
2.3 NLP of OCRed documents
Despite decades of research on OCR, outputs of such systems often contain errors especially when the OCR input document is damaged, old, or badly digitized. OCR systems are always located at the beginning of pipeline processes and their errors might have a negative impact and can be sometimes harmful for further tasks. For this reason, many researchers have studied the problems relating to processing text data from noisy sources. Those studies where led in order to understand the effects of optical recognition errors on text analysis routines and eventually estimate which zones of digitized documents may require a prior error correction process.
Many works have been done in the field of NLP to process noisy data (Lopresti Reference Lopresti2005). For instance, Lopresti (Reference Lopresti2009) have considered a text analysis pipeline consisting of sentence boundary detection, tokenization, and part-of-speech tagging of noisy unstructured text data. They reported that on the sentence boundary task, for example, the insertion errors have more destructive impact than character deletion errors, while OCR substitution errors are worse on part-of-speech tagging. More recently, Nguyen et al. (Reference Nguyen, Jatowt, Coustaty, Nguyen and Doucet2019) proposed an analysis of OCR errors on many collections of historical documents obtained from digital libraries. They showed that 81.49% of OCRed words contain two erroneous characters and that characters such as ‘b’, ‘d’, ‘m’, and ‘n’ are more easily misrecognized than others. The effects of processing noisy texts have also been studied on many other NLP tasks such as machine translation (Yaser Reference Yaser2005), document summarization (Jing, Lopresti, and Shih Reference Jing, Lopresti and Shih2003), and topic modelling (Mutuvi et al. Reference Mutuvi, Doucet, Odeo and Jatowt2018).
Several works focused on information retrieval from noisy data (Croft et al. Reference Croft, Harding, Taghva and Borsack1994). Chiron et al. (Reference Chiron, Doucet, Coustaty, Visani and Moreux2017) proposed a method to estimate the impact of OCR errors on the use of digital libraries. They built an OCR error model using a large corpus of OCRed documents aligned with their corresponding gold standard. Their model estimated the risk that a user’s query might fail to match with the targeted documents. Taghva, Borsack, and Condit (Reference Taghva, Borsack and Condit1996) showed that moderate OCR errors have no desperate impact on the effectiveness of classical information retrieval measures. Other studies focused on the impact of OCR errors on the classification of pathology reports for cancer notification (Zuccon et al. Reference Zuccon, Nguyen, Bergheim, Wickman and Grayson2012). They concluded that OCR errors even with modest rates are not perceptible for extracting cancer notification items.
2.3.1 OCR errors and NER
Concerning NER, several works have been done to extract NEs from diverse text types such as outputs of OCR Hamdi et al. (Reference Hamdi, Jean-Caurant, Sidere, Coustaty and Doucet2019), automatic speech recognition systems (Favre, Béchet, and Nocéra Reference Favre, Béchet and Nocéra2005), informal text messages, and noisy social network posts (Ritter et al. Reference Ritter, Clark, Mausam and Etzioni2011). An exhaustive survey on NER on historical documents was recently published (Ehrmann et al. Reference Ehrmann, Hamdi, Pontes, Romanello and Doucet2021). Palmer and Ostendorf (Reference Palmer and Ostendorf2001) described an approach for improving NE extraction from ASR systems outputs by explicitly modeling errors through the use of confidence scores. In a similar setting, Miller et al. (Reference Miller, Boisen, Schwartz, Stone and Weischedel2000) studied the NER performance under a variety of spoken and OCRed data. They trained a NER system on both clean and noisy input material and observed that performance degraded linearly as a function of word error rate (WER). They concluded that results may lose about 8 points of F-score with only 15% of WER. Rodriquez et al. (Reference Rodriquez, Bryant, Blanke and Luszczynska2012) reported that manual correction of OCR output do not result in a clear improvement of NER results. Many other studies took interest in NE extraction from digitized historical journals (Grover et al. Reference Grover, Givon, Tobin and Ball2008), broadcast news (Gotoh and Renals Reference Gotoh and Renals2000) and religious monologues, scientific books, and medical emails (Maynard et al. Reference Maynard, Tablan, Ursu, Cunningham and Wilks2001).
Recently, two studies showed that neural network NER models are better able to alleviate OCR errors compared to traditional machine-learning approaches (Hamdi et al. Reference Hamdi, Jean-Caurant, Sidere, Coustaty and Doucet2019; Boros et al. Reference Boros, Linhares Pontes, Cabrera-Diego, Hamdi, Moreno, Sidère and Doucet2020b; van Strien et al. Reference van Strien, Beelen, Ardanuy, Hosseini, McGillivray and Colavizza2020). However, even with neural networks, NER performances considerably decrease when applied on OCRed documents. van Strien et al. (Reference van Strien, Beelen, Ardanuy, Hosseini, McGillivray and Colavizza2020) conducted a large-scale analysis of the impact of OCR errors on several NLP tasks. They found that the impact on NER is less significant than on other tasks such as dependency parsing and sentence segmentation. Interestingly, the damaging effect seems greater on geopolitical entities than person names or dates. Hamdi et al. (Reference Hamdi, Jean-Caurant, Sidere, Coustaty and Doucet2019) focused on NER. They processed five noisy datasets using a BiLSTM NER system and reported that NER F1-score drop about 30 percentage points when the error rate is around 20%. Huynh, Hamdi, and Doucet (Reference Huynh, Hamdi and Doucet2020) then applied a post-OCR correction method on these datasets and showed that the OCR impact can be considerably attenuated by only correcting OCR words with up to two erroneous characters.
Additional related work proposed to create collections for NER on digitised Chinese documents (Lawrie, Mayfield, and Etter Reference Lawrie, Mayfield and Etter2020). The aim of building such collections is to support the full context of NER over OCRed text and to improve NER performance. The methodology proposed for building OCR/NER collections is to convert blocks of text into images and then to extract the text from images using OCR. Finally, they generated OCRed text enriched with NER annotations.
2.3.2 OCR errors and NEL
Concerning NEL, we previously evaluated the performance of state-of-the-art NEL approaches over digitized documents with different levels of OCR quality (Linhares Pontes et al. Reference Linhares Pontes, Hamdi, Sidere and Doucet2019). We simulated OCR mistakes on contemporary datasets and analyzed the performance of Ganea and Hofmann (Reference Ganea and Hofmann2017) and Le and Titov (Reference Le and Titov2018) systems on these data. Ganea and Hofmann embed entities and words in a common vector space and use a neural attention mechanism over local context windows to select words that are informative for the disambiguation decision. Le and Titov relied on representation learning and learn embeddings of mentions, contexts, and relations to reduce the amount of human expertise required to construct the system and make the analysis more portable across languages and domains. In our analysis, the performance of these systems decreased around 20% when OCR errors, at the character and word levels, reached rates of 5% and 15%, respectively.
In addition to OCR errors, works in digital humanities deal with historical documents that may contain spelling variations from modern languages, which can be difficult to recognize because spelling conventions may be reformed from time to time. Some works focused on the use of available NEL approaches to analyze historical data (van Hooland et al. Reference van Hooland, De Wilde, Verborgh, Steiner and Van de Walle2013; Munnelly and Lawless Reference Munnelly and Lawless2018; Ruiz and Poibeau Reference Ruiz and Poibeau2019). Other works studied the development of features and rules to improve specific-domain NEL (Heino et al. Reference Heino, Tamper, Mäkelä, Leskinen, Ikkala, Tuominen, Koho and Hyvönen2017) or entity types (Brando, Frontini, and Ganascia Reference Brando, Frontini and Ganascia2016). Moreover, some studies focused on the effect of problems frequently encountered in historical documents on NEL (Linhares Pontes et al. Reference Linhares Pontes, Cabrera-Diego, Moreno, Boros, Hamdi, Sidère, Coustaty and Doucet2020a). They represented the entities in a continuous space and combined them with a neural attention mechanism to analyze context words and candidate entity embeddings to disambiguate mentions in historical documents. In addition, they developed several modules to handle the multilingualism and errors related to OCR engines.
Similarly to van Strien et al. (Reference van Strien, Beelen, Ardanuy, Hosseini, McGillivray and Colavizza2020), we propose in the next sections to study the impact of OCR quality on NLP tasks, specifically NER and NEL, however, running detailed analysis over noisy data. Unlike previous work, we use larger corpora for evaluation without relying on post-correction. The datasets used in this work cover several languages and the noisy version contain different types of degradation that might be related to storage or digitization processes. Finally, we present a deep analysis of the OCR quality and provide different types and levels of OCR errors required to perform reasonably reliable NER and NEL approaches.
3. Resources
Processing NEs in a noisy context is very common with historical content, since the text to be analyzed is almost always resulting from a digitization and an OCR process. Few annotated datasets (with NEs and their links) in a noisy context aligned with their ground truth are publicly available to assess the impact of OCR errors on those tasks. As the main objective of this article is to have a deep analysis of OCR error categories for NER/NEL tasks, we first review experiments on the global impact of OCR errors. To this end, we start with an overview of the two publicly available datasets we used. We then present how we simulated degradations of the digitization process and their OCR errors. Finally, we present the obtained results from the point of view of OCR quality, using the classical measures that are CER and WER.
3.1 Datasets
3.1.1 NER datasets
First, we focused on NER on a publicly available datasetFootnote g presented by Hamdi et al. (Reference Hamdi, Jean-Caurant, Sidère, Coustaty and Doucet2020). These corpora are based on dataset presented in the conference on natural language learning in 2002 and 2003 (CoNLL-02 and CoNLL-03). Then, we performed some operations to synthesize real OCR errors. These resources consist of three datasets covering three languages : English, Spanish, and Dutch. Each dataset is cut into the three subsets that are commonly used in machine learning, that is training set, development set, and test set. For each corpus (English, Dutch, and Spanish), degraded images and noisy texts extracted by the OCR as well as the aligned version with clean data at the word and the character levels are provided.
NEs are classified into four predefined categories: PER for persons, LOC for locations, ORG for organizations, and MISC for miscellaneous, which is used to annotate all NEs not belonging to any of the other three classes.
3.1.2 NEL datasets
For NEL experiments, we used a publicly available dataset,Footnote h presented in Linhares Pontes et al. (Reference Linhares Pontes, Hamdi, Sidere and Doucet2019). As for NER datasets, this corpora are based on degraded versions of clean existing NEL datasets:
-
AIDA-CoNLL dataset (Hoffart et al. Reference Hoffart, Yosef, Bordino, Fürstenau, Pinkal, Spaniol, Taneva, Thater and Weikum2011) is based on CoNLL-03 data that was used for the NER task. This dataset is divided into AIDA-train for training, AIDA-A for validation, and AIDA-B for testing. This dataset contains 1393 Reuters news articles and 27,817 linkable mentions.
-
AQUAINT dataset (Guo and Barbosa Reference Guo and Barbosa2014) is composed of 50 short news documents (250-300 words) from the Xinhua News Service, the New York Times, and the Associated Press. This dataset contains 727 mentions.
-
ACE2004 dataset (Guo and Barbosa Reference Guo and Barbosa2014) is a subset of the ACE2004 coreference documents with 57 articles and 306 mentions, annotated through crowdsourcing.
-
MSNBC dataset (Guo and Barbosa Reference Guo and Barbosa2014) is composed of 20 news articles from 10 different topics (two articles per topic: Business, U.S. Politics, Entertainment, Health, Sports, Tech & Science, Travel, News).
3.1.3 Simulation of noisy data
Both datasets follow the same process of construction. Due to the lack of real noisy annotated data, they were built by synthesizing the process of text extraction from digitized documents. First, this was done by generating clean images from raw text coming from NER-NEL datasets. To simulate the noise induced by digitization, they then used the DocCreator toolFootnote i developed by Journet et al. (Reference Journet, Visani, Mansencal, Van-Cuong and Billy2017). This tool provides many filters to apply various degradations to document images such as blurring, bleeding-through, ink degradation, holes, and more.
In order to simulate OCRed versions, the raw texts are extracted from the NER and NEL annotated corpora. They are then converted into images which have been contaminated by injecting common OCR degradation when using a scanner. Using the tesseract OCR engine, the noisy texts are then extracted from the degraded document images. Finally, the original and noisy texts are aligned using the RETASFootnote j tool, to match the annotations from the original corpus to the noisy version.
Four types of degradation are applied to both datasets: the character degradation adds small ink spots on characters due to the age of documents. The phantom degradation simulates eroded characters that can occur after successive uses of documents. The bleed-through simulates the ink from the back side of a page, appearing on its front side. The blurring adds a blurring effect. Each of these types of degradation are performed at two levels in the NER datasets: LEV-1 level where noises are applied rarely and LEV-2 where they are applied more frequently. On the NEL dataset, only the LEV-1 has been applied. Figure 1 shows an original image and its degraded version.
Two additional degradations have been defined: LEV-0 and LEV-MIX. LEV-0 is the re-OCRred version of original images with no degradation added. It aims to provide a baseline of the OCR engine with a clean image. LEV-MIX is more of a real-world example, representing the result of simultaneously applying the four types of degradation at LEV-1 to the original texts. Table 1 outlines the CER and WER percentages for each OCRed version of test sets. These results are close to OCR error rates with real-life collections (Holley Reference Holley2009).
3.2. Global impact of OCR errors on NER/NEL
Neural networks as well as the training process have several hyper-parameters such as character embedding dimension, character-based token embedding, LSTM dimension, token embedding dimension, etc. The same parameters for training and testing have been used for both OCRed and clean datasets. In order to quantitatively estimate the impact of OCR errors on the NER and NEL tasks, we highlighted our previous works (Linhares Pontes et al. Reference Linhares Pontes, Hamdi, Sidere and Doucet2019; Hamdi et al. Reference Hamdi, Jean-Caurant, Sidère, Coustaty and Doucet2020) over the clean corpora and the noisy simulated ones.
3.2.1 Impact of OCR errors on NER
All experiments we will perform in this article are exposed in Hamdi et al. (Reference Hamdi, Jean-Caurant, Sidère, Coustaty and Doucet2020). The first one was conducted over clean corpora. Then, the same experiment was run on the OCRed datasets. We evaluated four NER systems, one machine learning-based system, and three deep learning-based systems. As results showed that deep learning-based systems have very similar performances and outperform the machine learning-based system, we reuse in this article the results obtained by the BiLSTM-CNN-CRF modelFootnote k (Ma and Hovy Reference Ma and Hovy2016). The model uses a forward LSTM and a backward LSTM that encode the left and right contexts, respectively. The forward and backward LSTM pair is referred to as a bidirectional LSTM (BiLSTM). Then, a CRF layer generates the most probable sequence of predicted labels from surrounding words. In order to get the best possible performances, we adapted this architecture to only use word and character embeddings. For the character-level embedding, the system induces character-level features using a convolutional neural network engine. It therefore adds to each word vector a new feature in the form of a character-based vector. Character features can be character embedding and character types (i.e. uppercase, lowercase, numbers, punctuation marks, special characters). For the word embedding level, we used the FastText (Bojanowski et al. Reference Bojanowski, Grave, Joulin and Mikolov2017) embedding as this is a word fragment-based model that can usually handle unseen words, and it still generates one vector per word. Finally, the character-level representation vector is concatenated with the word embedding vector to feed the BiLSTM network. Finally, the output vectors of BiLSTM are used as inputs to the CRF layer to jointly decode the best label sequence. The word embedding model of our system relies on the pre-trained word embedding FastText model (Grave et al. Reference Grave, Bojanowski, Gupta, Joulin and Mikolov2018), while the character embedding was trained on our data. To remedy issues with out-of-vocabulary words, we use both character-based and subword-based embeddings computed with FastText (Bojanowski et al. Reference Bojanowski, Grave, Joulin and Mikolov2017). This method is able to retrieve embeddings for unknown words by incorporating subword information. These systems convert the input sequence of words into a sequence of fixed-size vectors ( $x_{1}$ , $x_{2}$ ,…, $x_{n}$ ), that is the word-embedding part, and return another sequence of vectors ( $h_{1}$ , $h_{2}$ ,…, $h_{n}$ ) that represents NE labels at every step of the input. This tool achieved impressive results on two linguistic sequence labeling tasks: POS tagging with an accuracy exceeding 97% and NER with a F1-score of 91% (Ma and Hovy Reference Ma and Hovy2016). Results are detailed in Table 2.
The results show that the relative performance variations are essentially comparable for the three languages. As expected, the accuracy of NER drops proportionally to the level of OCR errors, which is itself related to the degradation type and level. Additionally, Table 2 shows that NER results may drop 3–5 percentage points from clean data to LEV-0 OCRed data, the OCRed synthesised data with no noise added. In other words, even with perfect storage and digitization, NER accuracy may be affected by the OCR quality. For other types of degradation, taking English as an example, the OCR WERs vary from 8% to 50%, while NER F-score drop from 90% to 50%.
3.2.2 Impact of OCR errors on NEL
In this subsection, we give an overview of our previous work (Linhares Pontes et al. Reference Linhares Pontes, Hamdi, Sidere and Doucet2019) about the impact of OCR errors on NEL. As Ganea and Hofmann’s and Le and Titov’s systems had a similar impact on the performance generated by the OCRed documents, we reused the results of the Ganea and Hofmann system to explain and link the impact of OCRed documents between NER and NEL applications. We used the Ganea and Hofmann system instead of an end-to-end NEL system in order to evaluate the impact of the OCR quality on entity disambiguation only (using NER gold tags). This choice also allows us to evaluate the impact of OCR errors on the end-to-end process for entity linking by using the outputs of the NER system as input to Ganea and Hofmann’s NEL system (Section 4.7).
In order to analyze the impact of OCR degradation on the NEL task, experiments were conducted using the system of Ganea and Hofmann (Reference Ganea and Hofmann2017) on OCRed datasets. Ganea and Hofmann proposed a deep-learning modelFootnote l that represents entities and words in a common vector space. Their model uses a neural attention mechanism over local context windows and a conditional random field that collectively disambiguates the mentions in a document. Their model was pretrained using the word embedding model Word2Vec with vectors of 300 dimensions from the Wikipedia corpus published in February 2014. Then, it was trained on the AIDA-CoNLL dataset (Hoffart et al. Reference Hoffart, Yosef, Bordino, Fürstenau, Pinkal, Spaniol, Taneva, Thater and Weikum2011).
Table 3 shows the performance of the Ganea and Hofmann’s system (Ganea and Hofmann Reference Ganea and Hofmann2017) on datasets with multiple levels of OCR quality, in terms of micro-averaged F1 scores. It can be seen that the CER and WER generated by OCR degradation are globally correlated to the performance of NEL. While the LEV-0 and bleed degradation had the lowest CER and WER levels and generated the least impact on the F1 scores, character degradation produced larger CER and WER values and caused a bigger drop in the NEL performance. In contrast to the effect on the NER datasets, the blur degradation had a similar impact on the NEL performance as the bleed degradation. Among all datasets, ACE2004 was the most affected by OCR degradation, with a drop of almost 20 percentage points for the LEV-MIX degradation.
The combination of the OCR degradations (LEV-MIX) generated the highest CER and WER. Despite high error rates, this does not appear to have a significant impact on the NEL performance (decrease rate on the F1-scores up to 11%).
In spite of the complexity of the NEL task and the introduction of several types of errors, the systems achieved robust results with limited document degradation (up to 5% WER). However, degradations causing a stronger decrease in OCR performance had double the impact on NEL performance.
4. In-depth OCR error analysis
This section contains a key contribution of this article. We propose an in-depth analysis of OCR errors and their impact on NER and NEL. Based on the conclusions from our analysis, one can observe that some OCR errors have a stronger impact on the tasks of NER and linking. Noting this is notably useful for the implementation of effective OCR postprocessing approaches related to NER and NEL.
As discussed in Section 3.2 and detailed by Hamdi et al. (Reference Hamdi, Jean-Caurant, Sidère, Coustaty and Doucet2020) and Linhares Pontes et al. (Reference Linhares Pontes, Hamdi, Sidere and Doucet2019), the impact of OCR quality on NER and NEL is very similar regardless of the system or the dataset used on each task. For NER, all the systems achieve satisfying results when the OCR error rates are between 1% and 5% at the character level and 8% and 15% at the word level. Similarly all NEL systems reach good results when the CER is less than 7% and the WER does not exceed 17%. From these rates, NER and NEL are harmed by OCR errors. For these reasons, we use for each task one system and one dataset including all the noisy versions (Section 3). However, the BiLSTM-CNN-CRF model and the system of Ganea and Hofmann (Reference Ganea and Hofmann2017) showed slightly better results for NER and NEL, respectively. We used the outputs of these systems to conduct our analysis. In terms of corpora, we used the English CoNLL-03 corpus for NER and its extension the AIDA corpus for NEL. Indeed, they are mostly used to evaluate state-of-the-art systems for the two tasks separately rather than through an end-to-end approach. Our in-depth analysis aims at covering the most relevant phenomena that define good strategies for post-OCR correction (edit operations, Levenshtein distance, length effects, first character errors, and segmentation errors).
4.1 Recognition and linking of contaminated NEs
Contaminated NEs are NEs that are wrongly recognized from the image by the OCR process. As we mentioned earlier, NER and NEL systems are able in some cases to face OCR errors and to correctly recognize and link NEs. However, in many cases NER and NEL systems fail to overcome OCR errors.
To the best of our knowledge, all post-OCR research focused on general words (Nguyen et al. Reference Nguyen, Jatowt, Coustaty and Doucet2021). No previous research study has been conducted on the analysis OCR errors over NEs. In order to study post-OCR on NEs, we have first extracted the contaminated entities in all the noisy versions of our datasets. Then, we have identified among them the NEs correctly recognized/linked and those wrongly recognized/linked in order to analyse them and identify OCR aspects that impacted the effectiveness of NER and NEL systems. Table 4 shows the percentages of contaminated and noncontaminated NEs on our datasets and the relative rates of the NEs correctly tagged by the systems. We perform this analysis with several OCRed versions of both datasets, each with varying error rates resulting from the injection of different types of document image noise.
Table 4 shows that the recognition and linking of contaminated NEs is slightly affected by OCR errors, particularly when the level of errors is low (LEV-0, bleed-through effects and phantom degradation for NER; and LEV-0 and blurring for NEL). Despite higher CER and WER values (3.4 and 16.9, respectively), AIDA does not contain as many contaminated NEs as the CoNLL-03 dataset. Indeed, most tokens that were affected by the OCR degradation are not entities. Among the degraded entities, the main OCR errors in the AIDA dataset are: deletion of characters (“outh Korea” and “South Korea”), accent (“Jerusalém” and “Jerusalem”), uppercase character (“indonesia” and “Indonesia”, and “United states” and “United States”), and punctuation (“Australia.” and “Australia”).
Unsurprisingly, the more NEs are contaminated, the more NER results are degraded. Things are less clear with NEL. For example, the phantom degradation affected more NEs than the bleed-through. However, more contaminated NEs are correctly linked in data with phantom degradation. In addition, all the contaminated NEs in the blurred data are successfully linked despite the number of contaminated NEs compared to other degraded data. These observations show that an additional analysis is required in order to understand different OCR aspects and their impact on NER and NEL.
OCR errors directly affected contaminated entities but also uncontaminated entities. The performance of NEL was worse for LEV-MIX and character degradation (Table 4). One of the main reasons for this performance reduction is related to the context of these entities that were contaminated by OCR errors and degraded the disambiguation analysis of these entities.
For this reason, we conducted five experiments to characterize contaminated NEs and find out reasons why they are correctly or wrongly tagged by the NER and NEL systems. We have studied known aspects related to general OCR errors including length effects, edit operations, distance with original NEs, case sensitivity, and segmentation errors.
4.2 Length effects
It was already observed that shorter words are more affected by OCR engines (Kukich Reference Kukich1992). This section aims to examine this aspect on NE tokens. In practice, the length of OCRed tokens may differ from the actual length of the tokens in the ground truth. For example, the OCRed token “Japgfl” of length 6 comes from the ground truth (GT) word “Japan” of length 5. We therefore analysed the effect of length over the OCRed tokens, since post-OCR algorithms focus on those rather than on the GT tokens. To do so, we first categorized contaminated NEs according to the length of their tokens and then identified among them those that were correctly tagged by the NER and the NEL systems. Counts of correct/incorrect NE recognized according to their lengths on our datasets are shown in Figure 2.
The analysis of OCRed token lengths (see Figure 2) shows two main findings. First, the vast majority of contaminated tokens are of length between 4 and 10 (about 78%). Second, most tokens outside this interval are correctly recognized despite OCR errors. We therefore suggest a post-OCR correction of tokens of lengths between 4 and 10.
Contrary to NER, the length effect does not show a significant impact on NEL. The model is mostly able to overcome OCR errors regardless of the entity length. Table 5 shows the number of noisy entities in the AIDA dataset that are correctly (or incorrectly) linked according to their lengths.
4.3 Levenshtein distance
The Levenshtein distance is a measure of the difference between two strings. It considers the edit distance required to convert a string into another one based on classical edition operations (insertion, deletion, and substitution). With it, we can estimate the degree of modification that the OCR degradation generated compared with the clean version and how this degradation affected the performance of NER/NEL systems. Percentages of errors based on Levenshtein distances (edit distances) of our datasets are shown in Figure 3. Depending on edit distance, there are single-error tokens (e.g. “Spain” vs. “Span”) and multierror tokens with edit distances higher than 1 (e.g. “Spain” vs. “Syin”). Mitton (Reference Mitton1987) reported that single-error words largely exceed multierror words in OCR outputs. The edit distance is an important criteria for post-OCR approaches. It helps to filter potential candidates and to select relevant ones. Figure 3 gives the distribution of contaminated NEs (blue) in the CoNLL-03 corpus according to their Levenshtein distances with the GT as well as with the number of NEs correctly recognized (orange) among them and those wrongly recognized (gray).
As Figure 3 shows, most OCR errors are single-error tokens with approximately 58.92% of the occurrences. When it comes to multierror tokens, most of them are of edit distance 2 (22.57% in total). The rate of contaminated NEs correctly recognized is satisfactory when the edit distance is larger than 2. However, from a distance of 6, the NER system cannot handle OCR errors anymore since none of the NEs are correctly identified. Interestingly, NEs with edit distances 3, 4, and 5 that are correctly recognized are almost always multitoken NEs and errors, with the errors distributed over different tokens.
Regarding NEL, the OCR degradation on AIDA generated errors with a Levenshtein distance of at most 4. Table 6 shows the number of NEs correctly linked according to the Levenshtein distance between the noisy NEs and the corresponding ground truth.
As shown in Table 6, the errors had a limited impact on the performance of the NEL system. OCR errors are dominated by single-error tokens. The NEL system is based on a probability table $p(e|m)$ to identify the entity candidates related to the entity mention m. When the degraded mention m does not have a corresponding entry in this probability table, the NEL approach cannot correctly link it to the KB. For instance, the mention “Europe” exists in our probability table (possible entity candidates: continent, band, music album, and so on) and can be disambiguated to the correct candidate. However, the mentions “Europi”, “Europe.”, “Eurape”, and “europe” (Levenshtein distance of 1 to “Europe”) do not exist in the probability table and, consequently, the NEL system cannot disambiguate them because they do not have any corresponding candidate. However, when these errors correspond to additional punctuation marks and lower/upper mistakes (e.g. “france” and “France” or “france” and “France.”), a preprocessing method normalises the mentions to fix these OCR errors. Moreover, a co-referencing method is used to find all mentions that refer to the same entity in a text. This process potentially links degraded mentions to nondegraded mentions and thus allows the NEL system to correctly disambiguate some of the degraded mentions and fixes these OCR errors. Moreover, a co-referencing method is used to find all mentions that refer to the same entity in a text.
For the two tasks, the analysis indicates that most contaminated NEs are of edit distances 1 and 2 in both the CoNLL (81.49%) and the AIDA corpora (98.22%). When the Levenshtein distance exceeds 2, errors are often distributed over different tokens of the degraded NE. An edit distance threshold 2 at the token level can therefore be defined for post-OCR approaches to filter out irrelevant candidates.
4.4 Edit operations
In this section, we discuss in further details the impact of the different types of edit operations. Three basic edit operation types can be performed:
-
Substitution, where one character has been replaced by another.
-
Deletion, where a character has simply not been recognized by the OCR.
-
Insertion, where an additional character has been wrongly added.
Contaminated NEs necessarily contain at least one type of modification (deletion, insertion, and substitution). Nguyen et al. (Reference Nguyen, Jatowt, Coustaty, Nguyen and Doucet2019) demonstrated that in around 23% of OCRed words, the three operations of deletion, insertion, and substitution can appear together in the same word. Based on that and on the edit distance analysis (Section 4.3), post-OCR algorithms should pay more attention to single modification types instead of their combinations in order to filter potential candidates. In this aim, we analysed the correlation of contaminated NEs and single-error types. Figure 4 and Table 7, respectively, show the corresponding distribution of contaminated NEs in the CoNLL-03 and the AIDA corpora. We also show the relative rates of NEs correctly/wrongly tagged in both corpora.
These results demonstrate two interesting facts. First, for NER, most of the contaminated NEs with one OCR error either undergo a substitution or an insertion operation. Second, the NEL corpus is clearly dominated by the substitution operation. On the other hand, the analysis shows that the delete operation is easily handled by NER systems (79.5% NEs are correctly recognized). For NEL in the AIDA corpus, contrary to NER, all the single-error NEs can be correctly linked regardless of the error type.
Our suggestions for OCR postprocessing methods is thus to focus on insertion and substitution operations in order to filter potential candidates and mainly benefit to NER.
4.5 First character errors
In NEs, the first character generally has more importance than the other characters, for instance when it is a capital letter. This is illustrated by the fact that early systems for NE extraction only focused on capital letters (McDonald Reference McDonald1993; Mikheev Reference Mikheev1999). These systems based on capital letters were using them to identify and delimit NEs (in English). Each word (or sequence of words) which did not occur in an ambiguous position (such as the beginning of a sentence, or in a capitalized title) with a capitalized first letter was considered as a NE. However, with OCR degradation, upper and lower case characters may be mixed up. NER results are, therefore, very impacted by case modifications (e.g. Apple vs. apple).
Mitton (Reference Mitton1987) described that 7% of the misspellings of his dataset appeared at the first character. Misspellings can change the first character by another capitalized character (e.g. “Spain” vs. “Opain”) or by a lower character (e.g. “Spain” vs. “spain”) or by another type of character (e.g. “Spain” vs. “;pain”). Table 8 shows the percentages of NEs impacted by errors at the first character in NER and NEL.
Results show that the NER system is very sensitive to the first character. Only 30.41% of the NEs are correctly recognized in spite of errors at the first character. For NEL, upper and lower letters have a lower impact since matching with KBs is not case-dependent. The NEL system is able to correctly link the NEs even if the first character is wrongly lower-cased. However, a few errors remain critical, such as the common substitution of a small case “l” instead of large case “I”, as in “Iowa” and “lowa”. In this case, the mention “lowa” does not exist in the probability table and, consequently, the NEL approach cannot link this mention to the KB.
In summary, when aiming at NER, we recommend post-OCR systems to specifically focus on the correction of first character errors, since only 30% of the contaminated NEs are properly recognized.
4.6 Segmentation errors
Also known as spacing errors, they occur in two cases:
-
over-segmentation: when a word is split into several words (generally due to different text alignments and spacing).
-
under-segmentation: when multiple words are wrongly joined.
It is worth noting that over- and under-segmentation can occur simultaneously. In the case of NEs, segmentation errors occur when a white space character is omitted between words in multiword NEs or when white space characters are erroneously inserted between two characters in at least one token of a NE.
Table 9 shows the percentages of NEs impacted by segmentation errors on the CoNLL-03 corpus. The impact is dramatic as only 19.25% of the contaminated NEs are correctly recognized.
All types of degradation in the AIDA dataset generated a few cases of segmentation errors that were correctly disambiguated by the NEL system.
In conclusion, together with first character errors, segmentation errors have the highest impact on NER performance, and a similar yet weaker impact on NEL. To deal with segmentation errors, post-OCR based on auto-encoders or language models could be used in future works to attempt to decrease the impact of content degradation.
4.7 End-to-end NE processing
Finally, in order to better understand how OCR errors are propagated from NER to NEL systems, we propose to evaluate the overall impact of OCR errors in an end-to-end NEL scenario. This scenario consists in both recognising (NER) and disambiguating (NEL) the entities into a KB. This section analyses the cumulative errors of this two-step pipeline, and evaluates to what extent this impacts the performance of NEL systems. We conducted two experiments: one (NEL-only) based on the LEV-1 degradations (Section 3.2) and the other based on the propagation of OCR errors between the NER and the NEL systems (we applied the NER techniques on the OCRed version with errors, and then applied NEL directly on the output). The OCRed versions were obtained by applying the degradation procedures detailed in Section 3.2 of the paper (blurring, bleeding effect, phantom, and character degradation) as well as the mix of all degradations.
In order to compare the performance of the NEL system in the disambiguation-only (Section 3.2.2) and the end-to-end scenarios, we used our NER approach (BiLSTM-CNN-CRF) to recognize the entities on the AIDA dataset and then Ganea and Hofmann’s approach to disambiguate them to a KB. Table 10 shows the performance evolution between the NEL system only (from Section 3.2.2) and the end-to-end combination for each OCR degradation. The most notable performance loss are shown in bold. Table 3 shows the impact of the propagation of OCR errors from NER to NEL. For comparative reasons, we report the results of the NEL-only experiments. Despite the good NEL performance in the NEL-only scenario for all versions on the AIDA dataset, we can observe that the combination of NER and NEL caused a performance drop of 12%. Among all OCR versions, the LEV-MIX achieved the worst results.
As Table 10 shows, the NEL-only model unsurprisingly outperforms the end-to-end model. Linking the output of the NER system (end-to-end) is more complicated than disambiguating GT NEs annotated in the clean version. In the noisy versions, it is clear that NER errors on OCRed data are impacting the NEL process, leading to a performance drop of the end-to-end model compared to the NEL-only model. Nevertheless, despite this drop, we can consider that the end-to-end NEL pipeline achieved good results (almost 0.8Footnote m in F-scores), which shows that the combination of NER and NEL systems can provide satisfying results for OCRed documents even without post-OCR correction. Recently, we proposed an analysis of the NEL process in order to overcome some OCR errors in historical documents (Linhares Pontes et al. Reference Linhares Pontes, Cabrera-Diego, Moreno, Boros, Hamdi, Sidère, Coustaty and Doucet2020a), which can improve the performance of NEL systems in OCRed documents.
4.8 Discussion
Huynh et al. (Reference Huynh, Hamdi and Doucet2020) showed that post-OCR correction algorithms are able to improve NER results over noisy texts when error rates at the character and word levels, respectively, exceed 2% and 10%. However, this approach does not consistently increase NER performance and the results sometimes remain far from those obtained with clean text. To address this shortcoming, we believe that defining heuristics to assist post-OCR algorithms can remedy the issue in a more effective way.
To perform our analysis, the selection of the dataset was thus intended to target and exhaust a dataset with variable OCR types and a realistic number of errors. The spectrum of the analysis is meant to be broad in the types of degradation rather than in the amount of data processed. As shown in Section 3, the CoNLL-03 and AIDA corpora, which are frequently used in NER and NEL tasks, contain diverse types of OCR errors, and the five aspects analyzed in this section are sufficiently distributed in the noisy versions of each of them. As a matter of fact, OCR error rates in the CoNLL-03 corpus vary from 2% to 7% at the character level and from 8% to 23% at the word level whereas in the AIDA corpus the CER is between 1% and 5% and the WER is between 4% and 18% (cf. Table 1). Our findings suggest that the type and breadth of OCR errors are independent of the performance of the NER and NEL systems. It is mainly their extent that varies.
5. Conclusion
The recognition and linking of NEs in OCRed documents remain a challenge when compared with the clean version of these documents (Linhares Pontes et al. Reference Linhares Pontes, Hamdi, Sidere and Doucet2019; Hamdi et al. Reference Hamdi, Jean-Caurant, Sidère, Coustaty and Doucet2020). The errors generated by OCR engines and degraded documents have an impact on the performances of NER/NEL systems. Despite the recent progress achieved with neural networks and post-OCR corrections systems, several improvements can be done in order to minimize the impact of these errors and reduce the gap in the performance of these tasks between OCRed and clean documents.
In order to identify the types of OCR errors and propose possible solutions to achieve this goal, we presented an in-depth quantitative analysis of the types of OCR errors and their impact on the performance of the NER and NEL tasks both jointly and separately. We selected a dataset that, on the one hand, has annotations for both tasks, and on the other hand, contains variable OCR error types. The study covers five types of OCR errors and the analysis of the impact on NER and NEL output led to many interesting findings. The length effects demonstrated that most contaminated NEs, which the NER/NEL systems fail to recognize/link, are of length between 5 and 15 characters. The edit distance analysis showed that most of the contaminated NEs contain either single- or double-character errors. Post-OCR techniques should therefore be able to fix about 81.49% of the impacted NEs with an edit distance threshold of 2. Moreover, our observations showed that character deletion errors are the most easily overcome by NER/NEL systems contrary to character insertion and substitution errors. Post-OCR algorithms are recommended to primarily favor the correction of character substitutions, then insertions. For NEL, post-OCR can be limited to correct substitutions, as the edit operation analysis showed that the NEL system is able to deal with errors generated from deletion and insertion. When it comes to the position of erroneous characters, our observations showed that NER systems are particularly vulnerable to errors made with the first character notably when the case is changed. For NEL, however, case sensitivity has little impact, while the substitution of the first character did. Finally, the analysis showed that segmentation errors had a very strong impact on NER performance. Post-OCR correction techniques need to be tailored towards these kind of errors to best benefit the NER task.
In future works, we will extend our analysis to include space errors occurring at the end of the row of multicolumn documents which lead to block segmentation errors. We additionally plan to rely on probabilities of OCR outputs at the character and the word levels to process NEs. Using the background knowledge of this study, we could predict NEs and improve the precision of NER and NEL systems. Another important point that we aim to study is NE-focused OCR post-correction. Given the importance of NEs in the activity of the users of digital libraries, post-OCR solutions geared towards NEs would have high impact on their access to information.
Acknowledgement
This work has been supported by the European Union’s Horizon 2020 research and innovation program under grant 770299 (NewsEye) and by the ANNA project funded by the Nouvelle-Aquitaine Region.