PRACTICAL NAMED ENTITY RECOGNITION: THE ROLE OF ENTITY AND ITS CONTEXT
Degree type
Graduate group
Discipline
Computer Sciences
Linguistics
Subject
information extraction
machine learning
named entity recognition
natural language processing
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
Neural supervised models for named entity recognition perform well within the same domain but fail to recognize entities not seen in the (pre-)training data with high accuracy. For better generalization, it is essential that models are able to recognize predictive contextual clues. In this thesis, we explore the role of entity names and the context (sentence) in which they appear in named entity recognition. We quantify the generalization ability of models by probing them for the degree of learning names vs. contexts. We define constraining contexts as contexts with strong selectional preferences for the entity type. We argue that for constraining contexts, models should be able to recognize the entity type correctly regardless of the word identity. At the same time, we recognize that there is a generalization limit for named entity recognition based on the prevalence of constraining contexts, the accuracy of their automatic identification, and the names appearing in the model (pre-)training data for other contexts. We determine the feasibility of developing such a model by conducting human studies and by developing methods for the identification of constraining contexts. From a practical perspective, since named entity recognition models are often developed for targeted applications, we also examine the robustness of models to challenges encountered in practice. Specifically, we study the effect of entities from different countries of origin, the effect of fine-grained topics within a domain often treated as homogeneous, and the effects of temporal changes. While it is challenging to identify the areas where model performance may suffer given the homogeneity of benchmark datasets, a practical solution for better performance remains to collect representative training data samples for each such area.