PRACTICAL NAMED ENTITY RECOGNITION: THE ROLE OF ENTITY AND ITS CONTEXT

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Computer and Information Science
Discipline
Data Science
Computer Sciences
Linguistics
Subject
generalization
information extraction
machine learning
named entity recognition
natural language processing
Funder
Grant number
License
Copyright date
2023
Distributor
Related resources
Author
Agarwal, Oshin
Contributor
Abstract

Neural supervised models for named entity recognition perform well within the same domain but fail to recognize entities not seen in the (pre-)training data with high accuracy. For better generalization, it is essential that models are able to recognize predictive contextual clues. In this thesis, we explore the role of entity names and the context (sentence) in which they appear in named entity recognition. We quantify the generalization ability of models by probing them for the degree of learning names vs. contexts. We define constraining contexts as contexts with strong selectional preferences for the entity type. We argue that for constraining contexts, models should be able to recognize the entity type correctly regardless of the word identity. At the same time, we recognize that there is a generalization limit for named entity recognition based on the prevalence of constraining contexts, the accuracy of their automatic identification, and the names appearing in the model (pre-)training data for other contexts. We determine the feasibility of developing such a model by conducting human studies and by developing methods for the identification of constraining contexts. From a practical perspective, since named entity recognition models are often developed for targeted applications, we also examine the robustness of models to challenges encountered in practice. Specifically, we study the effect of entities from different countries of origin, the effect of fine-grained topics within a domain often treated as homogeneous, and the effects of temporal changes. While it is challenging to identify the areas where model performance may suffer given the homogeneity of benchmark datasets, a practical solution for better performance remains to collect representative training data samples for each such area.

Advisor
Nenkova, Ani
Date of degree
2023
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation