PRACTICAL NAMED ENTITY RECOGNITION: THE ROLE OF ENTITY AND ITS CONTEXT

Agarwal, Oshin

PRACTICAL NAMED ENTITY RECOGNITION: THE ROLE OF ENTITY AND ITS CONTEXT

Files

Agarwal_upenngdas_0175C_15638.pdf (1.38 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Computer and Information Science

Discipline

Data Science
Computer Sciences
Linguistics

Subject

generalization
information extraction
machine learning
named entity recognition
natural language processing

Copyright date

2023

Permalink

https://repository.upenn.edu/handle/20.500.14332/58960

View all metadata

Author

Agarwal, Oshin

Abstract

Neural supervised models for named entity recognition perform well within the same domain but fail to recognize entities not seen in the (pre-)training data with high accuracy. For better generalization, it is essential that models are able to recognize predictive contextual clues. In this thesis, we explore the role of entity names and the context (sentence) in which they appear in named entity recognition. We quantify the generalization ability of models by probing them for the degree of learning names vs. contexts. We define constraining contexts as contexts with strong selectional preferences for the entity type. We argue that for constraining contexts, models should be able to recognize the entity type correctly regardless of the word identity. At the same time, we recognize that there is a generalization limit for named entity recognition based on the prevalence of constraining contexts, the accuracy of their automatic identification, and the names appearing in the model (pre-)training data for other contexts. We determine the feasibility of developing such a model by conducting human studies and by developing methods for the identification of constraining contexts. From a practical perspective, since named entity recognition models are often developed for targeted applications, we also examine the robustness of models to challenges encountered in practice. Specifically, we study the effect of entities from different countries of origin, the effect of fine-grained topics within a domain often treated as homogeneous, and the effects of temporal changes. While it is challenging to identify the areas where model performance may suffer given the homogeneity of benchmark datasets, a practical solution for better performance remains to collect representative training data samples for each such area.

Advisor

Nenkova, Ani

Date of degree

2023

Collection

Dissertations and Theses