Low-resource Named Entity Recognition

Stephen Mayhew, University of Pennsylvania


Most of the success in natural language processing (NLP) in the last 20 years has come from statistical machine learning methods that discover complex patterns in text and make predictions. These methods traditionally require supervised data, which is nearly always created by humans, as a gold standard for that task. But as we look to extend these successes to other languages, we are faced with the daunting task of starting from scratch. The years of effort that went into creating annotations for English and a select few popular languages must be relived for each new language. This unrealistic requirement means that as we seek to perform old tasks in new languages we must use existing resources, or rapidly develop new resources. In particular, we study the problem of Named Entity Recognition (NER) in low resource languages. The task of NER is to find and classify names in text, and the low-resource qualifier signifies that we build these models without access to training data. This thesis discusses the use of incidental signals for developing NER systems, such as character sequences indicative of named entities, or partially-annotated text, such as might come from non-speaker annotations. It describes new methods for cross-lingual NER, exploiting such resources as Wikipedia and bilingual lexicons. The penultimate chapter applies several prominent techniques to a broad array of test languages, giving valuable insights into what has been accomplished, and what is left to do. The final chapter distils knowledge from several years of experience building low-resource NER systems into a practical guide.

Subject Area

Artificial intelligence

Recommended Citation

Mayhew, Stephen, "Low-resource Named Entity Recognition" (2019). Dissertations available from ProQuest. AAI27665861.