Improving named entity recognition with co-training and unlabeled bilingual data

Xiaoyi Ma, University of Pennsylvania


Supervised learning systems require a large quantity of labeled data, which is time-consuming, expensive and in some cases requires linguistic expertise to create. Semi-supervised methods combine the use of labeled with unlabeled data—which are abundant in the forms of monolingual, parallel, or comparable text in many domains—to improve classifiers trained on labeled data. These unlabeled data, if used correctly, can be explored to improve the state-of-the-art classifiers. This paper investigates ways to utilize unlabeled parallel and comparable bilingual text to boost the performance of named entity taggers. We consider each side of the bilingual text as a distinct view of the same data and each view by itself would be sufficient for learning if there were enough labeled data. Thus, machine learning problems in a parallel document context in general fits nicely into a co-training framework in which different views of the same document can complement and improve each other via an iterative process. This approach also allows us to extend our method to document pairs that aren't translations but have similar content, because for many annotation tasks (such as named entity recognition) these comparable documents are "parallel" at the document level in the sense that they have the same content and are concerned with the same people, places and organizations. Co-training allows us to achieve the following under one framework: (1) improving current state-of-the-art taggers; (2) adapting existing taggers to new domains; (3) inducing taggers for a new language from resource-rich languages, such as English. This dissertation describes the experiments and results on using co-training algorithm with unlabeled English-Chinese, English-Spanish, and English-Arabic bilingual text, including comparable text, to improve named entity taggers. The results show considerable improvement of the named entity taggers when the co-training algorithm and unlabeled bilingual text are used.

Subject Area

Computer science

Recommended Citation

Ma, Xiaoyi, "Improving named entity recognition with co-training and unlabeled bilingual data" (2008). Dissertations available from ProQuest. AAI3346163.