Towards high-performance word sense disambiguation by combining rich linguistic knowledge and machine learning approaches

Jinying Chen, University of Pennsylvania

Abstract

Supervised word sense disambiguation (WSD) for truly polysemous words (in contrast to homonyms) is difficult for machine learning, mainly due to two problems: the lack of sense-tagged training data and the sparsity of the matrix of observed instances vs. features. At the same time, high accuracy is necessary for WSD to be beneficial for high-level applications, such as information retrieval, question answering, and machine translation. This paper addresses the above two problems through combining rich linguistic knowledge and machine learning methods. Our work has two major contributions. First, we propose and demonstrate empirically evidence that careful design and generation of linguistically motivated features help to alleviate the data sparseness inherent in WSD. We built a supervised WSD system by using a smoothed maximum entropy (MaxEnt) model and linguistically motivated features (e.g., the semantic categories of a verb's noun phrase (NP) arguments). With three specific enhancements to automatic feature generation, our system achieved the best published results in an evaluation using the SENSEVAL2 English verbs with fine-grained senses (64.6% accuracy; 16.7 senses on average). We then developed a WSD-based feature generation method to filter out semantic features associated with irrelevant senses of a verb's NP arguments, which improved the system performance on verbs whose senses rely heavily on their NP arguments. To generalize semantic features, we developed a new clustering algorithm that automatically acquires semantically coherent noun groups from large text corpora. Using semantic features associated with these noun groups improved our system's performance further. The second contribution of our work is showing the effectiveness of active learning in the creation of more labeled training data for supervised WSD. Our experiments showed that two uncertainty-based active learning methods, combined with the smoothed MaxEnt model, reduced the required training data by 1/2 to 3/4 when learning coarse-grained English verb senses, suggesting the high potential of active learning in reducing sense annotation effort and facilitating the development of a broad coverage high performance supervised WSD system.

Subject Area

Information systems|Computer science

Recommended Citation

Chen, Jinying, "Towards high-performance word sense disambiguation by combining rich linguistic knowledge and machine learning approaches" (2006). Dissertations available from ProQuest. AAI3246146.
https://repository.upenn.edu/dissertations/AAI3246146

Share

COinS