IRCS Technical Reports Series
Document Type
Technical Report
Date of this Version
October 2000
Abstract
This document describes the Part-of-Speech (POS) tagging guidelines for the Penn Chinese Treebank Project. The goal of the project is the creation of a 100-thousand-word corpus of Mandarin Chinese text with syntactic bracketing. The Chinese Treebank has been released via the Linguistic Data Consortium (LDC) and is available to the public.
The POS tagging guidelines have been revised several times during the two-year period of the project. The previous two versions were completed in December 1998 and March 1999, respectively. This document is the third and final version. We have added an introduction chapter in order to explain some rationale behind certain decisions in the guidelines. We also include the English gloss to the Chinese words in the guidelines.
In this document, we first discuss the criteria for POS tagging and other factors that we considered when designing our POS tagset. Second, we describe each of the thirty-three POS tags in detail. Third, we provide tests to distinguish certain POS tag pairs and specify the treatment for some common collocations. Fourth, we list a number of words with each POS tag. Finally, we compare our tagset with three tagsets: the tagset for the Academia Sinica Balanced Corpus in Taiwan (CKIP, 1995), the tagset for the Grammatical Knowledge Base developed by Peking University in China (Yu et al., 1998), and the tagset for the English Penn Treebank (Santorini, 1990).
Date Posted: 11 August 2006
Comments
University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-00-07.