The Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0)

Thumbnail Image
Penn collection
IRCS Technical Reports Series
Degree type
Grant number
Copyright date
Related resources
Xia, Fei

This document describes the Part-of-Speech (POS) tagging guidelines for the Penn Chinese Treebank Project. The goal of the project is the creation of a 100-thousand-word corpus of Mandarin Chinese text with syntactic bracketing. The Chinese Treebank has been released via the Linguistic Data Consortium (LDC) and is available to the public. The POS tagging guidelines have been revised several times during the two-year period of the project. The previous two versions were completed in December 1998 and March 1999, respectively. This document is the third and final version. We have added an introduction chapter in order to explain some rationale behind certain decisions in the guidelines. We also include the English gloss to the Chinese words in the guidelines. In this document, we first discuss the criteria for POS tagging and other factors that we considered when designing our POS tagset. Second, we describe each of the thirty-three POS tags in detail. Third, we provide tests to distinguish certain POS tag pairs and specify the treatment for some common collocations. Fourth, we list a number of words with each POS tag. Finally, we compare our tagset with three tagsets: the tagset for the Academia Sinica Balanced Corpus in Taiwan (CKIP, 1995), the tagset for the Grammatical Knowledge Base developed by Peking University in China (Yu et al., 1998), and the tagset for the English Penn Treebank (Santorini, 1990).

Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Publication date
Volume number
Issue number
Publisher DOI
Journal Issue
University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-00-07.
Recommended citation