Guidelines for Penn Korean Treebank Version 2.0

Loading...
Thumbnail Image
Penn collection
IRCS Technical Reports Series
Degree type
Discipline
Subject
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Ryu, Shijong
Contributor
Abstract

The Korean Treebank Annotations Version 2.0 is a second volume of The Korean Treebank Annotations (Palmer et al., 2002; Han et al., 2002). It contains new texts that are from the news domain: the original corpus for the Korean Treebank 2.0 was extracted from The Korean Newswire corpus published by LDC, catalog number LDC2000T45. The Korean Treebank Annotations Version 2.0 consists of 647 news articles in 112 files which contain 132,040 words and 5,010 sentences. There are 40,252 unique words and 13,844 unique morphemes (12,681 unique morphemes excluding foreign characters and arabic numbers). The annotated text measures about 8.5MB in size. While annotating the new texts, many new linguistic constructions and phenomena were encountered which called for setting additional guidelines. Furthermore, a few guidelines used for the first volume of the Korean Treebank were re-examined and modified in the second volume. This document outlines the guidelines that were newly introduced for the second volume of the Penn Korean Treebank, as well as the ones that have been revised since the publication of volume 1.0. Therefore, this is not a self-contained document, but is rather an addendum to the two previously published guidelines for the Penn Korean Treebank (Han and Han, 2001; Han et al., 2001).

Advisor
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Publication date
2005-10-20
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-5-03.
Recommended citation
Collection