IRCS Technical Reports Series
Document Type
Technical Report
Date of this Version
October 2005
Abstract
The Korean Treebank Annotations Version 2.0 is a second volume of The Korean Treebank Annotations (Palmer et al., 2002; Han et al., 2002). It contains new texts that are from the news domain: the original corpus for the Korean Treebank 2.0 was extracted from The Korean Newswire corpus published by LDC, catalog number LDC2000T45. The Korean Treebank Annotations Version 2.0 consists of 647 news articles in 112 files which contain 132,040 words and 5,010 sentences. There are 40,252 unique words and 13,844 unique morphemes (12,681 unique morphemes excluding foreign characters and arabic numbers). The annotated text measures about 8.5MB in size.
While annotating the new texts, many new linguistic constructions and phenomena were encountered which called for setting additional guidelines. Furthermore, a few guidelines used for the first volume of the Korean Treebank were re-examined and modified in the second volume. This document outlines the guidelines that were newly introduced for the second volume of the Penn Korean Treebank, as well as the ones that have been revised since the publication of volume 1.0. Therefore, this is not a self-contained document, but is rather an addendum to the two previously published guidelines for the Penn Korean Treebank (Han and Han, 2001; Han et al., 2001).
Date Posted: 07 August 2006
Comments
University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-5-03.