Guidelines for Penn Korean Treebank Version 2.0
Penn collection
Degree type
Discipline
Subject
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
The Korean Treebank Annotations Version 2.0 is a second volume of The Korean Treebank Annotations (Palmer et al., 2002; Han et al., 2002). It contains new texts that are from the news domain: the original corpus for the Korean Treebank 2.0 was extracted from The Korean Newswire corpus published by LDC, catalog number LDC2000T45. The Korean Treebank Annotations Version 2.0 consists of 647 news articles in 112 files which contain 132,040 words and 5,010 sentences. There are 40,252 unique words and 13,844 unique morphemes (12,681 unique morphemes excluding foreign characters and arabic numbers). The annotated text measures about 8.5MB in size. While annotating the new texts, many new linguistic constructions and phenomena were encountered which called for setting additional guidelines. Furthermore, a few guidelines used for the first volume of the Korean Treebank were re-examined and modified in the second volume. This document outlines the guidelines that were newly introduced for the second volume of the Penn Korean Treebank, as well as the ones that have been revised since the publication of volume 1.0. Therefore, this is not a self-contained document, but is rather an addendum to the two previously published guidelines for the Penn Korean Treebank (Han and Han, 2001; Han et al., 2001).