CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank

Hockenmaier, Julia; Steedman, Mark

CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank

Files

CCGbank.pdf (304.81 KB)

Penn collection

Departmental Papers (CIS)

Subject

Computer Sciences

Permalink

https://repository.upenn.edu/handle/20.500.14332/6519

View all metadata

Author

Hockenmaier, Julia

Steedman, Mark

Abstract

This article presents an algorithm for translating the Penn Treebank into a corpus of Combinatory Categorial Grammar (CCG) derivations augmented with local and long-range word–word dependencies. The resulting corpus,CCGbank,includes 99.4% of the sentences in the Penn Treebank. It is available from the Linguistic Data Consortium,and has been used to train widecoverage statistical parsers that obtain state-of-the-art rates of dependency recovery. In order to obtain linguistically adequate CCG analyses,and to eliminate noise and inconsistencies in the original annotation,an extensive analysis of the constructions and annotations in the Penn Treebank was called for,and a substantial number of changes to the Treebank were necessary. We discuss the implications of our findings for the extraction of other linguistically expressive grammars from the Treebank,and for the design of future treebanks.

Publication date

2007-02-21

Comments

Suggested Citation: Hockenmaier, J. and Steedman, M. (2007). CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank. Computational Linguistics. Vol. 33(3). p. 355-396. © 2007 Massachusetts Institute of Technology Press http://www.mitpressjournals.org/loi/coli

Collection

Articles