CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank

Loading...
Thumbnail Image
Penn collection
Departmental Papers (CIS)
Degree type
Discipline
Subject
Computer Sciences
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Hockenmaier, Julia
Steedman, Mark
Contributor
Abstract

This article presents an algorithm for translating the Penn Treebank into a corpus of Combinatory Categorial Grammar (CCG) derivations augmented with local and long-range word–word dependencies. The resulting corpus,CCGbank,includes 99.4% of the sentences in the Penn Treebank. It is available from the Linguistic Data Consortium,and has been used to train widecoverage statistical parsers that obtain state-of-the-art rates of dependency recovery. In order to obtain linguistically adequate CCG analyses,and to eliminate noise and inconsistencies in the original annotation,an extensive analysis of the constructions and annotations in the Penn Treebank was called for,and a substantial number of changes to the Treebank were necessary. We discuss the implications of our findings for the extraction of other linguistically expressive grammars from the Treebank,and for the design of future treebanks.

Advisor
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Publication date
2007-02-21
Journal title
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Suggested Citation: Hockenmaier, J. and Steedman, M. (2007). CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank. Computational Linguistics. Vol. 33(3). p. 355-396. © 2007 Massachusetts Institute of Technology Press http://www.mitpressjournals.org/loi/coli
Recommended citation
Collection