Models for Improved Tractability and Accuracy in Dependency Parsing
natural language processing
Automatic syntactic analysis of natural language is one of the fundamental problems in natural language processing. Dependency parses (directed trees in which edges represent the syntactic relationships between the words in a sentence) have been found to be particularly useful for machine translation, question answering, and other practical applications. For English dependency parsing, we show that models and features compatible with how conjunctions are represented in treebanks yield a parser with state-of-the-art overall accuracy and substantial improvements in the accuracy of conjunctions. For languages other than English, dependency parsing has often been formulated as either searching over trees without any crossing dependencies (projective trees) or searching over all directed spanning trees. The former sacrifices the ability to produce many natural language structures; the latter is NP-hard in the presence of features with scopes over siblings or grandparents in the tree. This thesis explores alternative ways to simultaneously produce crossing dependencies in the output and use models that parametrize over multiple edges. Gap inheritance is introduced in this thesis and quantifies the nesting of subtrees over intervals. The thesis provides O(n6) and O(n5) edge-factored parsing algorithms for two new classes of trees based on this property, and extends the latter to include grandparent factors. This thesis then defines 1-Endpoint-Crossing trees, in which for any edge that is crossed, all other edges that cross that edge share an endpoint. This property covers 95.8% or more of dependency parses across a variety of languages. A crossing-sensitive factorization introduced in this thesis generalizes a commonly used third-order factorization (capable of scoring triples of edges simultaneously). This thesis provides exact dynamic programming algorithms that find the optimal 1-Endpoint-Crossing tree under either an edge-factored model or this crossing-sensitive third-order model in O(n4) time, orders of magnitude faster than other mildly non-projective parsing algorithms and identical to the parsing time for projective trees under the third-order model. The implemented parser is significantly more accurate than the third-order projective parser under many experimental settings and significantly less accurate on none.