Machine translation using probabilistic synchronous dependency insertion grammars
Abstract
This thesis addresses the use of Probabilistic Synchronous Dependency Insertion Grammars (PSDIG) for syntax based statistical machine translation (SMT). Dependency Insertion Grammar (DIG) is a generative grammar formalism that captures word order phenomena using dependency representation. Its Synchronous version, Synchronous DIG (or SDIG) aims at capturing structural divergences across languages. We prove DIG has a generation capacity weakly equivalent to that of CFG. In SDIG, the parallel sub-sentential dependency structures are defined as Elementary Tree (ET) pairs. By comparing to TAG and Synchronous TAG, we show how such formalisms are linguistically motivated. We propose a framework to learn such an SDIG from parallel corpora based on synchronous tree partitioning. We introduce three algorithms, which break down the sentence-level parallel dependency trees into phrase-level ET pairs. (1) The synchronous hierarchical partitioning algorithm iteratively adds category constraints to word level alignments, breaking down the dependency tree pairs, generating more fine-grained ET pairs at each iteration. However, its greedy nature motivates the second algorithm: (2) the exhaustive learner. It removes the category constraints and collects all the compatible treelet pairs. For these two algorithms, a set of heuristics in the tree to tree mapping process are used, and are combined together through a Maximum Entropy model. (3) We also introduce a grammar learner that specifically learns treelet pairs that are linear n-gram phrases at the same time. Combining the grammar rules learned from the two learners (algorithms (2) and (3) as mentioned above) improved the MT system performance. We introduce a decoding algorithm which is based on several log-linearly interpolated models, including a tri-gram language model. According to the Bleu automatic MT evaluation software [Papineni et al., 2002], the PSDIG MT system performance is significantly better than IBM Model 4 [Brown et al., 1990, 1993], while on par with the state-of-the-art public domain phrase based system Pharaoh [Koehn, 2004]. Analysis shows PSDIG and phrase based SMT each excel in different sentences, which gives possibility to combine the two approaches together. The improved integration of syntax on both source and target languages opens the door to more sophisticated SMT processes.
Subject Area
Computer science
Recommended Citation
Ding, Yuan, "Machine translation using probabilistic synchronous dependency insertion grammars" (2006). Dissertations available from ProQuest. AAI3246151.
https://repository.upenn.edu/dissertations/AAI3246151