Forest-based algorithms in natural language processing

Liang Huang, University of Pennsylvania


Many problems in Natural Language Processing (NLP) involves an efficient search for the best derivation over (exponentially) many candidates. For example, a parser aims to find the best syntactic tree for a given sentence among all derivations under a grammar, and a machine translation (MT) decoder explores the space of all possible translations of the source-language sentence. In these cases, the concept of packed forest provides a compact representation of huge search spaces by sharing common sub-derivations, where efficient algorithms based on Dynamic Programming (DP) are possible. Building upon the hypergraph formulation of forests and well-known 1-best DP algorithms, this dissertation develops fast and exact k-best DP algorithms on forests, which are orders of magnitudes faster than previously used methods on state-of-the-art parsers. We also show empirically how the improved output of our algorithms has the potential to improve results from parse reranking systems and other applications. We then extend these algorithms to approximate search when the forests are too big for exact inference. We discuss two particular instances of this new method, forest rescoring for MT decoding, and forest reranking for parsing. In both cases, our methods perform orders of magnitudes faster than conventional approaches. In the latter, faster search also leads to better learning, where our approximate decoding makes whole-Treebank discriminative training practical and results in an accuracy better than any previously reported systems trained on the Treebank. Finally, we apply the above materials to the problem of syntax-based translation and propose a new paradigm, forest-based translation. This scheme translates a packed forest of the source sentence into a target sentence, rather than just using 1-best or k -best parses as in usual practice. By considering exponentially many alternatives, it alleviates the propagation of parsing errors into translation, yet only comes with fractional overhead in running time. We also push this direction further to extract translation rules from packed forests. The combined results of forest-based decoding and rule extraction show significant improvements in translation quality with large-scale experiments, and consistently outperform the hierarchical system Hiero, one of the best performing systems to date.

Subject Area

Computer science

Recommended Citation

Huang, Liang, "Forest-based algorithms in natural language processing" (2008). Dissertations available from ProQuest. AAI3346133.