Nenkova, Ani

Profile Picture
Email Address
Research Projects
Organizational Units
Faculty Member
Research Interests

Search Results

Now showing 1 - 10 of 23
  • Publication
    To Memorize or to Predict: Prominence Labeling in Conversational Speech
    (2007-04-01) Nenkova, Ani; Brenier, Jason; Kothari, Anubha; Calhoun, Sasha; Whitton, Laura; Beaver, David; Jurafsky, Dan
    The immense prosodic variation of natural conversational speech makes it challenging to predict which words are prosodically prominent in this genre. In this paper, we examine a new feature, accent ratio, which captures how likely it is that a word will be realized as prominent or not. We compare this feature with traditional accent-prediction features (based on part of speech and N-grams) as well as with several linguistically motivated and manually labeled information structure features, such as whether a word is given, new, or contrastive. Our results show that the linguistic features do not lead to significant improvements, while accent ratio alone can yield prediction performance almost as good as the combination of any other subset of features. Moreover, this feature is useful even across genres; an accent-ratio classifier trained only on conversational speech predicts prominence with high accuracy in broadcast news. Our results suggest that carefully chosen lexicalized features can outperform less fine-grained features.
  • Publication
    Using Entity Features to Classify Implicit Discourse Relations
    (2010-09-01) Joshi, Aravind K; Louis, Annie; Nenkova, Ani; Prasad, Rashmi
    We report results on predicting the sense of implicit discourse relations between adjacent sentences in text. Our investigation concentrates on the association between discourse relations and properties of the referring expressions that appear in the related sentences. The properties of interest include coreference information, grammatical role, information status and syntactic form of referring expressions. Predicting the sense of implicit discourse relations based on these features is considerably better than a random baseline and several of the most discriminative features conform with linguistic intuitions. However, these features do not perform as well as lexical features traditionally used for sense prediction.
  • Publication
    Discourse Indicators for Content Selection in Summaization
    (2010-09-01) Joshi, Aravind K; Louis, Annie; Nenkova, Ani
    We present analyses aimed at eliciting which specific aspects of discourse provide the strongest indication for text importance. In the context of content selection for single document summarization of news, we examine the benefits of both the graph structure of text provided by discourse relations and the semantic sense of these relations. We find that structure information is the most robust indicator of importance. Semantic sense only provides constraints on content selection but is not indicative of important content by itself. However, sense features complement structure information and lead to improved performance. Further, both types of discourse information prove complementary to non-discourse features. While our results establish the usefulness of discourse features, we also find that lexical overlap provides a simple and cheap alternative to discourse for computing text structure with comparable performance for the task of content selection.
  • Publication
    Automatic Summarization
    (2011-06-01) Nenkova, Ani; McKeown, Kathleen
    It has now been 50 years since the publication of Luhn’s seminal paper on automatic summarization. During these years the practical need for automatic summarization has become increasingly urgent and numerous papers have been published on the topic. As a result, it has become harder to find a single reference that gives an overview of past efforts or a complete view of summarization tasks and necessary system components. This article attempts to fill this void by providing a comprehensive overview of research in summarization, including the more traditional efforts in sentence extraction as well as the most novel recent approaches for determining important content, for domain and genre specific summarization and for evaluation of summarization. We also discuss the challenges that remain open, in particular the need for language generation and deeper semantic understanding of language that would be necessary for future advances in the field.
  • Publication
    Structural Features for Predicting the Linguistic Quality of Text: Applications to Machine Translation, Automatic Summarization and Human-Authored Text
    (2010-01-01) Nenkova, Ani; Chae, Jieun; Louis, Annie; Pitler, Emily
    Sentence structure is considered to be an important component of the overall linguistic quality of text. Yet few empirical studies have sought to characterize how and to what extent structural features determine fluency and linguistic quality. We report the results of experiments on the predictive power of syntactic phrasing statistics and other structural features for these aspects of text. Manual assessments of sentence fluency for machine translation evaluation and text quality for summarization evaluation are used as gold-standard. We find that many structural features related to phrase length are weakly but significantly correlated with fluency and classifiers based on the entire suite of structural features can achieve high accuracy in pairwise comparison of sentence fluency and in distinguishing machine translations from human translations. We also test the hypothesis that the learned models capture general fluency properties applicable to human-authored text. The results from our experiments do not support the hypothesis. At the same time structural features and models based on them prove to be robust for automatic evaluation of the linguistic quality of multi-document summaries.
  • Publication
    Information Status Distinctions and Referring Expressions: An Empirical Study of References to People in News Summaries
    (2011-03-25) Siddharthan, Advaith; Nenkova, Ani; McKeown, Kathleen
    Although there has been much theoretical work on using various information status distinctions to explain the form of references in written text, there have been few studies that attempt to automatically learn these distinctions for generating references in the context of computer-regenerated text. In this article, we present a model for generating references to people in news summaries that incorporates insights from both theory and a corpus analysis of human written summaries. In particular, our model captures how two properties of a person referred to in the summary—familiarity to the reader and global salience in the news story—affect the content and form of the initial reference to that person in a summary. We demonstrate that these two distinctions can be learned from a typical input for multi-document summarization and that they can be used to make regeneration decisions that improve the quality of extractive summaries.
  • Publication
    Detecting Prominence in Conversational Speech: Pitch Accent, Givenness and Focus
    (2008-01-01) Rangarajan Sridhar, Vivek Kumar; Nenkova, Ani; Narayanan, Shrikanth; Jurafsky, Dan
    The variability and reduction that are characteristic of talking in natural interaction make it very difficult to detect prominence in conversational speech. In this paper, we present analytic studies and automatic detection results for pitch accent, as well as on the realization of information structure phenomena like givenness and focus. For pitch accent, our conditional random field model combining acoustic and textual features has an accuracy of 78%, substantially better than chance performance of 58%. For givenness and focus, our analysis demonstrates that even in conversational speech there are measurable differences in acoustic properties and that an automatic detector for these categories can perform significantly above chance.
  • Publication
    Revisiting Readability: A Unified Framework for Predicting Text Quality
    (2008-10-01) Pitler, Emily; Nenkova, Ani
    We combine lexical, syntactic, and discourse features to produce a highly predictive model of human readers’ judgments of text readability. This is the first study to take into account such a variety of linguistic factors and the first to empirically demonstrate that discourse relations are strongly associated with the perceived quality of text. We show that various surface metrics generally expected to be related to readability are not very good predictors of readability judgments in our Wall Street Journal corpus. We also establish that readability predictors behave differently depending on the task: predicting text readability or ranking the readability. Our experiments indicate that discourse relations are the one class of features that exhibits robustness across these two tasks.
  • Publication
    Predicting the Fluency of Text with Shallow Structural Features: Case Studies of Machine Tanslation and Human-Written Text
    (2009-03-01) Chae, Jieun; Nenkova, Ani
    Sentence fluency is an important component of overall text readability but few studies in natural language processing have sought to understand the factors that define it. We report the results of an initial study into the predictive power of surface syntactic statistics for the task; we use fluency assessments done for the purpose of evaluating machine translation. We find that these features are weakly but significantly correlated with fluency. Machine and human translations can be distinguished with accuracy over 80%. The performance of pairwise comparison of fluency is also very high—over 90% for a multi-layer perceptron classifier. We also test the hypothesis that the learned models capture general fluency properties applicable to human-written text. The results do not support this hypothesis: prediction accuracy on the new data is only 57%. This finding suggests that developing a dedicated, task-independent corpus of fluency judgments will be beneficial for further investigations of the problem.
  • Publication
    Automatically Evaluating Content Selection in Summarization Without Human Models
    (2009-08-01) Louis, Annie; Nenkova, Ani
    We present a fully automatic method for content selection evaluation in summarization that does not require the creation of human model summaries. Our work capitalizes on the assumption that the distribution of words in the input and an informative summary of that input should be similar to each other. Results on a large scale evaluation from the Text Analysis Conference show that input-summary comparisons are very effective for the evaluation of content selection. Our automatic methods rank participating systems similarly to manual model-based pyramid evaluation and to manual human judgments of responsiveness. The best feature, Jensen- Shannon divergence, leads to a correlation as high as 0.88 with manual pyramid and 0.73 with responsiveness evaluations.