Departmental Papers (CIS)

Date of this Version

May 2004

Document Type

Conference Paper


Copyright © 2004 IEEE. Reprinted from Proceedings of the 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), held 17-24 May 2004, Montreal, Quebec, Canada. Publisher URL:

This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of the University of Pennsylvania's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to By choosing to view this document, you agree to all provisions of the copyright laws protecting it.


We investigate a simple algorithm that combines multiband processing and least squares fits to estimate f0 contours in speech. The algorithm is untraditional in several respects: it makes no use of FFTs or autocorrelation at the pitch period; it updates the pitch incrementally on a sample-by-sample basis; it avoids peak picking and does not require interpolation in time or frequency to obtain high resolution estimates; and it works reliably, in real time, without the need for postprocessing to produce smooth contours. We show that a baseline implementation of the algorithm, though already quite accurate, is significantly improved by incorporating a model of statistical learning into its final stages. Model parameters are estimated from training data to minimize the likelihood of gross errors in f0 as well as errors in classifying voiced versus unvoiced speech. Experimental results on several databases confirm the benefits of statistical learning.


speech processing, statistical learning



Date Posted: 27 July 2004

This document has been peer reviewed.