Departmental Papers (CIS)

Date of this Version

August 2001

Document Type

Conference Paper


Presented at the Workshop on Data Mining in Bioinformatics 2001 (BIOKDD 2001).


Many of the same modeling methods used in natural languages, specifically Markov models and HMM's, have also been applied to biological sequence analysis. In recent years, natural language models have been improved upon by using maximum entropy methods which allow information based upon the entire history of a sequence to be considered. This is in contrast to the Markov models, whose predictions generally are based on some mixed number of previous emissions, that have been the standard for most biological sequence models. To test the utility of Maximum Entropy modeling for biological sequence analysis, we used these methods to model amino acid sequences. Our results show that there is significant long-distance information in amino acid sequences and suggests that maximum entropy techniques may be beneficial for a range of biological sequence analysis problems.


maximum entropy, amino acids, sequence analysis



Date Posted: 21 May 2005