Skip to content

6.8 Profile Hidden Markov Models and Biological Applications

6.8 Profile Hidden Markov Models and Biological Applications

Section titled “6.8 Profile Hidden Markov Models and Biological Applications”

In the previous sections, we developed Hidden Markov Models as a general framework for modeling biological sequences. We introduced their structure, discussed the fundamental computational problems, and showed how parameters can be learned from data.

We now take the final step and connect these ideas to one of the most important practical applications in bioinformatics:

Modeling sequence families and motifs using Profile Hidden Markov Models.


6.8.1 From Pairwise Comparison to Family Models

Section titled “6.8.1 From Pairwise Comparison to Family Models”

In earlier chapters, we approached sequence comparison through pairwise alignment. This allowed us to detect similarity between two sequences and to infer possible evolutionary relationships.

However, many biological questions go beyond pairwise comparisons. Instead of asking whether two sequences are similar, we often ask:

  • Does a sequence belong to a known protein family?
  • Does it contain a known functional motif?
  • Is it evolutionarily related to a group of sequences, even if similarity is weak?

In such cases, we are not comparing one sequence to another, but rather comparing a sequence to a family of sequences.

A natural way to represent such a family is through a multiple sequence alignment (MSA).


6.8.2 Limitations of Multiple Sequence Alignments

Section titled “6.8.2 Limitations of Multiple Sequence Alignments”

A multiple sequence alignment summarizes conserved and variable regions across a set of sequences. It provides valuable insights into:

  • conserved residues
  • variable positions
  • insertion and deletion patterns

However, as a computational model, an alignment has limitations:

  • it is a static representation
  • it does not explicitly define a generative process
  • it cannot easily assign probabilities to new sequences
  • it struggles with insertions and deletions in a principled way

To overcome these limitations, we transform the alignment into a probabilistic model.


A Profile Hidden Markov Model is an HMM derived from a multiple sequence alignment. It captures both the composition of conserved regions and the variability observed across the alignment.

The construction proceeds in several steps.


Not all columns in an alignment are equally informative. Some contain many gaps and contribute little to the definition of the motif.

A common approach is to define a threshold:

  • columns with too many gaps are removed
  • the remaining columns define the core positions of the model

These core positions correspond to match states in the HMM.


Each conserved column in the alignment becomes a match state MiM_i.

For each match state, we estimate emission probabilities by counting how often each symbol appears in that column and normalizing the counts.

This step is closely related to the construction of position probability matrices, but it is now embedded within a larger probabilistic model.


Biological sequences often contain insertions relative to the consensus motif. To model this, we introduce insertion states.

Insertion states allow:

  • emission of symbols not aligned to core positions
  • variable-length insertions between match states

These states typically have emission probabilities reflecting more general sequence composition.


Deletions are handled differently. Instead of emitting symbols, deletion states are silent states:

  • they do not emit any symbol
  • they allow the model to skip match states

This enables the model to represent sequences in which certain conserved positions are absent.


Finally, we define transitions between states:

  • match → match (continuing the motif)
  • match → insertion (introducing variability)
  • insertion → insertion (extending an insertion)
  • insertion → match (returning to the motif)
  • match → deletion (skipping a position)
  • deletion → match (resuming the motif)

These transitions encode the allowed structural variations of the sequence.


6.8.4 Interpretation: Modeling Evolutionary Processes

Section titled “6.8.4 Interpretation: Modeling Evolutionary Processes”

A key conceptual advantage of Profile HMMs is that they do not merely describe an alignment. Instead, they model the evolutionary processes that generate sequence variation.

In this view:

  • match states represent conserved positions
  • insertion states represent insertion events
  • deletion states represent deletion events

Thus, a Profile HMM can be interpreted as a probabilistic model of how a sequence family evolves over time.

This interpretation explains why Profile HMMs are often more sensitive than alignment-based methods when detecting distant homologs.


It is instructive to compare Profile HMMs with classical alignment approaches such as Needleman–Wunsch or Smith–Waterman.

AspectAlignment methodsProfile HMMs
Representationpairwise or multiple alignmentprobabilistic model
Insertions/deletionshandled via scoring schemeexplicitly modeled via states
Scoringheuristic or substitution matricesprobabilistic likelihood
Sensitivitylimited for distant homologshigher sensitivity

The key difference is that Profile HMMs integrate:

  • position-specific information
  • transition structure
  • probabilistic scoring

into a unified framework.


Profile HMMs are widely used in bioinformatics, particularly in large-scale sequence analysis.


One of the most important applications is the classification of proteins into families.

The challenge is that:

  • known family members may be few
  • evolutionary divergence may be large
  • pairwise similarity may be weak

Profile HMMs address this by aggregating information across all known family members.

Even if a new sequence is only weakly similar to any single sequence in the family, it may still match the overall profile well.


Hidden Markov Models can also be extended to more complex models for gene prediction.

In this setting, states may represent:

  • exons
  • introns
  • intergenic regions
  • start and stop codons

By modeling transitions between these states, HMMs can identify gene structures in genomic sequences.


Another application is the prediction of protein secondary structure.

Here, hidden states correspond to structural elements such as:

  • α-helices
  • β-sheets
  • coils

Emission probabilities reflect the tendency of specific amino acids to occur in each structural context.


Profile HMMs bring together all key ideas of this chapter:

  • probabilistic modeling of sequences
  • hidden states representing biological context
  • dynamic programming algorithms for inference
  • parameter estimation from data

They demonstrate how abstract models can be translated into powerful tools for real biological problems.


Hidden Markov Models provide a flexible framework for modeling biological sequences. Profile HMMs extend this framework to sequence families by incorporating:

  • conserved positions
  • insertions and deletions
  • evolutionary variation

As a result, they form the basis of many modern tools for sequence analysis and annotation.


  1. Why are multiple sequence alignments insufficient as probabilistic models?
  2. What roles do match, insertion, and deletion states play in a Profile HMM?
  3. How do Profile HMMs model evolutionary processes?
  4. Why are Profile HMMs more sensitive than pairwise alignment methods?
  5. In which types of biological problems are Profile HMMs particularly useful?

With this, we have completed the development of Hidden Markov Models as a framework for sequence modeling.

In the next chapter, we will extend these ideas further and explore how probabilistic models can be used to reconstruct evolutionary relationships between sequences.