Spring 2002, Feb. 15 lecture
Reference: Dr. Rick Hershberger's guide lines on ClustalW (http://www.rickhershberger.com/darwin2000/msa/Guided%20Activity)
Multiple sequence alignments by ClustalW
The multiple sequence alignment algorithm is sometimes called a "many-against-each-other" search because the input is a small, defined set of sequences which are compared only against each other, not against an entire database. This is in contrast to the BLAST homology search algorithm, a "one-against-all" homology search, in which the input is a single sequence that is compared against all other known sequences listed in the database. Thus the starting point for a multiple sequence alignment is a set of sequences that are already presumed to be homologous. There are many multiple sequence alignment tools available in the internet. One widely used tool is the ClustalW http://www.ebi.ac.uk/clustalw. We will leave the multiple alignment methods behind those tools to later lectures, but demonstrate how to use ClustalW here.
Link to the European Bioinformatics Institute's ClustalW server (http://www.ebi.ac.uk/clustalw). We will use a supercomputer in Hinxton, near Cambridge, England for our sequence alignment. There are many options in the job-submitting window. But each option has a link to describe it. Some options should look familiar to you, such as MATRIX, GAP OPEN, and GAP EXTENTION. For a simple demonstration, we will use the default options in our example.
Link to the HbB_FASTAs.txt file (http://www.bioactivesite.com/darwin2000/msa/HbB_FASTAs.txt). This is a collection of hemoglobin beta chain protein sequences from a variety of species, formatted in the FASTA format required by many biocomputing servers. Select all of these sequences and copy the text from this web page. Paste your selected hemoglobin sequences, in FASTA format, into the large text entry field. You can add to and delete sequences within the entry field. Click on Run ClustalW to run your multiple sequence alignment.
Note: Full use of the features of the EBI site requires a Java-enabled web browser. If you experience difficulty accomplishing the tasks described below, you may suspect that your web browser is not configured with Java support. Visit www.netscape.com to download Navigator or Communicator, or visit www.microsoft.com to download Internet Explorer. Check during the installation process that Java is installed with the web browser software.
The ClustalW result page first shows a summary of the multiple sequence set. We have 13 sequences in the hemoglobin beta chain data. Following the summary is the pairwise alignment result, because the multiple alignment in ClustalW is actually derived from a succession of pairwise alignments. Then there is a description of sequence groups, which is for the process of building a guide tree for the multiple sequences. We can skip this part for only the purpose of multiple alignment.
The results of the multiple sequence alignment are shown in a series of stacked lines, each line representing one of the sequences in the query set. Gaps (dashes) are introduced as necessary to maximize the alignment of identical or similar residues among the set of sequences. This reflects insertion or deletion events during evolution. At the bottom of each stack of aligned sequences are symbols that summarize the alignment at that position in the sequence. An asterisk denotes a position at which all query sequences have the exact same amino acid. Dots indicate the degree of homology when there is not complete sequence conservation.
For a more graphical view of the alignment, click on the gray button labeled "JalView". (For instruction on all of JalView's features, click on the text link "Use JalView".) A new browser window will open (don't close the old one!). In the JalView window colored boxes group homologous residues. The darker the color, the greater percentage of sequences within the set that have the same residue at that position. Notice that this view lets you quickly spot broad regions of high homology, and note individual sequences that are non-homologous at a given position.
At the end of ClustalW result page, the guide tree of multiple alignment is given. We use the tree to show the relationships of sequences, and also guide the order of the alignment, which starts from a pair of sequences then progressively adds all sequences in the alignment. We will study more about trees in later lectures.
Inferring Structure-Function Relationships by Identifying Conserved Residues and Regions
Identify the amino acid residues that are absolutely conserved among the complete set of sequences (marked by asterisks; there are eight). Switch to the JalView window. Record the position number within the human sequence for each conserved residue by clicking on that residue. The sequence entry and position will appear in the lower left corner of the applet window ("Sequence ID: Human (4) Residue=P (37)). This proline at position 37 of the human sequence is the first of the eight conserved residues. Do not use the numbers at the top of the alignment as these numbers indicate positions within the alignment, not within any single sequence entry. Note that amino acid number 1 in the human sequence is position number 10 within the alignment. Also note that each gap introduced in the human sequence further changes the numbering of residues within the human sequence relative to the number scale at the top of the alignment.
(1) What amino acid residues within the human beta globin protein sequence appear to be conserved among vertebrate (and one invertebrate) hemoglobins? List amino acid and sequence position for each. How might these individual amino acids be important for protein structure or function?
(2) Are there any broad regions of consistent homology (as opposed to absolute conservation) among the sequences? How might these series of amino acids be important for protein structure or function?
Now view the GenBank sequence database entry for the human hemoglobin beta chain protein (www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=2144721&form=6&db=p&Dopt=g).
(3) Are any of the amino acids or regions you identified as conserved known to be important for hemoglobin structure or function as indicated in the Features table of the database record?