Final Examination of STAT 598K
December 3, 2001
Instruction: The exam is open books. It is due on December 14,
2001. There are 4 questions with equal credits. Data sets for some
questions are available in the hyperlinks. Please work INDEPENDENTLY.
You are welcome to discuss your questions with the instructor, but NO
group work. Please write your answers in a clear format.
- The first data set includes
6 sequences of human leucine zipper transcription factors. Use
Feng-Doolittle progressive alignment algorithm to align the multiple
sequences.
You do not have to write a program to perform the alignment. To make
the problem easier, use BLAST pairwise alignment tool instead of dynamic
programming. And simply use BLAST scores for the distances of pairs of
sequences. If you are really tired of performing pairwise alignments in
the web, Standalone BLAST will be a better choice. But you need some work
to learn how to use it.
- The second data set has 29 helix
turn helix protein sequences. An alignment
of these sequences detects a common motif in the multiple sequence set.
Build a profile of the motif from the alignment result. And use the
profile to search the motif of a new
sequence.
- The motif is assumed to have no gap. And the positions inside the
motif
are assumed independent. The profile of the motif can be defined either by
position specific probability distribution of amino acids, or a position
specific score matrix (PSSM). Calculate the PSSM for the profile. Use
pseudocounts when computing the amino acids probability, where the
pseudocount of each type residue is proportional to the background
distribution. And we usually use about 10% of the total sequence number as
the weight of the pseudocounts.
-
Use the profile of the motif to detect motif region of a new sequence.
- Verify protein E.coli gyrase A (with 875 amino acids) and protein
E.coli gyrase B (804 amino acids) are functional linked.
Based on computational methods, you should at least apply both protein
phylogenetic profiles and Rosetta stone method to draw your conclusion.
Please show detail procedures and results in your implementation of
tools. One sentence statement of the final result will not be accepted.
Also be aware that the Expected values you chosen when performing
BLAST search will determine the phylogenetic profiles. Thus you may have
to try different E-values.
- A gene expression data set is
obtained from Spellman's yeast cell cycle experiments. The data include
genes' names, their cell cycle stages, log-ratios of red and green
fluorescent intensities from alpha factor and CDC15 synchronization
experiments.
- Use two clustering algorithms to classify this data set. Check your
results with the known cycle stages and give detail reports. Also
compare the performance of the two clustering algorithms you used.
- A gene class expression profile can be obtained by averaging over all
the genes in that class. Study time series graphs of the class profiles
and give an order of the classes. This order should be
consistent with the cell cycle.