BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B

 

Spring 2002, Jan. 30 lecture

Database Access

Reading: Text "A Primer of geneme science", p. 21-26, and p.86-87.

In this lecture, we learn how to access NCBI (National Center for Biotechnology Information) database. We use examples to study Entrez, BLAST and Structure databases in NCBI website.

2.      The locus is followed by the reference, including names of the authors, the journal and the title of the article in which the sequence was or will be published, and any historical notes on updates.

3.      The meat of the file is the features section, which has subheadings describing the known extent of the gene, the coding sequence (CDS) including the predicted protein sequence, and miscellaneous features (misc_feature) such as intron-exon boundaries, identified protein domains, variations, mutations (if the sequence is derived from a mutant strain), and alternate transcripts. All of these are associated with links.

4.      Next comes the base count (the number of each nucleotide in the sequence), followed by the sequence itself in blocks of 10 bases, 60 bases per row, with a running tally of the site number at the beginning of each row.

5.      Users have the option of displaying the sequence in FASTA format, which is simply an uninterrupted sequence of letters following a header, or of downloading it as a text file by clicking on the DISPLAY and SAVE boxes at the head of the file. XML and ASN.1 files can be used to export the file to certain bioinformatics applications.

6.      When we choose GRAPHICS in the display option, the LocusView representation of the sequence is shown. Users can also access an AceView display by typing "hoxa1" into the search box at the AceView home page. The AceView page of HOXA1 gives more detailed information of the sequence, specially for those that have no idea of the annotation of "hoxa1".

In the searching field choose "Protein", and input an accession identifier XP_004915.1. We will get a similar report about this protein sequence. The alphabetic sequence is shown at the end of the file. When we do not know the accession number, we can try key words in the searching field. However selecting the right sequence is always a non-trivial problem.

·         BLAST is the abbreviation of Basic Local Alignment Search Tool. It includes a set of programs designed to search similarity between a query sequence and all of the available sequence databases.

The BLAST package provides programs for finding high scoring local alignments between the query sequence and the target database. BLAST makes a list of all 'neighborhood words' of a fixed length (by default 3 for protein sequences, and 11 for nucleic acids), that would match the query sequence somewhere with score higher than some threshold. It then scans through the database, and whenever it finds a word in this set, it starts a 'hit extension' process to extend the possible match as an ungapped alignment in both directions, stopping at the maximum scoring extension.

An example is to align a protein sequence HBA_HUMAN (ID P01922) to a given database. Go to BLAST (www.ncbi.nlm.nih.gov/BLAST), and choose standard blastp. The protein HBA_HUMAN is the query sequence. We may search similar sequences in the database of nr (non redundant merge of several protein databases). The similar sequences will be listed in a descendant order by the score. Altering the database for searching, we may restrict search only in a organism genome, such as yeast. In this way, we obtain the comparison between a human gene sequence and the yeast genome.

The statistical significance of the hits in BLAST is described by a measure called the E-value. Formally, the E-value is the number of hits with the same level of similarity that you would expect by chance if there were no true matches in the database. Thus, a hit with an E-value of 0.01 would be expected to occur once every 100 searches even when there is no true match in the database. For the results of the example search, the E-values for the top hits are all very small, strongly suggesting that the similarities are not the result of random chance.

We can also use pairwise alignment to align the sequences of HBA_HUMAN and HBB_HUMAN. Their similarity and other information are described in the output.

·         Structure database in NCBI provides three-dimensional biomolecular structures.

Most 3D-structure data are obtained from X-ray crystallography and NMR-spectroscopy. The structures give a wealth of information on the biological function and evolutionary history of molecules.

One structure example is the protein with PDB (Protein Data Bank) ID 1AMK. Input 1AMK in the structure searching field. The MMDB summary of this protein is displayed. We can also view the 3D-structure in PDB. This protein structure includes both alpha helix and beta sheet. The sequence details in PDB describe the detailed secondary structures, which can be used to compare sequence structures.

It is very helpful if you know the PDB ID before getting its structure. However, most of sequences have no 3D structure available, thus have no PDB ID. You can also try to input a key word in the structure search. But it generally returns you many items related.