Bioinformatics: Sequence and Genome Analysis
As more species’ genomes are sequenced, computational analysis of these data has become increasingly important. The second, entirely updated edition of this widely praised textbook provides a comprehensive and critical examination of the computational methods needed for analyzing DNA, RNA, and protein data, as well as genomes. The book has been rewritten to make it more accessible to a wider audience, including advanced undergraduate and graduate students. New features include chapter guides and explanatory information panels and glossary terms. New chapters in this second edition cover statistical analysis of sequence alignments, computer programming for bioinformatics, and data management and mining. Practically oriented problems at the ends of chapters enhance the value of the book as a teaching resource. The book also serves as an essential reference for professionals in molecular biology, pharmaceutical, and genome laboratories.
Univ. of Arizona, Tucson. Based on a course given at the University of Arizona, this text is a foundation for the undergraduate and graduate student. Features sequence alignment, structure prediction, database searching, underlying algorithms, examples in simple numerical terms, tables, and web sources. Hardcover, softcover also available. –This text refers to the Hardcover edition.
Good introduction, but somewhat qualitative, July 9, 2002
The field of bioinformatics has exploded in the last five years, and several monographs and textbooks have appeared to assist in the elucidation of the concepts involved. Bioinformatics is a field that grew hand-in-hand with the rise of the Internet, and anyone going into it will need expertise in the PERL and JAVA programming languages, as well as a fairly strong mathematical background. In this book, the author gives a very good overview of bioinformatics from mostly a qualitative and descriptive point of view, although some elementary mathematical discussions are inserted in various places. Because of the level of mathematics used, this might not be the book to use for the mathematician who desires to go into bioinformatics or computational biology. On the other hand, for the student of biology or mathematics who intends to pursue bioinformatics as a profession, this book would be an excellent choice. One cannot read the book however without visiting its accompanying Website, for the author extends some of the results of the book there.
The book begins with an historical introduction to the subject, and a newcomer to the subject will get a brief overview of some of the first sequence analysis programs and some of the first DNA sequence databases developed long before bioinformatics was recognized as a real discipline. The author introduces some of the techniques that will be discussed in the book, such as global and local sequence alignment, dynamic programming, RNA structure prediction, and protein structure prediction. This is followed in chapter 2 by an overview of the procedures used to collect and store sequences in the laboratory. To the reader not familiar with these techniques, the discussion may be too brief. The different sequence formats used are outlined, as well as techniques used to convert from one sequence format to another.
Chapter 3 takes a closer look at the pairwise alignment of sequences, and the author also outlines the reasons behind examining sequence alignment in the first place, namely that of finding the functional, structural, and phylogenetic information. The view of sequence alignment as an optimization problem is emphasized via the dynamic programming algorithm for sequence alignment. Dot matrix analysis is discussed a sequence alignment strategy that allows all possible matches of residues between two sequences. The author is careful to note that local alignment algorithms might give global alignments, and vice versa, because of small changes in the scoring system. The PAM and BLOSUM substitution matrices are compared as to their relative merits and pitfalls. A very detailed discussion of gap penalties is given, along with the role of the Gumbel extreme value distribution in the determination of the statistical significance of a local alignment score between two sequences. And, after a brief introduction to Bayesian statistics, the author shows how to to use it produce alignments between pairs of sequences and to calculate distances between sequences. The Bayes block aligner software package is discussed in detail as a tool for Bayesian sequence alignment.
In chapter 4, the author gives an extensive discussion of multiple sequence alignment algorithms, the most important of these by contemporary standards being hidden Markov models. The author though does treat the "progressive" methods, as well as the use of genetic algorithms in doing multiple sequence alignment. The former include the classic CLUSTALW package and the PILEUP program for doing msa. Although the discussion of hidden Markov models makes sparing use of mathematics, is does serve to explain how they work and should assist readers who need a solid understanding of them.
I did not read chapters 5 and 6 so I will omit their review. Chapter 7 is an introduction to database searches in order to find similar sequences. The algorithms developed in chapters 3 and 4 again make their appearance, and the reader is confronted with various user interfaces for performing genetic database searching online. The FASTA and BLAST tools are introduced as fast methods to do database searching. As computer performance increases in the years ahead, these and other currently existing tools will no doubt be replaced by more powerful search routines. While perusing this chapter, one cannot help but be fascinated by the current situation in the biological/genetic sciences. Once thought of as a purely descriptive science, it is now dominated by a reductionist philosophy, involving huge amounts of data, and sophisticated mathematics for the analysis of this data.
The author moves on to the methods for detecting protein-encoding regions of DNA sequences in chapter 8. The simplest method according to the author for doing this is to search for ORFs, and he discusses the reliability of methods for accomplishing this. Hidden Markov models again make their appearance as a tool to study eukaryotic internal exons and in gene prediction in microbial genomes. And, neural networks are introduced as tools for finding complex patterns and relationships among sequence positions, and Grail II is discussed as a system for exon finding in eukaryotic genes. Promotor prediction in E. Coli is also briefly overviewed.
I did not read chapter 9 so I will omit its review. Chapter 10 though is an introduction to one of most interesting parts of bioinformatics, namely that of analyzing the entire genomes of organisms. Due to rapid experimental advances in genetics, several genomes are now available, and this allows a more global, dynamical view of the role of genes and how their expression correlates to result in a fully-developed functioning organism. The techniques discussed in earlier chapters come into play in genomic analysis, and many other more novel techniques will have to be invented if sense is to be made of the enormous amount of genomic data currently available.