Statistical Methods for Improving Motif Evaluation

Tanaka, Emi

Permalink

Access status:

USyd Access

Type

Thesis

Thesis type

Doctor of Philosophy

Author/s

Tanaka, Emi

Abstract

Gene regulation, especially cis-regulation of gene expression by the binding of transcription factors, is a critical component of cellular physiology. Transcription regulation is heavily influenced by the binding of transcription factors, and as such, it is of great interest to ...
See moreGene regulation, especially cis-regulation of gene expression by the binding of transcription factors, is a critical component of cellular physiology. Transcription regulation is heavily influenced by the binding of transcription factors, and as such, it is of great interest to characterise these binding sites. The binding sites of a transcription factor are collectively referred to as a regulatory motif. Recent advancement in sequencing technology generated vast amounts of biological data. Thus computational tools are required to process and analyse this massive information. In particular, computational tools were developed to search for over-represented words among a set of co-regulated sequences. Such tools would be somewhat incomplete without a statistical analysis that allows researchers to discern between real biological significant sites and random artefacts. By analogy, it is difficult to imagine evaluating a BLAST result without its accompanying E-value. Of the many motif finders, MEME, with over 9000 unique users recorded in the first half of 2013, is one of the most popular motif finding tools available. Currently MEME evaluates its candidate motifs using an extension of BLAST's E-value to the motif finding context. Ng et al. (2006) previously showed the drawbacks of MEME's current significance evaluation scheme, however because MEME relies on the same E-value to internally rank competing candidate motifs, the alternative evaluation offered by Keich and Ng (2007) was not a practical substitute. Here we offer a two-tiered significance analysis that can replace the E-value in selecting the best candidate motif as well as in evaluating its overall statistical significance. We show that our new approach substantially improves MEME's motif finding performance and also provides the user with a reliable significance analysis. In addition, for large input sets our new approach is faster than the currently implemented E-value analysis. After applying a motif finder to a set of co-regulated DNA sequences, researchers often are interested to know whether the reported putative motif is similar to any known motif. While several tools have been designed for this task, Habib et al. (2008) pointed out that the scores that are commonly used for measuring similarity between motifs do not distinguish between a good alignment of two informative columns (say, all-A) and one of two uninformative columns. This observation explains why motif comparison tools such as Tomtom occasionally return an alignment of uninformative columns which is clearly spurious. To address this distinguishability problem Habib et al. (2008) suggested a new score, the BLiC. This score uses a Bayesian information criterion to penalise matches that are similar to the background distribution. We show that the BLiC score exhibits other, highly undesirable properties. Therefore as an alternative, we offer a general approach to adjust any motif similarity score so as to reduce the number of reported spurious alignments of uninformative columns. We implemented our method in Tomtom and we show that, without significantly compromising Tomtom's retrieval accuracy or runtime, we drastically reduce the number of uninformative alignments. The modified Tomtom is currently available as part of the MEME Suite at http://meme.nbcr.net. A motif is not limited to sites regulating gene expression. A motif is a recurring nucleotide sequence pattern that has a biological significance. One such example is in the context of the origins of replication of Saccharomyces cerevisiae. Autonomously replicating sequences (ARSs) are DNA fragments that promote extrachromosomal maintenance of plasmids. These ARSs mostly coincide with origins of DNA replication and therefore we use the terms interchangeably. The origins of replication in Saccharomyces cerevisiae have a highly conserved sequence known as the ACS (ARS consensus sequence). Depending on the reference, its representation varies from the 11bp consensus sequence WTTTAYRTTTW to a 33bp position weight matrix. While the replication origins of some species, such as Schizosaccharomyces pombe and metazoans, do not have any known motif, Liachko et al. (2010) found that the replication origins of another budding yeast Kluyveromyces lactis share a 50-bp ACS motif which is inherently different to the ACS motif found in S. cerevisiae. Here we characterise ARSs in Lachancea (Saccharomyces) kluyveri - a pre-whole genome duplication budding yeast. In addition, we demonstrate that ARS function in L. kluyveri is dependent on a much longer sequence compared with S. cerevisiae and K. lactis. Furthermore, the system of replication initiation in L. kluyveri appears to be more permissive than in these other two species - it is able to initiate replication from all S. cerevisiae ARSs and most K. lactis ARSs, while only half of L. kluyveri ARSs function in S. cerevisiae and less than 10% function in K. lactis. Our findings demonstrate a replication initiation system with novel features and underscore its functional diversity within the budding yeasts.
See less

Date

2014-02-01

Licence

The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.

Faculty/School

Faculty of Science, School of Mathematics and Statistics

Awarding institution

The University of Sydney

Subjects

motif finding
MEME
autonomously replicating sequence
ARS consensus sequence
motif significance analysis
motif comparison