Upcoming Seminars
Date: 10/20/2009
Location: 6115 Gates-Hillman
Speaker: Hagit Shatkay, Queen's University
Title: Life by the Book: Pragmatically Using Text in Large Scale -Omics.
Abstract: The genomic era, in which we live since the sequencing of the human genome, is characterized by tremendous amounts of biomedical data, accompanied by a significant increase in the number of related scientific publications.
Much biomedical knowledge is hidden within the abundant literature. The ability to rapidly and effectively survey the literature can support numerous applications, including multiple stages in the design and the interpretation of large-scale experiments.
A variety of methods are being applied to the biomedical literature in an attempt to meet these goals, mostly through careful mining of text for gene/protein names and interactions, using natural language processing methods. However, the idea of general “biomedical text mining” remains elusive.
Rather than view biomedical text mining as one monolithic (and not very well defined) task, we attend to specific biological goals that may benefit from the use of text. The talk will focus on several biological applications/problems involving text, and discuss some non-traditional, coarse-grain methods, that we use to address them.
Past Seminars
Date: 3/2/09
Speaker: Nicholas Buchler, Rockefeller University
Title: Bait and switch: How protein sequestration generates a flexible ultrasensitive response
Abstract: Regulatory networks in cells exhibit important dynamical behaviors, such as bistability (e.g. epigenetic switch) and oscillation (e.g. clocks, cell cycle). Ultrasensitive or `all-or-none~R gene expression is a necessary feature for the emergence of such dynamics in gene networks. In biology, many regulatory molecules are sequestered by an inhibitor into an inactive complex. Using an experimental approach in budding yeast, I will demonstrate how protein sequestration generates tunable, all-or-none thresholds in gene expression. A simple quantitative model for this genetic network shows that both the threshold and the degree of ultrasensitivity depend upon the abundance of the inhibitor, exactly as observed experimentally. The abundance of the inhibitor can be altered by simple mutation; thus ultrasensitive responses mediated by protein sequestration are easily tunable. Gene duplication of regulatory homodimers and loss-of-function mutations can create dominant-negatives that sequester the original duplicate into an inactive complex. These results suggest a mechanism for the rapid evolution of bistable switches and oscillators in regulatory networks.
Date: 2/19/09
Speaker: Andrew Grimson, Massachusetts Inst. of Technology
Title: Animal microRNAs: their ancient origin and contemporary targets
Abstract: Hundreds of microRNAs (miRNAs) collectively regulate a substantial fraction of the animal transcriptome. Because virtually all aspects of biology are likely impinged upon by miRNAs, the identification of the mRNAs targeted by each miRNA remains a fundamental question. Specific ~7 nt recognition sequences, located primarily in 3' UTRs, are important for target recognition. These sites are complementary to the 5' end, or seed region, of the miRNA. However, seed matches are not sufficient for repression, indicating that other characteristics help specify miRNA targeting. By combining computational and experimental approaches, we discovered five features of site context that govern site efficacy. We developed a model that combines these context determinants to quantitatively predict site performance thereby indicating which of the thousands of potential miRNA-target relationships are functional. The predictions are made without recourse to site conservation, and are therefore effective at predicting a wide variety of target interactions including nonconserved sites and siRNA off-target effects.
The scale of transcriptome regulation by miRNAs together with the extent of miRNA conservation between bilaterians (e.g., humans, flies, and worms) is evidence for the importance of miRNA biology during animal evolution. In addition to miRNAs, other bilaterian small RNAs, known as Piwi-interacting RNAs (piRNAs), protect the genome from transposons. Neither miRNAs nor piRNAs were known to exist in the simplest, pre-bilaterian, animal phyla, raising the question of whether a rich small-RNA biology is characteristic of more complex animals, or whether these small RNAs might have emerged earlier in metazoan evolution. To gain perspective on the evolution of miRNAs and piRNAs, we used high-throughput sequencing to identify small RNAs from several basal animal lineages that diverged prior to the emergence of the Bilateria. We found that the cnidarian Nematostella vectensis, a relatively close relative of bilaterians, possesses an extensive repertoire of miRNA genes, two classes of piRNAs, and a complement of proteins specific to small-RNA biology comparable to that of humans. Similarly, the sponge Amphimedon queenslandica, amongst the simplest of animals and distant relative of bilaterians, also possess miRNAs, piRNAs and a full complement of small-RNA machinery. These data indicate that both miRNAs and piRNAs have existed from the earliest stages of metazoan evolution and have been available to shape gene expression throughout the evolution and radiation of animal phyla.
Date: 2/16/09
Speaker: Eric Deeds, Harvard Medical School
Title: Dynamic individuality in protein-protein interaction networks
Abstract: Protein-protein interactions play a crucial role in all cellular processes, from the regulation of gene expression to the transduction and processing of extracellular signals. Over the past decade, high-throughput techniques such as Yeast 2-Hybrid (Y2H) and Tandem Affinity Purification (TAP-tagging) have provided a global picture of what the entire protein-protein interaction (PPI) network in certain organisms might look like. While these methods are often quite noisy (with potentially high rates of false positives and false negatives), they have nonetheless served as the substrate for a large body of work aimed at characterizing or explaining the general topological structure of these networks. Such purely topological studies are limited, however, by the fact that they consider a static description of an inherently dynamical system. A full characterization and understanding of the behavior of PPI networks clearly requires that one be able to describe and understand the dynamics of hundreds to thousands of objects physically interacting with one another. In this work we employ recently developed rule-based modeling techniques to perform the first large-scale stochastic simulations of the PPI network found in the cytoplasm of yeast cells. These simulations reveal that cells prepared in identical initial conditions will, at steady state, differ considerably from one another in terms of the identities of the large protein complexes found in each. Our results indicate that such dynamic individuality may arise in many complex interaction and signaling networks.
Date: 2/6/09
Speaker: Su-In Lee, Carnegie Mellon University
Title: Individual Genetic Variation and Gene Regulation: From Networks to Mechanisms
Abstract: Gene expression data of genetically diverse individuals (eQTL data) provide a unique perspective on the effect of genetic variation on cellular pathways, and help identify sequence variations with phenotypic effect. However, the large number of possible regulatory interactions, combined with the challenges of linkage disequilibrium (LD), makes it difficult to correctly identify causal polymorphisms. To resolve this problem, researchers traditionally apply heuristics for selecting among plausible hypotheses, favoring polymorphisms that are more conserved, that lead to significant amino acid change, or that reside in genes whose function is related to that of the targets. We can construct a list of properties (called, regulatory features) that can indicate how likely each polymorphism having that property changes the gene regulatory network. But how do we know how much weight to attribute to different regulatory features? This talk describes a novel method, called Lirnet (linear regulation network), for identifying regulatory networks from eQTL data. Lirnet automatically learns from eQTL data how to weight regulatory features and induce a regulatory potential for candidate sequence variations. Lirnet assesses these weights simultaneously to learning a regulatory network, finding weights that lead to a more predictive network. This feature, combined with Lirnet's ability to learn the importance of these features automatically, makes it especially advantageous for mammalian systems, where many forms of prior knowledge used in simple model organisms are incomplete or unavailable.
We apply Lirnet to eQTL data in yeast, mouse and human (Phase II HapMap data), and provide statistical and biological results demonstrating that Lirnet produces significantly better regulatory programs than other recent approaches. We demonstrate in the yeast data that Lirnet can correctly suggest a specific causal sequence variation within a large, linked chromosomal region. In yeast, Lirnet uncovered a novel, experimentally validated connection between Puf3, a sequence-specific RNA binding protein, and P-bodies, cytoplasmic structures that regulate translation and RNA stability, as well as the particular causative polymorphism, a SNP in Mkt1, that induces the variation in the pathway.
Date: 1/27/09
Speaker: Derek Ruths, Rice University
Title: Execution Strategies for Executable Biological Models
Abstract: Progress in advancing our understanding of biological systems is limited by their sheer complexity, the cost of laboratory materials and equipment, and limitations of current laboratory technology. Computational and mathematical modeling provides ways to address these limitations through hypothesis generation and testing without experimentation - allowing researchers to analyze system structure and dynamics in silico and, then, design lab experiments that yield desired information about phenomena of interest. These models, however, are only as accurate and complete as the data used to build them. Currently most models are constructed from quantitative experimental data. However, since accurate quantitative measurements are hard to obtain and difficult to adapt from literature and online databases, new sources of data for building models need to be explored. In my research, I design methods for building and executing computational models of cellular networks based on qualitative experimental data, which is more abundant, easier to obtain, and reliably reproducible. Such executable models allow for in silico perturbation, simulation, and exploration of biological systems. In this talk, I will present two general strategies for building and executing Petri net-based models of biochemical networks. Both have been successfully used to model and predict the dynamics of signaling networks in normal and cancer cell lines, rivaling the accuracy of existing methods trained on quantitative data.
This work is done in collaboration with Luay Nakhleh (Rice University) and Prahlad T. Ram (MD Anderson Cancer Center).
Date: 1/15/09
Speaker: Phil Hyoun Lee, Queen's University
Title: Selecting single nucleotide polymorphisms for effective genetic association study
Abstract: Genetic variation analysis holds much promise as a basis for understanding disease-gene association. In particular, single nucleotide polymorphisms (SNPs) are at the forefront of such studies, as they are the most common form of DNA variation on the genome. However, due to the tremendous number of candidate SNPs, there is a clear need to expedite genotyping and analysis by selecting and considering only a subset of all SNPs.
In this talk, I will present three machine learning applications that successfully address the problem of SNP selection and improve current state-of-the-art. The first tag SNP selection approach aims to choose a subset of SNPs whose allele information can best represent the allele information of unselected SNPs. Using the formalism of Bayesian networks, it enables to select a subset of independent and highly predictive SNPs, without limiting the number or the location of predictive tag SNPs. The second method is based on the functionality of SNPs. It aims to directly select a subset of SNPs that are likely to be disease-causing. In the probabilistic framework, our integrative scoring system combines the functional assessments from a variety of bioinformatics tools, and prioritizes SNPs according to their potential deleterious effects to major biological functions. Lastly, I will describe a new multi-objective optimization framework for identifying SNPs that are both informative tagging and have functional significance.
Date: 1/13/09
Speaker: Xin Gao, University of Waterloo
Title: Zero in on the fully automated NMR protein structure determination
Abstract: High-throughput structural genomics requires parallelizable technologies for high-resolution protein structure determination. Nuclear Magnetic Resonance (NMR) would be such a technology if its tedious and lengthy process can be fully automated. In the talk, I will describe our efforts on a fully automated protocol for NMR protein structure determination. We have developed a singular value decomposition-based peak picking method, PICKY, which achieves an average of 88% recall and 74% precision over 32 raw spectra extracted from eight proteins. Existing resonance assignment methods, however, do not work well on incomplete and imperfect peak lists. Consequently, we have designed an integer linear programming-based assignment method. It significantly outperforms other existing programs on both perfect peak lists and noisy peak lists. With the partial resonance assignments, FALCON-NMR is developed as a hidden Markov model-based torsion angle sampling method. The whole system, AMR, has been successfully tested to on four proteins with weights of approximately 15kDa.
Date: 11/10/08
Speaker: William Noble, University of Washington
Title: Machine learning analysis of shotgun proteomics data
Abstract: Mass spectrometry has become the most widely used tool for the characterization of proteins within complex mixtures. In this talk, I will describe several successful applications of machine learning to improve the rate at which we can correctly assign peptide sequences to observed tandem mass spectra. We use supervised and semi-supervised discriminative learning methods to train a classifier that discriminates between correctly and incorrectly annotated spectra. Unlike previous methods, the classifier can be trained dynamically on each given data set, thereby adjusting to particular characteristics of the sample preparation protocol, machine platform, calibration and chromatography conditions. We have also trained a dynamic Bayesian network to model the process of peptide fragmentation within the mass spectrometer. The resulting model yields useful insights into fragmentation biochemistry as well as significantly improved peptide identification performance.
Date: 3/27/08
Speaker: Gad Kimmel, University of California, Berkeley
Title: Computational Problems in Human Genetics
Abstract: The question how genetic variation and personal health are linked is one of the compelling puzzles facing scientists today. The ultimate goal is to exploit human variability to find genetic causes for multi-factorial diseases such as cancer and coronary heart disease. Recent technology improvement enables the typing of millions of single nucleotide polymorphisms (SNPs) for a large number of individuals. Consequently, there is a great need for efficient and accurate computational tools for rigorous and powerful analysis of these data. In my talk I am going to concentrate on two computational problems, which are an essential step in studying the data obtained by this technology: Accurate and efficient significance testing with a correction for population stratification and estimating local ancestries in admixed populations.
Date: 3/26/08
Speaker: Itamar Simon, Hebrew University
Title: A high resolution map of mouse genome replication timing suggests a role in gene regulation
Abstact: Although it is known that genomes are divided into distinct replication time zones, a more detailed understanding of their organization is limited. Taking advantage of a novel synchronization method and of genomic DNA microarrays we have mapped replication times of the entire mouse genome at a high temporal resolution. The measurement results have allowed us to assign distinct replication times to 91% of the genome, define asynchronously replicating regions and identify very large replicons. Analysis of the association between replication and transcriptional features has revealed a correlation between replication and transcription potential as well as evolutionary conservation of replication timing. Finally, analysis of large replicons, and in particular of regions at which the time of replication differs from the time of replication of a distant origin, reveals that transcription is correlated with the actual time of replication and not with the time of origin activation. Overall, these findings suggest that early replication plays a causal role in potentiating gene transcription.
Date: 3/17/08
Speaker: Olivier Elemento, Princeton University
Title: Decoding the regulatory genome
Abstract: Deciphering the non-coding regulatory genome has proved a formidable challenge. Despite the wealth of available gene expression data, there currently exists no broadly applicable method for characterizing the regulatory elements that shape the rich underlying dynamics. I will present a general framework for detecting such regulatory DNA and RNA motifs that relies on directly assessing the mutual information between sequence and gene expression measurements. Our approach makes minimal assumptions about the background sequence model and the mechanisms by which elements affect gene expression. This provides a versatile motif discovery framework, across all data types and genomes, with exceptional sensitivity and near-zero false-positive rates. Applications from yeast to human uncover novel putative and established transcription-factor binding and miRNA target sites, revealing rich diversity in their spatial configurations, pervasive co-occurrences of DNA and RNA motifs, context-dependent selection for motif avoidance, and the strong impact of post-transcriptional processes on eukaryotic transcriptomes. This approach complements our previous and ongoing work using comparative genomics, and represents a major contribution to our ongoing effort to systematically characterize eukaryotic regulatory elements and understand their role in complex processes such as development, aging and disease.
Date: 3/10/08
Speaker: Philip Kim, Yale University
Title: Jumping scales: How 3D structures and molecular genetics meet in protein networks
Abstract: Protein interaction networks form the central layer of a systems-level description of the cell. While most studies of protein networks operate on a high level of abstraction, neglecting structural and chemical aspects of each interaction, I will describe our approach of characterizing interactions by using atomic-resolution information from three-dimensional protein structures. We find that some previously recognized relationships between network topology and genomic features (e.g., hubs tending to be essential proteins) are actually more reflective of a structural quantity, the number of distinct binding interfaces. Subdividing hubs with respect to this quantity provides insight into their evolutionary rate and indicates that additional mechanisms of network growth are active in evolution.
Furthermore, I will provide an overview of a major international collaborative effort that aims to resolve interactions involved in signaling pathways. These tend to involve intrinsically disordered regions are hence complementary to the structured interactions studied by the above approach. Our approach combines modern experimental screening techniques with a novel integrated analysis pipeline. The former screens measure binding specificities with hitherto unachievable accuracy and the analysis pipeline maximizes prediction accuracy by integrating a variety of genomic and proteomic features.
Lastly, I will present a study that examined the relationship between genetic signatures of adaptive evolution and proteomic properties, such as the location of sites in protein networks and structures. Due to recent advances in genotyping and sequencing technology, human genetic variation and adaptive evolution in the primate lineage have become a major research focus. We find a striking tendency of proteins that have been subject to adaptive evolution (as compared to the chimpanzee) to be located at the periphery of the interaction network. We also find that the fixation of large-scale copy number variants into segmental duplications also preferentially occurs at the network periphery, bolstering our argument for selection at periphery. This suggests that the observed preferential selection at the network periphery may be due to an increase of adaptive events on the cellular periphery responding to changing environments.
Date: 3/3/08
Speaker: Han Liang, University of Chicago
Title: System Structures and MicroRNA regulation in humans: a view of systems biology
Abstract: MicroRNAs are ~22nt non-coding RNAs that can post-transcriptionally repress the expression of many protein-coding genes in higher eukaryotes. Recently available functional genomic data enables us to examine the regulatory role of microRNAs at the system level. Integrating human protein-protein interaction and microRNA targeting data, I found a global correlation between protein connectivity and microRNA regulation complexity in the corresponding genes, and that microRNA regulation likely coordinates the behavior of interacting partners. To understand the evolution of microRNA-mediated regulation in humans, I evaluated the role of three types of nucleotide variation on microRNA targeting: variation between species, variation within populations and epigenetic variation. While purifying selection appears to be a driving force maintaining the stability of microRNA regulation at the system level, a small amount of variants may have significant functional effects. In particular, I found an appreciable level of polymorphism at microRNA target sites (including SNPs with a signature of positive selection or within important disease genes), which suggests that allele-specific microRNA regulation is an important source of phenotypic differences among individuals.
Date: 2/28/08
Speaker: Ge Yang, Scripps Research Institute
Title: Metaphase spindle architecture and molecular motor coordination revealed by model driven computer vision
Abstract: The development of biology over the past half century makes it possible to identify the complete set of genes and proteins of an organism. A fundamental challenge remains, however, to understand the complex dynamics of and interactions between the many individual molecular components involved in situ and in space and time. Of particular importance in addressing this challenge is to understand how force and motion are generated, transmitted, and controlled within dynamic cellular structures during basic cellular processes. In this presentation, I will focus on addressing this question in two such processes: cell division and intracellular transport. First, single-fluorophore imaging and biochemical perturbation are used to investigate architecture of the metaphase microtubule cytoskeleton in cell division. This assay provides a model system to understand how cytoskeletal filament networks are dynamically organized to transmit force and to directly generate force. Second, fluorescence imaging and genetic manipulation are used to probe the interaction between molecular motors in the axonal transport machinery of neurons. This assay provides a sufficiently reduced yet extremely powerful model system to understand the interactions between molecular motors of same and opposite polarities in force and motion generation. Shared by both studies is the use of computer vision techniques, driven by mechanistic models, to extract high-resolution quantitative measurements of the complex spatial-temporal dynamics visualized by powerful fluorescence live cell imaging techniques. These studies reveal some fundamental and exquisite connections between force and motion generation and the dynamic organization of the cytoskeleton in cellular life.
Date: 2/25/08
Speaker: Kevin Chen, New York University
Title: Macro- and micro-evolution of gene regulation mediated by microRNAs
Abstract: Studying the evolution of cis-regulatory elements is important for three general reasons. First, mutations in these elements can cause phenotypes of medical importance; second, understanding cis-element evolution will help us design algorithms for predicting these elements; third, regulatory evolution is important for understanding phenotypic evolution. In this talk, I will focus on a class of cis-elements called "microRNA sites". MicroRNAs are small, noncoding RNAs that post-transcriptionally regulate their target mRNAs by binding to these sites. They have been implicated in many biological processes, including cancer and viral defense.
I will discuss the evolution of animal microRNA sites at two different time scales. At the macro-evolutionary time scale, we show that while the microRNA genes are well-conserved, overall their targets have diverged rapidly. However, there exists a core of deeply-conserved regulatory relationships that may be an important component of animal developmental networks. At the micro-evolutionary time scale, we use human SNP genotype data to demonstrate significant selective constraint on microRNA sites, implying that polymorphisms in these sites are candidates for causal variants of human disease. Our approach also applies to human-specific microRNA sites and we use it to identify a set of these sites in genes co-expressed with the microRNA.
Date: 2/11/08
Speaker: James Taylor, New York University
Title: Making sense of genome-scale data
Abstact: High-throughput data production technologies are revolutionizing modern biology. Translating this experimental data into discoveries of relevance to human health relies on sophisticated computational tools that can handle large-scale data (e.g. multiple genome alignments of dozens of species or billion genotype genome-wide association studies).
This talk will first discuss a specific large-scale data analysis problem: using comparative genomics to identify and understanding functional genomic regions, particularly cis-regulatory elements. Using data generated by the ENCODE project we will demonstrate the power of genome comparisons to distinguish these elements from neutral DNA and the importance of looking for more than just signs of strong evolutionary constraint. We will then describe a machine learning approach that goes beyond sequence conservation and attempts to capture broader and more informative sequence and evolutionary patterns that better distinguish different classes of elements. This approach, denoted ESPERR, uses training examples to learn encodings of multispecies alignments into reduced forms tailored for the prediction of chosen classes of functional elements. ESPERR has proven successful for a variety of classification problems. In particular, the "Regulatory Potential Score" produced using ESPERR has been used to identify putative regulatory elements with high rates of experimental validation.
Second, we will consider the more general problem of making sophisticated computational methods more available to experimental biologists. Many powerful analysis tools exist or are currently being developed, along with many excellent data warehouses and browsers. However, for the average experimental biologists with limited computer expertise, making effective use of these tools and data sources is still out of reach because many existing tools do not have easy-to-use interfaces, and different tools and data sources are not well integrated. We have developed a framework and application, called Galaxy, that solves this problem by providing an integrated web-based workspace that bridges the gap between different tools and data sources. Galaxy simultaneously targets two audiences. For tool developers it eliminates the repetitive effort involved in creating high-quality user interfaces, while giving them the benefit of being able to provide their tools in an integrated environment. For experimental biologist it allows running complex analysis on huge datasets with nothing more than a web browser, and without needing to worry about details of installing tools, allocating computing resources, and file format compatibility. Galaxy is not only incredibly easy to use, it is also incredibly easy to deploy. A developer or lab can create their own Galaxy instance, and start integrating custom tools with only a few minutes work.
Date: 1/16/08
Speaker: Insuk Lee, University of Texas at Austin
Title: Network biology approaches to study complex traits
Abstract: The relationship between genotype and phenotype is a central issue in genetics, and approaches are needed that allow us to interpret the increasing collection of data on genotypic variation in terms of the affect on organismal phenotypes. Our understanding of these relationships came historically from forward-genetics approaches, which have proved remarkably powerful, but which are still difficult in complex animals, and the complete definition of pathways from forward-genetic data alone is hard. In contrast, reverse-genetics approaches allow unbiased tests across entire genomes for associations with traits of interest, e.g., by using systematic genome-wide knock-out or silencing. However, reverse-genetics is in general labor intensive and time consuming, requiring enormous numbers of assays in order to span large number of genes in combination with multiple experimental conditions. Ideally, we would like to be able to choose which genes to target for reverse-genetics analyses, prioritizing the most likely candidates for being involved in a trait of interest. Such an approach would allow highly focused reverse-genetics studies to be performed, increasing both the sensitivity and efficiency of genetic screens. Here, we present a method for predicting gene loss-of-function phenotypes that can be applied to extend genetic screens and prioritize candidate genes for focused testing in from simple single cellular organism yeast to multicellular animal model C. elegans (worm) to human.

