Seminars - Ray and Stephanie Lane Center for Computational Biology - Carnegie Mellon University

Upcoming Seminars

5/2/2013 - 3pm - 6115 GHC
Ivo F. Sbalzarini, Dresden International Graduate School for Biomedicine and Bioengineering
Computational Biology with Particle Methods
Understanding the function of biological systems from the interactions between their constituents requires predictive forward models of hypothetical mechanisms. Given the complexity of biological systems, such forward models are frequently computational, where numerical simulations are used to probe a model's behavior in regimes where it cannot be solved analytically. We review the key differences between biological and engineering applications of numerical simulations and highlight the main challenges in computational data processing and simulation of biological systems. We propose to exploit the unifying algorithmic framework of particle methods to develop numerical simulations, image-processing, and optimization algorithms that meet the challenges of modern biology. We provide examples from our own work, highlighting how methodological advances in scientific computing have enabled new biological insight and progress in computer science alike. The examples include a self-organizing deterministic particle method for the simulation of multi-scale continuum models, a novel class of stochastic simulation algorithms with reduced time complexity, a domain-specific language for particle methods on heterogeneous parallel computer platforms, and a new class of particle-based image segmentation algorithms. This covers the workflow of image-based systems biology, illustrating several analogies and connections between the different fields involved.

10/4/2013 - 11am - 6115 GHC
Wing Wong, Stanford University

Past Seminars

3/29/2013
Christine Vogel, New York University
The Ups and Downs of Human Protein Expression Regulation
While transcription regulation has been studied for many years, we now have amounting evidence that the regulation of protein translation and degradation is at least as important in determining protein expression levels. Under normal conditions, for example, transcription and mRNA degradation account for ~30% of gene expression regulation in mammalian cells, while translation and protein degradation account for another 30-40%.  We now have extended these studies to systems under perturbation, i.e. cells responding to a stimulus. Using a variety of large-scale methods, we examine the behavior of the mammalian proteome and transcriptome in response to environmental stresses.  We have quantified the expression of ~4,000 genes and proteins and are in the process of characterizing different regulatory patterns that we observe. Again, transcription is only half the story.

3/27/13
Shayok Chakraborty, Arizona State University
Batch Mode Active Learning for Multimedia Pattern Recognition

The rapid escalation of technology and the widespread emergence of modern technological equipments have resulted in the generation of humongous amounts of digital data (in the form of images, videos and text among others). This has the expanded the possibility of solving real world problems using computational learning frameworks. However, while gathering a large amount of data is cheap and easy, annotating them with class labels is an expensive process in terms of time, labor and human expertise. This has paved the way for research in the field of active learning. Such algorithms automatically select the salient and exemplar instances from large quantities of unlabeled data and are effective in reducing human labeling effort in inducing classification models. To utilize the possible presence of multiple labeling agents, there have been attempts towards a batch mode form of active learning, where a batch of data instances is selected simultaneously for manual annotation. This talk will cover a basic background of batch mode active learning, some related work and my current research in this domain. Specifically, the following three contributions will be discussed in details: (i) batch mode active learning algorithms based on convex relaxations of an NP-hard integer quadratic programming (IQP) problem, with guaranteed bounds on the solution quality, (ii) an active matrix completion algorithm and its application to solve several variants of the active learning problem (transductive active learning, multi-label active learning, active feature acquisition and active learning for regression) and (iii) a framework for dynamic batch mode active learning, where the batch size and the specific data instances to be queried are selected adaptively through a single formulation, based on the complexity of the data stream in question. These contributions are validated on the face recognition and facial expression recognition problems, which are commonly encountered in real world applications like robotics, security and assistive technology for the blind and the visually impaired.

3/22/2013
Vineet Bafna, University of California, San Diego
The breakage fusion bridge and other exotic structural variations: combinatorics and cancer genomics
Cancer genomes are marked by genomic instability and massive rearrangements. Recently, many exotic mechanisms have been proposed as mechanistic explanations for these rearrangements. For example, the breakage-fusion-bridge (BFB) mechanism, proposed over seven decades ago, has seen renewed interest as a  source of genomic variability and gene amplification in cancer. Here, we formally model and analyze the BFB mechanism, the first rigorous formulation of the mechanism.  Using this model, we show that BFB can achieve a surprisingly broad range of amplification patterns, and describe efficient combinatorial algorithms to characterize patterns consistent with BFB. An extensive analysis of simulated, cell-line, and primary tumor data reveals the existence of BFB. Our results also suggest that BFB may be hard to detect under heterogeneity and polyploidy. Time remaining, we will also discuss other sources of variation (joint work with Shay Zakov, and Marcus Kinsella).

3/20/2013
Yongjin Park, Johns Hopkins University
Resolving the Structure and Dynamics of Large-scale Interactome

Community structures are embedded in real-world networks.  A set of nodes or edges can be decomposed into fairly homogeneous subsets.  In biological network analysis, community structures are considered as functionally coherent modules.  For instance, tightly connected sub-networks in a protein-protein interaction network generally correspond to protein complexes.  Modules are easily identified in a network of hundreds of nodes by visual inspection or simple pattern searches.  However, large-scale network datasets pose significant challenges, not only in computation, but also in its completely different properties.
In this talk, I will describe our attempts to solve community-finding problems on genome-scale interactome datasets.  I will explain how a probabilistic framework can help design simple yet powerful algorithms, for instance, avoiding “resolution-limits”, and how this framework can extend to dynamic network analysis.  Next, I will talk about a newly designed inference algorithm, which is applicable to ultra large-scale hierarchical stochastic block models.  We propose a nearly linear time algorithm that can efficiently estimate maximum a posteriori on a deep hierarchical block structure.  Moreover, I will show how we combined this hierarchical model with other sources of heterogeneous biological evidence, such as RNA-seq measurements and pathway annotations.

3/19/2013
Bo Li, University of Wisconsin at Madison
Computational Analysis of RNA-Seq Data in the Absence of A Sequenced Genome: From Transcript Quantification to De novo Transcriptome Assembly Evaluation

RNA-Seq technology has revolutionized the way we study transcriptomes. In particular, it has enabled us to investigate the transcriptomes of species that have not yet had their genomes sequenced. I will discuss our work on two computational tasks that are crucial to analyzing RNA-Seq data in the absence of a sequenced genome: transcript quantification and de novo transcriptome assembly evaluation.  For transcript quantification, RNA-Seq is considered as a more accurate replacement for microarrays. However, to allow for the highest accuracy, methods for analyzing RNA-Seq data must address the challenge of handling reads that map to multiple genes or isoforms.  We present RSEM, a generative statistical model of the sequencing process and associated inference methods, which tackles this challenge in a principled manner.
Our results on both simulated and real data sets suggest that RSEM has superior or comparable performance to other quantification methods developed at the same time. Building off of RSEM, we have developed a novel probabilistic model based method, RSEM-EVAL, for evaluating de novo transcriptome assemblies from RNA-Seq data without the ground truth. Our results on both simulated and real data sets show that our RSEM-EVAL metric correlates well with the ground truth accuracies of the assemblies. Our metric has a broad range of potential applications, such as selecting assemblers, optimizing parameters for an assembler and guiding new assembler design.

3/18/2013
A. Ercument Cicek, Case Western Reserve University
ADEMA: An Algorithm to Determine Expected Metabolite Level Alterations Using Mutual Information

Sitting on the top of the omics hierarchy, metabolomics is an important platform to understand the changes in the physiological activity due to a condition. Despite the advancements in the analytical methodology and increasing number of genome scale metabolic networks of the organisms, current techniques that are used to make sense out of metabolic profiles are quite limited. The objective of this presentation is (1) to address the shortcomings of the current techniques, which are used for analyzing changes in metabolite levels, and (2) to describe ADEMA, a multivariate method that computes the expected metabolite level changes using the metabolic network topology and mutual information. Results show that (1) ADEMA’s prediction on alteration of De Novo Lipogenesis pathway in Cystic Fibrosis mouse model conforms to independently performed flux and gene expression analyses, and (2) ADEMA’s classifier scheme outperforms other well-known classification algorithms.

3/1/2013
Zia Kahn, University of Chicago

Quantitative Proteomics Provides a New Window into How Genetic Differences Impact Protein-Levels Between Species
Understanding how genetic differences affect a phenotypic variation within and between species is a central goal of evolutionary and medical genetics.  Genetic differences that impact the regulation of a gene are key contributors to trait differences. Yet, identifying genetic differences that impact gene regulation is challenging: not all genetic differences are functional and gene regulation is the result of a complex network of interactions between genes. Measuring differing levels, or differential allele-specific expression, of gene products, RNA or protein, from two variants of a gene in the same individual provides direct evidence that a DNA sequence difference between these variants impacts their regulation. This measurement sets the stage for further studies to pinpoint functional genetic variation. While recent technological advances have made it possible to measure allele-specific RNA expression across many genes in high-throughput, the same cannot be said for protein levels. As proteins carry out much of the work of the cell, the absence of a corresponding protein measurement leaves a gap in our understanding of the genetic basis of phenotypic variation. I present a quantitative, computation method for measuring differential expression of two protein variants in an individual. The computational method is based on a simple observation that overcomes a key limitation of a data-intensive, or “big data,” technology in biological sciences called quantitative mass spectrometry.  As a proof of concept, I use this computational method to study allele-specific protein levels in a hybrid between two distantly related species of yeast. This study demonstrates how this computation method provides a new window into how two classes of genetic differences have impacted protein levels between species.

2/1/2013
Luisa Hiller, Carnegie Mellon University
Genomic Plasticity: To Be or Not to Be
The gram positive bacteria Streptococcus pneumoniae, colonizes humans as a nasopharyngeal commensal or a respiratory pathogen. This species displays extensive genomic diversity and a notable capacity to incorporate genes from neighboring cells into their genomes producing new genomic combinations. Yet, the majority of pandemic multi-drug resistant strains belong to one of several lineages that displays decreases genomic diversity. In this talk I will discuss the genomic diversity and plasticity in the population, as well as possible barriers to gene exchange that may be leading to the genomic isolation of clinically important lineages.

11/30/2012
Joel McManus, Carnegie Mellon University
Evolution of post-transcriptional gene regulatory networks
Differences in gene expression are an important source of phenotypic variation and disease. Gene expression differences result from changes in gene regulatory networks, principally comprised of cis-acting sequences and trans-acting factors. These networks control numerous processes, including transcription, alternative splicing, and translation of mRNA into protein. Research over the past decade revealed that changes in trans-acting factors are responsible for most mRNA abundance differences within species, while changes in cis-regulatory sequences accumulate between species. In contrast, much less is known regarding how alternative splicing and mRNA translation regulatory networks evolve. We used high throughput sequencing of cDNA libraries from multiple Drosophila species to investigate the evolution of alternative splicing. Our results suggest that regulation of alternative splicing diverges more rapidly in non-coding regions than in coding regions, and that frame shifting alternative splicing events have more conserved regulation. We further investigated the contributions of cis- and trans-acting changes in splicing regulatory networks by comparing allele-specific splicing in F1 interspecific hybrids. In F1 nuclei, each allele is subjected to the same set of trans-acting factors. Thus differences in allele-specific splicing reflect changes in cis­-regulatory element activity. Changes in cis-regulatory elements contribute more to species-specific differences in intron retention and alternative splice site usage, while changes in trans-acting factors contribute more to species-specific exon skipping differences. These results suggest important differences in the regulatory network architecture among classes of alternative splicing. We are also studying the evolution of mRNA translation using allele-specific ribosome profiling. Our preliminary results suggest that translation regulatory networks may buffer species-specific mRNA abundance differences in budding yeast.

10/26/2012
Eric Schadt, Mt. Sinai School of Medicine
Moving towards a better understanding of human disease in the era of big data
Common human diseases and drug response are complex traits that involve entire networks of changes at the molecular level driven by genetic and environmental perturbations.  Changes at the molecular level can induce changes in biochemical processes or broader molecular networks that affect cell behavior, and changes in cell behavior can affect normal tissue or whole organ function, eventually leading to pathophysiological states at the organism level that we associate with disease.  While the vast majority of previous efforts to elucidate disease and drug response traits have focused on single dimensions of the system, achieving a more comprehensive view of common human diseases requires examining living systems in multiple dimensions and at multiple scales.  Studies focused on identifying changes in DNA that correlate with changes in disease or drug response traits, changes in gene expression that correlate with disease or drug response traits, or changes in other molecular traits (e.g., metabolite, methylation status, protein phosphorylation status, and so on) that correlate with disease or drug response are fairly routine and have met with great success in many cases.  However, to further our understanding of the complex network of molecular and cellular changes that impact disease risk, disease progression, severity, and drug response, we can more formally integrate these different data dimensions.  Here I present an approach for integrating a diversity of molecular and clinical trait data to uncover models that predict complex system behavior.  By integrating diverse types of data on a large scale I demonstrate that some forms of common human diseases like diabetes are most likely the result of perturbations to specific gene networks that in turn causes changes in the states of other gene networks both within and between tissues that drive biological processes associated with disease.  These models elucidate not only primary drivers of disease and drug response, but they provide a context within which to interpret biological function, beyond what could be achieved by looking at one dimension alone.  That some forms of common human diseases are the result of complex interactions among networks has significant implications for drug discovery: designing drugs or drug combinations to impact entire network states rather than designing drugs that target specific disease associated genes.

9/21/2012
Carl Kingsford, Carnegie Mellon University
Computational Challenges Comprehending Chromosome Conformation Capture Constraints

The physical shape and arrangement of chromosomes in the cell affects gene expression, long-range regulation of transcription, and genome evolution (particularly biasing which rearrangements occur), and it has been implicated in the development of several types of cancers.  New high-throughput experimental techniques derived from "chromosome conformation capture" (3C) have produced measurements that hint at the spatial proximity of regions of the genome as it is arranged in the cell.

I will describe our work in three directions to make this 3C-like data more confidently useful for correlating structure with biological function.  First, I will describe a new approach called metric filtering for discarding false-positive proximity measurements that selects edges to keep based on both their surprising observation counts and their metric consistency with other selected edges. We show this technique keeps more information and produces three-dimensional models that agree better with observations from light microscopy.

Second, I will discuss an approach based on rigidity theory to decide whether a 3C experiment has generated sufficient constraints to determine a structure. We find in fact that current experiments provide far more than enough constraints to determine a non-floppy structure for most of the genome in several organisms. As a byproduct, we produce a more practical algorithm for large-scale testing of rigidity.

Finally, I will discuss improved techniques for finding statistically significant correlations between genomic features and spatial proximity that avoid the computationally demanding and error-prone step of deriving a three-dimensional structure.

Various aspects of this research was done jointly with Geet Duggal, Hao Wang, Darya Filippova, Rob Patro, Emre Sefer, Sridhar Hannenhalli (UMD), and Michelle Girvan (UMD).

9/28/2012
Curtis Huttenhower, Harvard University
Bug bytes: bioinformatics for meta'omics and microbial community analysis
Among many surprising insights, the genomic revolution has helped us to realize that we're never alone and, in fact, barely human. For most of our lives, we share our bodies with some ten times as many microbes as human cells; these are resident in our gut and on nearly every body surface, and they are responsible for a tremendous diversity of metabolic activity, immunomodulation, and intercellular signaling.
These microbial communities have only recently become well-described using high-throughput sequencing, requiring analyses that simultaneously apply techniques from genomics, "big data" mining, and molecular epidemiology. I will discuss emerging end-to-end bioinformatics approaches for metagenomics and metatranscriptomics, including handling of sequence data for mixed microbial communities, its reconstruction into metabolic pathways, and biomarker discovery in disease. In particular, computational processing is key in identifying unique markers for microbial taxonomy, phylogeny, and in identifying genes and pathways significantly disrupted in inflammatory conditions such as Crohn's and ulcerative colitis.

7/20/2012
Christoph Wuelfing, UT Southwestern Medical Center
Spatiotemporal organization of lymphocyte signaling systems as a regulator of function

The subcellular organization of the T cell signaling system, similar to that of many other cell types, is highly diverse in time and space. Using systems-scale imaging of T cell signaling, we analyze such organization togain unique insight into T cell function with an emphasis on T cell actin regulation.

5/8/2012
Tom Bartol, Salk Inst. for Biological Studies

How to Build a Synapse from Molecules, Membranes, and Monte Carlo Methods

4/30/2012
Mark Gerstein, Yale University

Analysis of Molecular Networks
My talk will be concerned the analysis of networks and the use of networks as a "next-generation annotation" for interpreting personal genomes. I will initially describe current approaches to genome annotation in terms of one-dimension browser tracks. Then I will describe various aspects of networks. In particular, I will touch on the following topics: (1) I will show how analyzing the structure of the regulatory network indicates that it has a hierarchical layout with the "middle-managers" acting as information-flow bottlenecks and with more "influential" TFs on top. (2) I will show that most human variation occurs at the periphery of the network. (3) I will compare the topology and variation of the regulatory network to the call graph of a computer operating system, showing that they have different patterns of variation. (4) I will talk about web-based tools for the analysis of networks (TopNet and tYNA).

http://networks.gersteinlab.org
http://tyna.gersteinlab.org

Comparing genomes to computer operating systems in terms of the topology and evolution of their regulatory control networks. KK Yan, G Fang, N Bhardwaj, RP Alexander, M Gerstein (2010). Proc Natl Acad Sci U S A 107:9186-91.

Analysis of diverse regulatory networks in a hierarchical context shows consistent tendencies for collaboration in the middle levels. N Bhardwaj, KK Yan, MB Gerstein (2010). Proc Natl Acad Sci U S A 107:6841-6.

Positive selection at the protein network periphery: evaluation in terms of structural constraints and cellular context. PM Kim, JO Korbel, MB Gerstein (2007). Proc Natl Acad Sci U S A 104:20274-9.

The tYNA platform for comparative interactomics: a web tool for managing, comparing and mining multiple networks. KY Yip, H Yu, PM Kim, M Schultz, M Gerstein (2006). Bioinformatics 22:2968-70.

4/26/2012
Pavel Sumazin, Columbia Medical Center
RNA regulatory networks help propagate the effects of genetic alterations

Biomedical researchers profile DNA and chromatin of large patient cohorts in an attempt to identify common alterations that drive pathology and can point to diagnostic and therapeutic biomarkers. Increasingly, however, it is clear that genetic and epigenetic alterations can regulate pathology combinatorially, and that different combinations of alterations may generate the same phenotype. To make full use of molecular profiling data, we need to understand how alterations affect cellular programs.

I will describe two new types of computationally predicted post-transcriptional regulatory networks. Computational and experimental evidence suggest that interactions in these networks may alter the expression of known drivers of high-grade glioma. I will describe regulators of microRNA activity, which modify the activity of microRNAs without necessarily altering their expression. These regulators may channel the effects of genomic deletions to distally downregulate established tumor suppressors. Conversely, post-transcriptional regulators of microRNA biogenesis alter the expression of known drivers of gliomagenesis by regulating the abundance of the microRNAs that target them. Alterations to these regulators lead to widespread changes to the expression of microRNAs that target known drivers of glioma.

Taken together, our results suggest that post-transcriptional regulation in the cell is both extensive and complex. We present evidence that genetic and epigenetic alterations may be amplified and propagated by post-transcriptional interactions to affect both disease initiation and outcome. Our work provides some of the building blocks necessary for reverse engineering integrated regulatory networks that will help identify driver alterations and explain their effects on cellular programs and pathology.

Bio: Pavel Sumazin is a research scientist at Columbia Medical Center. He graduated from Stony Brook University with a PhD in computer science with a focus on design and analysis of algorithms. He taught computer science theory at Portland State University, was an NSF fellow in human genetics at Cold Spring Harbor Laboratory, and served as Associate Director for bioinformatics at Columbia University’s Genome Center.

4/19/2012
Frank DiMaio, University of Washington
Protein structure determination with sparse and noisy data
Determining the structure of a protein, which involves finding the three-dimensional placement of each of a protein's thousand of atoms, is an important problem in biochemistry, providing key insights into mechanisms as well as targets for drug design.  However, many proteins of biomedical importance elude traditional structure determination methods.  For these proteins, sparse data -- either experimental or knowledge-based -- may provide structural information, though not enough to uniquely determine a solution.  The Rosetta structure prediction methodology uses an energy-based approach to explore physically feasible protein conformations.  By combining this energy function with sparse data, I can quickly infer high-accuracy protein models. I will describe the effectiveness of this approach using data from four different sources.  First, I will show how we may use cryo-electron microscopy density data, which provides a very coarse envelope function describing the protein shape to infer models that accurately recapitulate high-resolution details.  I will describe how a similar approach may be used to solve difficult molecular replacement problems.  Here, sparse data is confounded with significant noise; nonetheless, my approach led to the solution of thirteen protein structures, previously unsolved in the hands of expert crystallographers.  Similarly, using only low-resolution crystallographic data, my approach recapitulates high-resolution details that are not captured by current refinement methods.  Finally, I will describe recent breakthroughs I have made in homology modeling, where the source of data is not from experiment, but instead from previously solved protein structures.  I will additionally show how these methods are broadly applicable using both experimental and statistical sources of data, with implications for both protein structure determination and design.

4/17/2012
Zhengqing Ouyang
, Stanford University
Statistical modeling of next generation sequencing data for global gene regulation

Unraveling the global regulation of gene expression is essential for understanding embryonic development and human diseases. Gene expression is regulated at multiple levels, including transcription, RNA processing, and translation. At each level, regulators such as transcription factors, RNAs, and RNA-binding proteins are forming complex regulatory networks. Recent advances in high-throughput technologies, including next generation sequencing, provide unprecedented opportunities to profile multiple levels of gene regulatory information. In this talk, I will describe statistical methods for integrating next generation sequencing data to discover the principles of global gene regulation. At the transcriptional regulation level, a joint model of ChIP-Seq and RNA-Seq will be introduced. The model effectively quantifies transcription factor regulatory strength, reveals combinatorial regulation, and accurately predicts genome-wide expression levels of genes. At the post-transcriptional level, an integrative approach is proposed to reconstruct RNA secondary structures at the genome-scale from deep sequencing data. I will demonstrate the advantages of our approach and the widespread impact of RNA secondary structure on gene regulation.

3/30/2012
Jing Li, Case Western Reserve University
Rare variant discovery and calling by sequencing pooled samples with overlaps
For many complex traits/diseases, it is believed that rare variants account for the missing heritability that cannot be explained by common variants. Sequencing a large number of samples through DNA pooling is a cost effective strategy to discover rare variants and to investigate their associations with phenotypes. Overlapping pool designs provide further benefit because such approaches can potentially identify variant carriers. However, existing algorithms for analyzing sequence data from overlapping pools are limited. We propose a complete data analysis framework for overlapping pool designs, with novelties in all three major steps: variant pool and variant locus identification, variant allele frequency estimation and variant sample decoding. The framework can be utilized in combination with any design matrix. We have investigated its performance based on two different overlapping designs, and have compared it with two state-of-the-art methods, by simulating targeted sequencing. Results show that our algorithm has made significant improvements over existing ones.

3/29/2012
Ron Dror, D.E. Shaw Research
How drugs bind and control their targets: characterizing GPCR signaling using Anton, a special-purpose supercomputer for molecular dynamics simulations

Roughly one-third of all drugs act by binding to G-protein-coupled receptors (GPCRs) and either triggering or preventing receptor activation, but the process by which they do so has proven difficult to determine using either experimental or computational approaches.  We recently completed a special-purpose machine, named Anton, that uses a combination of novel algorithms and application-specific hardware to accelerate molecular dynamics simulations by orders of magnitude, enabling all-atom protein simulations as long as a millisecond (Science 330:341-6, 2010).  Anton has made possible simulations in which drugs spontaneously associate with GPCRs to achieve bound conformations that match crystal structures almost perfectly (PNAS 108:13118-23, 2011; Nature 482:552-6, 2012).  Simulations on Anton have also captured transitions of a GPCR between its active and inactive states, allowing us to characterize the mechanism of receptor activation (Nature 469:236-40, 2011; PNAS 108:18684-9, 2011).  Our results, together with complementary experimental data, suggest opportunities for the design of drugs that achieve greater specificity and control receptor signaling more precisely.

3/28/2012
Hannah Carter, Johns Hopkins University
Identifying driver missense mutations in tumor sequencing data
Large-scale sequencing of cancer genomes is uncovering thousands of DNA alterations, but the functional relevance of the majority of these mutations to tumorigenesis is unknown. Identifying which of these mutations contribute to cancer is critical for understanding tumor biology, and for finding new diagnostic biomarkers and therapeutic targets. We have developed a computational method, called Cancer-specific High-throughput Annotation of Somatic Mutations (CHASM), to identify and prioritize the missense mutations most likely to generate functional changes in proteins that enhance tumor cell proliferation. CHASM uses a supervised machine learning technique called a random forest and more than 80 quantitative features describing amino acid changes to predict candidate driver mutations. The method has high sensitivity and specificity when discriminating between known driver missense mutations and randomly generated missense mutations, and performs well relative to other computational methods applied to this problem. CHASM has been applied to over 15 tumor sequencing studies to prioritize missense mutations for further study and initial results are promising; however, further experimental validation is needed to confirm CHASM predictions.

3/26/2012
Jianyang (Michael) Zeng, Duke University
Automated Nuclear Magnetic Resonance Assignment and Protein Structure Determination

High-throughput protein structure determination based on solution nuclear magnetic resonance (NMR) spectroscopy plays an important role in structural genomics. Unfortunately, current NMR structure determination is still limited by the lengthy time required to process and analyze the experimental data. In this talk, I will describe our recent success stories about the applications of computational techniques in addressing several bottlenecks in NMR structure determination. First, I will talk about a novel high-resolution structure determination algorithm that starts with a global fold calculated from the exact and analytic solutions to the residual dipolar coupling (RDC) equations. Our high-resolution structure determination protocol has been applied to solve the NMR structures of the FF Domain 2 of human transcription elongation factor CA150 (RNA polymerase II C-terminal domain interacting protein), which have been deposited into the Protein Data Bank (PDB ID: 2KIQ). Second, I will present a Bayesian approach to determine protein side-chain rotamer conformations by integrating the likelihood function derived from unassigned NOE data, with prior information (i.e., empirical molecular mechanics energies) about the protein structures. Third, I will describe an automated side-chain resonance assignment algorithm that does not require any explicit through-bond experiment to facilitate side-chain resonance assignment. All our algorithms have been tested on real NMR data. The promising results demonstrate that our algorithms can be successfully applied to high-quality protein structure determination. Since our algorithms reduce the time required in NMR assignment, it can accelerate the protein structure determination process.

3/22/2012
Roger Pique-Regi, University of Chicago
Understanding the impact of genetic variation on molecular mechanisms of transcriptional regulation

My research focuses on developing novel computational methods to identify regulatory sequences, and to model the molecular mechanisms of gene transcription control. The mapping of expression quantitative trait loci (eQTLs) has emerged as an important tool for linking genetic variation to changes in gene regulation. However, it remains difficult to identify the causal variants underlying eQTLs, and little is known about the regulatory mechanisms by which they act. We used DNase I sequencing to measure chromatin accessibility in 70 Yoruba lymphoblastoid cell lines, for which genome-wide genotypes and estimates of gene expression levels are also available. We obtained a total of 2.7 billion uniquely mapped DNase I-sequencing (DNase-seq) reads, which allowed us to infer transcription factor binding exploiting the specific DNase I cleavage footprint left on 827,000 sites corresponding to more than 100 factors. Across individuals, we identified 8,902 locations at which the DNase-seq read depth correlated significantly with genotype at a nearby locus (FDR = 10%). We call such genetic variants 'DNase I sensitivity quantitative trait loci' (dsQTLs). We found that dsQTLs are strongly enriched within inferred transcription factor binding sites and are frequently associated with allele- specific changes in transcription factor binding. A substantial number of dsQTLs are also associated with variation in the expression levels of nearby genes. Our observations indicate that dsQTLs are highly abundant in the human genome and are likely to be important contributors to phenotypic variation.

3/8/2012
Carl Kingsford, University of Maryland
Computational Challenges in Reconstructing Evolutionary Histories
I will discuss our recent efforts to reveal important evolutionary events in two biological systems.
First, I will describe our work identifying reassortments, or mixing of genomic segments, in the influenza virus. Reassortment is the main process by which new pandemic strains arise and was the driving force behind the recent “swine flu” outbreak. We have developed an algorithm and software program called GIRAF that finds reassortment events among large collections of influenza genomes.  GIRAF is the first fully automated computational approach to this problem, and it is based on the first quadratic-delay algorithm for enumerating high-weight maximal bicliques in bipartite graphs.  It allows us to quickly scan thousands of influenza genomes for reassortments. Using our algorithm, we have discovered many novel reassortment events in collections of human, avian, and swine influenza strains.
Second, I will present our recent work on reconstructing ancient biological networks. We have developed several methods for recovering interactions between molecules that were present in ancestral species, starting with only the present-day networks that we are able to measure.  We have shown that many properties of the evolution of extinct networks can be inferred using our approaches and that ancestral interactions can be inferred with high accuracy.
Various parts of this work were done jointly with Niranjan Nagarajan, Saket Navlakha, Rob Patro, Guillaume Marçais, Justin Malin, and Emre Sefer.

3/2/2012
Meera Sitharam, University of Florida
EASAL: Entropy computation for Assembly Configuration spaces via Stratified Convex Parametrizations

Differences between the geometries of molecular assembly versus folding configuration spaces are illuminated by a new theory of convex configuration spaces developed by the speaker's group. While assembly configurations of molecular complexes of up to 7 rigid monomers are already high dimensional and entropically challenging, they are far more tractable to explore, search and analyze than folding configuration spaces. This is because: (a) the assembly configuration space topology can be decomposed directly into a standard Thom-Whitney complex of active constraint regions, including boundaries of varying dimensions; (b) (the key point) these active constraint regions can be charted with convex parameterizations. We refer to the precisely roadmapped union of these charts as the atlas of the configuration space.
EASAL is the software implementation of various efficient algorithms with proven guarantees for atlasing and related search problems for such small molecular assemblies.
Atlasing the configuration spaces and assembly pathways of larger molecular assemblies is effected by recursive decomposition and recombination as smaller molecular subassemblies (that can be atlased using EASAL) making active use of symmetry often present in larger assemblies.
We have recently used EASAL (a) to correctly predict crucial interactions for the assembly of a T= 1 viral shell of AAV4 (confirmed by mutagenesis experiments in the Mckenna lab at UF) and (b) to illuminate features and configurational entropy of a helix packing configuration space that cause standard metropolis montecarlo sampling to be non-stochastic (helix and montecarlo trajectory data from the lab of Maria Kurnikova, a computational chemist at CMU).

2/24/2012
Kevin White, University of Chicago
Integrating Genomic Networks to Identify Biomarkers and Drug Targets
Systems level approaches to construct abstract molecular networks can lead to predictions about genetic and biochemical functions in cells, organisms and in disease states.  We have used integrated experimental and computational approach to construct a large scale functional networks in both model organisms and human cancer cells. Our network models are based on a combination of gene expression, transcription factor DNA binding site mapping, automated literature mining and protein-protein interaction mapping. We provide a strategy for reducing the dimensionality of the massive networks that result from such integrated whole genome analyses.  I will present examples from both Drosophila and human breast cancer cell lines that illustrate how one can translate systems biology-driven findings in model systems to useful tools for diagnosing human diseases.   I will also discuss our use of large scale genome sequence data in the context of systems approaches to developing prognostic signatures for breast cancer, and the use of cloud computing to manage and mine 'omics data.

1/27/2012
Ernest Fraenkel, Massachusetts Inst. of Technology
Integrating 'Omic' Data to Reveal Disease Mechanisms

Proteomic technologies, next-generation sequencing and RNAi screens are providing increasingly detailed descriptions of the molecular changes that occur in diseases. However, it is difficult to assemble these data into a coherent picture that could lead to new therapeutic insights for several reasons. Despite their power, each of these methods still only captures a small fraction of the cellular response. Moreover, when different assays are applied to the same problem, they often provide apparently conflicting answers. We have developed powerful new approaches to integrate these data to identify small, functionally coherent pathways that underlie cellular behavior. In this talk, I will discuss recent unpublished work from my laboratory showing that these methods suggest novel therapeutic strategies for glioblastoma multiforme.

10/14/2011
Daphne Koller, Stanford University
Twelfth Morris H. DeGroot Memorial Lecture

10/7/2011
Wendy Cornell, Merck
Comparison of 2D, 3D, and QSAR Methods for Virtual Screening

Using a set of 47 protein targets from the MDDR, we assess the performance of 2D similarity, 3D similarity, and QSAR methods at identifying active compounds for each target when starting with some number (1, 5, 10, 20, or 40) of actives. Two 2D similarity methods are tested - Toposim, which uses Dice similarity, and Lassi, which uses latent semantic structural indexing. Three QSAR methods are included - random forest, trendvector, and support vector machine (SVM). Each 2D similarity and QSAR method is used in combination with different descriptor sets, including atom pairs (AP), topological torsions (TT), binding property torsions (DT), extended connectivity fingerprints (ECFP4), and MACCS. We assess retrieval rates for single compounds as well as clusters. Among the descriptor sets, ECFP4 performed consistently the best. Although Toposim and Lassi found different hits, their retrieval rates for individual compounds were surprisingly similar. Among the QSAR methods, random forest and trendvector outperformed SVM. Combinations of methods are also explored to maximize both lead hopping and retrieval of close neighbors.

8/29/2011
Wei Wu, University of Pittsburgh
Reverse Engineering Dynamic Gene Networks Underlying Breast Cancer Cell Lineages and Yeast Cell Cycles
Estimating gene regulatory networks over biological lineages or time series is central to a deeper understanding of how cells evolve during development and differentiation. One challenge in estimating such evolving networks is that their host cells are not only contiguously evolving, but also can branch over time. For example, a biologist may apply several different drugs to a malignant cancer cell to analyze the changes each drug has produced in the treated cells. Cells treated with one drug are not directly related to cells treated with another drug, but rather to the malignant cancer cells that they were derived from. Underlying these intriguing dynamic systems, one expects that the interactions between genes are not always constant over time, but rather they are often transient; in other words, gene-gene interactions occur during a time interval may disappear and then reappear again later in time. This challenging behavior renders existing network inference methods inapplicable.
We proposed two novel approaches, Treegl and TV-DBN, which build on the L1 plus time-dependent penalized graphical logistic regression to effectively estimate multiple evolving gene networks corresponding to cell types related by a tree-genealogy, or cell stages related by a evolving chain, based on only a few samples from each condition. Our methods take advantage of the similarity between related networks along the biological lineage, while at the same time exposing sharp differences between the networks. We explore applications to analysis of a breast cancer development, and yeast cell cycle regulation. Based on only a few microarray measurements, our algorithms are able to produce biologically valid results that provide insight into the progression and reversion of breast cancer, and transient interactions among genes in yeast cell cycle.

5/5/2011
Ioannis Tsamardinos, Vanderbilt University

Towards Integrative Causal Analysis of Heterogeneous Datasets and Prior Knowledge
Modern data analysis methods for the most part, concern the analysis of a single dataset. The conclusions of an analysis are published in the scientific literature and their synthesis is left up to a human expert. Integrative Causal Analysis (INCA) aims at automating this process as much as possible. It is a new, causal-based paradigm for inducing models in the context of prior knowledge and by co-analyzing heterogeneous datasets in terms of measured variables, experimental conditions, or sampling methodologies. INCA is related to, but is fundamentally different from statistical meta-analysis, multi-task learning, and transfer learning.
In this talk, we illustrate the enabling INCA ideas, present INCA algorithms, and give proof-of-concept empirical results. Among others, we show that the algorithms are able to predict the existence of conditional and unconditional dependencies (correlations), as well as the strength of the dependence, between two variables Y and Z never measured on the same samples, solely based on prior studies (datasets) measuring either Y or Z, but not both. The algorithms accurately predict thousands of dependencies in a wide range of domains, demonstrating the universality of the INCA idea. The novel inferences are entailed by assumptions inspired by causal and graphical modeling theories, such as the Faithfulness Condition. The results provide ample evidence that these assumptions often hold in many real systems. The long term goal of INCA is to enable the automated large-scale integration of available data and knowledge to construct causal models involving a significant part of human concepts.

4/22/2011
Li-San Wang, University of Pennsylvania
Gene expression in aging and aging-associated disorders
Aging is a highly complex phenomenon that affects virtually all aspects of biology. In medicine, age is a primary risk factor for cancer, neurodegeneration, and many other diseases. Thus, understanding how aging proceeds and contributes to these diseases are key to finding cause and means of intervention. This presentation will cover some of our work towards understanding the connection between aging and age-associated diseases, by investigating gene expression through bioinformatic means.
The first half of my talk focuses on G-quadruplexes (Gquads). Gquads are genomic motifs consisting of four runs of guanines that can form highly stable 3D structures in vivo and have high occurrence in telomeres. Analysis of yeast and human genomic distributions suggest that Gquads are associated with differentially expressed genes in yeast senescence model and human fibroblasts from patients with Werner syndrome, a genetic disorder that exhibits premature aging phenotypes.
The second half of my talk concerns gene expression changes in human brain aging and Alzheimer's disease. We developed algorithms that can estimate the age of an individual using gene expression profiles. Using these algorithms, we found that brains with Alzheimer's disease or frontal temporal dementia show trends of accelerated aging in gene expression change.

4/08/2011
Tamer Kahveci, University of Florida
Computational strategies for understanding how biological networks function.
Biological networks of an organism show how different bio-chemical entities, such as enzymes or genes, interact with each other to perform vital functions for that organism. Each subnetwork within a network can perform various functions that it can not do without interacting with other entities in the network. Understanding the functions of the entire networks as well as the individual subnetworks has been a prime goal for explaining how the organisms work.
Dr. Kahveci's lab is focusing on developing computational methods that will help in understanding the functions of large scale biological networks. This talk we will focus on comparative analysis of biological networks. This topic will be considered in two parts. The first part will focus on comparative analysis of a pair of networks. This part will constitute the majority of the talk. The second part will discuss scalabilities issues for performing this analysis on a large database of networks. The first part will guide step by step starting from a simplified model to a more realistic model. The first step will limit comparison to pairs of entities of networks and explain how we can compare networks when the biological process is explained through different types of biological entities. The second step will eliminate this limitation and describe a computational approach when the same biological process can be performed at different number of steps. The last step will challenge the existing definition of similarity and introduce a new measure, functional similarity that explains the function in terms of the steady states of the biological networks and describe how we can compute the steady states for large regulatory networks. The second part of the talk will discuss a probabilistic strategy for finding highly similar networks to a query network in a database that contains a large number of networks.

3/10/2011
Gerald Quon, University of Toronto
De-mixing heterogeneous gene expression profiles into their constituent components, and applications to personalized medicine
One of the primary goals of gene expression profiling experiments is to identify key genes and pathways associated with a particular condition or disease.   However, biological samples are often composed of multiple distinct cell populations, of which only a few are of interest.  We have developed ISOLATE, a computational model for separating heterogeneous mixtures of cell populations into their individual components, given only the expression profiles of heterogeneous samples and some of the homogeneous populations.  We demonstrate the accuracy and value of computational purification in three problem domains: identifying prognostic signatures for cancer, linking changes in gene expression to patient outcome in juvenile arthritis, and monitoring cell population dynamics in hematopoietic stem cell systems.

3/9/2011
Xin He, University of California, San Francisco
Understanding genetics of complex diseases through systems biology and regulatory genomics
Genome-wide association studies (GWAS) have identified many candidate loci for a number of complex traits. In most cases, however, there is little functional evidence of these loci and the mechanisms of their influence on complex traits are not clear. A very promising strategy is to link genotypes and phenotypes through molecular level traits, such as gene expression level. The first part of my talk will be focused on a new strategy we developed recently to incorporate expression QTL (eQTL) data in the analysis of GWAS. We developed a Bayesian statistical method that integrates the information of the SNPs underlying a gene expression trait, with appropriate weighting, to test if the expression of this gene contributes to the complex disease of interest. In particular, our statistical test allows us to exploit information in a large number of weak SNPs, which are often ignored but represent a collectively important part of the genetics of any complex trait.
To ultimately understand how genotypic variations influence phenotypes, we need a detailed understanding of how DNA sequences encode their immediate molecular functions. The second part of my talk will be focused on the study of regulatory sequences, which harbor a large fraction of SNPs discovered by GWAS and are believed to be important for many complex diseases. To recognize these sequences in a genome, we developed a comparative genomic method based on the assumption that functional transcription factor binding sites (TFBSs) tend to be conserved across species. The method utilizes a probabilistic model of regulatory sequence evolution that captures substitutions, insertions/deletions, and selection or turnover of TFBSs. A more difficult question of regulatory sequences is to understand how these sequences generate spatial-temporal expression patterns. For this purpose, we developed a quantitative model based on statistical thermodynamics theory and an efficient dynamic programming algorithm. This model incorporates a number of features of regulatory sequences, including the importance of weak TFBSs, cooperative interactions among TF molecules, among other things. We demonstrated the predictive power of our model, and by applying it to an early developmental system in Drosophila, we were able to gain understanding of the quantitative rules of gene regulation.

2/25/2011
Richard H. Lathrop, University of California, Irvine
Predicting Positive p53 Cancer Rescue Regions Using Most Informative Positive (MIP) Active Learning
Many protein engineering problems involve finding mutations that produce proteins with a particular function. Most Informative Positive (MIP) active learning is tailored to biological problems because it seeks novel and informative positive results. We applied MIP to discover mutations in the tumor suppressor protein p53 that reactivate mutated p53 found in human cancers. MIP found Positive (cancer rescue) p53 mutants in silico using 33% fewer experiments than traditional non-MIP active learning. MIP was used to select a Positive Region predicted to be enriched for p53 cancer rescue mutants. In vivo assays showed that the predicted Positive Region: (1) had significantly more (p<0.01) new strong cancer rescue mutants than control regions (Negative, and non-MIP active learning); (2) had slightly more new strong cancer rescue mutants than an Expert region selected by a human expert for purely biological considerations; and (3) rescued for the first time the previously unrescuable p53 cancer mutant P152L.

2/16/2011
Anne E. Carpenter, Broad Inst. of Harvard and MIT
Extracting quantitative information from biological images to tackle world health problems
Microscopy images contain rich information about the state of cells and organisms and are an important part of experiments to address a multitude of basic biological questions and world health problems.  Our laboratory works on image analysis and data mining, primarily for high-throughput screening experiments. These experiments test thousands of chemical or genetic perturbations in order to identify the causes and potential cures of disease. Machine-learning approaches, guided by a biologists’ intuition, have been particularly successful for measuring subtle and complex phenotypes in these experiments.
The biological systems being tested in high-throughput experiments are becoming increasingly more physiologically relevant. For example, co-cultures of two particular cell types can better replicate certain tissue and organ systems and preserve normal cellular functions such as liver and hematopoiesis. Whole organisms like C. elegans and zebrafish can be screened for complex phenomena like behavior, infection, and metabolism. These more complex systems present new challenges in image analysis.
We are also exploring the potential of extracting patterns of morphological perturbations (“signatures”) from cell images in order to identify the similarities between various chemical or genetic treatments, in experiments to identify distinctions between human isoforms of cancer-relevant proteins, mechanisms of hepatotoxicity, and diagnostics for bipolar disorder and schizophrenia.
The methods we develop are freely available through the biologist-friendly open-source software, CellProfiler, for both small- and large-scale experiments.

2/4/2011
Jinbo Xu, Toyota Technological Institute at Chicago
Probabilistic Graphical Model for Protein Structure Prediction
If we know the primary sequence of a protein, can we predict its three-dimensional structure by computational methods? This is one of the most important and difficult problems in computational molecular biology and has tremendous implications for protein functional study and drug discovery.
Existing computational methods for protein structure prediction can be broadly classified into two categories: template-based modeling (i.e., protein threading/homology modeling) and template-free modeling (i.e., ab initio folding). Template-based modeling predicts structure of a protein using experimental structures in the Protein Data Bank (PDB) as templates while template-free modeling predicts protein structure without depending on a template.
This talk will present new probabilistic graphical models for knowledge-based protein structure prediction. In particular, this talk will present a regression-tree-based Conditional Random Fields (CRF) method for template-based modeling and a Conditional Random Fields/Conditional Neural Fields (CRF/CNF) method for template-free modeling. Experimental results indicate that our template-based method performs extremely well, especially on hard template-based modeling targets and our template-free method is also very promising for mainly-alpha proteins.

1/21/2011
Russell Malmberg, University of Georgia
Computational Searches for Non-Coding RNA; Ecological Genetics of Pitcher Plants

There will be two quite different research topics presented.
(1) The importance of RNAs that do not code for proteins, but that have functions directly as RNAs, has been recognized over the last 30 years in a series of dramatic discoveries. Estimates of the numbers of non-coding RNAs in eukaryotes vary considerably but are plausibly in the range of 0.5x to 2x the number of protein-coding RNAs. Computational identification of ncRNA genes in genomes is rendered difficult by the lack of sequence similarity of many related ncRNAs; however, some ncRNAs have their structure more conserved than their primary sequence. We have developed algorithms to search genomes for ncRNA on the basis of their structure, using conformational graph - tree decomposition methods to greatly speed up the process. We have studied the nature of evolutionary variability in RNA secondary structure, and are using these results to improve the genomic search methods.
(2) Pitcher plants (Sarracenia species) eat insects. They appeal to the inner 10 year old in us. Different Sarracenia species have varying pitcher morphologies and varying means of digesting insects. Some species actively digest secreting proteases and similar enzymes; other species support a microbial food-web which digests the insects. We are analyzing the genetic basis of the differences between the species in pitcher morphology, insect digestion strategy, and the degree to which individual plant genotypes can support the associated microbial community.

1/14/2011
Kris Dahl, Carnegie Mellon University
Computational approaches to determine multiscale structural changes in the nucleus associated with aging

There are numerous premature aging disorders associated with altered nuclear structure. We are primarily interested in Hutchinson Gilford progeria syndrome (HGPS) which is caused by a mutation in nuclear lamin A. (1) Using integrated experimental and computational studies we examine how the mutation associated with HGPS alters the structure of the protein, even though the mutation is in an inherently disordered region. (2) We have also simulated the structural filament network in the nucleus as a reductionist model to examine the cause of morphological changes in the nucleus associated with HGPS. (3) We examine how the HGPS mutation alters nuclear response to force in situ using computational methods to analyze complex mechanical character from live cell experiments. (4) At the microlevel we also use computational image analysis of a variety of premature aging diseases to understand the role of nuclear morphology in disease progression. In sum, we have used a combination of computation and experiment to examine structural proteins in the nucleus at many length scales and how they impact the etiology of HGPS.

12/3/2010
Klaus Palme, University of Freiburg (Germany)
The magic role of auxin and beyond

Unlike most animal cells, plant cells can easily regenerate new tissues from cells derived from different tissues. These cells first dedifferentiate but later can be reprogrammed to form a wide variety of organs when properly cultured. We investigate the signalling components and molecular mechanisms that provide plant cells with the property to regenerate de novo organs. Plant hormones like auxin (indole-3-acetic acid) play a fundamental role in plant cell proliferation, differentiation and organ formation. Auxin levels are controlled by biosynthesis, transport and degradation. Since its first description in the 19th century, the directional movement of auxin through the plant has attracted much attention for more than a century. An overview will be given on the current status of studies aiming to understand the physiology of auxin transport and structure-function characterization of the PIN interactom. Components of PIN nano-domains play crucial roles in determining instructive auxin gradients that direct plant development. As systems biology demands quantitative, comprehensive data, which need to be mapped into the three-dimensional landscape of cells, tissues and organs, tools for establishment of a robust three-dimensional (3D) digital atlas of cellular components in Arabidopsis roots were developed. Such an atlas may have important implications by providing previously unavailable knowledge on cellular characteristics. The intrinsic Root Coordinate System provides a reference model for the root apical meristem to annotate cells according to their location, type, and division status. This enables the direct quantitative comparison between roots at single cell resolution. Applications and innovative opportunities arising from this technological advance will be discussed.

11/19/2010
John Shon, Director of Disease and Translational Information, Hoffman-La Roache Pharmaceuticals
Drugs to Glide from Research to the Bedside: Opportunities for Software and IT in the Life Sciences

Ever wonder what opportunities exist in Life Science IT and software? The companies require the exchange of critical information that is rich and complex. The drug development process, for example, is greatly enhanced when valuable "nuggets" are passed between professionals that are focused on the start versus the end of the drug development process. This has not easy to accomplish. There are a number of such applications that will help accelerate and enable the creation and production of better therapeutics. Dr. Shon will present his vision for software enabled solutions that solve the multiple challenges facing large Pharma.

10/22/10
Michael Gilson, University of California, San Diego
Modeling molecular recognition: Free energy, entropy and mechanical stress

Better computer models of molecular recognition are needed to speed the design of new therapeutics and host-guest systems with a range of applications. I will discuss concepts and software we are developing for these purposes, as well as some unexpected insights into changes in entropy and mechanical stress on binding that have emerged from this work. In particular, changes in configurational entropy on binding appear to be as quantitatively important as changes in more commonly recognized free energy contributions, such as hydrogen bonding, and I will discuss recent developments in the characterization of entropy changes through the mutual information expansion of the entropy. In addition, we have begun to explore the application of ideas of mechanical stress at the molecular level as a potential basis for understanding the long-ranged transmission of information and other molecular mechanisms.

9/17/2010
Jean-Christophe Olivo-Marin, Institut Pasteur
Quantitative biological imaging: from cells to numbers.

This talk will present specific methods and algorithms fo of 2- and 3-D+t images sequences in biological microscopy and their use in the study of host-pathogen interactions. Our goal is to automate the quantification and analysis of dynamics parameters or the characterization of phenotypic and morphological changes occurring as a consequence of the interaction between microbes and target cells. The availability of this information and its thorough analysis is indeed of key importance to help deciphering underlying molecular mechanisms of infectious diseases. We will demonstrate algorithms for multi-particle tracking and active contours models for cell shape and deformation analysis and illustrate their application in projects related to the understanding of viruses, bacteria and parasites invasion of cells and tissues.

9/16/2010
Chris Bakal, Institute of Cancer Research
Signaling Networks that Regulate Morphological Noise and Promote Exploratory Behavior

Cell shape is not encoded by genomes. Rather genes encode the signaling networks that allow cells to explore morphological space through random variations in cell shape, which we term morphological noise. Stochastic and deterministic amplification of these small variations in shape can lead to the phenotypic diversity necessary for cells to adapt to unpredictable fluctuations in cellular environment. Morphological noise thus creates an ensemble of cell shapes that somatic variation can act upon, which can ultimately be stabilized via genetic evolution. To provide insight into how signaling networks regulate the exploration of shape space we perform quantitative measurements of single cell morphology in the context of genome scale RNAi screens. I will discuss the identification of noise enhancing local networks that act to regulate diverse cellular processes whose inhibition leads to canalized phenotypes and facilitate stochastic exploratory behavior. Furthermore, we have identified a number of other genes that act as morphological noise suppressors. Through computational integration of noise signatures with orthogonal datasets we derive a dynamic model that describes the information flow on a systems-level.

4/15/2010
Paul Boutros, Ontario Institute for Cancer Research
Prognostic Markers for Non-Small Cell Lung Cancer

Lung cancer is a disease with dismal prognosis; only 15% of newly-diagnosed patients survive for five years. Our understanding of how to diagnose, stage, and treat it is based largely on macroscopic or cellular phenomena. A molecular understanding of the disease may provide improved clinical management and new therapeutic options.
My group focuses on predicting the survival of lung cancer patients. In particular, we develop algorithms to exploit microarray datasets to develop biomarkers of survival, called prognostic markers. In this talk I will describe three recent results: an algorithm, a database, and an empirical finding.
First, I describe a new feature selection algorithm, called modified steepest descent (mSD). This algorithm couples gradient-descent with unsupervised machine-learning. Through greedy forward-selection it generates a six-gene prognostic marker for lung cancer that is validated in over 500 patient samples.
Second, I describe a meta-analytic database that compiles the data from nine transcriptomic studies of lung cancer. These studies were integrated using a novel normalization approach and then subject to meta-analysis. For each gene present in the analysis (16,391 in total), the univariate prognostic capacity was calculated. I show that this database increases our statistical power sufficiently to allow separate analysis of different histological subtypes of lung cancer.
Third, I describe an analysis of biomarker plurality. From an empirical study of biomarker-space we found that the number of effective markers is very large. The inter-relationship amongst these markers contains information about gene-gene interactions, and may provide an avenue for understanding the specific pathways dysregulated in lung cancer.
Lung cancer incidence remains high and survival remains low. The development of prognostic markers may improve this situation by allowing personalized therapy. The computational approaches described here may be applicable beyond this one disease, and may provide insight into the types of methodologies that will work well for other problem-domains.

4/2/2010
Mona Singh, Princeton University
Predicting and Analyzing Cellular Networks

Proteins accomplish virtually all of their cellular functions via interactions with other molecules. As a result, a broad array of computational methods have been developed to predict protein interactions, whether with DNA, other proteins, or small molecules. In combination with high-throughput experimental technologies, we now have the ability able to build large scale biological networks across the evolutionary spectrum. Global analyses of these networks provide new opportunities for revealing protein functions and pathways and for uncovering cellular organization principles.
In my talk I will discuss computational approaches that my group as developed for the complementary problems of predicting interactions and analyzing interaction networks. In the first part of the talk, I will describe sequence and structure approaches for predicting sites in protein sequences that interact with small molecules. In the second part of my talk I will discuss and describe algorithms for analyzing protein function and functional modules, and will present framework for explicitly incorporating known attributes of individual proteins into the analysis of biological networks, thereby allowing us to discover recurring network patterns underlying a range of biological processes.

3/26/2010
Nancy Amato, Texas A&M University
Using Motion Planning to Study Molecular Motions

Protein motions, ranging from molecular flexibility to large-scale conformational change, play an essential role in many biochemical processes. For example, some devastating diseases such as Alzheimer's and bovine spongiform encephalopathy (Mad Cow) are associated with the misfolding of proteins. Despite the explosion in our knowledge of structural and functional data, our understanding of protein movement is still very limited because it is difficult to measure experimentally and computationally expensive to simulate.
In this talk we describe a method we have developed for modeling protein motions that is based on probabilistic roadmap methods (PRM) for motion planning. Our technique yields an approximate map of a protein's potential energy landscape and can be used to generate transitional motions of a protein to the native state from unstructured conformations or between specified conformations. We describe a method based on rigidity theory that allows us to sample conformation space more efficiently than our initial sampling strategy and enables us to study a broader range of motions for larger proteins and new analysis tools that enable us to extract kinetics information, such as folding rates. For example, we show how our map-based tools for modeling and analyzing folding landscapes can capture subtle folding differences between protein G and its mutants, NuG1 and NuG2. In recent work, we have applied our techniques to identify and study the folding core. More information regarding our work, including an archive of protein motions generated with our technique, are available from our protein folding server: http://parasol.tamu.edu/foldingserver/

3/2/2010
Seyoung Kim, Carnegie Mellon University
Understanding the Genetic Basis of Complex Diseases via Genome-Phenome Association

Genome-wide association studies have recently become popular as a tool for identifying the genetic loci that are responsible for increased disease susceptibility by examining genetic and phenotypic variation across a large number of individuals. The cause of many complex disease syndromes involves the complex interplay of a large number of genomic variations that perturb disease-related genes in the context of a regulatory network. As patient cohorts are routinely surveyed for a large number of traits such as hundreds of clinical phenotypes and genome-wide profiling for thousands of gene expressions, this raises new computational challenges in identifying genetic variations associated simultaneously with multiple correlated traits. In this talk, I will present algorithms that go beyond the traditional approach of examining the correlation between a single genetic marker and a single trait. Our algorithms build on a sparse regression method in statistics, and are able to discover genetic variants that perturb modules of correlated molecular and clinical phenotypes during genome-phenome association mapping. Our approach is significantly better at detecting associations when genetic markers influence synergistically a group of traits.

2/25/2010
Alexander Schoenhuth, University of California-Berkeley
Classifying cancer tissue by inferring systemic markers

It has recently been shown that protein-protein interaction (PPI) subnetworks which exhibit synergistic differential gene expression in tumorigenic phenotypes are more accurate than single gene markers when it comes to classifying such phenotypes. Here we compute markers as connected subnetworks in confidence-scored PPI networks which achieve high overall confidence scores and are dysregulated in a sufficient number of patients. We do this by employing a novel, exhaustive search technique which, for the first time, renders the inherent search problem on weighted-edge networks tractable. We compute p-values for the resulting subnetworks and use the most significant candidates for classification purposes. Thereby we obtain sets of systemic markers which are superior in terms of gene ontology (GO) term enrichment. As a result, we outperform all prior approaches when classifying colon cancer versus healthy tissue.

2/23/2010
Can Alkan, University of Washington and Howard Hughes Medical Institute
Discovery and Characterization of Copy-Number Variants with Next-Gen Sequencing Technologies

Structural variation, in the broadest sense, is defined as the genomic changes among individuals that are not single nucleotide variants. These include insertions, deletions, duplications, inversions and translocations that were demonstrated to be common and ubiquitous among individuals. A variety of diseases have been associated (both causative and protective) with copy-number variants (CNVs) such as schizophrenia, mental retardation, and HIV susceptibility/resistance. However, CNVs, especially duplicated regions, have remained largely intractable due to difficulties in accurately resolving their structure, copy number and sequence content using hybridization based methods. Consequently, a significant fraction of the duplicated genomic content has not been assayed by standard genetic and molecular analyses.
The realization of new ultra-high-throughput sequencing platforms such as Roche/454, Illumina/Solexa and ABI/SOLiD now makes it feasible to detect the full spectrum of genomic variation among many individual genomes, including cancer patients and others suffering from diseases of genomic origin. Recently I have developed a set of computational methods to comprehensively detect and characterize structural variation and segmental duplications using next-gen sequencing. My algorithms are based on two different approaches: (i) read-depth analysis to characterize segmental duplications and predict absolute copy numbers (mrFAST), and (ii) read-pair analysis to discover structural variation including inversions (VariationHunter).  I applied my algorithms to detect structural variation and segmental duplications to genomes sequenced by Illumina and 454 technologies. I initially examine the genomes of three humans and experimentally validate copy-number differences in the organization of these genomes, and the application of my methods to study the genomes of >160 individuals sequenced as part of the 1000 Genomes Project.

2/12/2010
Tandy Warnow, University of Texas at Austin
Simultaneous Alignment and Phylogeneic Tree Estimation

Molecular sequences evolve under processes that include substitutions, insertions, and deletions (jointly called "indels"), as well as other mechanisms (e.g., duplications and rearrangements). The inference of the evolutionary history of these sequences has thus been performed in two stages: the first estimates the alignment on the sequences, and the second estimates the tree given that alignment. While such methods seem to work well on relatively small datasets, these two-stage approaches can produce highly incorrect trees and alignments when applied to large datasets, or ones that evolve with many indels.  In this talk, I will present a new method, SATe, that my lab has been developing that uses maximum likelihood to estimate the alignment and tree at the same time, and that can be used to analyze datasets with up to 1000 sequences on a desktop in 24 hours. Our study, using both real and simulated data, shows that this method produces much more accurate trees than the current best methods.  Joint work with Kevin Liu, Sindhu Raghavan, Serita Nelesen, and Randy Linder.

2/9/2010
Cheemeng Tan, Duke University
Emergent bistability in bacteria and implications for effective antibiotic treatment

A synthetic gene circuit is often engineered by considering the host cell as an invariable “chassis”. Circuit activation, however, may modulate host physiology, which in turn can drastically impact circuit behavior. In this talk, I will first discuss the engineering of a simple circuit consisting of mutant T7 RNA polymerase (T7 RNAP*) that activates its own expression in bacterium Escherichia coli (1). Although activation by the T7 RNAP* is noncooperative, the circuit caused bistable gene expression. This counterintuitive observation can be explained by growth retardation caused by circuit activation, which resulted in nonlinear dilution of T7 RNAP* in individual bacteria. Predictions made by models accounting for such effects were verified by further experimental measurements. Our results reveal a novel mechanism of generating bistability and underscore the need to account for host physiology modulation when engineering gene circuits.
Interestingly, bistability can also arise from interactions between bacterial physiology and antibiotics. We find that certain antibiotics, when applied at moderate concentrations, can cause ‘phenotypic bifurcation’ in bacterial growth: for the same concentration of antibiotic, a bacterial population survives only if its initial density is sufficiently high. We further show that the phenotypic bifurcation has profound implications for periodic treatment of bacteria by antibiotics. In the absence of phenotypic bifurcation, the efficacy of treatment increases with increasing frequency of antibiotic administration; otherwise, however, the efficacy of treatment can be drastically diminished at an intermediate frequency. Our results have implications on optimal design of antibiotic treatment.
(1) C. Tan, P. Marguet, and L. You. Emergent bistability by a growth-modulating positive feedback circuit. Nature Chemical Biology, 5, 842-848, 2009.
Highlighted in “News and Views”: Slow growth leads to a switch, Nature Chemical Biology, 5, 784-785, 2009.

2/5/2010
Junming Yin, Univ. of California, Berkeley
A new statistical model for studying gene conversions

Together with crossover recombination, gene conversion is a major evolutionary mechanism responsible for shaping observed genetic variation in a population. Although crossovers and gene conversions have different effects on the evolutionary history of chromosomes and therefore leave behind different footprints in the genome, it is a challenging task to tease apart their relative contributions to the observed genetic variation. In fact, the methods employed in recent studies of recombination rate variation in the human genome actually capture combined effects of crossovers and gene conversions.
Studying gene conversion is very important, for it has been argued that ignoring gene conversion may cause problems in association studies. By explicitly incorporating overlapping gene conversion events, we propose a new statistical model that can jointly estimate the crossover rate, the gene conversion rate and the mean tract length, which is widely regarded as a very difficult problem. Our simulated results show that modeling overlapping gene conversions is crucial for improving the accuracy of the joint estimation of the aforementioned three fundamental parameters. Our analysis of real data from the telomere of the X chromosome of Drosophila melanogaster suggests that the ratio of the gene conversion rate to the crossover rate for the region may not be nearly as high as previously claimed.
Joint work with Michael I. Jordan and Yun S. Song.

2/4/2010
Marcel Schulz, Max Planck Institute for Molecular Genetics
From RNA-Seq to Ontology Graphs: Application of probabilistic models


In the first part of my talk I am going to present methods that deal with the inference of alternative splicing events from high-throughput sequencing of mRNAs (RNA-Seq) data. Starting from millions of paired-end RNA-Seq reads, we attempt to reconstruct the original mRNA sequences without using a genomic reference sequence. We use a de Bruijn graph approach to the problem and show that many different types of alternative splicing events can be decoded from the topology of the simplified de Bruijn graph. The approach is implemented in the software Oases. Remarkably, application to data from the RGASP competition demonstrate its usefulness for organisms of various degrees of complexity. A statistical method was developed that subsequently allows the expression levels of the reconstructed mRNAs to be inferred.
 The second part of the talk will be about a new statistical method for semantic similarity searches in Ontology Graphs. The method is different from previous approaches because it incorporates the probability of random similarity scores and assigns p-values to them. An efficient algorithm has been developed that allows exact p-values to be computed. The use of the new method is illustrated with the Phenomizer webserver that assists medical geneticists in the differential diagnostic process using features of the Human Phenotype Ontology annotated to OMIM diseases.

1/29/2010
Quaid Morris, University of Toronto
Predicting the targets of mRNA-binding proteins


RNA-binding domains are among the most common domains in eukaryotic genomes and RNA-binding proteins (RBPs) play critical roles in post-transcriptional regulation (PTR) of gene expression by regulating mRNA processing, mRNA translation, mRNA export and mRNA stability. However, despite their importance, little is known about how RBPs identify their target sites.
As a first step towards building quantitative models of PTR, we are mapping out mRNA and RBP interactions using a combined biochemical and computational strategy. Our strategy is based on a microarray-based assay, called RNAcompete, that measures the binding affinity of a recombinant RBP for hundreds of thousands of short RNA sequences.
These sequences are designed to comprehensively query the space of possible binding preferences. We use a new RNA motif finding
algorithm, RNAcontext, to infer sequence and structural binding preferences of RBPs from the RNAcompete data.  However, using these motif models to find RBP binding sites on mRNAs requires estimating mRNA secondary structure computationally.  Some of our recent work suggests that estimating this structure is easier than expected.


1/28/2010
Hsiao-Mei Lu, Univ. of Illinois at Chicago
Dynamics of Biological Systems: Allosteric Signal Transmission and Epigenetic Circuits


The dynamics of biological networks is critically important in conducting cellular functions. It is often a challenging task to study the dynamics of networks due to the size and complexity. Based on our successful work in characterizing protein folding dynamics in such a large conformational space through a long time evolution, the same method is proposed to study dynamics and time evolution of allosteric signal transmission and epigenetic circuit.
Large macromolecular assemblies are often important for biological processes in cell.  Allosteric communications between different parts of these molecular machines play critical roles in cellular signaling.  Although studies of the topology and fluctuation dynamics of coarse-grained residue network can yield important insight, they do not provide characterization of time-dependent dynamic behavior of these macromolecular assemblies.  Here we develop a novel approach called Perturbation-based Markovian Transmission (PMT) model to globally study the dynamic responses of the macromolecular assemblies. By monitoring simultaneous responses of all residues (>8,000) across many (>6) decades of time span from the initial perturbation until reaching, we show that this approach can yield rich information. With criteria based on quantitative measurements of relaxation half-time, flow amplitude change, and oscillation dynamics, this approach can identify pivot residues that are important for macromolecular movement, messenger residues that are key to signal mediating, and anchor residues important for binding interactions.  Based on a detailed analysis of the GroEL-GroES chaperone system, we found that our predictions have an accuracy of 71-84% judged by independent experimental studies reported in the literature. I propose this computational method can detect allosteric signal transmission pathway, characterize the roles of functionally important residues, and make novel predictions about the importance of additional amino acid residues previously uncharacterized, which can be further tested in experimental studies. This approach is general and can be applied to other large macromolecular machineries such as virus capsid and ribosomal complex.
Models based on the chemical master equation can describe the interactions involved in biomolecular networks accurately. An epigenetic circuit of phage lambda switch in E. coli cells is modeled by the chemical master equation with full stochasticity. Based on the successfully developed model, the specific coopperative binding of CI dimer to OR1 and OR2 is found to be the only crucial one to maintain a stable and robust phage lambda switch. The explicit computational study of the mutations of the binding for CI dimer and Cro dimer to OR3 show that Cro dimer is necessary in an efficient phage lambda induction. The DNA looping, double positive and negative regulations, and other biochemical mutations will be studied. Algorithms are also proposed to solve lager systems efficiently.


Date: 10/20/2009
Speaker: Hagit Shatkay, Queen's University
Title: Life by the Book: Pragmatically Using Text in Large Scale -Omics.

Abstract: The genomic era, in which we live since the sequencing of the human genome, is characterized by tremendous amounts of biomedical data, accompanied by a significant increase in the number of related scientific publications.
Much biomedical knowledge is hidden within the abundant literature. The ability to rapidly and effectively survey the literature can support numerous applications, including multiple stages in the design and the interpretation of large-scale experiments.
A variety of methods are being applied to the biomedical literature in an attempt to meet these goals, mostly through careful mining of text for gene/protein names and interactions, using natural language processing methods. However, the idea of general “biomedical text mining” remains elusive.

Rather than view biomedical text mining as one monolithic (and not very well defined) task, we attend to specific biological goals that may benefit from the use of text. The talk will focus on several biological applications/problems involving text, and discuss some non-traditional, coarse-grain methods, that we use to address them.


Emma Lundberg, Royal Inst. of Technology, Stockholm
A Human Protein Atlas

Abstract: Information on protein localization and expression on tissue, cell and organelle level is important to map and characterize the human proteome as well as to better understand cellular functions of proteins and to find biomarkers. In the Human Protein Atlas program the human proteome is systematically analyzed using an antibody-based approach. By generation and thorough validation of antibodies, protein localization and expression in human tissues and cells can be analyzed using immunohistochemistry and fluorescence confocal microscopy. The results are publicly available in the Human Protein Atlas web portal (www.proteinatlas.org) that currently contains results from the use of more than 8,800 validated antibodies corresponding to one third of all human genes. The portal contains more than 7 million high-resolution images that each has been manually annotated and curated by a certified pathologist or a cell biologist to provide a knowledge base for functional studies and to allow searches and queries about protein profiles in normal and disease tissue as well as on a cell and subcellular level. Advanced queries can be performed, including searches for chromosome location, protein class and/or tissue specificity (including the 20 most common forms of human cancer), facilitating for instance biomarker discovery. Our results suggest that it should be possible to extend the protein atlas to cover the majority of all human proteins thus providing a valuable tool for biological and medical research.

Date: 3/2/09
Speaker: Nicholas Buchler, Rockefeller University
Title: Bait and switch: How protein sequestration generates a flexible ultrasensitive response


Abstract: Regulatory networks in cells exhibit important dynamical behaviors, such as bistability (e.g. epigenetic switch) and oscillation (e.g. clocks, cell cycle). Ultrasensitive or `all-or-none~R gene expression is a necessary feature for the emergence of such dynamics in gene networks. In biology, many regulatory molecules are sequestered by an inhibitor into an inactive complex. Using an experimental approach in budding yeast, I will demonstrate how protein sequestration generates tunable, all-or-none thresholds in gene expression. A simple quantitative model for this genetic network shows that both the threshold and the degree of ultrasensitivity depend upon the abundance of the inhibitor, exactly as observed experimentally. The abundance of the inhibitor can be altered by simple mutation; thus ultrasensitive responses mediated by protein sequestration are easily tunable. Gene duplication of regulatory homodimers and loss-of-function mutations can create dominant-negatives that sequester the original duplicate into an inactive complex. These results suggest a mechanism for the rapid evolution of bistable switches and oscillators in regulatory networks.

Date: 2/19/09
Speaker: Andrew Grimson, Massachusetts Inst. of Technology
Title: Animal microRNAs: their ancient origin and contemporary targets


Abstract: Hundreds of microRNAs (miRNAs) collectively regulate a substantial fraction of the animal transcriptome. Because virtually all aspects of biology are likely impinged upon by miRNAs, the identification of the mRNAs targeted by each miRNA remains a fundamental question. Specific ~7 nt recognition sequences, located primarily in 3' UTRs, are important for target recognition. These sites are complementary to the 5' end, or seed region, of the miRNA. However, seed matches are not sufficient for repression, indicating that other characteristics help specify miRNA targeting. By combining computational and experimental approaches, we discovered five features of site context that govern site efficacy. We developed a model that combines these context determinants to quantitatively predict site performance thereby indicating which of the thousands of potential miRNA-target relationships are functional. The predictions are made without recourse to site conservation, and are therefore effective at predicting a wide variety of target interactions including nonconserved sites and siRNA off-target effects.

The scale of transcriptome regulation by miRNAs together with the extent of miRNA conservation between bilaterians (e.g., humans, flies, and worms) is evidence for the importance of miRNA biology during animal evolution. In addition to miRNAs, other bilaterian small RNAs, known as Piwi-interacting RNAs (piRNAs), protect the genome from transposons. Neither miRNAs nor piRNAs were known to exist in the simplest, pre-bilaterian, animal phyla, raising the question of whether a rich small-RNA biology is characteristic of more complex animals, or whether these small RNAs might have emerged earlier in metazoan evolution. To gain perspective on the evolution of miRNAs and piRNAs, we used high-throughput sequencing to identify small RNAs from several basal animal lineages that diverged prior to the emergence of the Bilateria. We found that the cnidarian Nematostella vectensis, a relatively close relative of bilaterians, possesses an extensive repertoire of miRNA genes, two classes of piRNAs, and a complement of proteins specific to small-RNA biology comparable to that of humans. Similarly, the sponge Amphimedon queenslandica, amongst the simplest of animals and distant relative of bilaterians, also possess miRNAs, piRNAs and a full complement of small-RNA machinery. These data indicate that both miRNAs and piRNAs have existed from the earliest stages of metazoan evolution and have been available to shape gene expression throughout the evolution and radiation of animal phyla.

Date: 2/16/09
Speaker: Eric Deeds, Harvard Medical School
Title: Dynamic individuality in protein-protein interaction networks


Abstract: Protein-protein interactions play a crucial role in all cellular processes, from the regulation of gene expression to the transduction and processing of extracellular signals. Over the past decade, high-throughput techniques such as Yeast 2-Hybrid (Y2H) and Tandem Affinity Purification (TAP-tagging) have provided a global picture of what the entire protein-protein interaction (PPI) network in certain organisms might look like. While these methods are often quite noisy (with potentially high rates of false positives and false negatives), they have nonetheless served as the substrate for a large body of work aimed at characterizing or explaining the general topological structure of these networks. Such purely topological studies are limited, however, by the fact that they consider a static description of an inherently dynamical system. A full characterization and understanding of the behavior of PPI networks clearly requires that one be able to describe and understand the dynamics of hundreds to thousands of objects physically interacting with one another. In this work we employ recently developed rule-based modeling techniques to perform the first large-scale stochastic simulations of the PPI network found in the cytoplasm of yeast cells. These simulations reveal that cells prepared in identical initial conditions will, at steady state, differ considerably from one another in terms of the identities of the large protein complexes found in each. Our results indicate that such dynamic individuality may arise in many complex interaction and signaling networks.

Date: 2/6/09
Speaker: Su-In Lee, Carnegie Mellon University
Title: Individual Genetic Variation and Gene Regulation: From Networks to Mechanisms


Abstract: Gene expression data of genetically diverse individuals (eQTL data) provide a unique perspective on the effect of genetic variation on cellular pathways, and help identify sequence variations with phenotypic effect. However, the large number of possible regulatory interactions, combined with the challenges of linkage disequilibrium (LD), makes it difficult to correctly identify causal polymorphisms. To resolve this problem, researchers traditionally apply heuristics for selecting among plausible hypotheses, favoring polymorphisms that are more conserved, that lead to significant amino acid change, or that reside in genes whose function is related to that of the targets. We can construct a list of properties (called, regulatory features) that can indicate how likely each polymorphism having that property changes the gene regulatory network. But how do we know how much weight to attribute to different regulatory features? This talk describes a novel method, called Lirnet (linear regulation network), for identifying regulatory networks from eQTL data. Lirnet automatically learns from eQTL data how to weight regulatory features and induce a regulatory potential for candidate sequence variations. Lirnet assesses these weights simultaneously to learning a regulatory network, finding weights that lead to a more predictive network. This feature, combined with Lirnet's ability to learn the importance of these features automatically, makes it especially advantageous for mammalian systems, where many forms of prior knowledge used in simple model organisms are incomplete or unavailable.
We apply Lirnet to eQTL data in yeast, mouse and human (Phase II HapMap data), and provide statistical and biological results demonstrating that Lirnet produces significantly better regulatory programs than other recent approaches. We demonstrate in the yeast data that Lirnet can correctly suggest a specific causal sequence variation within a large, linked chromosomal region. In yeast, Lirnet uncovered a novel, experimentally validated connection between Puf3, a sequence-specific RNA binding protein, and P-bodies, cytoplasmic structures that regulate translation and RNA stability, as well as the particular causative polymorphism, a SNP in Mkt1, that induces the variation in the pathway.

Date: 1/27/09
Speaker: Derek Ruths, Rice University
Title: Execution Strategies for Executable Biological Models


Abstract: Progress in advancing our understanding of biological systems is limited by their sheer complexity, the cost of laboratory materials and equipment, and limitations of current laboratory technology. Computational and mathematical modeling provides ways to address these limitations through hypothesis generation and testing without experimentation - allowing researchers to analyze system structure and dynamics in silico and, then, design lab experiments that yield desired information about phenomena of interest. These models, however, are only as accurate and complete as the data used to build them. Currently most models are constructed from quantitative experimental data. However, since accurate quantitative measurements are hard to obtain and difficult to adapt from literature and online databases, new sources of data for building models need to be explored. In my research, I design methods for building and executing computational models of cellular networks based on qualitative experimental data, which is more abundant, easier to obtain, and reliably reproducible. Such executable models allow for in silico perturbation, simulation, and exploration of biological systems. In this talk, I will present two general strategies for building and executing Petri net-based models of biochemical networks. Both have been successfully used to model and predict the dynamics of signaling networks in normal and cancer cell lines, rivaling the accuracy of existing methods trained on quantitative data.
This work is done in collaboration with Luay Nakhleh (Rice University) and Prahlad T. Ram (MD Anderson Cancer Center).

Date: 1/15/09
Speaker: Phil Hyoun Lee, Queen's University
Title: Selecting single nucleotide polymorphisms for effective genetic association study

Abstract: Genetic variation analysis holds much promise as a basis for understanding disease-gene association. In particular, single nucleotide polymorphisms (SNPs) are at the forefront of such studies, as they are the most common form of DNA variation on the genome. However, due to the tremendous number of candidate SNPs, there is a clear need to expedite genotyping and analysis by selecting and considering only a subset of all SNPs.
In this talk, I will present three machine learning applications that successfully address the problem of SNP selection and improve current state-of-the-art. The first tag SNP selection approach aims to choose a subset of SNPs whose allele information can best represent the allele information of unselected SNPs. Using the formalism of Bayesian networks, it enables to select a subset of independent and highly predictive SNPs, without limiting the number or the location of predictive tag SNPs. The second method is based on the functionality of SNPs. It aims to directly select a subset of SNPs that are likely to be disease-causing. In the probabilistic framework, our integrative scoring system combines the functional assessments from a variety of bioinformatics tools, and prioritizes SNPs according to their potential deleterious effects to major biological functions. Lastly, I will describe a new multi-objective optimization framework for identifying SNPs that are both informative tagging and have functional significance.

Date: 1/13/09
Speaker: Xin Gao, University of Waterloo
Title: Zero in on the fully automated NMR protein structure determination

Abstract: High-throughput structural genomics requires parallelizable technologies for high-resolution protein structure determination. Nuclear Magnetic Resonance (NMR) would be such a technology if its tedious and lengthy process can be fully automated. In the talk, I will describe our efforts on a fully automated protocol for NMR protein structure determination. We have developed a singular value decomposition-based peak picking method, PICKY, which achieves an average of 88% recall and 74% precision over 32 raw spectra extracted from eight proteins. Existing resonance assignment methods, however, do not work well on incomplete and imperfect peak lists. Consequently, we have designed an integer linear programming-based assignment method. It significantly outperforms other existing programs on both perfect peak lists and noisy peak lists. With the partial resonance assignments, FALCON-NMR is developed as a hidden Markov model-based torsion angle sampling method. The whole system, AMR, has been successfully tested to on four proteins with weights of approximately 15kDa.


Date: 11/10/08
Speaker: William Noble, University of Washington
Title: Machine learning analysis of shotgun proteomics data


Abstract: Mass spectrometry has become the most widely used tool for the characterization of proteins within complex mixtures. In this talk, I will describe several successful applications of machine learning to improve the rate at which we can correctly assign peptide sequences to observed tandem mass spectra. We use supervised and semi-supervised discriminative learning methods to train a classifier that discriminates between correctly and incorrectly annotated spectra. Unlike previous methods, the classifier can be trained dynamically on each given data set, thereby adjusting to particular characteristics of the sample preparation protocol, machine platform, calibration and chromatography conditions. We have also trained a dynamic Bayesian network to model the process of peptide fragmentation within the mass spectrometer. The resulting model yields useful insights into fragmentation biochemistry as well as significantly improved peptide identification performance.

Date: 3/27/08
Speaker: Gad Kimmel, University of California, Berkeley
Title: Computational Problems in Human Genetics


Abstract: The question how genetic variation and personal health are linked is one of the compelling puzzles facing scientists today. The ultimate goal is to exploit human variability to find genetic causes for multi-factorial diseases such as cancer and coronary heart disease. Recent technology improvement enables the typing of millions of single nucleotide polymorphisms (SNPs) for a large number of individuals. Consequently, there is a great need for efficient and accurate computational tools for rigorous and powerful analysis of these data. In my talk I am going to concentrate on two computational problems, which are an essential step in studying the data obtained by this technology: Accurate and efficient significance testing with a correction for population stratification and estimating local ancestries in admixed populations.

Date: 3/26/08
Speaker: Itamar Simon, Hebrew University
Title: A high resolution map of mouse genome replication timing suggests a role in gene regulation


Abstact: Although it is known that genomes are divided into distinct replication time zones, a more detailed understanding of their organization is limited. Taking advantage of a novel synchronization method and of genomic DNA microarrays we have mapped replication times of the entire mouse genome at a high temporal resolution. The measurement results have allowed us to assign distinct replication times to 91% of the genome, define asynchronously replicating regions and identify very large replicons. Analysis of the association between replication and transcriptional features has revealed a correlation between replication and transcription potential as well as evolutionary conservation of replication timing. Finally, analysis of large replicons, and in particular of regions at which the time of replication differs from the time of replication of a distant origin, reveals that transcription is correlated with the actual time of replication and not with the time of origin activation. Overall, these findings suggest that early replication plays a causal role in potentiating gene transcription.

Date: 3/17/08
Speaker: Olivier Elemento, Princeton University
Title: Decoding the regulatory genome


Abstract: Deciphering the non-coding regulatory genome has proved a formidable challenge. Despite the wealth of available gene expression data, there currently exists no broadly applicable method for characterizing the regulatory elements that shape the rich underlying dynamics. I will present a general framework for detecting such regulatory DNA and RNA motifs that relies on directly assessing the mutual information between sequence and gene expression measurements. Our approach makes minimal assumptions about the background sequence model and the mechanisms by which elements affect gene expression. This provides a versatile motif discovery framework, across all data types and genomes, with exceptional sensitivity and near-zero false-positive rates. Applications from yeast to human uncover novel putative and established transcription-factor binding and miRNA target sites, revealing rich diversity in their spatial configurations, pervasive co-occurrences of DNA and RNA motifs, context-dependent selection for motif avoidance, and the strong impact of post-transcriptional processes on eukaryotic transcriptomes. This approach complements our previous and ongoing work using comparative genomics, and represents a major contribution to our ongoing effort to systematically characterize eukaryotic regulatory elements and understand their role in complex processes such as development, aging and disease.

Date: 3/10/08
Speaker: Philip Kim, Yale University
Title: Jumping scales: How 3D structures and molecular genetics meet in protein networks


Abstract: Protein interaction networks form the central layer of a systems-level description of the cell. While most studies of protein networks operate on a high level of abstraction, neglecting structural and chemical aspects of each interaction, I will describe our approach of characterizing interactions by using atomic-resolution information from three-dimensional protein structures. We find that some previously recognized relationships between network topology and genomic features (e.g., hubs tending to be essential proteins) are actually more reflective of a structural quantity, the number of distinct binding interfaces. Subdividing hubs with respect to this quantity provides insight into their evolutionary rate and indicates that additional mechanisms of network growth are active in evolution.

Furthermore, I will provide an overview of a major international collaborative effort that aims to resolve interactions involved in signaling pathways. These tend to involve intrinsically disordered regions are hence complementary to the structured interactions studied by the above approach. Our approach combines modern experimental screening techniques with a novel integrated analysis pipeline. The former screens measure binding specificities with hitherto unachievable accuracy and the analysis pipeline maximizes prediction accuracy by integrating a variety of genomic and proteomic features.

Lastly, I will present a study that examined the relationship between genetic signatures of adaptive evolution and proteomic properties, such as the location of sites in protein networks and structures. Due to recent advances in genotyping and sequencing technology, human genetic variation and adaptive evolution in the primate lineage have become a major research focus. We find a striking tendency of proteins that have been subject to adaptive evolution (as compared to the chimpanzee) to be located at the periphery of the interaction network. We also find that the fixation of large-scale copy number variants into segmental duplications also preferentially occurs at the network periphery, bolstering our argument for selection at periphery. This suggests that the observed preferential selection at the network periphery may be due to an increase of adaptive events on the cellular periphery responding to changing environments.

Date: 3/3/08
Speaker: Han Liang, University of Chicago
Title: System Structures and MicroRNA regulation in humans: a view of systems biology


Abstract: MicroRNAs are ~22nt non-coding RNAs that can post-transcriptionally repress the expression of many protein-coding genes in higher eukaryotes. Recently available functional genomic data enables us to examine the regulatory role of microRNAs at the system level. Integrating human protein-protein interaction and microRNA targeting data, I found a global correlation between protein connectivity and microRNA regulation complexity in the corresponding genes, and that microRNA regulation likely coordinates the behavior of interacting partners. To understand the evolution of microRNA-mediated regulation in humans, I evaluated the role of three types of nucleotide variation on microRNA targeting: variation between species, variation within populations and epigenetic variation. While purifying selection appears to be a driving force maintaining the stability of microRNA regulation at the system level, a small amount of variants may have significant functional effects. In particular, I found an appreciable level of polymorphism at microRNA target sites (including SNPs with a signature of positive selection or within important disease genes), which suggests that allele-specific microRNA regulation is an important source of phenotypic differences among individuals.

Date: 2/28/08
Speaker: Ge Yang, Scripps Research Institute
Title: Metaphase spindle architecture and molecular motor coordination revealed by model driven computer vision


Abstract: The development of biology over the past half century makes it possible to identify the complete set of genes and proteins of an organism. A fundamental challenge remains, however, to understand the complex dynamics of and interactions between the many individual molecular components involved in situ and in space and time. Of particular importance in addressing this challenge is to understand how force and motion are generated, transmitted, and controlled within dynamic cellular structures during basic cellular processes. In this presentation, I will focus on addressing this question in two such processes: cell division and intracellular transport. First, single-fluorophore imaging and biochemical perturbation are used to investigate architecture of the metaphase microtubule cytoskeleton in cell division. This assay provides a model system to understand how cytoskeletal filament networks are dynamically organized to transmit force and to directly generate force. Second, fluorescence imaging and genetic manipulation are used to probe the interaction between molecular motors in the axonal transport machinery of neurons. This assay provides a sufficiently reduced yet extremely powerful model system to understand the interactions between molecular motors of same and opposite polarities in force and motion generation. Shared by both studies is the use of computer vision techniques, driven by mechanistic models, to extract high-resolution quantitative measurements of the complex spatial-temporal dynamics visualized by powerful fluorescence live cell imaging techniques. These studies reveal some fundamental and exquisite connections between force and motion generation and the dynamic organization of the cytoskeleton in cellular life.

Date: 2/25/08
Speaker: Kevin Chen, New York University
Title: Macro- and micro-evolution of gene regulation mediated by microRNAs


Abstract: Studying the evolution of cis-regulatory elements is important for three general reasons. First, mutations in these elements can cause phenotypes of medical importance; second, understanding cis-element evolution will help us design algorithms for predicting these elements; third, regulatory evolution is important for understanding phenotypic evolution. In this talk, I will focus on a class of cis-elements called "microRNA sites". MicroRNAs are small, noncoding RNAs that post-transcriptionally regulate their target mRNAs by binding to these sites. They have been implicated in many biological processes, including cancer and viral defense.

I will discuss the evolution of animal microRNA sites at two different time scales. At the macro-evolutionary time scale, we show that while the microRNA genes are well-conserved, overall their targets have diverged rapidly. However, there exists a core of deeply-conserved regulatory relationships that may be an important component of animal developmental networks. At the micro-evolutionary time scale, we use human SNP genotype data to demonstrate significant selective constraint on microRNA sites, implying that polymorphisms in these sites are candidates for causal variants of human disease. Our approach also applies to human-specific microRNA sites and we use it to identify a set of these sites in genes co-expressed with the microRNA.

Date: 2/11/08
Speaker: James Taylor, New York University
Title: Making sense of genome-scale data


Abstact: High-throughput data production technologies are revolutionizing modern biology. Translating this experimental data into discoveries of relevance to human health relies on sophisticated computational tools that can handle large-scale data (e.g. multiple genome alignments of dozens of species or billion genotype genome-wide association studies).

This talk will first discuss a specific large-scale data analysis problem: using comparative genomics to identify and understanding functional genomic regions, particularly cis-regulatory elements. Using data generated by the ENCODE project we will demonstrate the power of genome comparisons to distinguish these elements from neutral DNA and the importance of looking for more than just signs of strong evolutionary constraint. We will then describe a machine learning approach that goes beyond sequence conservation and attempts to capture broader and more informative sequence and evolutionary patterns that better distinguish different classes of elements. This approach, denoted ESPERR, uses training examples to learn encodings of multispecies alignments into reduced forms tailored for the prediction of chosen classes of functional elements. ESPERR has proven successful for a variety of classification problems. In particular, the "Regulatory Potential Score" produced using ESPERR has been used to identify putative regulatory elements with high rates of experimental validation.

Second, we will consider the more general problem of making sophisticated computational methods more available to experimental biologists. Many powerful analysis tools exist or are currently being developed, along with many excellent data warehouses and browsers. However, for the average experimental biologists with limited computer expertise, making effective use of these tools and data sources is still out of reach because many existing tools do not have easy-to-use interfaces, and different tools and data sources are not well integrated. We have developed a framework and application, called Galaxy, that solves this problem by providing an integrated web-based workspace that bridges the gap between different tools and data sources. Galaxy simultaneously targets two audiences. For tool developers it eliminates the repetitive effort involved in creating high-quality user interfaces, while giving them the benefit of being able to provide their tools in an integrated environment. For experimental biologist it allows running complex analysis on huge datasets with nothing more than a web browser, and without needing to worry about details of installing tools, allocating computing resources, and file format compatibility. Galaxy is not only incredibly easy to use, it is also incredibly easy to deploy. A developer or lab can create their own Galaxy instance, and start integrating custom tools with only a few minutes work.

Date: 1/16/08
Speaker: Insuk Lee, University of Texas at Austin
Title: Network biology approaches to study complex traits


Abstract: The relationship between genotype and phenotype is a central issue in genetics, and approaches are needed that allow us to interpret the increasing collection of data on genotypic variation in terms of the affect on organismal phenotypes. Our understanding of these relationships came historically from forward-genetics approaches, which have proved remarkably powerful, but which are still difficult in complex animals, and the complete definition of pathways from forward-genetic data alone is hard. In contrast, reverse-genetics approaches allow unbiased tests across entire genomes for associations with traits of interest, e.g., by using systematic genome-wide knock-out or silencing. However, reverse-genetics is in general labor intensive and time consuming, requiring enormous numbers of assays in order to span large number of genes in combination with multiple experimental conditions. Ideally, we would like to be able to choose which genes to Abstract: The relationship between genotype and phenotype is a central issue in genetics, and approaches are needed that allow us to interpret the increasing collection of data on genotypic variation in terms of the affect on organismal phenotypes. Our understanding of these relationships came historically from forward-genetics approaches, which have proved remarkably powerful, but which are still difficult in complex animals, and the complete definition of pathways from forward-genetic data alone is hard. In contrast, reverse-genetics approaches allow unbiased tests across entire genomes for associations with traits of interest, e.g., by using systematic genome-wide knock-out or silencing. However, reverse-genetics is in general labor intensive and time consuming, requiring enormous numbers of assays in order to span large number of genes in combination with multiple experimental conditions. Ideally, we would like to be able to choose which genes to target for reverse-genetics analyses, prioritizing the most likely candidates for being involved in a trait of interest. Such an approach would allow highly focused reverse-genetics studies to be performed, increasing both the sensitivity and efficiency of genetic screens. Here, we present a method for predicting gene loss-of-function phenotypes that can be applied to extend genetic screens and prioritize candidate genes for focused testing in from simple single cellular organism yeast to multicellular animal model C. elegans (worm) to human.