Upcoming Seminars
5/2/2013 - 3pm - 6115 GHC
Ivo F. Sbalzarini, Dresden International Graduate School for Biomedicine and Bioengineering
Computational Biology with Particle Methods
Understanding the function of biological systems from the interactions between their constituents requires predictive forward models of hypothetical mechanisms. Given the complexity of biological systems, such forward models are frequently computational, where numerical simulations are used to probe a model's behavior in regimes where it cannot be solved analytically. We review the key differences between biological and engineering applications of numerical simulations and highlight the main challenges in computational data processing and simulation of biological systems. We propose to exploit the unifying algorithmic framework of particle methods to develop numerical simulations, image-processing, and optimization algorithms that meet the challenges of modern biology. We provide examples from our own work, highlighting how methodological advances in scientific computing have enabled new biological insight and progress in computer science alike. The examples include a self-organizing deterministic particle method for the simulation of multi-scale continuum models, a novel class of stochastic simulation algorithms with reduced time complexity, a domain-specific language for particle methods on heterogeneous parallel computer platforms, and a new class of particle-based image segmentation algorithms. This covers the workflow of image-based systems biology, illustrating several analogies and connections between the different fields involved.
10/4/2013 - 11am - 6115 GHC
Wing Wong, Stanford University
Past Seminars
3/29/2013
Christine Vogel, New York University
The Ups and Downs of Human Protein Expression Regulation
While
transcription regulation has been studied for many years, we now have
amounting evidence that the regulation of protein translation and
degradation is at least as important in determining protein expression
levels. Under normal conditions, for example, transcription and mRNA
degradation account for ~30% of gene expression regulation in mammalian
cells, while translation and protein degradation account for another
30-40%. We now have extended these studies to systems under
perturbation, i.e. cells responding to a stimulus. Using a variety of
large-scale methods, we examine the behavior of the mammalian proteome
and transcriptome in response to environmental stresses. We have
quantified the expression of ~4,000 genes and proteins and are in the
process of characterizing different regulatory patterns that we observe.
Again, transcription is only half the story.
3/27/13
Shayok Chakraborty, Arizona State University
Batch Mode Active Learning for Multimedia Pattern Recognition
The
rapid escalation of technology and the widespread emergence of modern
technological equipments have resulted in the generation of humongous
amounts of digital data (in the form of images, videos and text among
others). This has the expanded the possibility of solving real world
problems using computational learning frameworks. However, while
gathering a large amount of data is cheap and easy, annotating them with
class labels is an expensive process in terms of time, labor and human
expertise. This has paved the way for research in the field of active
learning. Such algorithms automatically select the salient and exemplar
instances from large quantities of unlabeled data and are effective in
reducing human labeling effort in inducing classification models. To
utilize the possible presence of multiple labeling agents, there have
been attempts towards a batch mode form of active learning, where a
batch of data instances is selected simultaneously for manual
annotation. This talk will cover a basic background of batch mode active
learning, some related work and my current research in this domain.
Specifically, the following three contributions will be discussed in
details: (i) batch mode active learning algorithms based on convex
relaxations of an NP-hard integer quadratic programming (IQP) problem,
with guaranteed bounds on the solution quality, (ii) an active matrix
completion algorithm and its application to solve several variants of
the active learning problem (transductive active learning, multi-label
active learning, active feature acquisition and active learning for
regression) and (iii) a framework for dynamic batch mode active
learning, where the batch size and the specific data instances to be
queried are selected adaptively through a single formulation, based on
the complexity of the data stream in question. These contributions are
validated on the face recognition and facial expression recognition
problems, which are commonly encountered in real world applications like
robotics, security and assistive technology for the blind and the
visually impaired.
3/22/2013
Vineet Bafna, University of California, San Diego
The breakage fusion bridge and other exotic structural variations: combinatorics and cancer genomics
Cancer
genomes are marked by genomic instability and massive rearrangements.
Recently, many exotic mechanisms have been proposed as mechanistic
explanations for these rearrangements. For example, the
breakage-fusion-bridge (BFB) mechanism, proposed over seven decades ago,
has seen renewed interest as a source of genomic variability and gene
amplification in cancer. Here, we formally model and analyze the BFB
mechanism, the first rigorous formulation of the mechanism. Using this
model, we show that BFB can achieve a surprisingly broad range of
amplification patterns, and describe efficient combinatorial algorithms
to characterize patterns consistent with BFB. An extensive analysis of
simulated, cell-line, and primary tumor data reveals the existence of
BFB. Our results also suggest that BFB may be hard to detect under
heterogeneity and polyploidy. Time remaining, we will also discuss other
sources of variation (joint work with Shay Zakov, and Marcus Kinsella).
3/20/2013
Yongjin Park, Johns Hopkins University
Resolving the Structure and Dynamics of Large-scale Interactome
Community
structures are embedded in real-world networks. A set of nodes or
edges can be decomposed into fairly homogeneous subsets. In biological
network analysis, community structures are considered as functionally
coherent modules. For instance, tightly connected sub-networks in a
protein-protein interaction network generally correspond to protein
complexes. Modules are easily identified in a network of hundreds of
nodes by visual inspection or simple pattern searches. However,
large-scale network datasets pose significant challenges, not only in
computation, but also in its completely different properties.
In
this talk, I will describe our attempts to solve community-finding
problems on genome-scale interactome datasets. I will explain how a
probabilistic framework can help design simple yet powerful algorithms,
for instance, avoiding “resolution-limits”, and how this framework can
extend to dynamic network analysis. Next, I will talk about a newly
designed inference algorithm, which is applicable to ultra large-scale
hierarchical stochastic block models. We propose a nearly linear time
algorithm that can efficiently estimate maximum a posteriori on a deep
hierarchical block structure. Moreover, I will show how we combined
this hierarchical model with other sources of heterogeneous biological
evidence, such as RNA-seq measurements and pathway annotations.
3/19/2013
Bo Li, University of Wisconsin at Madison
Computational
Analysis of RNA-Seq Data in the Absence of A Sequenced Genome: From
Transcript Quantification to De novo Transcriptome Assembly Evaluation
RNA-Seq
technology has revolutionized the way we study transcriptomes. In
particular, it has enabled us to investigate the transcriptomes of
species that have not yet had their genomes sequenced. I will discuss
our work on two computational tasks that are crucial to analyzing
RNA-Seq data in the absence of a sequenced genome: transcript
quantification and de novo transcriptome assembly evaluation. For
transcript quantification, RNA-Seq is considered as a more accurate
replacement for microarrays. However, to allow for the highest accuracy,
methods for analyzing RNA-Seq data must address the challenge of
handling reads that map to multiple genes or isoforms. We present RSEM,
a generative statistical model of the sequencing process and associated
inference methods, which tackles this challenge in a principled manner.
Our
results on both simulated and real data sets suggest that RSEM has
superior or comparable performance to other quantification methods
developed at the same time. Building off of RSEM, we have developed a
novel probabilistic model based method, RSEM-EVAL, for evaluating de
novo transcriptome assemblies from RNA-Seq data without the ground
truth. Our results on both simulated and real data sets show that our
RSEM-EVAL metric correlates well with the ground truth accuracies of the
assemblies. Our metric has a broad range of potential applications,
such as selecting assemblers, optimizing parameters for an assembler and
guiding new assembler design.
3/18/2013
A. Ercument Cicek, Case Western Reserve University
ADEMA: An Algorithm to Determine Expected Metabolite Level Alterations Using Mutual Information
Sitting
on the top of the omics hierarchy, metabolomics is an important
platform to understand the changes in the physiological activity due to a
condition. Despite the advancements in the analytical methodology and
increasing number of genome scale metabolic networks of the organisms,
current techniques that are used to make sense out of metabolic profiles
are quite limited. The objective of this presentation is (1) to address
the shortcomings of the current techniques, which are used for
analyzing changes in metabolite levels, and (2) to describe ADEMA, a
multivariate method that computes the expected metabolite level changes
using the metabolic network topology and mutual information. Results
show that (1) ADEMA’s prediction on alteration of De Novo Lipogenesis
pathway in Cystic Fibrosis mouse model conforms to independently
performed flux and gene expression analyses, and (2) ADEMA’s classifier
scheme outperforms other well-known classification algorithms.
3/1/2013
Zia Kahn, University of Chicago
Quantitative Proteomics Provides a New Window into How Genetic Differences Impact Protein-Levels Between Species
Understanding
how genetic differences affect a phenotypic variation within and
between species is a central goal of evolutionary and medical genetics.
Genetic differences that impact the regulation of a gene are key
contributors to trait differences. Yet, identifying genetic differences
that impact gene regulation is challenging: not all genetic differences
are functional and gene regulation is the result of a complex network of
interactions between genes. Measuring differing levels, or differential
allele-specific expression, of gene products, RNA or protein, from two
variants of a gene in the same individual provides direct evidence that a
DNA sequence difference between these variants impacts their
regulation. This measurement sets the stage for further studies to
pinpoint functional genetic variation. While recent technological
advances have made it possible to measure allele-specific RNA expression
across many genes in high-throughput, the same cannot be said for
protein levels. As proteins carry out much of the work of the cell, the
absence of a corresponding protein measurement leaves a gap in our
understanding of the genetic basis of phenotypic variation. I present a
quantitative, computation method for measuring differential expression
of two protein variants in an individual. The computational method is
based on a simple observation that overcomes a key limitation of a
data-intensive, or “big data,” technology in biological sciences called
quantitative mass spectrometry. As a proof of concept, I use this
computational method to study allele-specific protein levels in a hybrid
between two distantly related species of yeast. This study demonstrates
how this computation method provides a new window into how two classes
of genetic differences have impacted protein levels between species.
2/1/2013
Luisa Hiller, Carnegie Mellon University
Genomic Plasticity: To Be or Not to Be
The
gram positive bacteria Streptococcus pneumoniae, colonizes humans as a
nasopharyngeal commensal or a respiratory pathogen. This species
displays extensive genomic diversity and a notable capacity to
incorporate genes from neighboring cells into their genomes producing
new genomic combinations. Yet, the majority of pandemic multi-drug
resistant strains belong to one of several lineages that displays
decreases genomic diversity. In this talk I will discuss the genomic
diversity and plasticity in the population, as well as possible barriers
to gene exchange that may be leading to the genomic isolation of
clinically important lineages.
11/30/2012
Joel McManus, Carnegie Mellon University
Evolution of post-transcriptional gene regulatory networks
Differences
in gene expression are an important source of phenotypic variation and
disease. Gene expression differences result from changes in gene
regulatory networks, principally comprised of cis-acting sequences and
trans-acting factors. These networks control numerous processes,
including transcription, alternative splicing, and translation of mRNA
into protein. Research over the past decade revealed that changes in
trans-acting factors are responsible for most mRNA abundance differences
within species, while changes in cis-regulatory sequences accumulate
between species. In contrast, much less is known regarding how
alternative splicing and mRNA translation regulatory networks evolve. We
used high throughput sequencing of cDNA libraries from multiple
Drosophila species to investigate the evolution of alternative splicing.
Our results suggest that regulation of alternative splicing diverges
more rapidly in non-coding regions than in coding regions, and that
frame shifting alternative splicing events have more conserved
regulation. We further investigated the contributions of cis- and
trans-acting changes in splicing regulatory networks by comparing
allele-specific splicing in F1 interspecific hybrids. In F1 nuclei, each
allele is subjected to the same set of trans-acting factors. Thus
differences in allele-specific splicing reflect changes in
cisÂ-regulatory element activity. Changes in cis-regulatory elements
contribute more to species-specific differences in intron retention and
alternative splice site usage, while changes in trans-acting factors
contribute more to species-specific exon skipping differences. These
results suggest important differences in the regulatory network
architecture among classes of alternative splicing. We are also studying
the evolution of mRNA translation using allele-specific ribosome
profiling. Our preliminary results suggest that translation regulatory
networks may buffer species-specific mRNA abundance differences in
budding yeast.
10/26/2012
Eric Schadt, Mt. Sinai School of Medicine
Moving towards a better understanding of human disease in the era of big data
Common
human diseases and drug response are complex traits that involve entire
networks of changes at the molecular level driven by genetic and
environmental perturbations. Changes at the molecular level can induce
changes in biochemical processes or broader molecular networks that
affect cell behavior, and changes in cell behavior can affect normal
tissue or whole organ function, eventually leading to pathophysiological
states at the organism level that we associate with disease. While the
vast majority of previous efforts to elucidate disease and drug
response traits have focused on single dimensions of the system,
achieving a more comprehensive view of common human diseases requires
examining living systems in multiple dimensions and at multiple scales.
Studies focused on identifying changes in DNA that correlate with
changes in disease or drug response traits, changes in gene expression
that correlate with disease or drug response traits, or changes in other
molecular traits (e.g., metabolite, methylation status, protein
phosphorylation status, and so on) that correlate with disease or drug
response are fairly routine and have met with great success in many
cases. However, to further our understanding of the complex network of
molecular and cellular changes that impact disease risk, disease
progression, severity, and drug response, we can more formally integrate
these different data dimensions. Here I present an approach for
integrating a diversity of molecular and clinical trait data to uncover
models that predict complex system behavior. By integrating diverse
types of data on a large scale I demonstrate that some forms of common
human diseases like diabetes are most likely the result of perturbations
to specific gene networks that in turn causes changes in the states of
other gene networks both within and between tissues that drive
biological processes associated with disease. These models elucidate
not only primary drivers of disease and drug response, but they provide a
context within which to interpret biological function, beyond what
could be achieved by looking at one dimension alone. That some forms of
common human diseases are the result of complex interactions among
networks has significant implications for drug discovery: designing
drugs or drug combinations to impact entire network states rather than
designing drugs that target specific disease associated genes.
9/21/2012
Carl Kingsford, Carnegie Mellon University
Computational Challenges Comprehending Chromosome Conformation Capture Constraints
The
physical shape and arrangement of chromosomes in the cell affects gene
expression, long-range regulation of transcription, and genome evolution
(particularly biasing which rearrangements occur), and it has been
implicated in the development of several types of cancers. New
high-throughput experimental techniques derived from "chromosome
conformation capture" (3C) have produced measurements that hint at the
spatial proximity of regions of the genome as it is arranged in the
cell.
I will describe our work in three directions to make this
3C-like data more confidently useful for correlating structure with
biological function. First, I will describe a new approach called
metric filtering for discarding false-positive proximity measurements
that selects edges to keep based on both their surprising observation
counts and their metric consistency with other selected edges. We show
this technique keeps more information and produces three-dimensional
models that agree better with observations from light microscopy.
Second,
I will discuss an approach based on rigidity theory to decide whether a
3C experiment has generated sufficient constraints to determine a
structure. We find in fact that current experiments provide far more
than enough constraints to determine a non-floppy structure for most of
the genome in several organisms. As a byproduct, we produce a more
practical algorithm for large-scale testing of rigidity.
Finally,
I will discuss improved techniques for finding statistically
significant correlations between genomic features and spatial proximity
that avoid the computationally demanding and error-prone step of
deriving a three-dimensional structure.
Various aspects of this
research was done jointly with Geet Duggal, Hao Wang, Darya Filippova,
Rob Patro, Emre Sefer, Sridhar Hannenhalli (UMD), and Michelle Girvan
(UMD).
9/28/2012
Curtis Huttenhower, Harvard University
Bug bytes: bioinformatics for meta'omics and microbial community analysis
Among
many surprising insights, the genomic revolution has helped us to
realize that we're never alone and, in fact, barely human. For most of
our lives, we share our bodies with some ten times as many microbes as
human cells; these are resident in our gut and on nearly every body
surface, and they are responsible for a tremendous diversity of
metabolic activity, immunomodulation, and intercellular signaling.
These
microbial communities have only recently become well-described using
high-throughput sequencing, requiring analyses that simultaneously apply
techniques from genomics, "big data" mining, and molecular
epidemiology. I will discuss emerging end-to-end bioinformatics
approaches for metagenomics and metatranscriptomics, including handling
of sequence data for mixed microbial communities, its reconstruction
into metabolic pathways, and biomarker discovery in disease. In
particular, computational processing is key in identifying unique
markers for microbial taxonomy, phylogeny, and in identifying genes and
pathways significantly disrupted in inflammatory conditions such as
Crohn's and ulcerative colitis.
7/20/2012
Christoph Wuelfing, UT Southwestern Medical Center
Spatiotemporal organization of lymphocyte signaling systems as a regulator of function
The
subcellular organization of the T cell signaling system, similar to
that of many other cell types, is highly diverse in time and space.
Using systems-scale imaging of T cell signaling, we analyze such
organization togain unique insight into T cell function with an emphasis
on T cell actin regulation.
5/8/2012
Tom Bartol, Salk Inst. for Biological Studies
How to Build a Synapse from Molecules, Membranes, and Monte Carlo Methods
4/30/2012
Mark Gerstein, Yale University
Analysis of Molecular Networks
My
talk will be concerned the analysis of networks and the use of networks
as a "next-generation annotation" for interpreting personal genomes. I
will initially describe current approaches to genome annotation in terms
of one-dimension browser tracks. Then I will describe various aspects
of networks. In particular, I will touch on the following topics: (1) I
will show how analyzing the structure of the regulatory network
indicates that it has a hierarchical layout with the "middle-managers"
acting as information-flow bottlenecks and with more "influential" TFs
on top. (2) I will show that most human variation occurs at the
periphery of the network. (3) I will compare the topology and variation
of the regulatory network to the call graph of a computer operating
system, showing that they have different patterns of variation. (4) I
will talk about web-based tools for the analysis of networks (TopNet and
tYNA).
http://networks.gersteinlab.org
http://tyna.gersteinlab.org
Comparing
genomes to computer operating systems in terms of the topology and
evolution of their regulatory control networks. KK Yan, G Fang, N
Bhardwaj, RP Alexander, M Gerstein (2010). Proc Natl Acad Sci U S A
107:9186-91.
Analysis of diverse regulatory networks in a
hierarchical context shows consistent tendencies for collaboration in
the middle levels. N Bhardwaj, KK Yan, MB Gerstein (2010). Proc Natl
Acad Sci U S A 107:6841-6.
Positive selection at the protein
network periphery: evaluation in terms of structural constraints and
cellular context. PM Kim, JO Korbel, MB Gerstein (2007). Proc Natl Acad
Sci U S A 104:20274-9.
The tYNA platform for comparative
interactomics: a web tool for managing, comparing and mining multiple
networks. KY Yip, H Yu, PM Kim, M Schultz, M Gerstein (2006).
Bioinformatics 22:2968-70.
4/26/2012
Pavel Sumazin, Columbia Medical Center
RNA regulatory networks help propagate the effects of genetic alterations
Biomedical
researchers profile DNA and chromatin of large patient cohorts in an
attempt to identify common alterations that drive pathology and can
point to diagnostic and therapeutic biomarkers. Increasingly, however,
it is clear that genetic and epigenetic alterations can regulate
pathology combinatorially, and that different combinations of
alterations may generate the same phenotype. To make full use of
molecular profiling data, we need to understand how alterations affect
cellular programs.
I will describe two new types of
computationally predicted post-transcriptional regulatory networks.
Computational and experimental evidence suggest that interactions in
these networks may alter the expression of known drivers of high-grade
glioma. I will describe regulators of microRNA activity, which modify
the activity of microRNAs without necessarily altering their expression.
These regulators may channel the effects of genomic deletions to
distally downregulate established tumor suppressors. Conversely,
post-transcriptional regulators of microRNA biogenesis alter the
expression of known drivers of gliomagenesis by regulating the abundance
of the microRNAs that target them. Alterations to these regulators lead
to widespread changes to the expression of microRNAs that target known
drivers of glioma.
Taken together, our results suggest that
post-transcriptional regulation in the cell is both extensive and
complex. We present evidence that genetic and epigenetic alterations may
be amplified and propagated by post-transcriptional interactions to
affect both disease initiation and outcome. Our work provides some of
the building blocks necessary for reverse engineering integrated
regulatory networks that will help identify driver alterations and
explain their effects on cellular programs and pathology.
Bio:
Pavel Sumazin is a research scientist at Columbia Medical Center. He
graduated from Stony Brook University with a PhD in computer science
with a focus on design and analysis of algorithms. He taught computer
science theory at Portland State University, was an NSF fellow in human
genetics at Cold Spring Harbor Laboratory, and served as Associate
Director for bioinformatics at Columbia University’s Genome Center.
4/19/2012
Frank DiMaio, University of Washington
Protein structure determination with sparse and noisy data
Determining
the structure of a protein, which involves finding the
three-dimensional placement of each of a protein's thousand of atoms, is
an important problem in biochemistry, providing key insights into
mechanisms as well as targets for drug design. However, many proteins
of biomedical importance elude traditional structure determination
methods. For these proteins, sparse data -- either experimental or
knowledge-based -- may provide structural information, though not enough
to uniquely determine a solution. The Rosetta structure prediction
methodology uses an energy-based approach to explore physically feasible
protein conformations. By combining this energy function with sparse
data, I can quickly infer high-accuracy protein models. I will describe
the effectiveness of this approach using data from four different
sources. First, I will show how we may use cryo-electron microscopy
density data, which provides a very coarse envelope function describing
the protein shape to infer models that accurately recapitulate
high-resolution details. I will describe how a similar approach may be
used to solve difficult molecular replacement problems. Here, sparse
data is confounded with significant noise; nonetheless, my approach led
to the solution of thirteen protein structures, previously unsolved in
the hands of expert crystallographers. Similarly, using only
low-resolution crystallographic data, my approach recapitulates
high-resolution details that are not captured by current refinement
methods. Finally, I will describe recent breakthroughs I have made in
homology modeling, where the source of data is not from experiment, but
instead from previously solved protein structures. I will additionally
show how these methods are broadly applicable using both experimental
and statistical sources of data, with implications for both protein
structure determination and design.
4/17/2012
Zhengqing Ouyang, Stanford University
Statistical modeling of next generation sequencing data for global gene regulation
Unraveling
the global regulation of gene expression is essential for understanding
embryonic development and human diseases. Gene expression is regulated
at multiple levels, including transcription, RNA processing, and
translation. At each level, regulators such as transcription factors,
RNAs, and RNA-binding proteins are forming complex regulatory networks.
Recent advances in high-throughput technologies, including next
generation sequencing, provide unprecedented opportunities to profile
multiple levels of gene regulatory information. In this talk, I will
describe statistical methods for integrating next generation sequencing
data to discover the principles of global gene regulation. At the
transcriptional regulation level, a joint model of ChIP-Seq and RNA-Seq
will be introduced. The model effectively quantifies transcription
factor regulatory strength, reveals combinatorial regulation, and
accurately predicts genome-wide expression levels of genes. At the
post-transcriptional level, an integrative approach is proposed to
reconstruct RNA secondary structures at the genome-scale from deep
sequencing data. I will demonstrate the advantages of our approach and
the widespread impact of RNA secondary structure on gene regulation.
3/30/2012
Jing Li, Case Western Reserve University
Rare variant discovery and calling by sequencing pooled samples with overlaps
For
many complex traits/diseases, it is believed that rare variants account
for the missing heritability that cannot be explained by common
variants. Sequencing a large number of samples through DNA pooling is a
cost effective strategy to discover rare variants and to investigate
their associations with phenotypes. Overlapping pool designs provide
further benefit because such approaches can potentially identify variant
carriers. However, existing algorithms for analyzing sequence data from
overlapping pools are limited. We propose a complete data analysis
framework for overlapping pool designs, with novelties in all three
major steps: variant pool and variant locus identification, variant
allele frequency estimation and variant sample decoding. The framework
can be utilized in combination with any design matrix. We have
investigated its performance based on two different overlapping designs,
and have compared it with two state-of-the-art methods, by simulating
targeted sequencing. Results show that our algorithm has made
significant improvements over existing ones.
3/29/2012
Ron Dror, D.E. Shaw Research
How
drugs bind and control their targets: characterizing GPCR signaling
using Anton, a special-purpose supercomputer for molecular dynamics
simulations
Roughly one-third of all drugs act by binding to
G-protein-coupled receptors (GPCRs) and either triggering or preventing
receptor activation, but the process by which they do so has proven
difficult to determine using either experimental or computational
approaches. We recently completed a special-purpose machine, named
Anton, that uses a combination of novel algorithms and
application-specific hardware to accelerate molecular dynamics
simulations by orders of magnitude, enabling all-atom protein
simulations as long as a millisecond (Science 330:341-6, 2010). Anton
has made possible simulations in which drugs spontaneously associate
with GPCRs to achieve bound conformations that match crystal structures
almost perfectly (PNAS 108:13118-23, 2011; Nature 482:552-6, 2012).
Simulations on Anton have also captured transitions of a GPCR between
its active and inactive states, allowing us to characterize the
mechanism of receptor activation (Nature 469:236-40, 2011; PNAS
108:18684-9, 2011). Our results, together with complementary
experimental data, suggest opportunities for the design of drugs that
achieve greater specificity and control receptor signaling more
precisely.
3/28/2012
Hannah Carter, Johns Hopkins University
Identifying driver missense mutations in tumor sequencing data
Large-scale
sequencing of cancer genomes is uncovering thousands of DNA
alterations, but the functional relevance of the majority of these
mutations to tumorigenesis is unknown. Identifying which of these
mutations contribute to cancer is critical for understanding tumor
biology, and for finding new diagnostic biomarkers and therapeutic
targets. We have developed a computational method, called
Cancer-specific High-throughput Annotation of Somatic Mutations (CHASM),
to identify and prioritize the missense mutations most likely to
generate functional changes in proteins that enhance tumor cell
proliferation. CHASM uses a supervised machine learning technique called
a random forest and more than 80 quantitative features describing amino
acid changes to predict candidate driver mutations. The method has high
sensitivity and specificity when discriminating between known driver
missense mutations and randomly generated missense mutations, and
performs well relative to other computational methods applied to this
problem. CHASM has been applied to over 15 tumor sequencing studies to
prioritize missense mutations for further study and initial results are
promising; however, further experimental validation is needed to confirm
CHASM predictions.
3/26/2012
Jianyang (Michael) Zeng, Duke University
Automated Nuclear Magnetic Resonance Assignment and Protein Structure Determination
High-throughput
protein structure determination based on solution nuclear magnetic
resonance (NMR) spectroscopy plays an important role in structural
genomics. Unfortunately, current NMR structure determination is still
limited by the lengthy time required to process and analyze the
experimental data. In this talk, I will describe our recent success
stories about the applications of computational techniques in addressing
several bottlenecks in NMR structure determination. First, I will talk
about a novel high-resolution structure determination algorithm that
starts with a global fold calculated from the exact and analytic
solutions to the residual dipolar coupling (RDC) equations. Our
high-resolution structure determination protocol has been applied to
solve the NMR structures of the FF Domain 2 of human transcription
elongation factor CA150 (RNA polymerase II C-terminal domain interacting
protein), which have been deposited into the Protein Data Bank (PDB ID:
2KIQ). Second, I will present a Bayesian approach to determine protein
side-chain rotamer conformations by integrating the likelihood function
derived from unassigned NOE data, with prior information (i.e.,
empirical molecular mechanics energies) about the protein structures.
Third, I will describe an automated side-chain resonance assignment
algorithm that does not require any explicit through-bond experiment to
facilitate side-chain resonance assignment. All our algorithms have been
tested on real NMR data. The promising results demonstrate that our
algorithms can be successfully applied to high-quality protein structure
determination. Since our algorithms reduce the time required in NMR
assignment, it can accelerate the protein structure determination
process.
3/22/2012
Roger Pique-Regi, University of Chicago
Understanding the impact of genetic variation on molecular mechanisms of transcriptional regulation
My
research focuses on developing novel computational methods to identify
regulatory sequences, and to model the molecular mechanisms of gene
transcription control. The mapping of expression quantitative trait loci
(eQTLs) has emerged as an important tool for linking genetic variation
to changes in gene regulation. However, it remains difficult to identify
the causal variants underlying eQTLs, and little is known about the
regulatory mechanisms by which they act. We used DNase I sequencing to
measure chromatin accessibility in 70 Yoruba lymphoblastoid cell lines,
for which genome-wide genotypes and estimates of gene expression levels
are also available. We obtained a total of 2.7 billion uniquely mapped
DNase I-sequencing (DNase-seq) reads, which allowed us to infer
transcription factor binding exploiting the specific DNase I cleavage
footprint left on 827,000 sites corresponding to more than 100 factors.
Across individuals, we identified 8,902 locations at which the DNase-seq
read depth correlated significantly with genotype at a nearby locus
(FDR = 10%). We call such genetic variants 'DNase I sensitivity
quantitative trait loci' (dsQTLs). We found that dsQTLs are strongly
enriched within inferred transcription factor binding sites and are
frequently associated with allele- specific changes in transcription
factor binding. A substantial number of dsQTLs are also associated with
variation in the expression levels of nearby genes. Our observations
indicate that dsQTLs are highly abundant in the human genome and are
likely to be important contributors to phenotypic variation.
3/8/2012
Carl Kingsford, University of Maryland
Computational Challenges in Reconstructing Evolutionary Histories
I will discuss our recent efforts to reveal important evolutionary events in two biological systems.
First,
I will describe our work identifying reassortments, or mixing of
genomic segments, in the influenza virus. Reassortment is the main
process by which new pandemic strains arise and was the driving force
behind the recent “swine flu” outbreak. We have developed an algorithm
and software program called GIRAF that finds reassortment events among
large collections of influenza genomes. GIRAF is the first fully
automated computational approach to this problem, and it is based on the
first quadratic-delay algorithm for enumerating high-weight maximal
bicliques in bipartite graphs. It allows us to quickly scan thousands
of influenza genomes for reassortments. Using our algorithm, we have
discovered many novel reassortment events in collections of human,
avian, and swine influenza strains.
Second, I will present our recent
work on reconstructing ancient biological networks. We have developed
several methods for recovering interactions between molecules that were
present in ancestral species, starting with only the present-day
networks that we are able to measure. We have shown that many
properties of the evolution of extinct networks can be inferred using
our approaches and that ancestral interactions can be inferred with high
accuracy.
Various parts of this work were done jointly with Niranjan
Nagarajan, Saket Navlakha, Rob Patro, Guillaume Marçais, Justin Malin,
and Emre Sefer.
3/2/2012
Meera Sitharam, University of Florida
EASAL: Entropy computation for Assembly Configuration spaces via Stratified Convex Parametrizations
Differences
between the geometries of molecular assembly versus folding
configuration spaces are illuminated by a new theory of convex
configuration spaces developed by the speaker's group. While assembly
configurations of molecular complexes of up to 7 rigid monomers are
already high dimensional and entropically challenging, they are far more
tractable to explore, search and analyze than folding configuration
spaces. This is because: (a) the assembly configuration space topology
can be decomposed directly into a standard Thom-Whitney complex of
active constraint regions, including boundaries of varying dimensions;
(b) (the key point) these active constraint regions can be charted with
convex parameterizations. We refer to the precisely roadmapped union of
these charts as the atlas of the configuration space.
EASAL is the
software implementation of various efficient algorithms with proven
guarantees for atlasing and related search problems for such small
molecular assemblies.
Atlasing the configuration spaces and assembly
pathways of larger molecular assemblies is effected by recursive
decomposition and recombination as smaller molecular subassemblies (that
can be atlased using EASAL) making active use of symmetry often present
in larger assemblies.
We have recently used EASAL (a) to correctly
predict crucial interactions for the assembly of a T= 1 viral shell of
AAV4 (confirmed by mutagenesis experiments in the Mckenna lab at UF) and
(b) to illuminate features and configurational entropy of a helix
packing configuration space that cause standard metropolis montecarlo
sampling to be non-stochastic (helix and montecarlo trajectory data from
the lab of Maria Kurnikova, a computational chemist at CMU).
2/24/2012
Kevin White, University of Chicago
Integrating Genomic Networks to Identify Biomarkers and Drug Targets
Systems
level approaches to construct abstract molecular networks can lead to
predictions about genetic and biochemical functions in cells, organisms
and in disease states. We have used integrated experimental and
computational approach to construct a large scale functional networks in
both model organisms and human cancer cells. Our network models are
based on a combination of gene expression, transcription factor DNA
binding site mapping, automated literature mining and protein-protein
interaction mapping. We provide a strategy for reducing the
dimensionality of the massive networks that result from such integrated
whole genome analyses. I will present examples from both Drosophila and
human breast cancer cell lines that illustrate how one can translate
systems biology-driven findings in model systems to useful tools for
diagnosing human diseases. I will also discuss our use of large scale
genome sequence data in the context of systems approaches to developing
prognostic signatures for breast cancer, and the use of cloud computing
to manage and mine 'omics data.
1/27/2012
Ernest Fraenkel, Massachusetts Inst. of Technology
Integrating 'Omic' Data to Reveal Disease Mechanisms
Proteomic
technologies, next-generation sequencing and RNAi screens are providing
increasingly detailed descriptions of the molecular changes that occur
in diseases. However, it is difficult to assemble these data into a
coherent picture that could lead to new therapeutic insights for several
reasons. Despite their power, each of these methods still only captures
a small fraction of the cellular response. Moreover, when different
assays are applied to the same problem, they often provide apparently
conflicting answers. We have developed powerful new approaches to
integrate these data to identify small, functionally coherent pathways
that underlie cellular behavior. In this talk, I will discuss recent
unpublished work from my laboratory showing that these methods suggest
novel therapeutic strategies for glioblastoma multiforme.
10/14/2011
Daphne Koller, Stanford University
Twelfth Morris H. DeGroot Memorial Lecture
10/7/2011
Wendy Cornell, Merck
Comparison of 2D, 3D, and QSAR Methods for Virtual Screening
Using a set of 47 protein targets from the MDDR, we assess the performance of 2D similarity, 3D similarity, and QSAR methods at identifying active compounds for each target when starting with some number (1, 5, 10, 20, or 40) of actives. Two 2D similarity methods are tested - Toposim, which uses Dice similarity, and Lassi, which uses latent semantic structural indexing. Three QSAR methods are included - random forest, trendvector, and support vector machine (SVM). Each 2D similarity and QSAR method is used in combination with different descriptor sets, including atom pairs (AP), topological torsions (TT), binding property torsions (DT), extended connectivity fingerprints (ECFP4), and MACCS. We assess retrieval rates for single compounds as well as clusters. Among the descriptor sets, ECFP4 performed consistently the best. Although Toposim and Lassi found different hits, their retrieval rates for individual compounds were surprisingly similar. Among the QSAR methods, random forest and trendvector outperformed SVM. Combinations of methods are also explored to maximize both lead hopping and retrieval of close neighbors.
8/29/2011
Wei Wu, University of Pittsburgh
Reverse Engineering Dynamic Gene Networks Underlying Breast Cancer Cell Lineages and Yeast Cell Cycles
Estimating
gene regulatory networks over biological lineages or time series is
central to a deeper understanding of how cells evolve during development
and differentiation. One challenge in estimating such evolving networks
is that their host cells are not only contiguously evolving, but also
can branch over time. For example, a biologist may apply several
different drugs to a malignant cancer cell to analyze the changes each
drug has produced in the treated cells. Cells treated with one drug are
not directly related to cells treated with another drug, but rather to
the malignant cancer cells that they were derived from. Underlying these
intriguing dynamic systems, one expects that the interactions between
genes are not always constant over time, but rather they are often
transient; in other words, gene-gene interactions occur during a time
interval may disappear and then reappear again later in time. This
challenging behavior renders existing network inference methods
inapplicable.
We proposed two novel approaches, Treegl and
TV-DBN, which build on the L1 plus time-dependent penalized graphical
logistic regression to effectively estimate multiple evolving gene
networks corresponding to cell types related by a tree-genealogy, or
cell stages related by a evolving chain, based on only a few samples
from each condition. Our methods take advantage of the similarity
between related networks along the biological lineage, while at the same
time exposing sharp differences between the networks. We explore
applications to analysis of a breast cancer development, and yeast cell
cycle regulation. Based on only a few microarray measurements, our
algorithms are able to produce biologically valid results that provide
insight into the progression and reversion of breast cancer, and
transient interactions among genes in yeast cell cycle.
5/5/2011
Ioannis Tsamardinos, Vanderbilt University
Towards Integrative Causal Analysis of Heterogeneous Datasets and Prior Knowledge
Modern
data analysis methods for the most part, concern the analysis of a
single dataset. The conclusions of an analysis are published in the
scientific literature and their synthesis is left up to a human expert.
Integrative Causal Analysis (INCA) aims at automating this process as
much as possible. It is a new, causal-based paradigm for inducing models
in the context of prior knowledge and by co-analyzing heterogeneous
datasets in terms of measured variables, experimental conditions, or
sampling methodologies. INCA is related to, but is fundamentally
different from statistical meta-analysis, multi-task learning, and
transfer learning.
In this talk, we illustrate the enabling INCA
ideas, present INCA algorithms, and give proof-of-concept empirical
results. Among others, we show that the algorithms are able to predict
the existence of conditional and unconditional dependencies
(correlations), as well as the strength of the dependence, between two
variables Y and Z never measured on the same samples, solely based on
prior studies (datasets) measuring either Y or Z, but not both. The
algorithms accurately predict thousands of dependencies in a wide range
of domains, demonstrating the universality of the INCA idea. The novel
inferences are entailed by assumptions inspired by causal and graphical
modeling theories, such as the Faithfulness Condition. The results
provide ample evidence that these assumptions often hold in many real
systems. The long term goal of INCA is to enable the automated
large-scale integration of available data and knowledge to construct
causal models involving a significant part of human concepts.
4/22/2011
Li-San Wang, University of Pennsylvania
Gene expression in aging and aging-associated disorders
Aging
is a highly complex phenomenon that affects virtually all aspects of
biology. In medicine, age is a primary risk factor for cancer,
neurodegeneration, and many other diseases. Thus, understanding how
aging proceeds and contributes to these diseases are key to finding
cause and means of intervention. This presentation will cover some of
our work towards understanding the connection between aging and
age-associated diseases, by investigating gene expression through
bioinformatic means.
The first half of my talk focuses on
G-quadruplexes (Gquads). Gquads are genomic motifs consisting of four
runs of guanines that can form highly stable 3D structures in vivo and
have high occurrence in telomeres. Analysis of yeast and human genomic
distributions suggest that Gquads are associated with differentially
expressed genes in yeast senescence model and human fibroblasts from
patients with Werner syndrome, a genetic disorder that exhibits
premature aging phenotypes.
The second half of my talk concerns gene
expression changes in human brain aging and Alzheimer's disease. We
developed algorithms that can estimate the age of an individual using
gene expression profiles. Using these algorithms, we found that brains
with Alzheimer's disease or frontal temporal dementia show trends of
accelerated aging in gene expression change.
4/08/2011
Tamer Kahveci, University of Florida
Computational strategies for understanding how biological networks function.
Biological networks of an organism show how different bio-chemical
entities, such as enzymes or genes, interact with each other to perform
vital functions for that organism. Each subnetwork within a network can
perform various functions that it can not do without interacting with
other entities in the network. Understanding the functions of the entire
networks as well as the individual subnetworks has been a prime goal
for explaining how the organisms work.
Dr. Kahveci's lab is focusing
on developing computational methods that will help in understanding the
functions of large scale biological networks. This talk we will focus on
comparative analysis of biological networks. This topic will be
considered in two parts. The first part will focus on comparative
analysis of a pair of networks. This part will constitute the majority
of the talk. The second part will discuss scalabilities issues for
performing this analysis on a large database of networks. The first part
will guide step by step starting from a simplified model to a more
realistic model. The first step will limit comparison to pairs of
entities of networks and explain how we can compare networks when the
biological process is explained through different types of biological
entities. The second step will eliminate this limitation and describe a
computational approach when the same biological process can be performed
at different number of steps. The last step will challenge the existing
definition of similarity and introduce a new measure, functional
similarity that explains the function in terms of the steady states of
the biological networks and describe how we can compute the steady
states for large regulatory networks. The second part of the talk will
discuss a probabilistic strategy for finding highly similar networks to a
query network in a database that contains a large number of networks.
3/10/2011
Gerald Quon, University of Toronto
De-mixing heterogeneous gene expression profiles into their constituent components, and applications to personalized medicine
One
of the primary goals of gene expression profiling experiments is to
identify key genes and pathways associated with a particular condition
or disease. However, biological samples are often composed of multiple
distinct cell populations, of which only a few are of interest. We
have developed ISOLATE, a computational model for separating
heterogeneous mixtures of cell populations into their individual
components, given only the expression profiles of heterogeneous samples
and some of the homogeneous populations. We demonstrate the accuracy
and value of computational purification in three problem domains:
identifying prognostic signatures for cancer, linking changes in gene
expression to patient outcome in juvenile arthritis, and monitoring cell
population dynamics in hematopoietic stem cell systems.
3/9/2011
Xin He, University of California, San Francisco
Understanding genetics of complex diseases through systems biology and regulatory genomics
Genome-wide
association studies (GWAS) have identified many candidate loci for a
number of complex traits. In most cases, however, there is little
functional evidence of these loci and the mechanisms of their influence
on complex traits are not clear. A very promising strategy is to link
genotypes and phenotypes through molecular level traits, such as gene
expression level. The first part of my talk will be focused on a new
strategy we developed recently to incorporate expression QTL (eQTL) data
in the analysis of GWAS. We developed a Bayesian statistical method
that integrates the information of the SNPs underlying a gene expression
trait, with appropriate weighting, to test if the expression of this
gene contributes to the complex disease of interest. In particular, our
statistical test allows us to exploit information in a large number of
weak SNPs, which are often ignored but represent a collectively
important part of the genetics of any complex trait.
To ultimately
understand how genotypic variations influence phenotypes, we need a
detailed understanding of how DNA sequences encode their immediate
molecular functions. The second part of my talk will be focused on the
study of regulatory sequences, which harbor a large fraction of SNPs
discovered by GWAS and are believed to be important for many complex
diseases. To recognize these sequences in a genome, we developed a
comparative genomic method based on the assumption that functional
transcription factor binding sites (TFBSs) tend to be conserved across
species. The method utilizes a probabilistic model of regulatory
sequence evolution that captures substitutions, insertions/deletions,
and selection or turnover of TFBSs. A more difficult question of
regulatory sequences is to understand how these sequences generate
spatial-temporal expression patterns. For this purpose, we developed a
quantitative model based on statistical thermodynamics theory and an
efficient dynamic programming algorithm. This model incorporates a
number of features of regulatory sequences, including the importance of
weak TFBSs, cooperative interactions among TF molecules, among other
things. We demonstrated the predictive power of our model, and by
applying it to an early developmental system in Drosophila, we were able
to gain understanding of the quantitative rules of gene regulation.
2/25/2011
Richard H. Lathrop, University of California, Irvine
Predicting Positive p53 Cancer Rescue Regions Using Most Informative Positive (MIP) Active Learning
Many
protein engineering problems involve finding mutations that produce
proteins with a particular function. Most Informative Positive (MIP)
active learning is tailored to biological problems because it seeks
novel and informative positive results. We applied MIP to discover
mutations in the tumor suppressor protein p53 that reactivate mutated
p53 found in human cancers. MIP found Positive (cancer rescue) p53
mutants in silico using 33% fewer experiments than traditional non-MIP
active learning. MIP was used to select a Positive Region predicted to
be enriched for p53 cancer rescue mutants. In vivo assays showed that
the predicted Positive Region: (1) had significantly more (p<0.01)
new strong cancer rescue mutants than control regions (Negative, and
non-MIP active learning); (2) had slightly more new strong cancer rescue
mutants than an Expert region selected by a human expert for purely
biological considerations; and (3) rescued for the first time the
previously unrescuable p53 cancer mutant P152L.
2/16/2011
Anne E. Carpenter, Broad Inst. of Harvard and MIT
Extracting quantitative information from biological images to tackle world health problems
Microscopy images contain rich information about
The
biological systems being tested in high-throughput experiments are
becoming increasingly more physiologically relevant. For example,
co-cultures of two particular cell types can better replicate certain
tissue and organ systems and preserve normal cellular functions such as
liver and hematopoiesis. Whole organisms like C. elegans and zebrafish
can be screened for complex phenomena like behavior, infection, and
metabolism. These more complex systems present new challenges in image
analysis.
We are also exploring the potential of extracting patterns
of morphological perturbations (“signatures”) from cell images in order
to identify the similarities between various chemical or genetic
treatments, in experiments to identify distinctions between human
isoforms of cancer-relevant proteins, mechanisms of hepatotoxicity, and
diagnostics for bipolar disorder and schizophrenia.
The methods we
develop are freely available through the biologist-friendly open-source
software, CellProfiler, for both small- and large-scale experiments.
2/4/2011
Jinbo Xu, Toyota Technological Institute at Chicago
Probabilistic Graphical Model for Protein Structure Prediction
If we know the primary sequence of a protein, can we predict its
three-dimensional structure by computational methods? This is one of the
most important and difficult problems in computational molecular
biology and has tremendous implications for protein functional study and
drug discovery.
Existing computational methods for protein structure
prediction can be broadly classified into two categories:
template-based modeling (i.e., protein threading/homology modeling) and
template-free modeling (i.e., ab initio folding). Template-based
modeling predicts structure of a protein using experimental structures
in the Protein Data Bank (PDB) as templates while template-free modeling
predicts protein structure without depending on a template.
This
talk will present new probabilistic graphical models for knowledge-based
protein structure prediction. In particular, this talk will present a
regression-tree-based Conditional Random Fields (CRF) method for
template-based modeling and a Conditional Random Fields/Conditional
Neural Fields (CRF/CNF) method for template-free modeling. Experimental
results indicate that our template-based method performs extremely well,
especially on hard template-based modeling targets and our
template-free method is also very promising for mainly-alpha proteins.
1/21/2011
Russell Malmberg, University of Georgia
Computational Searches for Non-Coding RNA; Ecological Genetics of Pitcher Plants
There will be two quite different research topics presented.
(1) The importance of RNAs that do not code for proteins, but that have
functions directly as RNAs, has been recognized over the last 30 years
in a series of dramatic discoveries. Estimates of the numbers of
non-coding RNAs in eukaryotes vary considerably but are plausibly in the
range of 0.5x to 2x the number of protein-coding RNAs. Computational
identification of ncRNA genes in genomes is rendered difficult by the
lack of sequence similarity of many related ncRNAs; however, some ncRNAs
have their structure more conserved than their primary sequence. We
have developed algorithms to search genomes for ncRNA on the basis of
their structure, using conformational graph - tree decomposition methods
to greatly speed up the process. We have studied the nature of
evolutionary variability in RNA secondary structure, and are using these
results to improve the genomic search methods.
(2) Pitcher plants (Sarracenia species) eat insects. They appeal to the
inner 10 year old in us. Different Sarracenia species have varying
pitcher morphologies and varying means of digesting insects. Some
species actively digest secreting proteases and similar enzymes; other
species support a microbial food-web which digests the insects. We are
analyzing the genetic basis of the differences between the species in
pitcher morphology, insect digestion strategy, and the degree to which
individual plant genotypes can support the associated microbial
community.
1/14/2011
Kris Dahl, Carnegie Mellon University
Computational approaches to determine multiscale structural changes in the nucleus associated with aging
There are numerous premature aging disorders associated with altered nuclear structure. We are primarily interested in Hutchinson Gilford progeria syndrome (HGPS) which is caused by a mutation in nuclear lamin A. (1) Using integrated experimental and computational studies we examine how the mutation associated with HGPS alters the structure of the protein, even though the mutation is in an inherently disordered region. (2) We have also simulated the structural filament network in the nucleus as a reductionist model to examine the cause of morphological changes in the nucleus associated with HGPS. (3) We examine how the HGPS mutation alters nuclear response to force in situ using computational methods to analyze complex mechanical character from live cell experiments. (4) At the microlevel we also use computational image analysis of a variety of premature aging diseases to understand the role of nuclear morphology in disease progression. In sum, we have used a combination of computation and experiment to examine structural proteins in the nucleus at many length scales and how they impact the etiology of HGPS.
12/3/2010
Klaus Palme, University of Freiburg (Germany)
The magic role of auxin and beyond
Unlike most animal cells, plant cells can easily regenerate new tissues
from cells derived from different tissues. These cells first
dedifferentiate but later can be reprogrammed to form a wide variety of
organs when properly cultured. We investigate the signalling components
and molecular mechanisms that provide plant cells with the property to
regenerate de novo organs. Plant hormones like auxin (indole-3-acetic
acid) play a fundamental role in plant cell proliferation,
differentiation and organ formation. Auxin levels are controlled by
biosynthesis, transport and degradation. Since its first description in
the 19th century, the directional movement of auxin through the plant
has attracted much attention for more than a century. An overview will
be given on the current status of studies aiming to understand the
physiology of auxin transport and structure-function characterization of
the PIN interactom. Components of PIN
11/19/2010
John Shon, Director of Disease and Translational Information, Hoffman-La Roache Pharmaceuticals
Drugs to Glide from Research to the Bedside: Opportunities for Software and IT in the Life Sciences
Ever
wonder what opportunities exist in Life Science IT and software? The
companies require the exchange of critical information that is rich and
complex. The drug development process, for example, is greatly enhanced
when valuable "nuggets" are passed between professionals that are
focused on the start versus the end of the drug development process.
This has not easy to accomplish. There are a number of such applications
that will help accelerate and enable the creation and production of
better therapeutics. Dr. Shon will present his vision for software
enabled solutions that solve the multiple challenges facing large
Pharma.
10/22/10
Michael Gilson, University of California, San Diego
Modeling molecular recognition: Free energy, entropy and mechanical stress
Better
computer models of molecular recognition are needed to speed the design
of new therapeutics and host-guest systems with a range of
applications. I will discuss concepts and software we are developing for
these purposes, as well as some unexpected insights into changes in
entropy and mechanical stress on binding that have emerged from this
work. In particular, changes in configurational entropy on binding
appear to be as quantitatively important as changes in more commonly
recognized free energy contributions, such as hydrogen bonding, and I
will discuss recent developments in the characterization of entropy
changes through the mutual information expansion of the entropy. In
addition, we have begun to explore the application of ideas of
mechanical stress at the molecular level as a potential basis for
understanding the long-ranged transmission of information and other
molecular mechanisms.
9/17/2010
Jean-Christophe Olivo-Marin, Institut Pasteur
Quantitative biological imaging: from cells to numbers.
This talk will present specific methods and algorithms fo of 2- and
3-D+t images sequences in biological microscopy and their use in the
study of host-pathogen interactions. Our goal is to automate the
quantification and analysis of dynamics parameters or the
characterization of phenotypic and morphological changes occurring as a
consequence of the interaction between microbes and
9/16/2010
Chris Bakal, Institute of Cancer Research
Signaling Networks that Regulate Morphological Noise and Promote Exploratory Behavior
Cell shape is not encoded by genomes. Rather genes encode the
signaling networks that allow cells to explore morphological space
through random variations in cell shape, which we term morphological
noise. Stochastic and deterministic amplification of these small
variations in shape can lead to the phenotypic diversity necessary for
cells to adapt to unpredictable fluctuations in cellular environment.
Morphological noise thus creates an ensemble of cell shapes that somatic
variation can act upon, which can ultimately be stabilized via genetic
evolution. To provide insight into how signaling networks regulate the
exploration of shape space we perform quantitative measurements of
single cell morphology in the context of genome scale RNAi screens. I
will discuss the identification of noise enhancing local networks that
act to regulate diverse cellular processes whose inhibition leads to
canalized phenotypes and facilitate stochastic exploratory behavior.
Furthermore, we have identified a number of other genes that act as
morphological noise suppressors. Through computational integration of
noise signatures with orthogonal datasets we derive a dynamic model that
describes the information flow on a systems-level.
4/15/2010
Paul Boutros, Ontario Institute for Cancer Research
Prognostic Markers for Non-Small Cell Lung Cancer
Lung
cancer is a disease with dismal prognosis; only 15% of newly-diagnosed
patients survive for five years. Our understanding of how to diagnose,
stage, and treat it is based largely on macroscopic or cellular
phenomena. A molecular understanding of the disease may provide improved
clinical management and new therapeutic options.
My group focuses on
predicting the survival of lung cancer patients. In particular, we
develop algorithms to exploit microarray datasets to develop biomarkers
of survival, called prognostic markers. In this talk I will describe
three recent results: an algorithm, a database, and an empirical
finding.
First, I describe a new feature selection algorithm, called
modified steepest descent (mSD). This algorithm couples gradient-descent
with unsupervised machine-learning. Through greedy forward-selection it
generates a six-gene prognostic marker for lung cancer that is
validated in over 500 patient samples.
Second, I describe a
meta-analytic database that compiles the data from nine transcriptomic
studies of lung cancer. These studies were integrated using a novel
normalization approach and then subject to meta-analysis. For each gene
present in the analysis (16,391 in total), the univariate prognostic
capacity was calculated. I show that this database increases our
statistical power sufficiently to allow separate analysis of different
histological subtypes of lung cancer.
Third, I describe an analysis
of biomarker plurality. From an empirical study of biomarker-space we
found that the number of effective markers is very large. The
inter-relationship amongst these markers contains information about
gene-gene interactions, and may provide an avenue for understanding the
specific pathways dysregulated in lung cancer.
Lung cancer incidence
remains high and survival remains low. The development of prognostic
markers may improve this situation by allowing personalized therapy. The
computational approaches described here may be applicable beyond this
one disease, and may provide insight into the types of methodologies
that will work well for other problem-domains.
4/2/2010
Mona Singh, Princeton University
Predicting and Analyzing Cellular Networks
Proteins
accomplish virtually all of their cellular functions via interactions
with other molecules. As a result, a broad array of computational
methods have been developed to predict protein interactions, whether
with DNA, other proteins, or small molecules. In combination with
high-throughput experimental technologies, we now have the ability able
to build large scale biological networks across the evolutionary
spectrum. Global analyses of these networks provide new opportunities
for revealing protein functions and pathways and for uncovering cellular
organization principles.
In my talk I will discuss computational
approaches that my group as developed for the complementary problems of
predicting interactions and analyzing interaction networks. In the
first part of the talk, I will describe sequence and structure
approaches for predicting sites in protein sequences that interact with
small molecules. In the second part of my talk I will discuss and
describe algorithms for analyzing protein function and functional
modules, and will present framework for explicitly incorporating known
attributes of individual proteins into the analysis of biological
networks, thereby allowing us to discover recurring network patterns
underlying a range of biological processes.
3/26/2010
Nancy Amato, Texas A&M University
Using Motion Planning to Study Molecular Motions
Protein motions,
ranging from molecular flexibility to large-scale conformational
change, play an essential role in many biochemical processes. For
example, some devastating diseases such as Alzheimer's and bovine
spongiform encephalopathy (Mad Cow) are associated with the misfolding
of proteins. Despite the explosion in our knowledge of structural and
functional data, our understanding of protein movement is still very
limited because it is difficult to measure experimentally and
computationally expensive to simulate.
In this talk we describe a method we have developed for modeling protein
motions that is based on probabilistic roadmap methods (PRM) for motion
planning. Our technique yields an approximate map of a protein's
potential energy landscape and can be used to generate transitional
motions of a protein to the native state from unstructured conformations
or between specified conformations. We describe a method based on
rigidity theory that allows us to sample conformation space more
efficiently than our initial sampling strategy and enables us to study a
broader range of motions for larger proteins and new analysis tools
that enable us to extract kinetics information, such as folding rates.
For example, we show how our map-based tools for modeling and analyzing
folding landscapes can capture subtle folding differences between
protein G and its mutants, NuG1 and NuG2. In recent work, we have
applied our techniques to identify and study the folding core. More
information regarding our work, including an archive of protein motions
generated with our technique, are available from our protein folding
server: http://parasol.tamu.edu/foldingserver/
3/2/2010
Seyoung Kim, Carnegie Mellon University
Understanding the Genetic Basis of Complex Diseases via Genome-Phenome Association
Genome-wide association studies have recently become popular as a
2/25/2010
Alexander Schoenhuth, University of California-Berkeley
Classifying cancer tissue by inferring systemic markers
It has
recently been shown that protein-protein interaction (PPI) subnetworks
which exhibit synergistic differential gene expression in tumorigenic
phenotypes are more accurate than single gene markers when it comes to
classifying such phenotypes. Here we compute markers as connected
subnetworks in confidence-scored PPI networks which achieve high overall
confidence scores and are dysregulated in a sufficient number of
patients. We do this by employing a novel, exhaustive search technique
which, for the first time, renders the inherent search problem on
weighted-edge networks tractable. We compute p-values for the resulting
subnetworks and use the most significant candidates for classification
purposes. Thereby we obtain sets of systemic markers which are superior
in terms of gene ontology (GO) term enrichment. As a result, we
outperform all prior approaches when classifying colon cancer versus
healthy tissue.
2/23/2010
Can Alkan, University of Washington and Howard Hughes Medical Institute
Discovery and Characterization of Copy-Number Variants with Next-Gen Sequencing Technologies
Structural variation, in the broadest
sense, is defined as the genomic changes among individuals that are not
single nucleotide variants. These include insertions, deletions,
duplications, inversions and translocations that were demonstrated to be
common and ubiquitous among individuals. A variety of diseases have
been associated (both causative and protective) with copy-number
variants (CNVs) such as schizophrenia, mental retardation, and HIV
susceptibility/resistance. However, CNVs, especially duplicated
regions, have remained largely intractable due to difficulties in
accurately resolving their structure, copy number and sequence content
using hybridization based methods. Consequently, a significant fraction
of the duplicated genomic content has not been assayed by standard
genetic and molecular analyses.
The realization of new ultra-high-throughput sequencing platforms such
as Roche/454, Illumina/Solexa and ABI/SOLiD now makes it feasible to
detect the full spectrum of genomic variation among many individual
genomes, including cancer patients and others suffering from diseases of
genomic origin. Recently I have developed a set of computational
methods to comprehensively detect and characterize structural variation
and segmental duplications using next-gen sequencing.
My algorithms are based on two different approaches: (i) read-depth
analysis to characterize segmental duplications and predict absolute
copy numbers (mrFAST), and (ii) read-pair analysis to discover
structural variation including inversions (VariationHunter). I applied
my algorithms to detect structural variation and segmental duplications
to genomes sequenced by Illumina and 454 technologies. I initially
examine the genomes of three humans and experimentally validate
copy-number differences in the organization of these genomes, and the
application of my methods to study the genomes of >160 individuals
sequenced as part of the 1000 Genomes Project.
2/12/2010
Tandy Warnow, University of Texas at Austin
Simultaneous Alignment and Phylogeneic Tree Estimation
Molecular sequences evolve under processes that include substitutions,
insertions, and deletions (jointly called "indels"), as well as other
mechanisms (e.g., duplications and rearrangements). The inference of the
evolutionary history of these sequences has thus been performed in two
stages: the first estimates the alignment on the sequences, and the
second estimates the tree given that alignment. While such methods seem
to work well on relatively small datasets, these two-stage approaches
can produce highly incorrect trees and alignments when applied to large
datasets, or ones that evolve with many indels. In this talk, I will
present a new method, SATe, that my lab has been developing that uses
maximum likelihood to estimate the alignment and tree at the same time,
and that can be used to analyze datasets with up to 1000 sequences on a
desktop in 24 hours. Our study, using both real and simulated data,
shows that this method produces much more accurate trees than the
current best methods. Joint work with Kevin Liu, Sindhu Raghavan,
Serita Nelesen, and Randy Linder.
2/9/2010
Cheemeng Tan, Duke University
Emergent bistability in bacteria and implications for effective antibiotic treatment
A synthetic gene circuit is often engineered by considering the host
cell as an invariable “chassis”. Circuit activation, however, may
modulate host physiology, which in turn can drastically impact circuit
behavior. In this talk, I will first discuss the engineering of a simple
circuit consisting of mutant T7 RNA polymerase (T7 RNAP*) that
activates its own expression in bacterium Escherichia coli (1). Although
activation by the T7 RNAP* is noncooperative, the circuit caused
bistable gene expression. This counterintuitive observation can be
explained by growth retardation caused by circuit activation, which
resulted in nonlinear dilution of T7 RNAP* in individual bacteria.
Predictions made by models accounting for such effects were verified by
further experimental measurements. Our results reveal a novel mechanism
of generating bistability and underscore the need to account for host
physiology modulation when engineering gene circuits.
Interestingly, bistability can also arise from interactions between
bacterial physiology and antibiotics. We find that certain antibiotics,
when applied at moderate concentrations, can cause ‘phenotypic
bifurcation’ in bacterial growth: for the same concentration of
antibiotic, a bacterial population survives only if its initial density
is sufficiently high. We further show that the phenotypic bifurcation
has profound implications for periodic treatment of bacteria by
antibiotics. In the absence of phenotypic bifurcation, the efficacy of
treatment increases with increasing frequency of antibiotic
administration; otherwise, however, the efficacy of treatment can be
drastically diminished at an intermediate frequency. Our results have
implications on optimal design of antibiotic treatment.
(1) C. Tan, P. Marguet, and L. You. Emergent bistability by a
growth-modulating positive feedback circuit. Nature Chemical Biology, 5,
842-848, 2009.
Highlighted in “News and Views”: Slow growth leads to a switch, Nature Chemical Biology, 5, 784-785, 2009.
2/5/2010
Junming Yin, Univ. of California, Berkeley
A new statistical model for studying gene conversions
Together with crossover recombination, gene conversion is a major
evolutionary mechanism responsible for shaping observed genetic
variation in a population. Although crossovers and gene conversions have
different effects on the evolutionary history of chromosomes and
therefore leave behind different footprints in the genome, it is a
challenging task to tease apart their relative contributions to the
observed genetic variation. In fact, the methods employed in recent
studies of recombination rate variation in the human genome actually
capture combined effects of crossovers and gene conversions.
Studying gene conversion is very important, for it has been argued that
ignoring gene conversion may cause problems in association studies. By
explicitly incorporating overlapping gene conversion events, we propose a
new statistical model that can jointly estimate the crossover rate, the
gene conversion rate and the mean tract length, which is widely
regarded as a very difficult problem. Our simulated results show that
modeling overlapping gene conversions is crucial for improving the
accuracy of the joint estimation of the aforementioned three fundamental
parameters. Our analysis of real data from the telomere of the X
chromosome of Drosophila melanogaster suggests that the ratio of the
gene conversion rate to the crossover rate for the region may not be
nearly as high as previously claimed.
Joint work with Michael I. Jordan and Yun S. Song.
2/4/2010
Marcel Schulz, Max Planck Institute for Molecular Genetics
From RNA-Seq to Ontology Graphs: Application of probabilistic models
In the first part of my talk I am going to present methods that deal
with the inference of alternative splicing events from high-throughput
sequencing of mRNAs (RNA-Seq) data. Starting from millions of paired-end
RNA-Seq reads, we attempt to reconstruct
The second part of the talk will be about a new statistical method for
semantic similarity searches in Ontology Graphs. The method is different
from previous approaches because it incorporates the probability of
random similarity scores and assigns p-values to them. An efficient
algorithm has been developed that allows exact p-values to be computed.
The use of the new method is illustrated with the Phenomizer webserver
that assists medical geneticists in the differential diagnostic process
using features of the Human Phenotype Ontology annotated to OMIM
diseases.
1/29/2010
Quaid Morris, University of Toronto
Predicting the targets of mRNA-binding proteins
RNA-binding domains are among the most common domains in eukaryotic
genomes and RNA-binding proteins (RBPs) play critical roles in
post-transcriptional regulation (PTR) of gene expression by regulating
mRNA processing, mRNA translation, mRNA export and mRNA stability.
However, despite their importance, little is known about how RBPs
identify their
As a first step towards building quantitative models of PTR, we are
mapping out mRNA and RBP interactions using a combined biochemical and
computational strategy. Our strategy is based on a microarray-based
assay, called RNAcompete, that measures the binding affinity of a
recombinant RBP for hundreds of thousands of short RNA sequences.
These sequences are designed to comprehensively query the space of possible binding preferences. We use a new RNA motif finding
algorithm, RNAcontext, to infer sequence and structural binding
preferences of RBPs from the RNAcompete data. However, using these
motif models to find RBP binding sites on mRNAs requires estimating mRNA
secondary structure computationally. Some of our recent work suggests
that estimating this structure is easier than expected.
1/28/2010
Hsiao-Mei Lu, Univ. of Illinois at Chicago
Dynamics of Biological Systems: Allosteric Signal Transmission and Epigenetic Circuits
The dynamics of biological networks is critically important in
conducting cellular functions. It is often a challenging task to study
the dynamics of networks due to the size and complexity. Based on our
successful work in characterizing protein folding dynamics in such a
large conformational space through a long time evolution, the same
method is proposed to study dynamics and time evolution of allosteric
signal transmission and epigenetic circuit.
Large macromolecular assemblies are often important for biological
processes in cell. Allosteric communications between different parts of
these molecular machines play critical roles in cellular signaling.
Although studies of the topology and fluctuation dynamics of
coarse-grained residue network can yield important insight, they do not
provide characterization of time-dependent dynamic behavior of these
macromolecular assemblies. Here we develop a novel approach called
Perturbation-based Markovian Transmission (PMT) model to globally study
the dynamic responses of the macromolecular assemblies. By monitoring
simultaneous responses of all residues (>8,000) across many (>6)
decades of time span from the initial perturbation until reaching, we
show that this approach can yield rich information. With criteria based
on quantitative measurements of relaxation half-time, flow amplitude
change, and oscillation dynamics, this approach can identify pivot
residues that are important for macromolecular movement, messenger
residues that are key to signal mediating, and anchor residues important
for binding interactions. Based on a detailed analysis of the
GroEL-GroES chaperone system, we found that our predictions have an
accuracy of 71-84% judged by independent experimental studies reported
in the literature. I propose this computational method can detect
allosteric signal transmission pathway, characterize the roles of
functionally important residues, and make novel predictions about the
importance of additional amino acid residues previously uncharacterized,
which can be further tested in experimental studies. This approach is
general and can be applied to other large macromolecular machineries
such as virus capsid and ribosomal complex.
Models based on the chemical master equation can describe the
interactions involved in biomolecular networks accurately. An epigenetic
circuit of phage lambda switch in E. coli cells is modeled by the
chemical master equation with full stochasticity. Based on the
successfully developed model, the specific coopperative binding of CI
dimer to OR1 and OR2 is found to be the only crucial one to maintain a
stable and robust phage lambda switch. The explicit computational study
of the mutations of the binding for CI dimer and Cro dimer to OR3 show
that Cro dimer is necessary in an efficient phage lambda induction. The
DNA looping, double positive and negative regulations, and other
biochemical mutations will be studied. Algorithms are also proposed to
solve lager systems efficiently.
Date: 10/20/2009
Speaker: Hagit Shatkay, Queen's University
Title: Life by the Book: Pragmatically Using Text in Large Scale -Omics.
Abstract: The genomic era, in which we live since the sequencing of the
human genome, is characterized by tremendous amounts of biomedical data,
accompanied by a significant increase in the number of related
scientific publications.
Much biomedical knowledge is hidden within the abundant literature. The
ability to rapidly and effectively survey the literature can support
numerous applications, including multiple stages in the design and the
interpretation of large-scale experiments.
A variety of methods are being applied to the biomedical literature in
an attempt to meet these goals, mostly through careful mining of text
for gene/protein names and interactions, using natural language
processing methods. However, the idea of general “biomedical text
mining” remains elusive.
Rather than view biomedical text mining as one monolithic (and not very well defined) task, we attend to specific biological goals that may benefit from the use of text. The talk will focus on several biological applications/problems involving text, and discuss some non-traditional, coarse-grain methods, that we use to address them.
A Human Protein Atlas
Abstract: Information on protein localization and expression on tissue,
cell and organelle level is important to map and characterize the human
proteome as well as to better understand cellular functions of proteins
and to find biomarkers. In the Human Protein Atlas program the human
proteome is systematically analyzed using an antibody-based approach. By
generation and thorough validation of antibodies, protein localization
and expression in human tissues and cells can be analyzed using
immunohistochemistry and fluorescence confocal microscopy. The results
are publicly available in the Human Protein Atlas web portal
(www.proteinatlas.org) that currently contains results from the use of
more than 8,800 validated antibodies corresponding to one third of all
human genes. The portal contains more than 7 million high-resolution
images that each has been manually annotated and curated by a certified
pathologist or a cell biologist to provide a knowledge base for
functional studies and to allow searches and queries about protein
profiles in normal and disease tissue as well as on a cell and
subcellular level. Advanced queries can be performed, including searches
for chromosome location, protein class and/or tissue specificity
(including the 20 most common forms of human cancer), facilitating for
instance biomarker discovery. Our results suggest that it should be
possible to extend the protein atlas to cover the majority of all human
proteins thus providing a valuable
Date: 3/2/09
Speaker: Nicholas Buchler, Rockefeller University
Title: Bait and switch: How protein sequestration generates a flexible ultrasensitive response
Abstract: Regulatory networks in cells exhibit important dynamical
behaviors, such as bistability (e.g. epigenetic switch) and oscillation
(e.g. clocks, cell cycle). Ultrasensitive or `all-or-none~R gene
expression is a necessary feature for the emergence of such dynamics in
gene networks. In biology, many regulatory molecules are sequestered by
an inhibitor into an inactive complex. Using an experimental approach in
budding yeast, I will demonstrate how protein sequestration generates
tunable, all-or-none thresholds in gene expression. A simple
quantitative model for this genetic network shows that both the
threshold and the degree of ultrasensitivity depend upon the abundance
of the inhibitor, exactly as observed experimentally. The abundance of
the inhibitor can be altered by simple mutation; thus ultrasensitive
responses mediated by protein sequestration are easily tunable. Gene
duplication of regulatory homodimers and loss-of-function mutations can
create dominant-negatives that sequester
Date: 2/19/09
Speaker: Andrew Grimson, Massachusetts Inst. of
Title: Animal microRNAs: their ancient origin and contemporary targets
Abstract: Hundreds of microRNAs (miRNAs) collectively regulate a
substantial fraction of the animal transcriptome. Because virtually all
aspects of biology are likely impinged upon by miRNAs, the
identification of the mRNAs targeted by each miRNA remains a fundamental
question. Specific ~7 nt recognition sequences, located primarily in 3'
UTRs, are important for
The scale of transcriptome regulation by miRNAs together with the
extent of miRNA conservation between bilaterians (e.g., humans, flies,
and worms) is evidence for the importance of miRNA biology during animal
evolution. In addition to miRNAs, other bilaterian small RNAs, known as
Piwi-interacting RNAs (piRNAs), protect the genome from transposons.
Neither miRNAs nor piRNAs were known to exist in the simplest,
pre-bilaterian, animal phyla, raising the question of whether a rich
small-RNA biology is characteristic of more complex animals, or whether
these small RNAs might have emerged earlier in metazoan evolution. To
gain
Date: 2/16/09
Speaker: Eric Deeds, Harvard Medical School
Title: Dynamic individuality in protein-protein interaction networks
Abstract: Protein-protein interactions play a crucial role in all
cellular processes, from the regulation of gene expression to the
transduction and processing of extracellular signals. Over the past
decade, high-throughput techniques such as Yeast 2-Hybrid (Y2H) and
Tandem Affinity Purification (TAP-tagging) have provided a global
picture of what the entire protein-protein interaction (PPI) network in
certain organisms might look like. While these methods are often quite
noisy (with potentially high rates of false positives and false
negatives), they have nonetheless served as the substrate for a large
body of work aimed at characterizing or explaining the general
topological structure of these networks. Such purely topological studies
are limited, however, by the fact that they consider a static
description of an inherently dynamical system. A full characterization
and understanding of the behavior of PPI networks clearly requires that
one be able to describe and understand the dynamics of hundreds to
thousands of objects physically interacting with one another. In this
work we employ recently developed rule-based modeling techniques to
perform the first large-scale stochastic simulations of the PPI network
found in the cytoplasm of yeast cells. These simulations reveal that
cells prepared in identical initial conditions will, at steady state,
differ considerably from one another in terms of the identities of the
large protein complexes found in each. Our results indicate that such
dynamic individuality may arise in many complex interaction and
signaling networks.
Date: 2/6/09
Speaker: Su-In Lee, Carnegie Mellon University
Title: Individual Genetic Variation and Gene Regulation: From Networks to Mechanisms
Abstract: Gene expression data of genetically diverse individuals (eQTL data) provide a unique
We apply Lirnet to eQTL data in yeast, mouse and human (Phase II HapMap
data), and provide statistical and biological results demonstrating that
Lirnet produces significantly better regulatory programs than other
recent approaches. We demonstrate in the yeast data that Lirnet can
correctly suggest a specific causal sequence variation within a large,
linked chromosomal region. In yeast, Lirnet uncovered a novel,
experimentally validated connection between Puf3, a sequence-specific
RNA binding protein, and P-bodies, cytoplasmic structures that regulate
translation and RNA stability, as well as the particular causative
polymorphism, a SNP in Mkt1, that induces the variation in the pathway.
Date: 1/27/09
Speaker: Derek Ruths, Rice University
Title: Execution Strategies for Executable Biological Models
Abstract: Progress in advancing our understanding of biological systems
is limited by their sheer complexity, the cost of laboratory materials
and equipment, and limitations of current laboratory
This work is done in collaboration with Luay Nakhleh (Rice University) and Prahlad T. Ram (MD Anderson Cancer Center).
Date: 1/15/09
Speaker: Phil Hyoun Lee, Queen's University
Title: Selecting single nucleotide polymorphisms for effective genetic association study
Abstract: Genetic variation analysis holds much promise as a basis
for understanding disease-gene association. In particular, single
nucleotide polymorphisms (SNPs) are at the forefront of such studies, as
they are the most common form of DNA variation on the genome. However,
due to the tremendous number of candidate SNPs, there is a clear need to
expedite genotyping and analysis by selecting and considering only a
subset of all SNPs.
In this talk, I will present three machine learning applications that
successfully address the problem of SNP selection and improve current
state-of-the-art. The first
Date: 1/13/09
Speaker: Xin Gao, University of Waterloo
Title: Zero in on the fully automated NMR protein structure determination
Abstract: High-throughput structural genomics requires parallelizable
technologies for high-resolution protein structure determination.
Nuclear Magnetic Resonance (NMR) would be such a
Date: 11/10/08
Speaker: William
Title: Machine learning analysis of shotgun proteomics data
Abstract: Mass spectrometry has become the most widely used
Date: 3/27/08
Speaker: Gad Kimmel, University of California, Berkeley
Title: Computational Problems in Human Genetics
Abstract: The question how genetic variation and personal health are
linked is one of the compelling puzzles facing scientists today. The
ultimate goal is to exploit human variability to find genetic causes for
multi-factorial diseases such as cancer and coronary heart disease.
Recent
Date: 3/26/08
Speaker: Itamar Simon, Hebrew University
Title: A high resolution map of mouse genome replication timing suggests a role in gene regulation
Abstact: Although it is known that genomes are divided into distinct
replication time zones, a more detailed understanding of their
organization is limited. Taking advantage of a novel synchronization
method and of genomic DNA microarrays we have mapped replication times
of the entire mouse genome at a high temporal resolution. The
measurement results have allowed us to assign distinct replication times
to 91% of the genome, define asynchronously replicating regions and
identify very large replicons. Analysis of the association between
replication and transcriptional features has revealed a correlation
between replication and transcription potential as well as evolutionary
conservation of replication timing. Finally, analysis of large
replicons, and in particular of regions at which the time of replication
differs from the time of replication of a distant origin, reveals that
transcription is correlated with the actual time of replication and not
with the time of origin activation. Overall, these findings suggest that
early replication plays a causal role in potentiating gene
transcription.
Date: 3/17/08
Speaker: Olivier Elemento, Princeton University
Title: Decoding the regulatory genome
Abstract: Deciphering the non-coding regulatory genome has proved a
formidable challenge. Despite the wealth of available gene expression
data, there currently exists no broadly applicable method for
characterizing the regulatory elements that shape the rich underlying
dynamics. I will present a general framework for detecting such
regulatory DNA and RNA motifs that relies on directly assessing the
mutual information between sequence and gene expression measurements.
Our approach makes minimal assumptions about the background sequence
model and the mechanisms by which elements affect gene expression. This
provides a versatile motif discovery framework, across all data types
and genomes, with exceptional sensitivity and near-zero false-positive
rates. Applications from yeast to human uncover novel putative and
established transcription-factor binding and miRNA
Date: 3/10/08
Speaker: Philip Kim, Yale University
Title: Jumping scales: How 3D structures and molecular genetics meet in protein networks
Abstract: Protein interaction networks form the central layer of a
systems-level description of the cell. While most studies of protein
networks operate on a high level of abstraction, neglecting structural
and chemical aspects of each interaction, I will describe our approach
of characterizing interactions by using atomic-resolution information
from three-dimensional protein structures. We find that some previously
recognized relationships between network topology and genomic features
(e.g., hubs tending to be essential proteins) are actually more
reflective of a structural quantity, the number of distinct binding
interfaces. Subdividing hubs with respect to this quantity provides
insight into their evolutionary rate and indicates that additional
mechanisms of network growth are active in evolution.
Furthermore, I will provide an overview of a major international collaborative effort that aims to resolve interactions involved in signaling pathways. These tend to involve intrinsically disordered regions are hence complementary to the structured interactions studied by the above approach. Our approach combines modern experimental screening techniques with a novel integrated analysis pipeline. The former screens measure binding specificities with hitherto unachievable accuracy and the analysis pipeline maximizes prediction accuracy by integrating a variety of genomic and proteomic features.
Lastly, I will present a study that examined the relationship between
genetic signatures of adaptive evolution and proteomic properties, such
as the location of sites in protein networks and structures. Due to
recent advances in genotyping and sequencing
Date: 3/3/08
Speaker: Han Liang,
Title: System Structures and MicroRNA regulation in humans: a view of systems biology
Abstract: MicroRNAs are ~22nt non-coding RNAs that can
post-transcriptionally repress the expression of many protein-coding
genes in higher eukaryotes. Recently available functional genomic data
enables us to examine the regulatory role of microRNAs at the system
level. Integrating human protein-protein interaction and microRNA
targeting data, I found a global correlation between protein
connectivity and microRNA regulation complexity in the corresponding
genes, and that microRNA regulation likely coordinates the behavior of
interacting partners. To understand the evolution of microRNA-mediated
regulation in humans, I evaluated the role of three types of nucleotide
variation on microRNA targeting: variation between species, variation
within populations and epigenetic variation. While purifying selection
appears to be a driving force maintaining the stability of microRNA
regulation at the system level, a small amount of variants may have
significant functional effects. In particular, I found an appreciable
level of polymorphism at microRNA
Date: 2/28/08
Speaker: Ge Yang, Scripps Research Institute
Title: Metaphase spindle architecture and molecular motor coordination revealed by model driven computer vision
Abstract: The development of biology over the past half century makes
it possible to identify the complete set of genes and proteins of an
organism. A fundamental challenge remains, however, to understand the
complex dynamics of and interactions between the many individual
molecular components involved in situ and in space and time. Of
particular importance in addressing this challenge is to understand how
force and motion are generated, transmitted, and controlled within
dynamic cellular structures during basic cellular processes. In this
presentation, I will focus on addressing this question in two such
processes: cell division and intracellular transport. First,
single-fluorophore imaging and biochemical perturbation are used to
investigate architecture of the metaphase microtubule cytoskeleton in
cell division. This assay provides a model system to understand how
cytoskeletal filament networks are dynamically organized to transmit
force and to directly generate force. Second, fluorescence imaging and
genetic manipulation are used to probe the interaction between molecular
motors in the axonal transport machinery of neurons. This assay
provides a sufficiently reduced yet extremely powerful model system to
understand the interactions between molecular motors of same and
opposite polarities in force and motion generation. Shared by both
studies is the use of computer vision techniques, driven by mechanistic
models, to extract high-resolution quantitative measurements of the
complex spatial-temporal dynamics visualized by powerful fluorescence
live cell imaging techniques. These studies reveal some fundamental and
exquisite connections between force and motion generation and the
dynamic organization of the cytoskeleton in cellular life.
Date: 2/25/08
Speaker: Kevin Chen, New York University
Title: Macro- and micro-evolution of gene regulation mediated by microRNAs
Abstract: Studying the evolution of cis-regulatory elements is important
for three general reasons. First, mutations in these elements can cause
phenotypes of medical importance; second, understanding cis-element
evolution will help us design algorithms for predicting these elements;
third, regulatory evolution is important for understanding phenotypic
evolution. In this talk, I will focus on a class of cis-elements called
"microRNA sites". MicroRNAs are small, noncoding RNAs that
post-transcriptionally regulate their
I will discuss the evolution of animal microRNA sites at two different time scales. At the macro-evolutionary time scale, we show that while the microRNA genes are well-conserved, overall their targets have diverged rapidly. However, there exists a core of deeply-conserved regulatory relationships that may be an important component of animal developmental networks. At the micro-evolutionary time scale, we use human SNP genotype data to demonstrate significant selective constraint on microRNA sites, implying that polymorphisms in these sites are candidates for causal variants of human disease. Our approach also applies to human-specific microRNA sites and we use it to identify a set of these sites in genes co-expressed with the microRNA.
Date: 2/11/08
Speaker: James Taylor, New York University
Title: Making sense of genome-scale data
Abstact: High-throughput data production technologies are
revolutionizing modern biology. Translating this experimental data into
discoveries of relevance to human health relies on sophisticated
computational tools that can handle large-scale data (e.g. multiple
genome alignments of dozens of species or billion genotype genome-wide
association studies).
This talk will first discuss a specific large-scale data analysis problem: using comparative genomics to identify and understanding functional genomic regions, particularly cis-regulatory elements. Using data generated by the ENCODE project we will demonstrate the power of genome comparisons to distinguish these elements from neutral DNA and the importance of looking for more than just signs of strong evolutionary constraint. We will then describe a machine learning approach that goes beyond sequence conservation and attempts to capture broader and more informative sequence and evolutionary patterns that better distinguish different classes of elements. This approach, denoted ESPERR, uses training examples to learn encodings of multispecies alignments into reduced forms tailored for the prediction of chosen classes of functional elements. ESPERR has proven successful for a variety of classification problems. In particular, the "Regulatory Potential Score" produced using ESPERR has been used to identify putative regulatory elements with high rates of experimental validation.
Second, we will consider the more general problem of making
sophisticated computational methods more available to experimental
biologists. Many powerful analysis tools exist or are currently being
developed, along with many excellent data warehouses and browsers.
However, for the average experimental biologists with limited computer
expertise, making effective use of these tools and data sources is still
out of reach because many existing tools do not have easy-to-use
interfaces, and different tools and data sources are not well
integrated. We have developed a framework and application, called
Galaxy, that solves this problem by providing an integrated web-based
workspace that bridges the
Date: 1/16/08
Speaker: Insuk Lee, University of Texas at Austin
Title: Network biology approaches to study complex traits
Abstract: The relationship between genotype and phenotype is a central
issue in genetics, and approaches are needed that allow us to interpret
the increasing collection of data on genotypic variation in terms of the
affect on organismal phenotypes. Our understanding of these
relationships came historically from forward-genetics approaches, which
have proved remarkably powerful, but which are still difficult in
complex animals, and the complete definition of pathways from
forward-genetic data alone is hard. In contrast, reverse-genetics
approaches allow unbiased tests across entire genomes for associations
with traits of interest, e.g., by using systematic genome-wide knock-out
or silencing. However, reverse-genetics is in general labor intensive
and time consuming, requiring enormous numbers of assays in order to
span large number of genes in combination with multiple experimental
conditions. Ideally, we would like to be able to choose which genes to
Abstract: The relationship between genotype and
phenotype is a central issue in genetics, and approaches are needed that
allow us to interpret the increasing collection of data on genotypic
variation in terms of the affect on organismal phenotypes. Our
understanding of these relationships came historically from
forward-genetics approaches, which have proved remarkably powerful, but
which are still difficult in complex animals, and the complete
definition of pathways from forward-genetic data alone is hard. In
contrast, reverse-genetics approaches allow unbiased tests across entire
genomes for associations with traits of interest, e.g., by using
systematic genome-wide knock-out or silencing. However, reverse-genetics
is in general labor intensive and time consuming, requiring enormous
numbers of assays in order to span large number of genes in combination
with multiple experimental conditions. Ideally, we would like to be able
to choose which genes to

