MBI Logo
MBI Logo

CTW: Recent Advances in Statistical Inference for Mathematical Biology: Titles & Abstracts

Inference in Mixed-Effects (and other) Models Through Profiling the Objective
Douglas Bates, Department of Statistics, University of Wisconsin - Madison

The use of Markov-chain Monte Carlo methods for Bayesian inference has increased awareness of the need to view the posterior distribution of the parameter (in the Bayesian sense) or the distribution of the parameter estimator for those who prefer non-Bayesian techniques. I will concentrate on non-Bayesian inference although the techniques can also be applied to the posterior density in Bayesian methods. For many statistical models, including linear and generalized linear mixed-effects models, parameter estimates are defined as the optimizer of an objective function, e.g. the MLE's maximize the log-likelihood, and inference is based upon the location of the optimizer and local approximation at the optimizer, without assessing the validity of the approximation. This made sense when fitting a single model may have involved many days waiting for the answers from shared computer systems. It doesn't make sense when models can be fit in a few seconds. By repeatedly fitting a model subject to holding a particular parameter fixed we can build up a profile of the objective with respect to the parameter and use the information to produce profile based confidence intervals. But perhaps the most important aspect of the technique is graphical presentation of the results that force us to consider the behavior of the estimator beyond the estimate, which can cast doubt on many of the principles of inference and simulation that we hold dear.

Linking systems biology models to data: a stochastic kinetic model of p53 oscillations
Richard Boys, Mathematics & Statistics, Newcastle University

This talk considers the assessment and refinement of a dynamic stochastic process model of the cellular response to DNA damage. The proposed model is a complex nonlinear continuous time latent stochastic process. It is compared to time course data on the levels of two key proteins involved in this response, captured at the level of individual cells in a human cancer cell line. The primary goal of is to "calibrate" the model by finding parameters of the model (kinetic rate constants) that are most consistent with the experimental data. Significant amounts of prior information are available for the model parameters. It is therefore most natural to consider a Bayesian analysis of the problem, using sophisticated MCMC methods to overcome the formidable computational challenges.

Estimation of ordinary differential equations with orthogonal conditions
Nicolas Brunel, Statistique et Genome lab, Federation de Mathematique d'Evry, Universite d'Evry & ENSIIE

Parameter inference of ordinary differential equations from noisy data can be seen as a nonlinear regression problem, within a parametric setting. The use of a classical statistical method such as Nonlinear Least Squares (NLS) gives rise to difficult and heavy optimization problems due to the corresponding badly posed inverse problem. Gradient Matching algorithms use a smooth (nonparametric) estimation of the solution from which is derived a nonparametric estimate of the derivative, and gives rise to a natural criterion easier than NLS to optimize. We introduce here a new class of criteria based on a weak formulation of the ODE. The estimator derived can be viewed as a generalized moment estimators which possesses nice statistical and computational properties. Finally, we consider several examples which illustrate the efficiency and the versatility of the proposed method.

Verification of a biophysical surface protein patternation model: MCMC analysis of dual fluorescence data
Nigel J. Burroughs, WSBC, Warwick, UK

In a number of biological systems a complex spatio-temporal orchestration of protein relocation is observed in cell:cell contact interfaces, predominantly a separation of small receptors from large ones, i.e. there is a segregation by size. This patterned structure is called the immunological synapse, occurring predominantly between cells of the immune system. Such structures can be realised in model experimental systems in 2D, or visualised in 3D z-stacks. A segregation mechanism that explains patternation by a minimisation of the free energy using a model of bond stretching has been extensively examined theoretically within a number of modelling frameworks; all these models agree that such a mechanism is able to explain the observed patterns within certain parameter regimes. However proving that the system is in the parameter regime conducive to separation has not been achieved. We use Bayesian inference to fit a model of fluorescence intensity to single cell data, extracting key parameters of the patternation- specifically estimating local free energies for proteins. By quantifying degrees of protein exclusion we determine the degree to which the patterns are consistent with the size dependent thermodynamic model of segregation.

Riemannian Manifolds and Statistical Models: The use of Differential Geometry in Probabilistic Modelling
Ben Calderhead, CoMPLEX (Centre for Mathematics and Physics in the Life Sciences and Experimental Biology), University College London

As mathematical models of biological systems increase in size and complexity, so too does the need for sophisticated statistical methodology that can consistently evaluate and update the evidence in favour of each model, given the limited and highly variable experimentaldata that is often available. A probabilistic approach based on Bayesian statistics offers a mathematically consistent way of characterising this uncertainty, both within the parameters and the models themselves.

There are many computational challenges associated with a Bayesian approach to model ranking and inference. The procedure is equivalent to evaluating high-dimensional integrals generally involving highly nonlinear functions. In more than three or four dimensions deterministic approaches are no longer feasible and we may resort to stochastic integration using simulation-based Monte Carlo techniques. The main challenge often lies in drawing samples from analytically intractable posterior probabilitydistributions, which exhibit strong nonlinear correlation structures, high dimensionality and non-identifiability of parameters.

In this talk I will discuss how Markov chain Monte Carlo methodology based on the natural differential geometric structure of the parameter space of a statistical model may alleviate many of the simulation issues by making proposals based on local sensitivity information. Along the way, I shall highlight the deep link between the sensitivity analysis of such statistical models and the underlying Riemannian geometry of the induced posterior probability distributions. I shall demonstrate how this methodology may be used to rank two differential equation models describing circadian rhythms in the plant Arabidopsis thaliana.

Incremental Mixture Important Sampling with Distributionally Relaxed Optimization and Application to Differential Equation Models
Dave Campbell, Simon Fraser University, Department of Statistics and Actuarial Science

Parameters from dynamic system models often end up having posterior distributions where the all but negligible probability mass is well described with curved contours in the parameter space. When using a random walk, Markov Chain Monte Carlo methods are slow to mix, owing to Gibbs samplers or Metropolis Hastings defining linear relationships whereas curved parameter relationships are more conducive to the distributional structure. While adaptively optimal methods exist for random walk sampling from distributions along curved distributions, sampling is sequential and consequently computationally slow and unable to benefit from parallel computing algorithms. In this work we present an importance Sampling Algorithm that builds a posterior approximation based on a mixture of weighted Gaussians.

Initially a large sample is drawn from the prior. Likelihood weights are estimated, and the more importantly weighted values are used as starting points for an optimization step. The results of which are used to produce a local Gaussian approximation to the posterior. Using a weighted combination of these Gaussians, an approximate sample from the posterior can be obtained. In this way the algorithm takes advantage of parallel computing unavailable to random walk based algorithms.

In the basic algorithm implementation, the prior distribution must be defined to not only reflect expert opinion but also to facilitate the posterior optimization. However when expert opinion is inconsistent with the data, the basic algorithm is easily trapped in modes of negligible posterior density. To alleviate this, we show how to use a distributional relaxation in the optimization, permitting the prior to be used for the single purpose of summarizing expert opinion, even when it is inconsistent with the data without negatively impacting the performance.

This is demonstrated using examples with differential equation model systems.

Using Bayesian and MCMC approaches for parameter estimation and model evaluation in physiology
Carson Chow, Mathematics, University of Pittsburgh and Laboratory of Biological Modeling, NIDDK, NIH

Differential equations are often used to model biological and physiological systems. An important and difficult problem is how to estimate parameters and decide which model among possible models is the best. I will show in several examples how Bayesian and Markov Chain Monte Carlo approaches provide a self-consistent framework to do both tasks. In particular, Bayesian parameter estimation provides a natural measure of parameter sensitivity and Bayesian model comparison automatically evaluates models by rewarding fit to the data while penalizing the number of parameters.

Bayesian Modeling of Smoking Exposure During Pregnancy
Vanja Dukic, Applied Mathematics, University of Colorado-Boulder

Studies trying to assess effects of prenatal exposure to cigarettes frequently acquire both self-report and biologic assays of maternal smoking. Most common biological assays are those of cotinine, a metabolite of nicotine, from urine or serum. Both of those measures have their own sources of information and bias. Single bioassay measures alone cannot reflect the metabolic mechanism over time, while self-report may have serious recall, topographic, and metabolic biases. In this project we present a Bayesian statistical model for describing in utero smoking exposure based on the combined biological and self-report information. The model takes into account heterogeneity among women and metabolism during pregnancy. The model is applied to the data from East Boston Family Study.

Modeling and inference for gene expression time series data (an overview)
Barbel Finkenstadt, Department of Statistics, University of Warwick

A central challenge in computational modeling of dynamic biological systems is parameter inference from experimental time course measurements. Here we present an overview of the modeling approaches based on stochastic population dynamic models and their approximations. For an application on the mesoscopic scale, we present a two dimensional continuous-time Bayesian hierarchical diffusion model which has the potential to address the different sources of variability that are relevant to the stochastic modelling of transcriptional and translational processes at the molecular level, namely, intrinsic noise due to the stochastic nature of the birth and deaths processes involved in chemical reactions, extrinsic noise arising from the cell-to-cell variation of kinetic parameters associated with these processes and noise associated with the measurement process. Inference is complicated by the fact that only the protein and rarely other molecular species are observed which is typically entailing problems of parameter identification in dynamical systems.

For an application on the macroscopic scale, we introduce a mechanistic 'switch' model for encoding a continuous transcriptional profile of genes over time with the aim of identifying the timing properties of mRNA synthesis which is assumed to switch between periods of transcriptional activity and inactivity, each time leading to the transition of a new steady state, while mRNA degradation is an ongoing linear process. The model is rich enough to capture a wide variety of expression behaviours including periodic genes. Finally, I will also give a brief introduction to some recent work on inferring the periodicity of the expression of circadian and other oscillating genes.

Joint work with: Maria Costa, Dan Woodcock, Dafyd Jenkins, David Rand (all Warwick Systems Biology), Michal Komorowski (Imperial College London).

Targeted CNV genotyping from SNP array data using a Variational Bayes approach
Eleni Giannoulatou, Wellcome Trust Centre for Human Genetics

Human genetic variation can be present in many forms, including Single Nucleotide Polymorphisms (SNPs) and Copy Number Variation (CNVs). Copy number variation is ubiquitous; large segments of DNA, ranging in size from hundreds to thousands of DNA bases, can differ in the number of copies between individuals. Using array technology these variants have be characterised on a large scale resulting in the development of large comprehensive catalogues of known variants. Available SNP genotyping chips often contain probes specifically targeting such CNVs. Using a large population of samples, CNV genotyping can be performed by identifying each individual's copy number level at these known CNV regions. Accurate CNV genotyping across populations is important and can provide insights into phenotypic diversity and disease susceptibility.

In comparison to SNP genotyping, these assays for CNV calling have a lower signal to noise ratio, and robust statistical algorithms for their analysis are not yet widespread. We have developed an analytical framework that through robust model selection, infers the total number of possible copy number states of a targeted region. A novel hierarchical mixture model algorithm is applied in order to accurately call targeted common CNVs across large populations within a variational Bayes framework. The approach utilises the allelic contrast provided by the SNP probes within CNVs, in addition to the total DNA intensity. This integrated SNP/CNV approach outperforms previous methods by providing more accurate calls as well as allele-specific copy number states.

Joint work with Christopher Yau, Chris Spencer, Christopher C. Holmes, and Peter Donnelly

Bayesian inference for generalized stochastic population growth models with application to aphids
Colin Gillespie, Mathematics & Statistics, Newcastle University

In this talk I will analyse the effects of various treatments on cotton aphids Aphis gossypii. The standard analysis of count data on cotton aphids determines parameter values by assuming a deterministic growth model and combines these with the corresponding stochastic model to make predictions on population sizes, depending on treatment. Here, we use an integrated stochastic model to capture the intrinsic stochasticity, of both observed aphid counts and unobserved cumulative population size for all treatment combinations simultaneously.

Unlike previous approaches, this allows us to explicitly explore and more accurately assess treatment interactions. Markov chain Monte Carlo methods and the moment closure technique are used within a Bayesian framework to integrate over uncertainty associated with the unobserved cumulative population size and estimate the result twenty-eight parameters. We restrict attention to data on aphid counts in the Texas High Plains obtained for three different levels of irrigation water, nitrogen fertiliser and block, but note that the methods we develop can be applied to a wide range of problems in population ecology.

Particle MCMC for Stochastic Kinetic Models
Andrew Golightly, School of Mathematics & Statistics, Newcastle University

We consider the problem of performing Bayesian inference for the rate constants governing stochastic kinetic models. As well as considering inference for the resulting Markov jump process (MJP) we consider working with a diffusion approximation obtained by matching the infinitesimal mean and variance of the MJP to the drift and diffusion coefficients of a stochastic differential equation (SDE). We sample from the posterior distribution of the model parameters given observations at discrete times via recently proposed particle MCMC methods. In the case of the diffusion approximation we increase the efficiency of the inference algorithm by exploiting the structure of the SDE. We present results from two toy examples: a Lotka-Volterra system and a simple model of prokaryotic autoregulation.

Dynamical Models of Periodic Processes
John Guckenheimer, Mathematics, Cornell University

Normal forms of dynamical systems can be used as nonlinear models for time series data. This talk describes the use of Floquet theory as a normal form for a stable periodic orbit. Primary goals of this joint research with Shai Revzen are to estimate the spectrum of Lyapunov exponents and to develop reduced order models based upon weakly stable modes of the system. The methods have been applied to animal locomotion, notably data from running cockroaches.

Inference for partially observed stochastic dynamic systems
Ed Ionides, Statistics, University of Michigan

Characteristic features of biological dynamic systems include stochasticity, nonlinearity, measurement error, unobserved variables, unknown system parameters, and even unknown system mechanisms. I will consider the resulting inferential challenges, with particular reference to pathogen/host systems (i.e., disease transmission). I will focus on statistical inference methodology which is based on simulations from a numerical model; such methodology is said to have the plug-and-play property. Plug-and-play methodology frees the modeler from an obligation to work with models for which transition probabilities are analytically tractable. A recent advance in plug-and-play likelihood-based inference for general partially observed Markov process models has been provided by the iterated filtering algorithm. I will discuss the theory and practice of iterated filtering.

Likelihood-based observability analysis and confidence intervals for model predictions
Clemens Kreutz, Freiburger Center for Biological Systems Analysis (ZBSA), Physics Department

Dynamic models of biochemical networks contain unknown parameters like the reaction rates and the initial concentrations of the compounds. The large number of parameters as well as their nonlinear impact on the model responses hampers the determination of confidence regions for parameter estimates. At the same time, classical approaches translating the uncertainty of the parameters into confidence intervals for model predictions are hardly feasible.

We present the so-called prediction profile likelihood which is utilized to generate reliable confidence intervals for model predictions. The prediction confidence intervals of the dynamic states are exploited for a data-based observability analysis. Moreover, a validation profile likelihood is introduced that can be applied when noisy validation experiments are judged.

The presented approaches are also applicable if there are non-identifiable parameters. Such ambiguities yield insufficiently specified model predictions that can be interpreted as nonobservability. The properties and applicability the approach are demonstrated by two examples, a small but instructive ODE model, and a model for the MAP kinase signal transduction pathway.

Work done in collaboration with A. Raue and J. Timmer. This project has been funded by the BMBF grants VirtualLiver 0315766 and FRISYS 0313921.

Errors in variables models: Diagnosing parameter estimability and MCMC convergence using empirical characteristic functions
Subhash R. Lele, Department of Mathematical and Statistical Sciences University of Alberta

Most ecological models are constructed to understand the relationship between environmental variables and an ecological response, be it site occupancy or population abundance or changes to them. The usual regression models take into account the environmental variation in the response but in many cases, the measurement of the environmental variables themselves are made with error. This is called an errors-in-variables model. Measurement error in the covariates leads to substantial issues with parameter estimability and likelihood-based inference is computationally challenging. Bayesian inference using Markov Chain Monte Carlo methods also runs into trouble because of the convergence issues with the MCMC algorithm. These convergence issues are severe especially with non-informative priors.

Errors in variables models, linear and non-linear, can be formulated as hierarchical models. Data cloning is a recently developed computational technique to conduct likelihood- based analysis for general hierarchical models. In this paper, we show that data cloning coupled with informative priors can circumvent the convergence issues with MCMC. We develop a new testing procedure to compare multivariate distributions using empirical characteristic functions and show its usefulness in diagnosing convergence of MCMC algorithm in these tricky situations. More importantly, we show that data cloning not only facilitates parameter estimation but also diagnosing which parameters are estimable and which ones are not. This is essential for drawing scientifically meaningful inferences. We illustrate the method using various linear and non-linear regression models useful in ecology. We report a somewhat surprising result that a widely used population dynamics model, the Hasell model, is non-identifiable but a closely related Generalized Beverton-Holt model is identifiable.

Work done in collaboration with Khurram Nadeem.

Inferring transcriptional and microRNA-mediated regulatory programs in cancer
Christina Leslie, Associate Member and Lab Head, Computational Biology Program, Memorial Sloan-Kettering Cancer Center

Large-scale cancer genomics projects are profiling hundreds of tumors at multiple molecular layers, including copy number, mRNA and miRNA expression, but the mechanistic relationships between these layers are often excluded from computational models. We developed a sparse regression framework for integrating molecular profiles with regulatory elements to reveal reveal mechanisms of dysregulation of gene expression in cancer, including miRNA-mediated expression changes. We applied our approach to 320 glioblastoma tumors and identified key miRNAs and transcription factors as common or subtype-specific regulators. We confirmed that target gene expression signatures for proneural subtype regulators were consistent with in vivo expression changes in a relevant mouse model. We tested two predicted proneural drivers, miR-124 and miR-132, both underexpressed in proneural tumors, by overexpression in neurospheres and observed a partial reversal of corresponding tumor expression changes. Computationally dissecting the role of miRNAs in cancer may ultimately lead to small RNA therapeutics tailored to subtype or individual.

Bayesian Analysis of Nonlinear Taxic Fields Guiding Immune Cell Motion
Ioanna Manolopoulou, Duke Statistical Science

The spatial organization of lymphocytes in the lymph nodes and spleen and its temporal dynamics influenced by biochemical taxic fields is a key component of the immune response to vaccines and infections, and essential to the induction of immune memory. We develop dynamic models of single-cell motion involving nonparametric representations of the nonlinear spatial fields that guide cellular motion, with a primary goal of learning the structure of these fields that fundamentally shape the immune response. We develop Bayesian analysis via customized Markov chain Monte Carlo methods for single-cell models, and multi-cell hierarchical extensions for aggregating models and data across multiple cells. The case study explores data from multiphoton intravital microscopy in murine lymph nodes, and we use a number of visualization tools to summarize and compare posterior inferences on the 3-dimensional taxic fields.

Joint work with Thomas B. Kepler and Mike West.

Temporal inhomogeneity and dependence of brain networks
Sofia Olhede, Department of Statistical Science, University College London

Electroencephalography (EEG) measurements yield information about electrical activity of the brain, via measured voltage fluctuations. Measurements are normally made at several locations on the scalp, and the network of activity is inferred from analysing multiple time series, that are neither linear nor stationary. We shall discuss how modern time series analysis methods can be adapted and innovated to model and make inferences of the complicated joint time-varying properties of such observations.

This is joint work with Maria Fitzgerald (UCL) and Hernando Ombao (Brown).

Summary statistics for ABC model choice
Dennis Prangle, Mathematics & Statistics, Lancaster University

ABC is a powerful method for inference of statistical models with intractable likelihoods. Recently there has been much interest in using ABC for model choice and concerns have been raised that the results are not robust to summary statistic choice. We propose a method of choosing useful summary statistics and apply the method to population genetic and epidemiological examples.

Inferring the gap between mechanism and phenotype in dynamical models of gene regulation
Dov Stekel, Biosciences, University of Nottingham

Dynamical (differential equation) models in molecular biology are often cast in terms of biological mechanisms such as transcription, translation and protein-protein and protein-DNA interactions. However, most molecular biological measurements are at the phenotypic level, such as levels of gene or protein expression in wild type and chemically or genetically perturbed systems. Mechanistic parameters are often difficult or impossible to measure. We have been combining dynamical models with statistical inference as a means to integrate phenotypic data with mechanistic hypotheses. In doing so we are able to identify key parameters that determine system behaviour, and parameters with insufficient evidence to estimate, and thus make informed predictions for further experimental work. We are also able to use inferred parameters to build stochastic and multi-scale models to investigate behaviour at single-cell level. We apply these ideas to two systems in microbiology: global gene regulation in the antibiotic-resistance bearing RK2 plasmids, and zinc uptake and efflux regulation in Escherichia coli.

Parameter estimation for bursting neural models
Joe Tien, Mathematics, The Ohio State University

Bursting is a ubiquitous phenomenon in neuroscience which involves multiple time scales (fast spikes vs. long quiescent intervals). Parameter estimation for bursting models is difficult due to these multiple scales. I will describe an approach to parameter estimation for these models which utilizes the geometry underlying bursting. This is joint work with John Guckenheimer.

Modeling and inference framework for studying noise and cell-to-cell variability in synthetic biology
Tina Toni, Department of Biological Engineering, Computer Science and Artificial Intelligence Laboratory, MIT

Synthetic biology faces many challenges in its bold attempt to construct functioning and programmable devices, circuits, and higher constructs. Chief amongst these may be selecting from multiple design possibilities given partially characterized biological systems, effects of biological noise and cell-to-cell variability. Here we report progress on developing modeling and inference techniques that incorporate noise and cell-to-cell variability to guide design principles in synthetic biology.

Sloppy Models, Information Geometry, and Data Fitting
Mark Transtrum, Physics, Cornell University

Parameter estimation by nonlinear least squares minimization is a ubiquitous problem that has an elegant geometric interpretation: all possible parameter values induce a manifold embedded within the space of data. The minimization problem is then to find the point on the manifold closest to the data. By interpreting nonlinear models as a generalized interpolation scheme, we find that the manifolds of many models, known as sloppy models, have boundaries and that their widths form a hierarchy. We describe this universal structure as a hyper-ribbon. The hyper-ribbon structure explains many of the difficulties associated with fitting nonlinear models and suggests improvements to standard algorithms. We add a "geodesic acceleration" correction to the standard Levenberg-Marquardt algorithm and observe a dramatic increase in success rate and convergence speed on many fitting problems.

Statistical Methods for High-Dimensional ODE Models for Dynamic Gene Regulatory Networks
Hulin Wu, Department of Biostatistics and Computational Biology University of Rochester, School of Medicine and Dentistry

Gene regulation is a complicated process. The interaction of many genes and their products forms an intricate biological network. Identification of this dynamic network will help us understand the biological process in a systematic way. However, the construction of such a dynamic network is very challenging for a high-dimensional system. We propose to use a set of ordinary differential equations (ODE), coupled with dimensional reduction by clustering and mixed-effects modeling techniques, to model the dynamic gene regulatory network (GRN). The ODE models allow us to quantify both positive and negative gene regulations as well as feedback effects of one set of genes in a functional module on the dynamic expression changes of the genes in another functional module, which results in a directed graph network. A five-step procedure, Clustering, Smoothing, regulation Identification, parameter Estimates refining and Function enrichment analysis (CSIEF) is developed to identify the ODE-based dynamic GRN. In the proposed CSIEF procedure, a series of cutting-edge statistical methods and techniques are employed. We apply the proposed method to identify the dynamic GRN for yeast cell cycle progression data and immune response to influenza infection. We are able to annotate the identified modules through function enrichment analyses. Some interesting biological findings are discussed. The proposed procedure is a promising tool for constructing a general dynamic GRN and more complicated dynamic networks.

Posters

Probabilistic Integration for Inference over Differential Equation Models
Oksana Chkrebtii, PhD Candidate, Department of Statistics & Actuarial Science Simon Fraser University

We introduce a new methodology for inference on model parameters for systems of ordinary differential equations. Classical estimation approaches construct the likelihood function using numerical solutions that are fully deterministic and thus ignore approximation uncertainty. We propose a probabilistic alternative to using a deterministic solver. The ODE solution is modeled as a stochastic process obtained by a sequential algorithm that samples iteratively from the derivative space and smooths the sampled points via Gaussian Process regression. The proposed Bayesian framework incorporates solution uncertainty in the likelihood and thus in the inference process. This poster describes the proposed methodology and simulations for the FitzHugh-Nagumo system of ODEs. We discuss the differences in the resulting posterior surface from that obtained with the classical approach and relate this to parameter inference.

Work done in collaboration with David Campbell (Simon Fraser University), Ben Calderhead and Mark Girolami (University College London).

Dramatic reduc- tion of dimensionality in large biochemical networks due to strong pair correlations
Michael Dworkin, The Ohio State University

Hiearchical cell signaling and gene regulatory kinetic reactions, composed of rich biochemical networks, produce decisive functional outcomes in cells that interact with diverse stimuli. Recent developments in high throughput experiments provide us with detailed views of these complex phenomena. While the amount of data from such experiments containing enormous number of variables is impressive, it is difficult to extract mechanisms underlying the complex kinetics that determine functional outcomes. Elucidating the mechanisms is essential for both scientific understanding and therapeutic applications.

We study multivariate statistical methods (e.g., principal component analysis) based on covariances used in the analysis of high throughput data sets. We show that these lead to a dramatic reduction (from hundreds to fewer than 5) of the dimensionality in the time-dependent data obtained from numerical solution of coupled ordinary differential equations describing large, biologically significant sets of biochemical reactions. We find this reduction independent of the form of the nonlinear interactions, network architecture and over a wide range of parameter values (rate constants and concentrations). We show how changes in time scales in the system are associated with the relative changes in the number of the principal components required to capture the maximal variance in the data set. We provide examples where description of the system kinetics in terms of few principal components can lead to new insights into complex multi-dimensional systems. This may lead us toward uncovering mechanisms and identifying the key processes in complex biological systems.

Combined Monte Carlo and Dynamical Modelling of Gene Regulation enables Multiple Parameter Estimations and Hypothesis Generation
Dorota Herman, Center for Systems Biology, School of Biosciences, University of Birmingham

The central control operon of plasmid RK2 is negatively and co-operatively autoregulated by dimers of two global regulators, KorA and KorB. Although experimental results indicate total repression when KorA and KorB are bound and partial repression when either KorA or KorB is bound, the mechanisms of these partial repressions have been unknown. We aim to increase the understanding of the transcriptional regulation mechanisms of the central control operon of plasmid RK2 by parameter estimation and hypothesis verification based on available data.

Analyses of the RK2 central control operon regulation were carried out by considering the steady state of a deterministic model of the system during exponential host bacterial growth. Using the Metropolis-Hastings algorithm we have estimated protein synthesis rates and revealed insignificance of monomerization rates to the model. Complementary parameter analyses were carried out with Metabolic Control Analyses. Based on the models, that include the association and dissociations of the repressors to the DNA strand, we could test different scenarios of partial repression and select the most plausible model in reference to experimental data of repression levels.

In summary, the reported results contribute to our understanding of the repression mechanisms of plasmid RK2. Moreover, we have shown that MCMC and MCA are complementary approaches to identifying biologically important parameters and determinants of the IncP-1 central control operon.

Integrated modelling and experiments to identify drug-induced changes in cell-cycle dynamics
Bartholomaeus Hirt, Centre for Mathematical Medicine and Biology, School of Mathematical Sciences, University of Nottingham

The eukaryotic cell cycle consists of five phases, namely a resting state, G0, and four cycling phases: G1, S, G2 and M, with cells progressing in this order and then dividing into two cells back in G1. We have collected experimental data and developed a mathematical model describing the cell cycle dynamics of cancer cells and their time- and dose-dependent modulation by a DNA interactive agent (RHPS4). RHPS4, which is in advanced preclinical development, is an attractive compound because of its anti-tumour properties in vitro and in vivo, inhibiting the activity of telomerase, an enzyme known for its role in cellular immortalisation in human cancer. The precise mechanism of action of the drug on the cell cycle dynamics, however, remains unclear.

We formulated compartmental models using ordinary differential equations and fitted simulations of our models to experimental observations. We performed maximum likelihood estimation of the parameters and model selection using Akaike's information criterion. We found that at low concentrations RHPS4 primarily affected cells in the G2/M phase, and caused a concentration-dependent, marked cell death in treated cells at later time points of the observation period. Since the drug uptake into the nucleus is rapid (saturation within ~5h), the observed delayed effect of 5 days of the compound is unexpected and is a novel finding of our research into this compound.

A systems biology approach to understanding Clostridium difficile infection
Sara Jabbari, Centre for Biomolecular Sciences, University of Nottingham, UK

The bacterium Clostridium difficile causes thousands of deaths every year in the United Kingdom and worldwide through healthcare-associated infections. The impact upon the healthcare system is enormous and set to worsen due to the increasing prevalence of hypervirulent and antibiotic-resistant strains. An urgent need to develop novel therapies to tackle C. difficile infection thus exists, but a key obstacle to this is our lack of understanding of the gene regulation networks governing C. difficile virulence. These complex networks of genes detect optimal conditions for successful infection, regulating the production of toxins and the initiation of related mechanisms accordingly. The complexity of such nonlinear networks defies their understanding by a single discipline alone. Our goal is to combine mathematical modelling with experimental work and parameter estimation techniques to improve our understanding of the interactions involved in the gene regulation network governing toxin production by C. difficile. We here present some numerical solutions which illustrate the capability of dynamical models to support or contradict existing hypotheses. We focus on the role of TcdC in the regulation of toxin production and show that numerical solutions to the model, combined with existing knowledge of the protein profiles, do not support the widely held belief that the lack of a functioning TcdC protein in hypervirulent strains is the key to their increased levels of toxin production.

Joint work with J.R. King and N.P. Minton.

Robustness of the Cell Signaling Network as a Means to Discriminate Among the Di erent Models of Itk Kinase Regulation in T cells
Sayak Mukherjee, Battelle Center for Mathematical Medicine, Nationwide Children's Hospital; Department of Pediatrics, Ohio State University

The process that warrants the generation of self tolerant peripheral T cells is called the thymocyte selection. During this maturation process, the overtly self reactive as well as un- responsive thymocytes are deleted from the cell population. The thymocytes equipped with T cell receptors (TCRs), capable of responding moderately to the self peptides are allowed to survive. Recently water soluble second messenger, inositol(1,3,4,5) tetrakisphosphate (IP4), has been implicated to play a crucial role in thymocyte positive selection (Huang et al.). It has been suggested that these IP4 molecules regulate the transient activation of the Tec- family protein tyrosine kinase Itk through a competing positive and negative feedbacks. The exact molecular mechanism involved in this feedback is however unclear. It is possible to con- struct more than one model with contrasting molecular mechanisms to explain the present body of experimental observations. This calls for criteria to choose among these models. Robustness in face of the variation of the parameters in a model has been ubiquitously used as a criterion for model discrimination. Here we have used the maximum entropy, calculated with the constraints imposed by the experiments as a measure of robustness. Our data in- dicates that the models which are maximally robust share a cooperative allosteric mode of Itk regulation involving dimeric PH domains.

Likelihood based Population Viability Analysis in the presence of observation error
Khurram Nadeem and Subhash R. Lele, Department of Mathematical and Statistical Sciences, University of Alberta

Population viability analysis (PVA) entails calculation of extinction risk, as defined by various extinction metrics, for a study population. These calculations strongly depend on the form of the population growth model and inclusion of demographic and/or environmental stochasticity. Form of the model and its parameters are determined based on observed population time series data. A typical population time series, consisting of estimated population sizes, inevitably has some observation error and likely has missing observations. In this paper, we present a likelihood based PVA in the presence of observation error and missing data. We illustrate the importance of incorporation of observation error in PVA by reanalyzing the population time series of song sparrow (Melospiza melodia) on Mandarte Island, British Columbia, Canada from 1975-1998. Using Akaike information criterion we show that model with observation error fits the data better than the one without observation error. The extinction risks predicted by with and without observation error models are quite different. Further analysis of possible causes for observation error revealed that some component of the observation error might be due to unreported dispersal. A complete analysis of such data, thus, would require explicit spatial models and data on dispersal along with observation error. Our conclusions are, therefore, two-fold: i) observation errors in PVA matter and ii) integrating these errors in PVA is not always enough and can still lead to important biases in parameter estimates if other processes such as dispersal are ignored.

Distinguishing sources of parasites on wild juvenile salmon
Stephanie Peacock, Biological Sciences & Centre for Mathematical Biology, University of Alberta, Canada

Ecological systems are complex. All too often, complex models are fit to ecological data without consideration of whether parameters are estimable. I present a recent example for a parasite transmission model tracking the diffusion of sea lice from salmon farms in coastal British Columbia, fit to data of sea lice abundances on migrating wild juvenile salmon. The estimability of parameters is not clear from published point estimates, but an effort to obtain standard errors on parameter estimates revealed either a lack of information in the data or non-identifiable parameters. I used data cloning techniques to verify an estimability issue, and subsequently revised the model. Results from the revised parasite transmission model support previous work, suggesting that salmon farms are a significant source of sea lice on wild juvenile salmon. But management interventions have altered parasite dynamics on salmon farms in recent years, and the temptation is great to account for these dynamics in the parasite transmission model. Can parameters be estimated if this additional complexity is introduced?

Joint work with Mark Lewis and Marty Krkosek.

Phylodynamic inference and hypothesis testing - influenza as a case study
Oliver Ratmann, Department of Biology, Duke University

The infectious disease dynamics of many human pathogens like influenza, norovirus and coronavirus are inextricably tied to their evolution. These "phylodynamics", i.e. the reciprocal, nonlinear interaction between evolutionary and ecological processes, make the infectious disease behavior of rapidly evolving pathogens so difficult to understand. Most statistical methods for the analysis of pathogen phylodynamics require that the likelihood of the data under a model that captures the nature of the disease behavior can be explicitly calculated. Currently, the scope of these methods is limited and questions on the interaction between different viral variants cannot be well-addressed. Simulation-based statistical methods circumvent likelihood calculations. We show that simulation-based methods can be used to fit and test complex phylodynamic models against both sequence and case report data by taking interpandemic human influenza virus subtype H3N2 as an example. We find that combining molecular genetic and epidemiological data is key for phylodynamic inference and hypothesis testing because the interdependencies across these two data sets severely constrain parameter space and expose quantitatively inconsistent model behavior. The proposed method is well-suited to interface population dynamic multi-strain models with both types of data and thereby offers the opportunity to analyze statistically the phylodynamics of rapidly evolving pathogens.

Improving statistical models for discovering cell type specifc genes
Sergiusz Wesolowski, Dept. of Mathematics, Computer Science, Mechanics; Warsaw University, Poland

Analysis of gene expression is one of the fundamental methods of characterizing cell populations. One of the major cells in the immune system are "helper" T cells expressing CD4 surface marker. The majority of these cells constitutes a population of conventional CD4+ T cells which supports functions of other cells of the adaptive and innate immune system. A smaller population, called regulatory CD4+ T cells (Treg), has opposite function and suppresses immune response and is responsible for the homeostasis of the immune system. The most characteristic gene expressed by Treg cells is a transcription factor Foxp3. Both conventional and Treg cells are generated in the thymus from bone marrow-derived progenitors. Treg cells produced in the thymus are called natural Treg cells. Under certain conditions, conventional CD4 T cells can express Foxp3 and acquire suppressor function. These Treg cells are called adaptive Treg.

One of the methods of investigating different subsets of CD4 T cells is to compare their gene expression profiles. This approach allows insight into cellular functions of individual cell subsets and allows for analysis of functions of differentially expressed genes. Analysis of the global expression pro les is commonly done using microarrays. To reveal genetic control of various subsets of CD4 T cells we compared gene expression pro les of resting and activated conventional CD4 T cells, resting and activated natural Treg cells and adaptive Treg cells. RNA was isolated from the respective T cell populations and hybridized to Affymetrix GeneChip M430 2.0 Plus microarrays. Three individual samples of each kind were processed.

In order to make our data set more representative, followin a similar approach described in [?], we included microarrays from the respective CD4 T cell subsets from other laboratories. These data were obtained from the GEO database: www.ncbi.nlm.nih.gov/geo. To deal with the problem we produced a framework combined from several available statistical approaches: Linear models for Microarray data, Bayesian approach, Non- Negative Matrix Factorisation [?].

Comparisons of data from multiple laboratories introduces additional levels of variability which need to be accounted for during data normalization.

Normalization attempts that adjusted mean values and standard deviation of gene expression resulted in the sets of differentially expressed genes that differed between laboratories instead of between different T cell populations. Our computations indicated that lab origin has more influence on gene expressions then investigated cell types among laboratories.

To account for multi-dimensionality of the normalization problem we developed a heuristic approach.

1. Jean-Philippe Brunet, Pablo Tamayo, Todd R. Golub, Jill P. Mesirov, Metagenes and molecular pattern discovery using matrix factorization PNAS 12 4164{4169.

2. Franz-Josef Muller, Louise C. Laurent, Dennis Kostka, Igor Ulitsky, Roy Williams, Christina Lu, In-Hyun Park, Mahendra S. Rao, Ron Shamir, Philip H. Schwartz, Nils O. Schmidt Loring, Jeanne F. Loring, Regulatory networks define phenotypic classes of human stem cell lines Nature (18 September 2008) 455 401{405.