Workshop 5: Biomarkers in HIV and Cancer Research

(April 18,2005 - April 22,2005 )

Organizers


Victor DeGruttolla
Department of Biostatistics, Harvard Medical School
Alan Perelson
Los Alamos National Laboratory
Mark Segal
Department of Biostatistics, University of California, San Diego
Steven Skates
MGH Biostatistics Center
Jeremy Taylor
Biostatistics - School of Public Health, University of Michigan

This workshop addresses the issue of the medical application of new scientific technologies, such as microarray and PCR. The medical applications will include use of measurements based on these technologies as biomarkers for diagnosis, disease progression, and effects of treatment. The disease areas in which applications will be considered include HIV and other viral infections, and cancer.

HIV: The new technologies have directly impacted clinical management of HIV infection. HIV gene sequencing is used to evaluate drug susceptibility and thereby select treatment regimens for drug-experienced patients. PCR technology makes it possible to count HIV RNA particles in body compartments. Such measures allow evaluation of drug efficacy in suppressing virus both in plasma and in genital secretions. They also allow modeling of HIV dynamics, providing insight into the mechanisms of drug action. In addition to viral genomics, human genomics is also a developing area of research. In particular, there is interest in determining whether polymorphisms in specific host genes explain patient variability in treatment response, toxicity, and pharmacokinetics of antiretroviral drugs.

The sessions will include methods for relating HIV genotype to resistance phenotype; methods for modeling the accumulation of HIV resistance mutations; and relationship of host genomics to treatment response, toxicity, and pharmacokinetics of ARV therapy.

CANCER: Biomarkers are an important component of oncology practice at present, particularly in monitoring for cancer recurrence and in early detection of some cancers. However, with the recent explosion of genomic and proteomic technologies, biomarkers have the potential to contribute far more broadly to cancer research and oncology practice, including the following areas: early detection of cancer in asymptomatic subjects; differential diagnosis for patients presented with symptoms; monitoring for recurrence; risk stratification for clinical trial eligibility or selection of subjects for prevention/early detection strategies; prognosis; aid in therapeutic decision-making; monitoring the course of therapy; and surrogate endpoints for clinical trials.

Statistical challenges abound in all these areas, ranging from methods to identify suitable biomarkers to optimization of their application. Tight collaboration between statisticians, biologists, surgeons, and physicians, where biological and medical knowledge is incorporated in the statistical modeling whenever possible, will likely increase the chances of biomarkers realizing their full potential impact.

This workshop aims to highlight the statistical challenges involved in the areas of HIV and cancer research and medical practice, present statistical research in progress, and provide a forum for discussing current answers to the statistical challenges and future directions.

Accepted Speakers

Donna Ankerst
Public Health Services Division, Fred Hutchinson Cancer Research Center
Kerry Bemis
Department of Statistics, Indiana Centers for Applied Protein Sciences
Steve Deeks
San Francisco General Hospital, University of California, San Diego
Eleftherios Diamandis
Section of Clinical Biochemistry , University of Toronto
Jane Fridlyand
Center for Bioinformatics & Molecular Biostatistics, University of California, San Diego
Debashis Ghosh
Biostatistics - School of Public Health, University of Michigan
Mary Elizabeth Halloran
Biostatistics - Rollins School of Public Health, Emory University
Rick Higgs
Genomic & Molecular Informatics, Eli Lilly
Steve Horvath
Human Genetics and Biostatistics, David Geffen School of Medicine
Clet Niyikiza
Oncology Pharmacogenomics, Eli Lilly
Alan Perelson
Los Alamos National Laboratory
Mark Segal
Department of Biostatistics, University of California, San Diego
John Semmes
Microbiology and Molecular Cell Biology, Eastern Virginia Medical School
Steven Skates
MGH Biostatistics Center
Jeremy Taylor
Biostatistics - School of Public Health, University of Michigan
Rodolphe Thiebaut
Biostatistics - ISPED, Universite de Bordeaux 2
Bruce Trock
Epidemiology and Oncology, Johns Hopkins University
Mark van der Laan
Biostatistics - School of Public Health, University of California, Berkeley
Hulin Wu
Biostatistics and Computational Biology, University of Rochester
Monday, April 18, 2005
Time Session
09:00 AM
09:45 AM
Steve Deeks - Pathogenesis of Drug-Resistant HIV: Implications for Novel Treatment Strategies

Many patients treated with combination antiretroviral therapy fail to achieve complete viral suppression. Optimizing individual treatment strategies requires an understanding of the complex relationship between replication of drug-resistant virus and the host response. In particular, the distinction between persistent drug activity, alterations in replicative capacity ("fitness") and the ability of a newly emergent variant to cause disease ("virulence") may prove to be important in designing long-term therapeutic strategies. These issues will likely become even more relevant with entry inhibitors, where drug-pressure may select for X4 variants that may be less fit but more virulent. To address these issue we have performed a series of studies focusing on the determinants of disease outcome in patients with drug-resistant viremia, and have observed the following: (1) HIV is often constrained in its ability to develop high-level drug resistance while maintaining replicative capacity, (2) immune activation is reduced in patients with drug-resistant HIV (after controlling for the level of viremia) and (3) patients who durably control HIV replication despite the presence of drug-resistance exhibit immunologic characteristics comparable to that observed in long-term non-progressors (e.g, low levels of T cell proliferation and activation and preserved HIV-specific IL-2 and gamma-interferon-high producing CD4+ T cells). We have initiated a number of interventional studies based on the hypothesis that drug-mediated alterations in HIV fitness/virulence may be clinically useful in patients with limited therapeutic options.


Supported by NIAID (AI052745,AI055273), the UCSF/Gladstone CFAR (P30 MH59037), the California AIDS Research Center (CC99-SF, ID01-SF-049) and the SFGH GCRC (5-MO1-RR00083-37).

10:00 AM
10:45 AM
Alan Perelson - Modeling Drug Efficacy and HIV Dynamics

Simple models of HIV infection and the effects of antiretroviral therapy have typically assumed that drugs have a constant efficacy. Here I will summarize some new models that incorporate ideas from pharmacokinetics and pharmacodynamics such that drug efficacy depends on drug concentration, which in turn depends on drug dose and time at which drug is taken. These models allow estimation of the relative efficacy of different drug combinations and also allow one to explicitly incorporate the effects of missed drug doses or intentional stopping of therapy for short periods of time. Effects of drug resistance can also be incorporated.

11:15 AM
12:00 PM
Hulin Wu - Modeling and Prediction of Biomarkers Longitudinally in AIDS Clinical Studies

Although a single event endpoint such as time to virological failure is simple and easy to use in large AIDS clinical trials, the longitudinal biomarker data from closely monitoring of viral load and CD4+ T cell counts can provide more detailed information regarding pathogenesis of HIV infection and characteristics of antiretroviral regimens. I will present a mechanistic HIV-1 dynamic model that will incorporate the information of pharmacokinetics, drug adherence and drug susceptibility to predict viral load trajectory. A Bayesian approach is proposed to fit this model to clinical data from ACTG A5055, a study of two dosage regimens of indinavir (IDV) with ritonavir (RTV) in subjects failing their first PI treatment. HIV RNA testing was completed at days 0, 7, 14, 28, 56, 84, 112, 140 and 168. An intensive PK evaluation was performed on day 14 and multiple trough concentrations were subsequently collected. Pill counts were used to monitor adherence. IC50 for IDV and RTV was determined at baseline and at virologic failure. Viral dynamic model fitting residuals were used to assess the significance of covariate effects on long-term virologic response. As univariate predictors, none of the four PK parameters C_trough, C_12h, C_max and AUC_0-12h was significantly related to virologic response (p>0.05). By including drug susceptibility (IC50), or IC50 and adherence together, C_trough, C_12h, C_max and AUC_0-12h were each significantly correlated to long-term virologic response (p=0.0055,0.0002,0.0136,0.0002 with IC50 and adherence considered). IC50 and adherence alone were not related to the virologic response. Adherence did not provide any additional information to PK parameters (p=0.064), to drug susceptibility IC50 (p=0.086), and to their combination (p=0.22) in predicting virologic response. Simple regression approaches did not detect any significant PD relationships. Any single factor of PK, adherence and drug susceptibility cannot be detected to have significant contribution to long-term virologic response. But appropriate combination of these factors using viral dynamic modeling approach was shown to be significant to predict virologic response. Adherence measured by pill counts and multiple trough drug concentrations did not provide additional information for virologic response presumably due to the data quality and noise problems. HIV dynamic modeling is a powerful tool to establish a PD relationship and correlate other factors such as adherence and drug susceptibility to long-term virologic response, since it can appropriately capture the complicated nonlinear relationships and interactions among multiple covariates. Our findings may help clinicians better understand the roles of these clinical factors in antiviral activities and predict the virologic response of various antiretroviral regimens.

01:15 PM
02:00 PM
Mark van der Laan - Interpreting HIV Mutations to Predict Response to Antiretroviral Therapy: The deletion/substitution/addition (DSA) algorithm for the estimation of direct causal effects

Our goal is to estimate the causal effect of mutations detected in the HIV strains infecting a patient on clinical virologic response to specific anti retroviral drugs and drug combinations. We consider the following data structure: 1) viral genotype, which we summarize as the presence or absence of each viral mutation considered by the Stanford HIV Database as likely to have some effect on virologic response to antiretroviral therapy; 2) drug regimen initiated following assessment of viral genotype (the regimen may involve changing some or all of the drugs in a patient's previous regimen); and, 3) change in plasma HIV RNA level (viral load) over baseline at twelve and twenty-four weeks after starting this regimen.


The effects of a set of mutations on virologic response are heavily confounded by past treatment. In addition, viral mutation profiles are often used by physicians to make treatment choices; we are interested in the direct causal effect of mutations on virologic outcome, not mediated by choice of other drugs in a patient's regimen. Finally, the need to consider multiple mutations and treatment history variables, as well as multi-way interactions between these variables, results in a high-dimensional modeling problem. This application thus requires data-adaptive estimation of the direct causal effect of a set of mutations on viral load under a particular drug, controlling for confounding and blocking the effect the mutations have on the assignment of other drugs. We developed such an algorithm based on a mix of the direct effect causal inference framework and the data adaptive regression deletion/substitution/addition (DSA) algorithm.

02:15 PM
03:00 PM
Rodolphe Thiebaut - Issues in Longitudinal Modelling of HIV Markers Using Mixed Models

Plasma HIV RNA and T lymphocytes CD4+ count are major biomarkers used to decide when to start, change or stop a treatment as well as to evaluate treatment efficacy in HIV-infected patients. Thus, repeated measurements of those biomarkers are common in HIV studies. Those data may be analysed by using models for longitudinal data such as mixed models. However, the statistical analysis is complicated by several methodological difficulties. Three of them are of particular importance: (i) left-censoring of HIV RNA due to a lower quantification limit; (ii) correlation between CD4+ T lymphocytes and plasma HIV RNA; (iii) missing data due to informative dropout or disease progression. I will present a unified approach to deal with those issues by jointly modelling longitudinal measurement data and event history data. Likelihood inference can be used to estimate the parameters of such model. I will illustrate it by studying HIV markers response to antiretroviral treatment in randomised clinical trials and observational cohort studies. This approach might help in studying the change in markers, their prognostic value and their surrogacy.

03:30 PM
04:15 PM
Victor DeGruttolla - Joint Modeling of Progression of HIV Resistance Mutations Measured with Uncertainty and Time to Virological Failure

Development of HIV resistance mutations is a major cause for failure of antiretroviral treatment. This article proposes a method for jointly modeling the processes of viral genetic changes and treatment failure. Because the viral genome is measured with uncertainty, a hidden markov model is used to fit the viral genetic process. The uncertain viral genotype is included as a time-dependent covariate in a Cox model for failure time, and an EM algorithm is used to estimate the model parameters. This model allows simultaneous evaluation of the sequencing uncertainty and the effect of resistance mutation on the risk of virological failure. The method is then applied to data collected in three phase II clinical trials testing antiretroviral treatments containing the drug efavirenz. Various model checking tests are provided to assess the appropriateness of the model.

Tuesday, April 19, 2005
Time Session
09:00 AM
09:45 AM
Mark Segal - Genomewide Prediction of HIV-1 Epitopes Using Ensemble Classifiers and Amino Acid Sequence of MHC Binding Peptides

Following infection, HIV-1 proteins are digested into short peptides that bind to major histocompatibility complex (MHC) molecules. Subsequently, these bound complexes are displayed by antigen presenting cells. T cells with receptors that recognize the complexes are activated, triggering an immune response. Peptides with this ability to induce T cell response are called T cell epitopes -- prediction thereof is important for vaccine development. Sung and Simon (JCB, 2004) start with compilations of peptide sequences that {bind/don't bind} to specific MHC molecules and, using biophysical properties of the constituent amino acids, develop a classifier. Properties are used because of the inability of select classifiers to effectively handle amino acid sequence itself. Tree-structured methods are not so limited (Segal et al., Biometrics, 2001). Here, we apply these methods, along with their ensemble extensions (bagging, boosting, random forests), and show they provide improved accuracy. Both additional properties (QSAR derived) and classifiers (SVMs, ANNs) are also investigated. HIV-1 genomewide comparisons with respect to predicted / conserved epitopes are also presented.

10:15 AM
11:00 AM
Mary Elizabeth Halloran - Using Validation Sets for Outcomes with Time-to-event Data in Vaccine Studies

In many vaccine studies, confirmatory diagnosis of a suspected case is made by doing a culture to confirm that the infectious agent of interest is present. However, often such cultures are too expensive or difficult to collect, so that an operational case definition, such as ``any respiratory illness'', is used. This leads to many misclassified cases and serious attenuation of efficacy and effectiveness estimates. A validation sample can be used to improve the attenuated estimates. We propose a new method of analysis for validation sets with time-to-event in vaccine studies when the baseline hazards of both the illness of interest and similar, nonspecific illnesses are changing. We analyze data from an influenza vaccine field study with these methods.

11:15 AM
12:00 PM
Joseph Hogan - Biomarker Evaluation and Analysis in a Causal Framework

Biomarkers can be used for several purposes, for example as surrogate markers of treatment effect or as inputs to a diagnostic algorithm. This talk will describe applications of causal modeling and inference for both settings, and highlight the role of potential outcomes for understanding properties of a biomarker.


First, we illustrate the use of instrumental variables and associated sensitivity analysis for estimating causal treatment effects of HAART from observational cohort studies. Our focus will be on transparent representation of underlying assumptions, and on the role of coherent sensitivity analyses to understand the effects of departures from those assumptions.


Second, we will describe the role of potential outcomes for assessing diagnostic utility of a continuous biomarker. An important measure of diagnostic utility is area under the ROC curve. The area represents P(X>Y), where X and Y are, respectively, randomly-drawn marker values from the 'case' and 'non-case' populations. In some observational studies, the 'case' and 'non-case' populations may be systematically different, and bias can be introduced by confounders. We propose a new definition for area under the ROC curve that is written in terms of potential outcomes, and appeals to a causal interpretation of diagnostic utility. Standard methods for causal inference can be used to estimate the area under the curve; the ideas are illustrated by examining the diagnostic utility of viral load and CD4 as markers for HIV-related mortality, using inverse probability weighting to adjust for potential confounders. We also make qualitative and quantitative comparisons to standard methods.

Wednesday, April 20, 2005
Time Session
01:00 PM
01:30 PM
Mark Rubin - Defining Aggressive Prostate Cancer Biomarkers using a Combination of High Throughput Technologies

Developing molecular tests to predict prostate cancer progression requires first defining a meaningful endpoint. There is controversy regarding the use of PSA or biochemical failure following prostatectomy or radiation therapy for clinically localized prostate cancer as a marker of progression. As a consequence, advances in prostate cancer biomarker development may require using population-based cohorts or cases from clinical trials to identify meaningful associations. Whereas the discovery of novel candidate biomarkers was slow 5-10 years ago and often resulted from serendipity, advances in high-throughput technologies have lead to the identification of a large number of candidate genes. Strategies to identify candidate genes include the use of novel software for genomic analysis. This presentation will provide an approach to validation of these candidate genes using tissue microarrays and other high throughput technologies. Since a critical factor in the evaluation of tissue markers is reproducibility, approaches to quantitative protein expression will be presented. The approaches presented here should be applicable to other tumor types and disease processes.

01:45 PM
02:15 PM
Bruce Trock - Surrogate Endpoint Biomarkers in Chemoprevention: Mathematical evaluation of feasibility

To establish efficacy of cancer chemoprevention agents using cancer incidence as the endpoint requires very large sample sizes (thousands) and long follow-up. Surrogate endpoint biomarkers (SEBs) are biomarkers of (presumably critical) intermediate steps in the carcinogenic pathway that may permit smaller and more rapid studies. If the chemopreventive agent modulates the SEB in a manner consistent with blocking or reducing progression to carcinogenesis it may be possible to infer the reduction in cancer risk attributable to the agent. However, if there is not a perfect one-to-one correspondence between the SEB and cancer then the SEB induces misclassification of the cancer outcome. The extent of bias in the SEB as a surrogate for cancer is measured by its sensitivity and specificity. This paper will show that the relative risk (RR) observed using the SEB as a surrogate for cancer can severely underestimate the true RR when specificity is less than perfect. Furthermore, if specificity in the group receiving the chemopreventive agent is less than that in the untreated group, the RR based on the SEB may even indicate that the agent increases cancer risk. The performance characteristics of SEBs as a function of sensitivity, specificity and cancer incidence will be explored, and criteria to determine if SEBs can realistically be used will be defined.

02:45 PM
03:15 PM
Jeremy Taylor - Statistical Issues in Cancer Biomarker Assessment

Cancer biomarkers can be used in many different ways in cancer research. They can be used as surrogate endpoints or auxiliary variables to help assess new therapies. They can be used for risk stratification prior to deciding on therapy. A biomarker might suggest responsiveness to a particular biological agent, and would thus assist in individualizing therapy. Modern technologies, such as from genomics and proteomics, are producing high dimensional sets of biomarkers, which give rise to numerous complex statistical issues. A longitudinal series of a biomarker can be useful for early detection of disease or for monitoring disease progression after therapy. There is a general feeling that combinations of biomarkers, that measure different aspects of the underlying biology, may be more useful than any single biomarker. This raises the statistical challenge of how to combine biomarkers. When using combinations of biomarkers to detect disease it is frequently appropriate to assume that the probability of disease is a monotonic function of the each biomarker. By incorporating this monotonicity into the analysis it may be possible to improve its efficiency. We consider the situation of two ordered categorical variables and a binary response. The probability of response is assumed to be monotonic in each of the biomarkers. Two approaches are considered, one Bayesian in which the monotonicity is built into the prior distributions and a second in which isotonic regression in two dimensions is used. When using a biomarker as a surrogate endpoint in a clinical trial it is well known that one requires more than a strong association between the biomarker and the true endpoint, one also needs the biomarker to explain the effect of the treatment on the true endpoint. Various measures of the proportion of treatment effect explained by the surrogate have been proposed. An alternative approach is to view the biomarker as an auxiliary variable, and use it to predict the true endpoint, and then perform inference on the true endpoint. Thus the problem is converted into one of missing data, for which there are various approaches. We have developed an approach of multiple imputation, in which the true endpoint is imputed based on information in the auxiliary variable, the treatment group and possibly other prognostic factors. This approach generalizes to more complex situations such as multivariate biomarkers or longitudinally measured biomarkers. A more general approach is to formulate and estimate the joint distribution of the biomarker and the true endpoint, once this is achieved measures such as the proportion explained and predictive distributions of true endpoint values are a natural consequence of the model.

03:30 PM
03:45 PM
Kevin Coombes - Day 3 Questions & Discussion

Day 3 Questions & Discussion with Kevin Coombes

03:45 PM
04:00 PM
Colin Begg - Day 3 Questions & Discussion

Day 3 Questions & Discussion with Colin Begg

Thursday, April 21, 2005
Time Session
08:30 AM
09:00 AM
Donna Ankerst - A Verification Bias Adjustment for Inferring Operating Characteristics of a Biomarker Used to Screen

Calculations of the operating characteristics of a biomarker for disease are subject to verification bias if the disease status is only verified for individuals with biomarkers within a specified-range, such as values greater than what is considered the "upper limit of normal". Such types of data predominate in prospective studies that employ a biomarker to screen, such as in the Prostate Cancer Prevention Trial (PCPT), necessitating statistical methods to accommodate potential biomarker-based verification bias for utilizing samples from these studies.


The PCPT randomized 18,882 men aged 55 or older with a normal digital rectal examination (DRE) and prostate-specific antigen (PSA) level less than or equal to 3 ng per milliliter (ng/mL) to either finasteride or placebo for seven years. A PSA and DRE were performed annually. Whenever PSA exceeded 4 ng/mL or the DRE was positive indicating suspicion of cancer, the participant was referred to biopsy. At the end of seven years all individuals not previously diagnosed with cancer were requested to have an end-of-study biopsy. The aim of our correlative study was to derive the operating characteristics of PSA for biopsy-detectable prostate cancer using the seven year screening histories and outcomes from the PCPT placebo arm. We walk through this case study, illustrating a Markov Chain Monte Carlo algorithm to adjust for verification bias, and ending with our conclusions concerning the operating characteristics of PSA and open questions for the design of future prospective screening studies.

09:15 AM
09:45 AM
Steven Skates - Longitudinal Biomarkers in Detection of Ovarian Cancer in Asymptomatic Women

Detecting ovarian cancer in asymptomatic women through regular screening tests is an appealing approach to reducing mortality from this disease due to the large survival difference between early and late stage disease, and the high proportion of cases detected at late stage (80%) under usual care. However, due to the low incidence of the disease, ovarian cancer screening is a delicate balance between detecting as many cancers as possible while limiting the number of false positive results per true positive. As the bar is lowered for declaring a test positive, the proportion of cancers detected usually increases; however, the number of false positives per cancer detected also increases. The definitive diagnosis of ovarian cancer requires invasive pelvic surgery. A method for screening requires at least one ovarian cancer to be found in ten screen related surgeries, and at least 70% of the ovarian cancers screen detected to be considered acceptable.


Prospective clinical screening trials with the blood test CA125, followed by ultrasound for elevated CA125 above a fixed cutpoint, resulted in a positive predictive value (# cancers at surgery/# surgeries), or PPV, exceeding 20% and with 70% of ovarian cancers screen detected, demonstrating this screening method is acceptable. However only 40% of screen detected cancers were found in early stage. While this result doubled the percentage found in early stage under usual care, a greater increase was required before the impact on mortality would be substantial. A method was required for increasing the sensitivity while maintaining a sufficiently high PPV. Retrospective analysis of longitudinal CA125 values indicated that CA125 values rose exponentially above an individual's baseline level prior to diagnosis of ovarian cancer, while in most other women the CA125 fluctuated around a baseline level. Incorporating this differential CA125 behavior into the screening decision for referral to ultrasound would potentially allow greater sensitivity (rise above a baseline but prior to achieving a level exceeding the fixed cutpoint) while maintaining specificity (rule out subjects with elevated yet stable CA125 levels). Modeling the longitudinal CA125 values in cases with a hierarchical longitudinal change-point model, and the CA125 in other women with a hierarchical longitudinal model, provided the basis for assessing referral to ultrasound with the Bayes factor calculated for subjects with new CA125 values. This approach has been used in a prospective randomized ovarian cancer screening trial in the UK and will be discussed at the workshop.

10:15 AM
10:45 AM
Steve Horvath - Improving Tumor Marker Validation Success Using Random Forest Clustering and Gene Co-expression Network Methods

Molecular data are widely used to screen for biomarkers that have prognostic significance for clinical outcomes, e.g. gene expression data or immuno-histochemical staining data may be used to screen for biomarkers that could predict post-operative survival time. A challenge is that such candidate biomarkers can sometimes not be validated in independent data sets. Here we will describe 2 different approaches that we have found to be useful for identifying biomarkers that have an increased chance of being validated.


The first approach is based on weighted gene co-expression network analysis. A clustering method is used to identify prognostic gene modules, i.e. sets of tightly co-expressed genes. Using brain cancer microarray data, we will show that highly connected prognostic `hub' genes in these modules have a substantially increased likelihood of being validated. The second approach seems to be quite different: first, it uses random forest clustering to identify high risk patient clusters. Second, a biomarker based threshold rule is derived for predicting cluster membership. Using prostate cancer data, we will provide empirical evidence that these rules can be validated while traditional approaches may lead to candidate biomarkers that cannot be validated.


There seems to be a mathematical and biological connection between these 2 approaches. Both rely on a clustering as an essential pre-processing step to identify "prognostic" clusters. The clusters correspond to global patterns that are more likely to be found in independent data sets as well. We provide empirical evidence that biomarker screening procedures that are based on prognostic clusters have an increased chance of validation success.


Acknowledgement: The gene co-expression network part was done in collaboration with Bin Zhang, Paul Mischel, and Stan Nelson. The random forest part was done in collaboration with Tao Shi, Siavash Kurdistani, and David Seligson.

11:00 AM
11:15 AM
Elizabeth Slate - Day 4 Questions & Discussion

Day 4 Questions & Discussion with Elizabeth Slate

11:15 AM
12:15 AM
Alexander Tsodikov - Day 4 Questions & Discussion

Day 4 Questions & Discussion with Alexander Tsodikov

01:00 PM
01:30 PM
John Semmes - Tackling Cancer Diagnostic Needs with Clinical Proteomics

Our group utilizes a variety of proteomic approaches to biomarker discovery for the early detection of cancer. Discussed will be the current mass spectrometry based studies and their application to solid tissue cancers such as prostate and head and neck. In addition recent studies examining serum from patients infected with the Human T-cell leukemia Virus type 1 will be presented with emphasis on the utility of the expressed biomarkers in the discrimination of Adult T-cell leukemia, HAM/TSP and asymptomatic infected individuals.

01:45 PM
02:15 PM
- Comparative Profiling of Complex Protein Mixtures with Peptide Arrays Generated from LC-MS Mass Spectrometry

Advancements in mass spectrometry (MS) instrumentation, liquid chromatography (LC) and maturing protein databases are leading many advances in the field of proteomics. Among the potential uses of this technology is the identification of predictive protein biological markers or biomarkers that can differentiate two or more groups of complex biological samples. Despite its proteome-wide potential few clinically relevant discoveries have come forth from these technologies when applied to complex protein mixtures, such as serum or tissue, characterized by a high complexity and dynamic range. Current approaches to profile proteins are dominated by the use of MALDI or LC-MS/MS mass spectrometry (MS/MS), and both approaches have difficulties in practice; MALDI can identify a large number of "peaks", but identification (sequence) of low abundant features can be difficult, and MS/MS lacks sensitivity and has poor reproducibility and low protein coverage due to its data-dependent sampling. It has been our hypothesis that greater efficiency of protein/peptide profiling could be obtained by more efficient use of high resolution LC-MS instrumentation where, like MALDI approaches, differential peptides are first identified from the list of potential precursor ions (LC-MS) and then those only those differential peptides are sequenced in subsequence LC-MS measurements. To evaluate this hypothesis, our group has developed a suite of software algorithms that produce a peptide array from a sequence of LC-MS measurements; the peptide array can be evaluated in much the same way as a transcript array with members identified by their accurate mass and time tags. Production of the peptide array requires substantial signal (image) processing, image alignment, and specialized normalization routines. We demonstrate that we can identify and compare hundreds or thousands of peptides and proteins across multiple replicates of biological samples. The algorithms will be demonstrated using data of increasingly complex biological samples; bacteria, yeast, and human serum.

02:30 PM
03:00 PM
Eleftherios Diamandis - Strategies for Discovering New Cancer Biomarkers: Opportunities and Pitfalls

Strategies for Discovering New Cancer Biomarkers: Opportunities and Pitfalls

03:15 PM
03:30 PM
Rick Higgs - A Comprehensive Label-free Method for the Relative Quantification of Proteins from Biological Samples

Global proteomics measurements are rapidly being developed to identify biomarkers for drug development applications. A major challenge with this strategy is the analysis of the raw data generated by high throughput HPLC-MS/MS experiments of protein digests from complex biological samples. This presentation will focus on a computational pipeline to automatically process HPLC-MS/MS data including: estimation of peptide charge and mass, noise filtering of MS/MS spectra, and peptide identification. Following this pre-processing of individual study samples we describe methods for chromatographic alignment and label-free relative quantification using integrated ion current of peptides from all samples in a biomarker study. Results from a rat serum variability study will be used to demonstrate how the method can be applied to biomarker discovery.

04:15 PM
04:45 PM
Kerry Bemis - Statistical Issues with LC/MS Proteomics for Biomarker Discovery Examined Using Data from Vinblastine Resistant and Sensitive Ovarian Cancer Cells

The difficult issues for the statistician designing and analyzing proteomic studies are similiar to the issues with genomic studies. I will discuss the following:



  1. Normalization: controlling the systematic biases affecting all proteins in a sample.

  2. Control of the number of false positives by estimating the False Discovery Rate instead of the False Positive Rate.

  3. Sample size calculation for controlling the False Discovery Rate.

  4. Visualizing the significant results when you have thousands of proteins to examine


These ideas will be examined using an experiment on vinblastine resistant and sensitive ovarian cancer cell lines. Lessons learned may facilitate discovery of biomarkers for vinblastine resistance.

05:00 PM
05:15 PM
Keith Baggerly - Day 4 Questions & Discussion

Day 4 Questions & Discussion with Keith Baggerly

05:15 PM
06:15 PM
Zhen Zhang - Day 4 Questions & Discussion

Day 4 Questions & Discussion with Zhen Zhang

Friday, April 22, 2005
Time Session
01:00 PM
01:30 PM
Debashis Ghosh - Combining Genomic Data in Cancer Microarray Experiments

With the advent of new high-throughput molecular technologies, consideration of high-dimensional data is becoming more common. A major role for statisticians to play in the future of this area of bioinformatics is combining genomic data from different sources. In this talk, we will discuss two examples of such analyses. The first is combining gene expression datasets from multiple cancer studies. The second is using gene expression data to infer chromosomal alterations.

01:45 PM
02:15 PM
Adam Olshen - Some Statistical Issues in the Analysis of Array CGH Data

Cancer progression often involves alterations in DNA sequence copy number. Multiple microarray platforms now facilitate high-resolution copy number assessment of entire genomes in single experiments. This technology is generally referred to as array comparative genomic hybridization (array CGH). In my talk, I will discuss issues that have arisen in the analysis of array CGH data. Topics will include pre-processing and normalization, identification of regions of abnormal copy number, and determination as to whether copy number abnormalities can be seen in gene expression data. Our method of identifying abnormal copy number, which we call circular binary segmentation (CBS), will be introduced.


This is joint work with E.S. Venkatraman.

02:30 PM
03:00 PM
Jane Fridlyand - Application of Array CGH to the Analysis of Cancer Data

The development of solid tumors is associated with acquisition of complex genetic alterations, indicating that failures in the mechanisms that maintain the integrity of the genome contribute to tumor evolution. Thus, one expects that the particular types of genomic derangement seen in tumors reflect underlying failures in maintenance of genetic stability, as well as selection for changes that provide growth advantage. In order to investigate genomic alterations we are using microarray-based comparative genomic hybridization (array CGH). The computational task is to map and characterize the number and types of copy number alterations present in the tumors, and so define copy number phenotypes as well as to associate them with known biological markers and with gene expression data. We discuss general analytical and visualization approaches applicable to the array CGH data. We also use unsupervised Hidden Markov Models approach to utilize the spatial coherence between nearby clones. The clones are partitioned into the states which represent underlying copy number of the group of clones. The output of the algorithm is given as an input to higher-level analyses such as testing and classification. We will also discuss some preliminary results on joint analysis of the copy number and gene expression data. The methods are demonstrated on simulated data as well as cell line and clinical tumor datasets.

Name Affiliation
Ankerst, Donna ankerst@slcmsr.org Public Health Services Division, Fred Hutchinson Cancer Research Center
Baggerly, Keith kabagg@mdanderson.org MD Anderson Cancer Center, University of Texas M. D. Anderson Cancer Center
Ballman, Karla ballman@mayo.edu Department of Biostatistics, Mayo Clinic College of Medicine
Begg, Colin beggc@mskcc.org Department of Epidemiology & Biostatistics, Memorial Sloan-Kettering Cancer Center
Bemis, Kerry kbemis@indianacaps.com Department of Statistics, Indiana Centers for Applied Protein Sciences
Best, Janet jbest@mbi.osu.edu Mathematics, The Ohio State University
Birkner, Merrill mbirkner@berkeley.edu Biostatistics - School of Public Health, University of California, Berkeley
Borisyuk, Alla borisyuk@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Buechler, Steven buechler.1@nd.edu Department of Mathematics, University of Notre Dame
Coombes, Kevin krc@odin.mdacc.tmc.edu Biostatistics - MD Anderson Cancer Center, University of Texas M. D. Anderson Cancer Center
Cooper, Kristi cooperkl@umich.edu Biostatistics - School of Public Health II, University of Michigan
Cracium, Gheorghe craciun@math.wisc.edu Dept. of Mathematics, University of Wisconsin-Madison
Day, Roger day@upci.pitt.edu Department of Biostatistics, University of Pittsburgh
Deeks, Steve sdeeks@php.ucsf.edu San Francisco General Hospital, University of California, San Diego
DeGruttolla, Victor victor@sdac.harvard.edu Department of Biostatistics, Harvard Medical School
Demichelis, Francesca michelis@itc.it Pathology Department, Brigham and Women's Hospital
Diamandis, Eleftherios ediamandis@mtsinai.on.ca Section of Clinical Biochemistry , University of Toronto
Dodd, Lori doddl@mail.nih.gov Biometric Research Branch, National Institutes of Health
Dougherty, Daniel dpdoughe@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Fridlyand, Jane janef@stat.berkeley.edu Center for Bioinformatics & Molecular Biostatistics, University of California, San Diego
Ghosh, Debashis ghoshd@umich.edu Biostatistics - School of Public Health, University of Michigan
Gimotty, Phyllis pgimotty@cceb.upenn.edu Biostatistics and Epidemiology, University of Pennsylvania
Goel, Pranay goelpra@helix.nih.gov NIDDK, Indian Institute of Science Education and Research
Guo, Yixin yixin@math.drexel.edu Department of Mathematics, The Ohio State University
Hall, David dhall@rdg.boehringer-ingelheim.com Biometrics and Data Management, Boehringer Ingelheim
Halloran, Mary Elizabeth mehallo@sph.emory.edu Biostatistics - Rollins School of Public Health, Emory University
Higgs, Rick higgs@lilly.com Genomic & Molecular Informatics, Eli Lilly
Hilsenbeck, Susan kkennedy@breastcenter.tmc.edu Breast Center, Baylor College of Medicine
Hogan, Joseph jhogan@stat.brown.edu Community Health - Center for Statistical Sciences, Brown University
Horvath, Steve SHorvath@mednet.ucla.edu Human Genetics and Biostatistics, David Geffen School of Medicine
Hu, Chengcheng chu@sdac.harvard.edu Department of Biostatistics, Harvard Medical School
Huang, Yangxin yhuang@bst.rochester.edu Biostatistics & Computational Biology, University of Rochester
Kannan, Dan kannan@uga.edu Department of Mathematics, University of Georgia
Kattan, Michael kattanm@ccf.org Biostatistics and Epidemiology, Cleveland Clinic Foundation
Lee, Jack jjlee@mdanderson.org Biostatistics - MD Anderson Cancer Center, University of Texas M. D. Anderson Cancer Center
Lim, Sookkyung limsk@math.uc.edu Department of Mathematical Sciences, University of Cincinnati
Lin, Shili lin.328@osu.edu Department of Statistics, The Ohio State University
Liu, Dacheng dliu@bst.rochester.edu Biostatistics and Computational Biology, University of Rochester
McCulloch, Charles chuck@biostat.ucsf.edu Epidemiology and Biostatistics, University of California, San Diego
Melfi, Vincent melfi@mbi.osu.edu Mathematics, Michigan State University
Merrill, Stephen stephen.merrill@marquette.edu Mathematics, Statistics, & Computer Science, Marquette University
Molinaro, Annette molinaran@mail.nih.gov Division of Cancer Epidemiology & Genetics, National Institutes of Health
Morris, Jeff jeffmo@odin.mdacc.tmc.edu Biostats/Applied Math - MD Anderson Cancer Center, University of Texas M. D. Anderson Cancer Center
Niyikiza, Clet clet@lilly.com Oncology Pharmacogenomics, Eli Lilly
Normolle, Daniel monk@umich.edu Cancer Center Biostatistics Unit, University of Michigan
Oberg, Ann oberg.ann@mayo.edu Biostatistics - Cancer Center, Mayo Clinic and Foundation
Olshen, Adam olshena@mskcc.org Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center
Perelson, Alan asp@lanl.gov Los Alamos National Laboratory
Petersen, Maya mayaliv@socrates.berkeley.edu Biostatistics - School of Public Health, University of California, Berkeley
Pol, Diego dpol@mbi.osu.edu Independent Researcher, Museo Paleontologico E. Feruglio
Rajicic, Natasa
Rajicic, Natasa
Rajicic, Natasa Biostatistics Center, Dana-Farber Cancer Institute
Rassoul-Agha, Firas firas@math.ohio-state.edu Department of Mathematics, University of Utah
Rejniak, Katarzyna rejniak@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Rubin, Mark scurtis1@partners.org Brigham and Women's Hospital
Segal, Mark mark@biostat.ucsf.edu Department of Biostatistics, University of California, San Diego
Semmes, John brassidn@evmsmail.evms.edu Microbiology and Molecular Cell Biology, Eastern Virginia Medical School
Shen, Ronglai rlshen@med.umich.edu Department of Biostatistics, University of Michigan
Shepherd, Bryan bshepher@u.washington.edu Department of Biostatistics, University of Washington
Shyr, Yu lisa.herrold@vanderbilt.edu Department of Biostatistics, Vanderbilt University
Siegmund, Kimberly kims@usc.edu Preventive Medicine - Keck School of Medicine, University of Southern California
Skates, Steven sskates@partners.org MGH Biostatistics Center
Slate, Elizabeth slateeh@musc.edu Department of Biostatistics, Medical University of South Carolina
Srivastava, Sudhir srivasts@mail.nih.gov Cancer Biomarkers Research Group, National Cancer Institute
Stubna, Michael stubna@mbi.osu.edu Engineering Team Leader, Pulsar Informatics
Su, Li lisu@stat.brown.edu Center for Statistical Sciences, Brown University
Sun, Junfeng sun@stat.ohio-state.edu Department of Statistics, The Ohio State University
Taylor, Jeremy jmgt@umich.edu Biostatistics - School of Public Health, University of Michigan
Terman, David terman@math.ohio-state.edu Mathemathics Department, The Ohio State University
Thiebaut, Rodolphe rodolphe.thiebaut@isped.u-bordeaux2.fr Biostatistics - ISPED, Universite de Bordeaux 2
Thurston, Sally thurston@bst.rochester.edu Biostatistics and Computational Biology, University of Rochester
Tian, Jianjun Paul tianjj@mbi.osu.edu Mathematics, College of William and Mary
Trock, Bruce btrock@jhmi.edu Epidemiology and Oncology, Johns Hopkins University
Tsodikov, Alexander atsodikov@ucdavis.edu Biostatistics - Dept. of Public Health Services, University of California, Davis
van der Laan, Mark laan@stat.berkeley.edu Biostatistics - School of Public Health, University of California, Berkeley
Verducci, Joseph verducci.1@osu.edu Department of Statistics, The Ohio State University
Wang, Zailong zlwang@mbi.osu.edu Integrated Information Sciences, Novartis
Wechselberger, Martin wm@mbi.osu.edu Mathematical Biosciences Insitute, The Ohio State University
Wu, Hulin hwu@bst.rochester.edu Biostatistics and Computational Biology, University of Rochester
Wu, Zhijun zhijun@iastate.edu Math, Bioinformatics, & Computational Biology, Iowa State University
Xu, Haiyan haiyan@stat.ohio-state.edu Department of Statistics, The Ohio State University
Yang, Yang yangyang@sdac.harvard.edu School of Public Health, Harvard University
Yuan, Zheng yuanz@umich.edu Department of Biostatistics, University of Michigan
Zhang, Zhen zzhang7@jhmi.edu Pathology - Center for Biomarker Discovery, Johns Hopkins University
Zhou, Jin jzhou@mbi.osu.edu Department of Mathematics, Northern Michigan University
A Verification Bias Adjustment for Inferring Operating Characteristics of a Biomarker Used to Screen

Calculations of the operating characteristics of a biomarker for disease are subject to verification bias if the disease status is only verified for individuals with biomarkers within a specified-range, such as values greater than what is considered the "upper limit of normal". Such types of data predominate in prospective studies that employ a biomarker to screen, such as in the Prostate Cancer Prevention Trial (PCPT), necessitating statistical methods to accommodate potential biomarker-based verification bias for utilizing samples from these studies.


The PCPT randomized 18,882 men aged 55 or older with a normal digital rectal examination (DRE) and prostate-specific antigen (PSA) level less than or equal to 3 ng per milliliter (ng/mL) to either finasteride or placebo for seven years. A PSA and DRE were performed annually. Whenever PSA exceeded 4 ng/mL or the DRE was positive indicating suspicion of cancer, the participant was referred to biopsy. At the end of seven years all individuals not previously diagnosed with cancer were requested to have an end-of-study biopsy. The aim of our correlative study was to derive the operating characteristics of PSA for biopsy-detectable prostate cancer using the seven year screening histories and outcomes from the PCPT placebo arm. We walk through this case study, illustrating a Markov Chain Monte Carlo algorithm to adjust for verification bias, and ending with our conclusions concerning the operating characteristics of PSA and open questions for the design of future prospective screening studies.

Day 4 Questions & Discussion

Day 4 Questions & Discussion with Keith Baggerly

Day 3 Questions & Discussion

Day 3 Questions & Discussion with Colin Begg

Statistical Issues with LC/MS Proteomics for Biomarker Discovery Examined Using Data from Vinblastine Resistant and Sensitive Ovarian Cancer Cells

The difficult issues for the statistician designing and analyzing proteomic studies are similiar to the issues with genomic studies. I will discuss the following:



  1. Normalization: controlling the systematic biases affecting all proteins in a sample.

  2. Control of the number of false positives by estimating the False Discovery Rate instead of the False Positive Rate.

  3. Sample size calculation for controlling the False Discovery Rate.

  4. Visualizing the significant results when you have thousands of proteins to examine


These ideas will be examined using an experiment on vinblastine resistant and sensitive ovarian cancer cell lines. Lessons learned may facilitate discovery of biomarkers for vinblastine resistance.

Day 3 Questions & Discussion

Day 3 Questions & Discussion with Kevin Coombes

Pathogenesis of Drug-Resistant HIV: Implications for Novel Treatment Strategies

Many patients treated with combination antiretroviral therapy fail to achieve complete viral suppression. Optimizing individual treatment strategies requires an understanding of the complex relationship between replication of drug-resistant virus and the host response. In particular, the distinction between persistent drug activity, alterations in replicative capacity ("fitness") and the ability of a newly emergent variant to cause disease ("virulence") may prove to be important in designing long-term therapeutic strategies. These issues will likely become even more relevant with entry inhibitors, where drug-pressure may select for X4 variants that may be less fit but more virulent. To address these issue we have performed a series of studies focusing on the determinants of disease outcome in patients with drug-resistant viremia, and have observed the following: (1) HIV is often constrained in its ability to develop high-level drug resistance while maintaining replicative capacity, (2) immune activation is reduced in patients with drug-resistant HIV (after controlling for the level of viremia) and (3) patients who durably control HIV replication despite the presence of drug-resistance exhibit immunologic characteristics comparable to that observed in long-term non-progressors (e.g, low levels of T cell proliferation and activation and preserved HIV-specific IL-2 and gamma-interferon-high producing CD4+ T cells). We have initiated a number of interventional studies based on the hypothesis that drug-mediated alterations in HIV fitness/virulence may be clinically useful in patients with limited therapeutic options.


Supported by NIAID (AI052745,AI055273), the UCSF/Gladstone CFAR (P30 MH59037), the California AIDS Research Center (CC99-SF, ID01-SF-049) and the SFGH GCRC (5-MO1-RR00083-37).

Joint Modeling of Progression of HIV Resistance Mutations Measured with Uncertainty and Time to Virological Failure

Development of HIV resistance mutations is a major cause for failure of antiretroviral treatment. This article proposes a method for jointly modeling the processes of viral genetic changes and treatment failure. Because the viral genome is measured with uncertainty, a hidden markov model is used to fit the viral genetic process. The uncertain viral genotype is included as a time-dependent covariate in a Cox model for failure time, and an EM algorithm is used to estimate the model parameters. This model allows simultaneous evaluation of the sequencing uncertainty and the effect of resistance mutation on the risk of virological failure. The method is then applied to data collected in three phase II clinical trials testing antiretroviral treatments containing the drug efavirenz. Various model checking tests are provided to assess the appropriateness of the model.

Strategies for Discovering New Cancer Biomarkers: Opportunities and Pitfalls

Strategies for Discovering New Cancer Biomarkers: Opportunities and Pitfalls

Application of Array CGH to the Analysis of Cancer Data

The development of solid tumors is associated with acquisition of complex genetic alterations, indicating that failures in the mechanisms that maintain the integrity of the genome contribute to tumor evolution. Thus, one expects that the particular types of genomic derangement seen in tumors reflect underlying failures in maintenance of genetic stability, as well as selection for changes that provide growth advantage. In order to investigate genomic alterations we are using microarray-based comparative genomic hybridization (array CGH). The computational task is to map and characterize the number and types of copy number alterations present in the tumors, and so define copy number phenotypes as well as to associate them with known biological markers and with gene expression data. We discuss general analytical and visualization approaches applicable to the array CGH data. We also use unsupervised Hidden Markov Models approach to utilize the spatial coherence between nearby clones. The clones are partitioned into the states which represent underlying copy number of the group of clones. The output of the algorithm is given as an input to higher-level analyses such as testing and classification. We will also discuss some preliminary results on joint analysis of the copy number and gene expression data. The methods are demonstrated on simulated data as well as cell line and clinical tumor datasets.

Combining Genomic Data in Cancer Microarray Experiments

With the advent of new high-throughput molecular technologies, consideration of high-dimensional data is becoming more common. A major role for statisticians to play in the future of this area of bioinformatics is combining genomic data from different sources. In this talk, we will discuss two examples of such analyses. The first is combining gene expression datasets from multiple cancer studies. The second is using gene expression data to infer chromosomal alterations.

Using Validation Sets for Outcomes with Time-to-event Data in Vaccine Studies

In many vaccine studies, confirmatory diagnosis of a suspected case is made by doing a culture to confirm that the infectious agent of interest is present. However, often such cultures are too expensive or difficult to collect, so that an operational case definition, such as ``any respiratory illness'', is used. This leads to many misclassified cases and serious attenuation of efficacy and effectiveness estimates. A validation sample can be used to improve the attenuated estimates. We propose a new method of analysis for validation sets with time-to-event in vaccine studies when the baseline hazards of both the illness of interest and similar, nonspecific illnesses are changing. We analyze data from an influenza vaccine field study with these methods.

A Comprehensive Label-free Method for the Relative Quantification of Proteins from Biological Samples

Global proteomics measurements are rapidly being developed to identify biomarkers for drug development applications. A major challenge with this strategy is the analysis of the raw data generated by high throughput HPLC-MS/MS experiments of protein digests from complex biological samples. This presentation will focus on a computational pipeline to automatically process HPLC-MS/MS data including: estimation of peptide charge and mass, noise filtering of MS/MS spectra, and peptide identification. Following this pre-processing of individual study samples we describe methods for chromatographic alignment and label-free relative quantification using integrated ion current of peptides from all samples in a biomarker study. Results from a rat serum variability study will be used to demonstrate how the method can be applied to biomarker discovery.

Biomarker Evaluation and Analysis in a Causal Framework

Biomarkers can be used for several purposes, for example as surrogate markers of treatment effect or as inputs to a diagnostic algorithm. This talk will describe applications of causal modeling and inference for both settings, and highlight the role of potential outcomes for understanding properties of a biomarker.


First, we illustrate the use of instrumental variables and associated sensitivity analysis for estimating causal treatment effects of HAART from observational cohort studies. Our focus will be on transparent representation of underlying assumptions, and on the role of coherent sensitivity analyses to understand the effects of departures from those assumptions.


Second, we will describe the role of potential outcomes for assessing diagnostic utility of a continuous biomarker. An important measure of diagnostic utility is area under the ROC curve. The area represents P(X>Y), where X and Y are, respectively, randomly-drawn marker values from the 'case' and 'non-case' populations. In some observational studies, the 'case' and 'non-case' populations may be systematically different, and bias can be introduced by confounders. We propose a new definition for area under the ROC curve that is written in terms of potential outcomes, and appeals to a causal interpretation of diagnostic utility. Standard methods for causal inference can be used to estimate the area under the curve; the ideas are illustrated by examining the diagnostic utility of viral load and CD4 as markers for HIV-related mortality, using inverse probability weighting to adjust for potential confounders. We also make qualitative and quantitative comparisons to standard methods.

Improving Tumor Marker Validation Success Using Random Forest Clustering and Gene Co-expression Network Methods

Molecular data are widely used to screen for biomarkers that have prognostic significance for clinical outcomes, e.g. gene expression data or immuno-histochemical staining data may be used to screen for biomarkers that could predict post-operative survival time. A challenge is that such candidate biomarkers can sometimes not be validated in independent data sets. Here we will describe 2 different approaches that we have found to be useful for identifying biomarkers that have an increased chance of being validated.


The first approach is based on weighted gene co-expression network analysis. A clustering method is used to identify prognostic gene modules, i.e. sets of tightly co-expressed genes. Using brain cancer microarray data, we will show that highly connected prognostic `hub' genes in these modules have a substantially increased likelihood of being validated. The second approach seems to be quite different: first, it uses random forest clustering to identify high risk patient clusters. Second, a biomarker based threshold rule is derived for predicting cluster membership. Using prostate cancer data, we will provide empirical evidence that these rules can be validated while traditional approaches may lead to candidate biomarkers that cannot be validated.


There seems to be a mathematical and biological connection between these 2 approaches. Both rely on a clustering as an essential pre-processing step to identify "prognostic" clusters. The clusters correspond to global patterns that are more likely to be found in independent data sets as well. We provide empirical evidence that biomarker screening procedures that are based on prognostic clusters have an increased chance of validation success.


Acknowledgement: The gene co-expression network part was done in collaboration with Bin Zhang, Paul Mischel, and Stan Nelson. The random forest part was done in collaboration with Tao Shi, Siavash Kurdistani, and David Seligson.

Some Statistical Issues in the Analysis of Array CGH Data

Cancer progression often involves alterations in DNA sequence copy number. Multiple microarray platforms now facilitate high-resolution copy number assessment of entire genomes in single experiments. This technology is generally referred to as array comparative genomic hybridization (array CGH). In my talk, I will discuss issues that have arisen in the analysis of array CGH data. Topics will include pre-processing and normalization, identification of regions of abnormal copy number, and determination as to whether copy number abnormalities can be seen in gene expression data. Our method of identifying abnormal copy number, which we call circular binary segmentation (CBS), will be introduced.


This is joint work with E.S. Venkatraman.

Modeling Drug Efficacy and HIV Dynamics

Simple models of HIV infection and the effects of antiretroviral therapy have typically assumed that drugs have a constant efficacy. Here I will summarize some new models that incorporate ideas from pharmacokinetics and pharmacodynamics such that drug efficacy depends on drug concentration, which in turn depends on drug dose and time at which drug is taken. These models allow estimation of the relative efficacy of different drug combinations and also allow one to explicitly incorporate the effects of missed drug doses or intentional stopping of therapy for short periods of time. Effects of drug resistance can also be incorporated.

Defining Aggressive Prostate Cancer Biomarkers using a Combination of High Throughput Technologies

Developing molecular tests to predict prostate cancer progression requires first defining a meaningful endpoint. There is controversy regarding the use of PSA or biochemical failure following prostatectomy or radiation therapy for clinically localized prostate cancer as a marker of progression. As a consequence, advances in prostate cancer biomarker development may require using population-based cohorts or cases from clinical trials to identify meaningful associations. Whereas the discovery of novel candidate biomarkers was slow 5-10 years ago and often resulted from serendipity, advances in high-throughput technologies have lead to the identification of a large number of candidate genes. Strategies to identify candidate genes include the use of novel software for genomic analysis. This presentation will provide an approach to validation of these candidate genes using tissue microarrays and other high throughput technologies. Since a critical factor in the evaluation of tissue markers is reproducibility, approaches to quantitative protein expression will be presented. The approaches presented here should be applicable to other tumor types and disease processes.

Genomewide Prediction of HIV-1 Epitopes Using Ensemble Classifiers and Amino Acid Sequence of MHC Binding Peptides

Following infection, HIV-1 proteins are digested into short peptides that bind to major histocompatibility complex (MHC) molecules. Subsequently, these bound complexes are displayed by antigen presenting cells. T cells with receptors that recognize the complexes are activated, triggering an immune response. Peptides with this ability to induce T cell response are called T cell epitopes -- prediction thereof is important for vaccine development. Sung and Simon (JCB, 2004) start with compilations of peptide sequences that {bind/don't bind} to specific MHC molecules and, using biophysical properties of the constituent amino acids, develop a classifier. Properties are used because of the inability of select classifiers to effectively handle amino acid sequence itself. Tree-structured methods are not so limited (Segal et al., Biometrics, 2001). Here, we apply these methods, along with their ensemble extensions (bagging, boosting, random forests), and show they provide improved accuracy. Both additional properties (QSAR derived) and classifiers (SVMs, ANNs) are also investigated. HIV-1 genomewide comparisons with respect to predicted / conserved epitopes are also presented.

Tackling Cancer Diagnostic Needs with Clinical Proteomics

Our group utilizes a variety of proteomic approaches to biomarker discovery for the early detection of cancer. Discussed will be the current mass spectrometry based studies and their application to solid tissue cancers such as prostate and head and neck. In addition recent studies examining serum from patients infected with the Human T-cell leukemia Virus type 1 will be presented with emphasis on the utility of the expressed biomarkers in the discrimination of Adult T-cell leukemia, HAM/TSP and asymptomatic infected individuals.

Longitudinal Biomarkers in Detection of Ovarian Cancer in Asymptomatic Women

Detecting ovarian cancer in asymptomatic women through regular screening tests is an appealing approach to reducing mortality from this disease due to the large survival difference between early and late stage disease, and the high proportion of cases detected at late stage (80%) under usual care. However, due to the low incidence of the disease, ovarian cancer screening is a delicate balance between detecting as many cancers as possible while limiting the number of false positive results per true positive. As the bar is lowered for declaring a test positive, the proportion of cancers detected usually increases; however, the number of false positives per cancer detected also increases. The definitive diagnosis of ovarian cancer requires invasive pelvic surgery. A method for screening requires at least one ovarian cancer to be found in ten screen related surgeries, and at least 70% of the ovarian cancers screen detected to be considered acceptable.


Prospective clinical screening trials with the blood test CA125, followed by ultrasound for elevated CA125 above a fixed cutpoint, resulted in a positive predictive value (# cancers at surgery/# surgeries), or PPV, exceeding 20% and with 70% of ovarian cancers screen detected, demonstrating this screening method is acceptable. However only 40% of screen detected cancers were found in early stage. While this result doubled the percentage found in early stage under usual care, a greater increase was required before the impact on mortality would be substantial. A method was required for increasing the sensitivity while maintaining a sufficiently high PPV. Retrospective analysis of longitudinal CA125 values indicated that CA125 values rose exponentially above an individual's baseline level prior to diagnosis of ovarian cancer, while in most other women the CA125 fluctuated around a baseline level. Incorporating this differential CA125 behavior into the screening decision for referral to ultrasound would potentially allow greater sensitivity (rise above a baseline but prior to achieving a level exceeding the fixed cutpoint) while maintaining specificity (rule out subjects with elevated yet stable CA125 levels). Modeling the longitudinal CA125 values in cases with a hierarchical longitudinal change-point model, and the CA125 in other women with a hierarchical longitudinal model, provided the basis for assessing referral to ultrasound with the Bayes factor calculated for subjects with new CA125 values. This approach has been used in a prospective randomized ovarian cancer screening trial in the UK and will be discussed at the workshop.

Day 4 Questions & Discussion

Day 4 Questions & Discussion with Elizabeth Slate

Statistical Issues in Cancer Biomarker Assessment

Cancer biomarkers can be used in many different ways in cancer research. They can be used as surrogate endpoints or auxiliary variables to help assess new therapies. They can be used for risk stratification prior to deciding on therapy. A biomarker might suggest responsiveness to a particular biological agent, and would thus assist in individualizing therapy. Modern technologies, such as from genomics and proteomics, are producing high dimensional sets of biomarkers, which give rise to numerous complex statistical issues. A longitudinal series of a biomarker can be useful for early detection of disease or for monitoring disease progression after therapy. There is a general feeling that combinations of biomarkers, that measure different aspects of the underlying biology, may be more useful than any single biomarker. This raises the statistical challenge of how to combine biomarkers. When using combinations of biomarkers to detect disease it is frequently appropriate to assume that the probability of disease is a monotonic function of the each biomarker. By incorporating this monotonicity into the analysis it may be possible to improve its efficiency. We consider the situation of two ordered categorical variables and a binary response. The probability of response is assumed to be monotonic in each of the biomarkers. Two approaches are considered, one Bayesian in which the monotonicity is built into the prior distributions and a second in which isotonic regression in two dimensions is used. When using a biomarker as a surrogate endpoint in a clinical trial it is well known that one requires more than a strong association between the biomarker and the true endpoint, one also needs the biomarker to explain the effect of the treatment on the true endpoint. Various measures of the proportion of treatment effect explained by the surrogate have been proposed. An alternative approach is to view the biomarker as an auxiliary variable, and use it to predict the true endpoint, and then perform inference on the true endpoint. Thus the problem is converted into one of missing data, for which there are various approaches. We have developed an approach of multiple imputation, in which the true endpoint is imputed based on information in the auxiliary variable, the treatment group and possibly other prognostic factors. This approach generalizes to more complex situations such as multivariate biomarkers or longitudinally measured biomarkers. A more general approach is to formulate and estimate the joint distribution of the biomarker and the true endpoint, once this is achieved measures such as the proportion explained and predictive distributions of true endpoint values are a natural consequence of the model.

Issues in Longitudinal Modelling of HIV Markers Using Mixed Models

Plasma HIV RNA and T lymphocytes CD4+ count are major biomarkers used to decide when to start, change or stop a treatment as well as to evaluate treatment efficacy in HIV-infected patients. Thus, repeated measurements of those biomarkers are common in HIV studies. Those data may be analysed by using models for longitudinal data such as mixed models. However, the statistical analysis is complicated by several methodological difficulties. Three of them are of particular importance: (i) left-censoring of HIV RNA due to a lower quantification limit; (ii) correlation between CD4+ T lymphocytes and plasma HIV RNA; (iii) missing data due to informative dropout or disease progression. I will present a unified approach to deal with those issues by jointly modelling longitudinal measurement data and event history data. Likelihood inference can be used to estimate the parameters of such model. I will illustrate it by studying HIV markers response to antiretroviral treatment in randomised clinical trials and observational cohort studies. This approach might help in studying the change in markers, their prognostic value and their surrogacy.

Surrogate Endpoint Biomarkers in Chemoprevention: Mathematical evaluation of feasibility

To establish efficacy of cancer chemoprevention agents using cancer incidence as the endpoint requires very large sample sizes (thousands) and long follow-up. Surrogate endpoint biomarkers (SEBs) are biomarkers of (presumably critical) intermediate steps in the carcinogenic pathway that may permit smaller and more rapid studies. If the chemopreventive agent modulates the SEB in a manner consistent with blocking or reducing progression to carcinogenesis it may be possible to infer the reduction in cancer risk attributable to the agent. However, if there is not a perfect one-to-one correspondence between the SEB and cancer then the SEB induces misclassification of the cancer outcome. The extent of bias in the SEB as a surrogate for cancer is measured by its sensitivity and specificity. This paper will show that the relative risk (RR) observed using the SEB as a surrogate for cancer can severely underestimate the true RR when specificity is less than perfect. Furthermore, if specificity in the group receiving the chemopreventive agent is less than that in the untreated group, the RR based on the SEB may even indicate that the agent increases cancer risk. The performance characteristics of SEBs as a function of sensitivity, specificity and cancer incidence will be explored, and criteria to determine if SEBs can realistically be used will be defined.

Day 4 Questions & Discussion

Day 4 Questions & Discussion with Alexander Tsodikov

Interpreting HIV Mutations to Predict Response to Antiretroviral Therapy: The deletion/substitution/addition (DSA) algorithm for the estimation of direct causal effects

Our goal is to estimate the causal effect of mutations detected in the HIV strains infecting a patient on clinical virologic response to specific anti retroviral drugs and drug combinations. We consider the following data structure: 1) viral genotype, which we summarize as the presence or absence of each viral mutation considered by the Stanford HIV Database as likely to have some effect on virologic response to antiretroviral therapy; 2) drug regimen initiated following assessment of viral genotype (the regimen may involve changing some or all of the drugs in a patient's previous regimen); and, 3) change in plasma HIV RNA level (viral load) over baseline at twelve and twenty-four weeks after starting this regimen.


The effects of a set of mutations on virologic response are heavily confounded by past treatment. In addition, viral mutation profiles are often used by physicians to make treatment choices; we are interested in the direct causal effect of mutations on virologic outcome, not mediated by choice of other drugs in a patient's regimen. Finally, the need to consider multiple mutations and treatment history variables, as well as multi-way interactions between these variables, results in a high-dimensional modeling problem. This application thus requires data-adaptive estimation of the direct causal effect of a set of mutations on viral load under a particular drug, controlling for confounding and blocking the effect the mutations have on the assignment of other drugs. We developed such an algorithm based on a mix of the direct effect causal inference framework and the data adaptive regression deletion/substitution/addition (DSA) algorithm.

Modeling and Prediction of Biomarkers Longitudinally in AIDS Clinical Studies

Although a single event endpoint such as time to virological failure is simple and easy to use in large AIDS clinical trials, the longitudinal biomarker data from closely monitoring of viral load and CD4+ T cell counts can provide more detailed information regarding pathogenesis of HIV infection and characteristics of antiretroviral regimens. I will present a mechanistic HIV-1 dynamic model that will incorporate the information of pharmacokinetics, drug adherence and drug susceptibility to predict viral load trajectory. A Bayesian approach is proposed to fit this model to clinical data from ACTG A5055, a study of two dosage regimens of indinavir (IDV) with ritonavir (RTV) in subjects failing their first PI treatment. HIV RNA testing was completed at days 0, 7, 14, 28, 56, 84, 112, 140 and 168. An intensive PK evaluation was performed on day 14 and multiple trough concentrations were subsequently collected. Pill counts were used to monitor adherence. IC50 for IDV and RTV was determined at baseline and at virologic failure. Viral dynamic model fitting residuals were used to assess the significance of covariate effects on long-term virologic response. As univariate predictors, none of the four PK parameters C_trough, C_12h, C_max and AUC_0-12h was significantly related to virologic response (p>0.05). By including drug susceptibility (IC50), or IC50 and adherence together, C_trough, C_12h, C_max and AUC_0-12h were each significantly correlated to long-term virologic response (p=0.0055,0.0002,0.0136,0.0002 with IC50 and adherence considered). IC50 and adherence alone were not related to the virologic response. Adherence did not provide any additional information to PK parameters (p=0.064), to drug susceptibility IC50 (p=0.086), and to their combination (p=0.22) in predicting virologic response. Simple regression approaches did not detect any significant PD relationships. Any single factor of PK, adherence and drug susceptibility cannot be detected to have significant contribution to long-term virologic response. But appropriate combination of these factors using viral dynamic modeling approach was shown to be significant to predict virologic response. Adherence measured by pill counts and multiple trough drug concentrations did not provide additional information for virologic response presumably due to the data quality and noise problems. HIV dynamic modeling is a powerful tool to establish a PD relationship and correlate other factors such as adherence and drug susceptibility to long-term virologic response, since it can appropriately capture the complicated nonlinear relationships and interactions among multiple covariates. Our findings may help clinicians better understand the roles of these clinical factors in antiviral activities and predict the virologic response of various antiretroviral regimens.

Day 4 Questions & Discussion

Day 4 Questions & Discussion with Zhen Zhang