Skip to main content

TARGET Project Experimental Methods

On this page researchers can find detailed information describing how TARGET data was generated by genomic platform, including protocols for establishing high-quality nucleic acid samples. 

*Note to users: Protocols are currently being added to this site, as a large portion of the data and metadata were recently released onto the TARGET Data Matrix. Please continue to check back in the coming weeks as updates are completed, and contact ocg@mail.nih.gov with any questions. 

TARGET Sample Naming 

TARGET samples are named using a coding system specific to OCG characterization programs. Please use the OCG Sample Codes document to properly discern the metadata reiterated within the sample name. PDF icon OCG Sample Codes_finalized 05 2017.pdf

Nucleic Acid Sample Processing

TARGET project teams use high-quality RNA and DNA from case-matched tumor and normal tissues to generate comprehensive genomics data. Below is a table outlining how each project team generated those samples.

Column One Sample Preparation Protocols
DNA/RNA Co-isolation with Qiagen AllPrep Kit (DNA/RNA) and mirVana (Total & Small RNAs) AML , AML-IF , CCSK , RT , WT , OS , NBL (after 2013)
DNA Isolation with Qiagen QIAamp DNA Minikit ALL P1 , ALL P2 , ALL MDLS
DNA Isolation with Qiagen Genomic-Tips NBL (prior to 2013) , WT
RNA Isolation with Invitrogen TRIzol ALL P1 , ALL P2 , ALL MDLS
RNA Isolation with Invitrogen TRIzol and Qiagen RNAeasy NBL (prior to 2013)
DNA/RNA Co-isolation with Qiagen AllPrep kit (DNA/RNA) and mirVana (total & small RNAs) for AML

The protocol herein describes the procedures used by Fred Hutchinson Cancer Research Center to process disease tissues for RNA and/or DNA subsequently used for characterization in the NCI’s TARGET initiative.

All nucleic acid samples used in TARGET projects were quality tested for consistency using picogreen quantification and SSTR genotyping methods, regardless of where the nucleic acid was originally extracted.

Experimental Protocol

DNA/RNA Co-isolation with Qiagen AllPrep kit (DNA/RNA) and mirVana (total & small RNAs) for AML-IF

The protocol herein describes the procedures used by Fred Hutchinson Cancer Research Center to process disease tissues for RNA and/or DNA subsequently used for characterization in the NCI’s TARGET initiative.

All nucleic acid samples used in TARGET projects were quality tested for consistency using picogreen quantification and SSTR genotyping methods, regardless of where the nucleic acid was originally extracted.

Experimental Protocol

DNA/RNA co-isolation with Qiagen AllPrep kit (DNA/RNA) and mirVana (total & small RNAs) for CCSK

The protocol herein describes the procedures used by Nationwide Children’s Hospital to process disease tissues for RNA and/or DNA subsequently used for characterization in the NCI’s TARGET initiative.

All nucleic acid samples used in TARGET projects were quality tested for consistency using picogreen quantification and SSTR genotyping methods, regardless of where the nucleic acid was originally extracted.

Experimental Protocol

DNA/RNA co-isolation with Qiagen AllPrep kit (DNA/RNA) and mirVana (total & small RNAs) for RT

The protocol herein describes the procedures used by Nationwide Children’s Hospital to process disease tissues for RNA and/or DNA subsequently used for characterization in the NCI’s TARGET initiative.

All nucleic acid samples used in TARGET projects were quality tested for consistency using picogreen quantification and SSTR genotyping methods, regardless of where the nucleic acid was originally extracted.

Experimental Protocol

DNA/RNA co-isolation with Qiagen AllPrep kit (DNA/RNA) and mirVana (total & small RNAs) for WT

The protocol herein describes the procedures used by Nationwide Children’s Hospital to process disease tissues for RNA and/or DNA subsequently used for characterization in the NCI’s TARGET initiative.

All nucleic acid samples used in TARGET projects were quality tested for consistency using picogreen quantification and SSTR genotyping methods, regardless of where the nucleic acid was originally extracted.

Experimental Protocol

DNA/RNA co-isolation with Qiagen AllPrep kit (DNA/RNA) and mirVana (total & small RNAs) for OS

The protocol herein describes the procedures used by Nationwide Children’s Hospital to process disease tissues for RNA and/or DNA subsequently used for characterization in the NCI’s TARGET initiative.

Some OS validation samples were extracted outside of Nationwide Children’s Hospital at either the National Cancer Institutes or international cancer centers to include Toronto SickKids (Ontario, Canada), Chiba Cancer Center (Japan) or the GRAACC Cancer Center in Sao Paulo, Brazil.

All nucleic acid samples used in TARGET projects were quality tested for consistency using picogreen quantification and SSTR genotyping methods, regardless of where the nucleic acid was originally extracted.

Experimental Protocol

DNA/RNA co-isolation with Qiagen AllPrep kit (DNA/RNA) and mirVana (total & small RNAs) for NBL

The protocol herein describes the procedures used by Nationwide Children’s Hospital to process disease tissues for RNA and/or DNA subsequently used for characterization in the NCI’s TARGET initiative.

All nucleic acid samples used in TARGET projects were quality tested for consistency using picogreen quantification and SSTR genotyping methods, regardless of where the nucleic acid was originally extracted.

Experimental Protocol

DNA Isolation with Qiagen QIAamp DNA Minikit for ALL P1

The protocol herein describes the procedures used by Nationwide Children’s Hospital to process disease tissues for RNA and/or DNA subsequently used for characterization in the NCI’s TARGET initiative.

All nucleic acid samples used in TARGET projects were quality tested for consistency using picogreen quantification and SSTR genotyping methods, regardless of where the nucleic acid was originally extracted.

Experimental Protocol

DNA Isolation with Qiagen QIAamp DNA Minikit for ALL P2

The protocol herein describes the procedures used by Nationwide Children’s Hospital to process disease tissues for RNA and/or DNA subsequently used for characterization in the NCI’s TARGET initiative.

All nucleic acid samples used in TARGET projects were quality tested for consistency using picogreen quantification and SSTR genotyping methods, regardless of where the nucleic acid was originally extracted.

Experimental Protocol

DNA Isolation with Qiagen QIAamp DNA Minikit for ALL MDLS

The protocol herein describes the procedures used by Nationwide Children’s Hospital to process disease tissues for RNA and/or DNA subsequently used for characterization in the NCI’s TARGET initiative.

All nucleic acid samples used in TARGET projects were quality tested for consistency using picogreen quantification and SSTR genotyping methods, regardless of where the nucleic acid was originally extracted.

Experimental Protocol

DNA Isolation with Qiagen Genomic-Tips for NBL

The protocol herein describes the procedures used by Nationwide Children’s Hospital to process disease tissues for RNA and/or DNA subsequently used for characterization in the NCI’s TARGET initiative.

All nucleic acid samples used in TARGET projects were quality tested for consistency using picogreen quantification and SSTR genotyping methods, regardless of where the nucleic acid was originally extracted.

Experimental Protocol

DNA Isolation with Qiagen Genomic-Tips for WT

The protocol herein describes the procedures used by Nationwide Children’s Hospital to process disease tissues for RNA and/or DNA subsequently used for characterization in the NCI’s TARGET initiative.

Some WT validation samples were extracted in the lab of Dr. Paul Grundy and were provided for TARGET through Ann & Robert H. Lurie Children’s Hospital.

All nucleic acid samples used in TARGET projects were quality tested for consistency using picogreen quantification and SSTR genotyping methods, regardless of where the nucleic acid was originally extracted.

Experimental Protocol

RNA Isolation with Invitrogen TRIzol for ALL P1

The protocol herein describes the procedures used by University of New Mexico to process disease tissues for RNA and/or DNA subsequently used for characterization in the NCI’s TARGET initiative.

All nucleic acid samples used in TARGET projects were quality tested for consistency using picogreen quantification and SSTR genotyping methods, regardless of where the nucleic acid was originally extracted.

Experimental Protocol

RNA Isolation with Invitrogen TRIzol for ALL P2

The protocol herein describes the procedures used by University of New Mexico to process disease tissues for RNA and/or DNA subsequently used for characterization in the NCI’s TARGET initiative.

All nucleic acid samples used in TARGET projects were quality tested for consistency using picogreen quantification and SSTR genotyping methods, regardless of where the nucleic acid was originally extracted.

Experimental Protocol

RNA Isolation with Invitrogen TRIzol for ALL MDLS

The protocol herein describes the procedures used by University of New Mexico to process disease tissues for RNA and/or DNA subsequently used for characterization in the NCI’s TARGET initiative.

All nucleic acid samples used in TARGET projects were quality tested for consistency using picogreen quantification and SSTR genotyping methods, regardless of where the nucleic acid was originally extracted.

Experimental Protocol

RNA Isolation with Invitrogen TRIzol and Qiagen RNAeasy for NBL

The protocol herein describes the procedures used by Nationwide Children’s Hospital to process disease tissues for RNA and/or DNA subsequently used for characterization in the NCI’s TARGET initiative.

All nucleic acid samples used in TARGET projects were quality tested for consistency using picogreen quantification and SSTR genotyping methods, regardless of where the nucleic acid was originally extracted.

Experimental Protocol

Gene Expression
Data Generation Protocols Data Analysis Protocols
Gene Chip® Human Exon ST Array (Affymetrix) NBL , OS
Gene Chip® Human Gene 1.1 ST (Affymetrix) AML
Gene Chip® Human Genome U133 Plus 2.0 Array (Affymetrix) ALL P1 , ALL P2/MDLS , CCSK , WT , PPTP
SurePrint G3 Human Gene Expression Array (Agilent) PPTP

Gene Chip® Human Exon ST Array (Affymetrix) for Neuroblastoma (NBL)

RNA was extracted from Optimal Cutting Temperature (OCT) embedded primary tumor tissues using TRIZOL based methods with QIAGEN RNAeasy clean up at either Children's Hospital Los Angeles, Children's Hospital of Philadelphia or the Children's Oncology Group Biopathology Center at Colombus, Ohio.

Manufacturer's protocol was used to label extract, hybridize, and scan the human exon arrays (Affymetrix Human Exon Array Labeled Extract, Affymetrix Human Exon Array Hybridization Protocol, Affymetrix Human Exon Array Scan Protocol).

Level 2 data from normalization and summariztion using rma-skectch analysis of Affymetrix APT tools (version 1.16.0). Level 2 batch effect corrected (BER) data were obtained by removing the batch effect observed related to RNA source of the specimens. Generalized linear model (GLM - R version 3.10) was used to remove institutional batch effect by fitting a model for each of the Human Exon array probeset regions (PSR) to the batch effect (RNA source by institution). This GLM model was adjusted for risk groups based on stage and MYCN amplification status.  This Level 2 data was used to generate all subsequent data transformations.

Level 3 based on PSRs that are part of the 'core' annotation.  The data was derived from Level 2 BER data.  First PSRs with low expression (less than median expression level of entire dataset) and low coefficient of variation (less than median cv of entire dataset) were removed (~10% of PSRs) prior to averaging of PSRs by Transcript ID (based on Affymetrix Annotation). Level 3 based on PSRs that are part of the 'extended' annotation.  The data was derived from Level 2 BER data.  First PSRs with low expression (less than median expression level of entire dataset) and low coefficient of variation (less than median cv of entire dataset) were removed (~10% of PSRs) prior to averaging of PSRs by Transcript ID (based on Affymetrix Annotation). Level 3 based on PSRs that are part of the 'full' annotation.  The data was derived from Level 2 BER data.  First PSRs with low expression (less than median expression level of entire dataset) and low coefficient of variation (less than median cv of entire dataset) were removed (~10% of PSRs) prior to averaging of PSRs by Transcript ID (based on Affymetrix Annotation). Level 3 based on PSRs that are part of the 'core' annotation.  The data was derived from Level 3 BER transcript data set where PSRs with low expression (less than median expression level of entire dataset) and low coefficient of variation (less than median cv of entire dataset) were removed (~10% of PSRs) prior to averaging of PSRs by Gene Symbol (based on BioCore Package Affymetrix huex10 annotation data - huex10stprobeset.db. Mappings were based on data provided by: Entrez Gene ftp://ftp.ncbi.nlm.nih.gov/gene/DATA, with a date stamp from the source of: 2014-Mar13).

Level 3 based on PSRs that are part of the 'extended' annotation.  The data was derived from Level 3 BER transcript data set where PSRs with low expression (less than median expression level of entire dataset) and low coefficient of variation (less than median cv of entire dataset) were removed (~10% of PSRs) prior to averaging of PSRs by Gene Symbol  (based on BioCore Package Affymetrix huex10 annotation data - huex10stprobeset.db. Mappings were based on data provided by: Entrez Gene ftp://ftp.ncbi.nlm.nih.gov/gene/DATA, with a date stamp from the source of: 2014-Mar13). Level 3 based on PSRs that are part of the 'full' annotation.  The data was derived from Level 3 BER transcript data set where PSRs with low expression (less than median expression level of entire dataset) and low coefficient of variation (less than median cv of entire dataset) were removed (~10% of PSRs) prior to averaging of PSRs by Gene Symbol (based on BioCore Package Affymetrix huex10 annotation data - huex10stprobeset.db. Mappings were based on data provided by: Entrez Gene ftp://ftp.ncbi.nlm.nih.gov/gene/DATA, with a date stamp from the source of: 2014-Mar13)

Gene Chip® Human Exon ST Array (Affymetrix) for Neuroblastoma (NBL)

*Protocols performed at the Children’s Hospital of Los Angeles and Texas Children’s Hospital.

RNA labeled using labeling protocol described by Affymetrix and reagents from Affymetrix.

Samples were hybridized using Affymetrix hybridization kit materials and protocols on the Affymetrix Fluidics Station 450.

Scanning of the microarrays was performed according to Affymetrix's recommended protocol for the Affymetrix Genechip Scanner 3000 7G.

Data preprocessing and normalization done using the affymetrix APT package with RMA.

Exon level data (L2) transfomred into gene level data (L3) by averaging the probesets per gene.

Protocols were performed at Hudson Alpha, Inc., and the Fred Hutchinson Cancer Research Center.

All microarray experiments were performed according to manufacturer’s protocol using the Ambion WT Expression Kit, the GeneChip WT Terminal Labeling and Controls Kit, and the GeneTitan. The arrays were hybridized according to manufacturer’s protocol to the Human Gene 1.1 ST 96-Array Plate using the Affymetrix GeneTitan.

Arrays were scanned and raw image data (intensity files) were generated using Affymetrix GeneChip Command Console Software.

Raw intensity files were imported into Affymetrix Expression Console Software and normalized using the Robust Multichip Analysis-sketch workflow to assess quality control parameters and ensure uniform performance across the data set.  All raw files were uploaded into Partek Genomics Suite (St. Louis, MO) and RMA normalized upon import. 

To assign a single value per gene ID, multiple cluster IDs mapping to the same gene ID were averaged into one value.  The level 3 file represents the average (if multiple cluster IDs are represented) value per gene.

Gene Chip® Human Genome U133 Plus 2.0 Array (Affymetrix)  for Acute Lymphoblastic Leukemia Phase I (ALL P1)

*This protocol was performed at University of New Mexico.

1-3 µg of total RNA was labeled and hybridized to Affymetrix U133_Plus_2 arrays according to the manufacturer's recommendations (Affymetrix). A mask to remove uninformative probe pairs and Affymetrix controls was applied to all the arrays (resulting in the removal of 171 probe sets) and the default Affymetrix MAS 5.0 normalization was used on the remaining 54,504 probe sets. Array experimental quality was assessed using the following parameters, and all arrays met these criteria for inclusion:

  • GAPDH more than 5000
  • more than 20% expressed genes
  • GAPDH 3./5. ratios less than 4
  • linear regression R2 values of spiked poly(A) controls more than 0.90.

This gene expression dataset may be accessed via the NCI caArray site or at Gene Expression Omnibus under accession number GSE11877.
Microarray gene expression profiling data were available from an initial 54,504 probe sets after masking and filtering of minimal probe sets and controls (Supplemental data). Three different unsupervised, unbiased methods were used to select genes for standard hierarchical clustering: High Coefficient of Variation (HC) as originally described by Eisen et al.1, Cancer Outlier Profile Analysis (COPA), and Recognition of Outliers by Sampling Ends (ROSE), a novel method similar to COPA developed in the Richard Harvey laboratory at the University of New Mexico2. In HC, the 54,504 probe sets were ordered by their coefficients of variation and the highest 254 probe sets were used for clustering; this method identifies probe sets having an overall high variance relative to mean intensities. COPA selects outlier probe sets, also in an unsupervised fashion, on the basis of their absolute deviation from median at a fixed point (typically the 95th percentile). ROSE was developed as an alternative to COPA, and selects probe sets both on the basis of the size of the outlier group they identify as well as the magnitude of the deviation from expected intensity (ROSE and COPA)2. For all 3 probe selection methods, the top 254 probe sets (Harvey et al.; supplemental Table 7A2) were clustered using EPCLUST (Version 0.9.23 beta, Euclidean distance, average linkage UPGMA). A threshold branch distance was applied, and the largest distinct branches above this threshold containing more than 8 patients were retained and labeled. The HC method was used as the basis of cluster definition and nomenclature, with each of the 8 predominant clusters first identified through HC being assigned a number (H1-H8). All clusters are prefixed by the method of their probe set selection (H indicates HC; C, COPA; and R, ROSE), with COPA and ROSE numbers being assigned based on the similarity of a specific cluster group's membership (patient membership) to that seen in the original H clusters. The top 100 median rank order probe sets for each ROSE cluster are provided in Supplemental data. In the validation cohort (COG CCG 1961), the same initial masking criteria were applied to the raw data, yielding 54 504 probe sets for analysis. Applying ROSE with the same parameters used for the COG P9906 ALL cohort2, 167 probe sets were identified for clustering. The selection criteria used for COG P9906 was also used for COPA and HC, and the top 167 probe sets derived from these methods were used for hierarchical clustering (Harvey et al.; supplemental Table 7A2).

RNA Sample Preparation Methodology
Gene Expression Profiling Method2 RNA was isolated from pretreatment diagnostic ALL samples in the 207 patients (131 bone marrow, 76 peripheral blood) using TRIzol (Invitrogen); all samples had more than 80% leukemic blasts.

References:

  1. Eisen MB, Spellman PT, Brown PO, Botstein D (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 95 (25):14863-14868 (PMID: 9843981)

  2. Harvey RC, Mullighan CG, Wang X, Dobbin KK, Davidson GS, Bedrick EJ, Chen IM, Atlas SR, Kang H, Ar K, Wilson CS, Wharton W, Murphy M, Devidas M, Carroll AJ, Borowitz MJ, Bowman WP, Downing JR, Relling M, Yang J, Bhojwani D, Carroll WL, Camitta B, Reaman GH, Smith M, Hunger SP, Willman CL. (2010). Identification of novel cluster groups in pediatric high-risk B-precursor acute lymphoblastic leukemia with gene expression profiling: correlation with genome-wide DNA copy number alterations, clinical characteristics, and outcome. Blood. 116 (23), 4874-84 (PMID: 20699438)

Gene Chip® Human Genome U133 Plus 2.0 Array (Affymetrix)  for Acute Lymphoblastic Leukemia Phase II and Xenografts (ALL P2 & ALL MDLS)

*Protocols performed at the University of New Mexico.

cRNA for hybridization to U133_Plus_2.0 arrays was performed according to Affymetrix's recommendations (GeneChip Expression Analysis Technical Manual).  First, 300 ng of total RNA was converted to cDNA.  Biotinylated cRNA was generated from the cDNA and 15 µg was subjected to fragmentation.  Either the Affymetrix One-Cycle Target Labeling Kit or the Affymetrix 3' IVT Express Kit was used.  This v02 labeling protcol differs from v01 because the labeling kit changed. Affymetrix changed the IVT kit between 2008 and 2009.  While most of the gene expression patterns remain the same, there are some pronounced differences that may result in set effects when trying to merge data generated from the different labeling kits.

Hybridization of 12.5 µg fragmented biotinylated cRNA was performed according to Affymetrix's recommendations (GeneChip Expression Analysis Technical Manual).

Scanning of the microarrays was performed according to Affymetrix's recommended protocol (GeneChip Expression Analysis Technical Manual).

Data were masked according to the method outlined in Harvey et al, Blood 116:4874-4884, (2010) in order to remove uninformative probe pairs.  Default MAS 5.0 normalization was performed on the masked data using Expression Console software (Affymetrix).

The non-collapsed GCT file is simply the masked MAS 5.0 data from the CHP files formatted as a GCT file.  Level 3 data were generated by using the CollapseDataset algorithm of GenePattern:  http://www.broadinstitute.org/cancer/software/genepattern/.  In applying this software, the "maximum" (as opposed to "median") probeset setting was used, and the gene-to-probeset associations were obtained from the file AFFYMETRIX.chip downloaded from ftp://gseaftp.broadinstitute.org/pub/gsea/annotations/.

Gene Chip® Human Genome U133 Plus 2.0 Array (Affymetrix) for Clear Cell Sarcoma of the Kidney (CCSK)

Total RNA was used for gene expression analysis using the Affymetrix 133 plus 2.0 array (Affymetrix, Santa Clara, CA, USA), performed according to the manufacturer’s protocol. The arrays were analyzed using Gene-Chip Operating Software (GCOS) and Robust Multichip Average (RMA) normalization was performed. Differentially expressed genes were identified using a significance analysis of microarrays (SAM)1; q-values of < 0.01 and fold changes of > 2 were considered significant. Gene Set Enrichment Analysis (GSEA), version 2.0.142 was performed using 1000 permutations and phenotype permutation. Lists with at least 50 genes of canonical pathways, biologic processes and oncogenic signatures with a false discovery rates (FDR) of < 20% and p-value of < 0.05 were considered significant. Pearson correlation coefficient (PCC) calculation was performed using the RMA-normalized Level 3 gene expression data for 76 favorable histology Wilms tumors available in the TARGET Data Matrix. Hierarchical clustering was performed by using GenePattern’s Hierarchical Clustering module (column distance measure = Pearson correlation; row distance measure = Pearson correlation; clustering method = pairwise average-linkage) and were visualized by the HierarchicalClusteringViewer module.

Specifically:

RNA was extracted from tumor samples at Nationwide Children's BioPathology Center (BPC) by using the standard BPC protocol. RNA quality was assessed by a bioanalyzer and RNA samples were required to have a RIN > 7. Total RNA was provided to Lurie Children's Hospital Research Center at a concentration of 150 ng/ul (2 ug total) in sets of 16 samples. One WT sample for which sufficient column-purified RNA was available was selected to serve as a control sample (PAJMLZ). Each set of 16 samples received from the BPC included the WT control sample, which was therefore repeated throughout all steps of this procedure in order to ensure consistency among all steps.

250 ng of total RNA was labeled by using the Affymetrix GeneChip 3' IVT Express Kit at Lurie Children's Hospital Research Center.  All procedures, including 1st strand reverse transcription, 2nd strand synthesis, in vitro transcription of aRNA, aRNA purification, quantitation, and fragmentation were performed according to the manufacturer's protocol.

Nucleic acid hybridization to the array was performed at Lurie Children's Hospital Research Center by using the AffyMetrix GeneChip Hybridization, Wash and Stain Kit per the manufacturer's instructions.

The arrays were scanned at Lurie Children's Hospital Research Center by using the Gene-Chip Operating Software (GCOS).  Each .dat file was visually inspected for large scratches and/or misalignment of the grid. Gene-Chip Operating Software (GCOS) was used to generate .chp files (Level 2 data), which represent the consolidation of all individual probes within a probeset, from .cel files (Level 1 data). From .chp files, GCOS was used to generate .rpt files (Level 3 data), which show probe intensity values and QC values. All samples were inspected for several parameters. Background < 45 (actual range: 28.19–43.18). Noise (Raw Q) < 1.35 (actual range: 0.670–1.30). Scaling Factor < 65% (actual range: 11.487–52.965). % Present call > 35% (actual range: 38.4–57.7). 3'/5' GAPDH < 3.92 (actual range: 0.95–3.48). Samples with parameters outside of these limits were rerun starting at the step of RNA labeling. All .cel files (Level 1 data) were imported into the Broad Institute’s GenePattern server and Robust Multichip Average (RMA) normalization was performed using the ExpressionFileCreator module. Data were exported as a single .txt file (Level 2 data) containing probeset information for each individual tumor within a single spreadsheet. Several analytic quality control steps were performed. Principle component analysis (PCA) was performed to ensure that none of the samples were outliers. Pair-wise correlation coefficient analysis was performed using the data from the WT control sample that was included in each individual batch of samples. The normalized averages of the expression levels from each WT control run showed a correlation coefficient > 98%, indicating a high level of consistency. Six probesets corresponding to five genes were identified that closely correlated with gender (four male genes [RPS4Y1, DDX3Y, SMCY, and EIF1AY] and one female gene [XIST]). All samples were classified as male or female according to the expression patterns of these genes and the results were checked against the known gender of the patient. No discrepancies were detected.

For analyses, 9/10 replicates for PAJMLZ were removed from the RMA gene expression file. A collapsed data file was created by using the Broad Insitute's GenePattern CollapseDataset module with the default parameters and the maximum probe collapse method.

References:

  1. Tusher VG, Tibshirani R, Chu G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 98 (18), 10515 (PMID: 11309499)
  2. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES and Mesirov JP. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 102, 15545-15550 (PMID: 16199517)

Gene Chip® Human Genome U133 Plus 2.0 Array (Affymetrix)  for Wilm's Tumor (WT)

*Protocol performed at Ann and Robert H. Lurie Children’s Hospital.

Gene expression analysis was performed with the Affymetrix U133+2 chip (Affymetrix, Santa Clara, CA, USA), according to the manufacturer’s protocol using the Gene-Chip Operating Software and normalized using robust multichip average normalization. Unsupervised analysis was performed using Non-negative Matrix Factorization Consensus Version 51. GSEA Version 2.0.142 was run using 1,000 permutations and phenotype permutation. Significant enrichment was defined as those lists with >50 genes, an FDR < 10%, and a p-value < 5%.

Specifically:

RNA quality was assessed by a bioanalyzer and RNA samples were required to have a RIN > 7. Total RNA was provided to Lurie Children's Hospital Research Center at a concentration of 150 ng/ul (2 ug total) in sets of 16 samples. One WT sample for which sufficient column-purified RNA was available was selected to serve as a control sample (PAJMLZ). Each set of 16 samples received from the BPC included the WT control sample, which was therefore repeated throughout all steps of this procedure in order to ensure consistency among all steps.

250 ng of total RNA was labeled by using the Affymetrix GeneChip 3' IVT Express Kit at Lurie Children's Hospital Research Center.  All procedures, including 1st strand reverse transcription, 2nd strand synthesis, in vitro transcription of aRNA, aRNA purification, quantitation, and fragmentation were performed according to the manufacturer's protocol.

Nucleic acid hybridization to the array was performed at Lurie Children's Hospital Research Center by using the AffyMetrix GeneChip Hybridization, Wash and Stain Kit per the manufacturer's instructions.

The arrays were scanned at Lurie Children's Hospital Research Center by using the Gene-Chip Operating Software (GCOS).  Each .dat file was visually inspected for large scratches and/or misalignment of the grid. Gene-Chip Operating Software (GCOS) was used to generate .chp files (Level 2 data), which represent the consolidation of all individual probes within a probeset, from .cel files (Level 1 data). From .chp files, GCOS was used to generate .rpt files (Level 3 files), which show probe intensity values and QC values. All samples were inspected for several parameters. Background < 45 (actual range: 28.19–43.18). Noise (Raw Q) < 1.35 (actual range: 0.670–1.30). Scaling Factor < 65% (actual range: 11.487–52.965). % Present call > 35% (actual range: 38.4–57.7). 3'/5' GAPDH < 3.92 (actual range: 0.95–3.48). Samples with parameters outside of these limits were rerun starting at the step of RNA labeling. All .cel files (Level 1 data) were imported into the Broad Institute’s GenePattern server and Robust Multichip Average (RMA) normalization was performed using the ExpressionFileCreator module. Data were exported as a single .txt file (Level 2 data) containing probeset information for each individual tumor within a single spreadsheet. Several analytic quality control steps were performed. Principle component analysis (PCA) was performed to ensure that none of the samples were outliers. Pair-wise correlation coefficient analysis was performed using the data from the WT control sample that was included in each individual batch of samples. The normalized averages of the expression levels from each WT control run showed a correlation coefficient > 98%, indicating a high level of consistency. Six probesets corresponding to five genes were identified that closely correlated with gender (four male genes [RPS4Y1, DDX3Y, SMCY, and EIF1AY] and one female gene [XIST]). All samples were classified as male or female according to the expression patterns of these genes and the results were checked against the known gender of the patient. No discrepancies were detected.

For analyses, 9/10 replicates for PAJMLZ were removed from the RMA gene expression file. A collapsed data file was created by using the Broad Insitute's GenePattern CollapseDataset module with the default parameters and the maximum probe collapse method.

SAM was used to compare gene expression in 51 tumors: favorable histology WT (FHWT) sequenced at CGI with the MLLT1 variant (5) vs the remainder of FHWT sequenced at CGI that do not have the MLLT1 variant (46). Gene expression data is not available for 1 FHWT with the MLLT1 variant. SAM was run using the Level 2 gene expression data. First, probesets that had absent "A" calls for 95% (48) or more samples were filtered out, resulting in the retention of 39913 probesets for analysis. The data were log transformed prior to running SAM. Two class unpaired analysis was run using 200 permutations; probesets with q < 0.05 were retained.

References:

  1. Brunet, J. P., Tamayo, P., Golub, T. R. & Mesirov, J. P. (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc. Natl Acad. Sci. USA. 101, 41644169 (PMID: 15016911)
  2. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES and Mesirov JP. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 102, 15545-15550 (PMID: 16199517)

Gene Chip® Human Genome U133 Plus 2.0 Array (Affymetrix)  for Pediatric Preclinical Testing Program (PPTP)

The RNA extraction was performed according to Qiagen manufacturer's protocol (RNeasy kit).

The labeling and array scanning was performed according to the manufacturer's protocol.

SurePrint G3 Human Gene Expression Array (Agilent) for Pediatric Preclinical Testing Program (PPTP)

The RNA extraction was performed according to Qiagen manufacturer's protocol (RNeasy kit).

The nucleic acid labeling was performed according to the manufacturer's protocol for One-Color Microarray-Based Gene Expression Analysis (Agilent Technologies). The Low Input Quick Amp Labeling Kit, One-Color generated fluorescent cRNA with a sample input RNA range between 10ng and 200ng of total RNA or a minimum of 5ng of poly A+ RNA for one-color processing. The method uses T& RNA Polymerase Blend (red cap)6 which simultaneously amplifies target material and incorporates Cyanine 3-CTP.

The nucleic acid hybridization to array was performed according to the manufacturer's protocol for One-Color Microarray-Based Gene Expression Analysis (Agilent Technologies). Briefly, the 10x blocking agent was prepared by adding 500ul of nuclease-free water to the 10x agent supplied with the kit, mixed on a vortex and centrifuged for 5-10 seconds.The RNA fragmentation reaction was performed at 60°C for 30 minutes, after which the samples were colled on ice for one minute and 2x Hi-RPM Hybridization Buffer was added to stop the reaction. These samples were further mixed, spun for 1 minute at room temperature at 13,000xg, placed on ice and loaded on array. The arrays were hybridized at 65°C for 17 hours. This step was followed by microarray slides wash with Gene Expression Wash Buffers I and II.

The array scanning was performed according to the manufacturer's protocol for One-Color Microarray-Based Gene Expression Analysis (Agilent Technologies). The assembled slide holders were put into the scanner cassette, after which the appropriate scanner protocol is selected and ran. In order to extract information from probe features from microarray scan data, the Feature Extraction process is performed using the software provided at Agilent web-site.

Copy Number
Data Generation Protocols Data Analysis Protocols
Gene Chip® Human Mapping 500K Array (Affymetrix) ALL P1
Genome-Wide Human SNP Array 6.0 (Affymetrix) ALL P1/ P2 , AML , CCSK , PPTP , WT , OS
HumanHap 550K Beadchip (Illumina) NBL
Human Omni5 BeadChip Kit (Illumina) PPTP
Infinium Omni2.5Exome-8 Kit (Illumina) ALAL

Gene Chip® Human Mapping 500K Array (Affymetrix) for Acute Lymphoblastic Leukemia Phase I (ALL P1)

*Protocol performed at St. Jude Children's Research Hospital.

DNA was extracted using QIAGEN QIAamp DNA Mini Kit according to manufacturer’s protocol.   

Nucleic acid labeling, hybridization and array scanning protocols were used according to Affymetrix manufacturer’s protocol for Affymetrix Mapping 250k or Affymetrix Genomewide SNP6 arrays at St. Jude’s Children’s Research Hospital.               

Normalization data transformation protocols were carried out at St. Jude’s Children’s Research Hospital as follows: 250K genotypes were generated using the BRLMM algorithm implemented in GTYPE (Affymetrix). SNP6 genotypes were generated using the birdseed v2 algorithm in Genotyping Console (Affymetrix). Samples that failed standard QC metrics (contrast QC) were excluded.  To generate copy number data, data were analyzed using a extensively used and validated algorithm developed at St Jude Children’s Research Hospital.  Affymetrix SNP array CEL files (Level 1 data) and SNP call files (either .CHP or .TXT files; Level 2 data) were imported into dChip and probe level values summarized1. Data were exported and normalized using the reference normalization algorithm2. This algorithm uses user supplied or computationally detected diploid chromosomes to guide normalization of the entire array on a sample-by-sample basis, and optimizes normalization of complex cancer samples while eliminating batch effects. This procedure generates the .cnmz file that includes both the summarize, normalized prove intensities and the genotype data.              

Optimally normalized data were subjected to paired circular binary segmentation3 with thresholds set to detect copy number segments of >2.3 or <0.7 copies and at least 5 markers (250K data) or 8 markers (SNP6 data). Raw copy number segmentation results inspected and curated in dChip.

References

  1. Lin M, et al. (2004) dChipSNP: significance curve and clustering of SNP-array-based loss-of-heterozygosity data. Bioinformatics 20, 1233-40 (PMID: 14871870)
  2. Pounds S, et al. (2009). Reference alignment of SNP microarray signals for copy number analysis of tumors. Bioinformatics 25, 315-21 (PMID: 19052058)
  3. Venkatraman ES, et al. (2007) A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 23, 657-63 (PMID: 17234643)

Genome-Wide Human SNP Array 6.0 (Affymetrix) for Acute Lymphoblastic Leukemia Phases I & II (ALL P1, ALL P2)

*Protocol performed at St. Jude Children’s Research Hospital.

DNA was extracted using QIAGEN QIAamp DNA Mini Kit according to manufacturer’s protocol. 

Nucleic acid labeling, hybridization and array scanning protocols were used according to Affymetrix manufacturer’s protocol for Affymetrix Mapping 250k or Affymetrix Genomewide SNP6 arrays at St. Jude’s Children’s Research Hospital.             

Normalization data transformation protocols were carried out at St. Jude’s Children’s Research Hospital as follows: 250K genotypes were generated using the BRLMM (Bayesian Robust Linear Model with Mahalanobis) algorithm implemented in GTYPE (Genotyping Analysis Software, Affymetrix). SNP6 genotypes were generated using the birdseed v2 algorithm in Genotyping Console (Affymetrix). Samples that failed standard quality control metrics (contrast quality control) were excluded.  To generate copy number data, data were analyzed using a extensively used and validated algorithm developed at St Jude Children’s Research Hospital.  Affymetrix SNP array CEL files (Level 1 data) and SNP call files (either .CHP or .TXT files; Level 2 data) were imported into dChip and probe level values summarized1. Data were exported and normalized using the reference normalization algorithm2. This algorithm uses user supplied or computationally detected diploid chromosomes to guide normalization of the entire array on a sample-by-sample basis, and optimizes normalization of complex cancer samples while eliminating batch effects. This procedure generates the .cnmz file that includes both the summarize, normalized prove intensities and the genotype data.                

Optimally normalized data were subjected to paired circular binary segmentation3 with thresholds set to detect copy number segments of >2.3 or <0.7 copies and at least 5 markers (250K data) or 8 markers (SNP6 data). Raw copy number segmentation results inspected and curated in dChip.

References

  1. Lin M, et al. (2004) dChipSNP: significance curve and clustering of SNP-array-based loss-of-heterozygosity data. Bioinformatics 20, 1233-40 (PMID: 14871870)
  2. Pounds S, et al. (2009). Reference alignment of SNP microarray signals for copy number analysis of tumors. Bioinformatics 25, 315-21 (PMID: 19052058)
  3. Venkatraman ES, et al. (2007) A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 23, 657-63 (PMID: 17234643)

Genome-Wide Human SNP Array 6.0 (Affymetrix) for Acute Myeloid Leukemia (AML)

*Protocols performed at the Fred Hutchinson Cancer Research Center.

All genotyping was performed according to manufacturer’s protocol.  Briefly, two identical aliquots containing 250ng of DNA were digested with specific restrictions enzymes in separate reactions; one reaction contained Nsp1 and the other Sty1.  Immediately following digestion, each sample was ligated with adaptors containing a complementary sequence to the overhang generated at digestion.  Following ligation, each sample was subjected to PCR amplification using standard reagents.  Following PCR, each sample was assayed on a 2% agarose gel to ensure that a DNA smear of appropriate size was produced.  The Nsp and Sty amplifications were combined, purified and quantitated.  All samples with at least 180 micrograms total DNA were allowed to continue to fragmentation using the enzymatic reaction  Affymetrix Fragmentation reagent.  The fragmented DNA was assayed on a 4% agarose gel to ensure that the size of the DNA collapsed to less than 75nt.  Following fragmentation, the DNA was end-labeled with terminal deoxy transferase and Affymetrix DNA labeling reagent. 

Nucleic acid hybridization was performed according to the manufacturer's protocol for the AffyMetrix 6.0 SNP array.  Each sample was then resuspended in hybridization buffer and hybridized to the Affymetrix 6.0 array for 16 hours. Following hybridization, the arrays were washed on the Affymetrix Fluidics station and scanned on the GeneChip scanner.

The array scanning protocol was performed according to the manufacturer's protocol for the AffyMetrix 6.0 SNP array.

All data were processed using the standard analysis suite provided by Affymetrix.  The QC call rate is developed for each sample using a subset of SNPs and the DM algorithm.  A QC call rate of greater than 87% is a passing score for Affymetrix, the average call rate for this dataset was 99.4%. Samples passing the QC call rate are then clustered using the Birdseed algorithm.  Individual data files (CEL files) were uploaded to Partek Genomics Suite (St. Louis, MO).  Using a paired analysis (each patient’s remission samples was used as the reference), copy number was calculated for each probeset and is indicated in the level 2 file, TARGET_AML_level2_paired_CN_log2_format.txt.  

To find areas of the genome amplified or deleted, the Partek segmentation algorithm was applied to the level 2 dataset.

The unfiltered copy number segmentation file, TARGET_AML_CN_level3_unfiltered_Diagnostic.txt, contains all segments for each patient, both changed and unchanged, with no filtering parameters applied.  To reduce the number of false positives and filter out segments of the genomes that are unchanged, TARGET_AML_CN_level3_filtered_Diagnostic.txt, contains only segments with <1.7 or >2.3 copy number, have >99 markers and a p-value <0.05.  In addition, segments from the Y chromosome and mitochondrial genome were removed.

Genome-Wide Human SNP Array 6.0 (Affymetrix) for Clear Cell Sarcoma of the Kidney (CCSK)

Nucleic acid labeling, hybridization, and array scanning were performed on 11 CCSKs according to the manufacturer’s protocol for the Affymetrix 6.0 SNP array (Affymetrix, Santa Clara, CA, USA) and processed with the Affymetrix Genotyping Console (GTC) 4.0 software. Reference normalization was performed as described by Pounds et al1. Circular binary segmentation (CBS) was performed using DNAcopy from BioConductor. Segmented regions of autosomal chromosomes containing at least 8 markers in which the log2 value was > +0.5 or < -0.5 were considered regions of gain or loss, respectively. For the other 2 CCSK samples, copy number was assessed by using relative coverage generated by whole genome sequencing.

Specifically:

*Protocol performed at Ann and Robert H. Lurie Children’s Hospital and St. Jude’s Children’s Research Hospital.

DNA was extracted from normal kidney, tumor, or blood samples at Nationwide Children's BioPathology Center (BPC) by using the standard BPC protocol. Pico green analysis was performed to verify concentration of gDNA.  Spectrophotometry was performed to verify DNA purity and gel electrophoresis was performed to verify DNA quality. Tumor and corresponding normal specimens (blood and/or normal kidney) were supplied to St. Jude Children's Research Hospital on 96-well plates allowing for the inclusion of two controls.

Nucleic acid labeling, hybridization and array scanning protocols were performed according to the Affymetrix manufacturer's protocol for the AffyMetrix 6.0 SNP array at St Jude's Children's Research Hospital.

Data were provided by  St Jude's Children's Research Hospital in the Affymetrix CEL file format (Level 1 data) and the CEL files were processed using AffyMetrix Genotyping Console (GTC) 4.0 software to generate corresponding Birdseed .chp and .txt files (Level 2 data) by using the Birdseed v2 algorithm with the default parameters. Several quality control parameters were used:

  • Contrast QC (quality control): The average contrast QC was 1.83 for all samples, which is above the minimal of 1.7 recommended by AffyMetrix. Less than 10% of samples had a Contrast QC <0.4, and those samples with contrast QC <0.4 were deemed acceptable based on their heterozygosity values and Birdseed call rates.
  • DNA gender check: Samples were classified into genders using AffyMetrix Genotyping Console software; no inconsistencies were noted. Only 0.03% of all samples could not be classified according to gender (“unknown”); all of these samples were tumor samples in which the gender of the corresponding normal sample was called correctly.
  • Sample Call Rate:  AffyMetrix GTC 4.0 software was used to check the calling rate of constitutional DNA samples and all samples had calling rates greater than the cut-off of >95.5% (range, 94.1–99.5%; mean, 97.9%).  Furthermore, the calling rate of tumor samples ranged from 93.4–97.3% (mean, 97.4%).
  • DNA Autosomal Heterozygosity rate: The percentage of heterozygous SNPs among all measured SNPs was determined per sample using AffyMetrix GTC 4.0 software. The heterozygosity rates of normal samples ranged from 24–32%, which is within normal limits. This rate, which is expected to be lower for tumors compared to normals, ranged from 15–32% in our tumor samples.
  • Normalization: The reference normalization procedure utilized for our data normalization relies on an algorithm developed at St. Jude that utilizes a diploid chromosome for each sample to guide data normalization, as described1. In the first step, the CEL files (Level 1 data) and Birdseed.txt files (Level 2 dta) are read into dChip and model-based expression analysis (MBEI) is performed to generate probe level summarization values for each individual probe. This results in a file containing two columns for each individual sample: (1) the summarized probe value and (2) the genotype call. This file containing un-normalized data is exported from dChip as a text file and imported into R for reference normalization according to Pounds et al1. This algorithm requires two input files: (1) the dChip output file described above and (2) a text file defining each SNP on the AffyMetrix 6.0 chip according to chromosome and location. The reference chromosome for each sample was selected by using Nexus 6.0 software.  The reference normalization algorithm provides an output text file containing two columns for each sample: (1) the normalized probe value and (2) the genotype call.

Circular binary segmentation (CBS) was then applied to the output files in order to obtain segmented copy number information. This was performed in R using the DNAcopy BioConductor package. First, the log (base=2) of the ratios of each tumor sample's signal values over the signal values of the corresponding normal samples was calculated. After detecting outliers and smoothing the log ratio signal data, CBS was applied to segment the data into regions of estimated equal copy number. CBS was performed using default parameters including “nperm = 10,000”, “alpha=0.01”,”undo.splits=sdundo”, and “undo.SD=1”. This algorithm resulted in a segmented file for each tumor sample relative to the corresponding normal sample.

Reference

  1. Pounds S, et al. (2009). Reference alignment of SNP microarray signals for copy number analysis of tumors. Bioinformatics 25, 315-21 (PMID: 19052058)

Genome-Wide Human SNP Array 6.0 (Affymetrix) for Pediatric Preclinical Testing Program (PPTP)

*Protocol performed at Nationwide Children’s Hospital.

The DNA extraction was performed according to the Qiagen manufacturer's protocol (DNeasy Kit) in combination with Trizol.   

All genotyping for the Genome-wide Human SNP array 6.0 was performed according to Affymetrix manufacturer’s protocol.  Briefly, two identical aliquots containing 250 ng of DNA were digested with specific restrictions enzymes in separate reactions; one reaction contained Nsp1 and the other Sty1.  Immediately following digestion, each sample was ligated with adaptors containing a complementary sequence to the overhang generated at digestion.  Following ligation, each sample was subjected to PCR amplification using standard reagents.  Following PCR, each sample was assayed on a 2% agarose gel to ensure that a DNA smear of appropriate size was produced.  The Nsp and Sty amplifications were combined, purified and quantitated.  All samples with at least 180 µg total DNA were allowed to continue to fragmentation using the enzymatic reaction Affymetrix Fragmentation reagent.  The fragmented DNA was assayed on a 4% agarose gel to ensure that the size of the DNA collapsed to less than 75nt.  Following fragmentation, the DNA was end-labeled with terminal deoxy transferase and Affymetrix DNA labeling reagent.            

Nucleic acid hybridization for the Genome-wide Human SNP array 6.0 was performed according to the manufacturer's protocol for the AffyMetrix 6.0 SNP array.  Each sample was then resuspended in hybridization buffer and hybridized to the Affymetrix 6.0 array for 16 hours. Following hybridization, the arrays were washed on the Affymetrix Fluidics station and scanned on the GeneChip scanner.         

The array scanning protocol for the Genome-wide Human SNP array 6.0 was performed according to the manufacturer's protocol for the AffyMetrix 6.0 SNP array.           

Genome-Wide Human SNP Array 6.0 (Affymetrix) for Wilm's Tumor (WT)

*Protocol performed at Ann and Robert H. Lurie Children’s Hospital and St. Jude’s Children’s Research Hospital.

DNA was extracted from normal kidney, tumor, or blood samples at Nationwide Children's BioPathology Center (BPC) by using the standard BPC protocol. Pico green analysis was performed to verify concentration of gDNA.  Spectrophotometry was performed to verify DNA purity and gel electrophoresis was performed to verify DNA quality. Tumor and corresponding normal specimens (blood and/or normal kidney) were supplied to St. Jude Children's Research Hospital on 96-well plates allowing for the inclusion of two controls.

Nucleic acid labeling, hybridization and array scanning protocols were performed according to the Affymetrix manufacturer's protocol for the AffyMetrix 6.0 SNP array at St Jude's Children's Research Hospital.

Data were provided by  St Jude's Children's Research Hospital in the Affymetrix CEL file format (Level 1 data) and the CEL files were processed using AffyMetrix Genotyping Console (GTC) 4.0 software to generate corresponding Birdseed .chp and .txt files (Level 2 data) by using the Birdseed v2 algorithm with the default parameters. Several quality control (QC) parameters were used.

  • Contrast QC: The average contrast QC was 1.83 for all samples, which is above the minimal of 1.7 recommended by AffyMetrix. Less than 10% of samples had a Contrast QC <0.4, and those samples with contrast QC <0.4 were deemed acceptable based on their heterozygosity values and Birdseed call rates.
  • DNA gender check: Samples were classified into genders using AffyMetrix Genotyping Console software; no inconsistencies were noted. Only 0.03% of all samples could not be classified according to gender (“unknown”); all of these samples were tumor samples in which the gender of the corresponding normal sample was called correctly.
  • Sample Call Rate:  AffyMetrix GTC 4.0 software was used to check the calling rate of constitutional DNA samples and all samples had calling rates greater than the cut-off of >95.5% (range, 94.1–99.5%; mean, 97.9%).  Furthermore, the calling rate of tumor samples ranged from 93.4–97.3% (mean, 97.4%).
  • DNA Autosomal Heterozygosity rate: The percentage of heterozygous SNPs among all measured SNPs was determined per sample using AffyMetrix GTC 4.0 software. The heterozygosity rates of normal samples ranged from 24–32%, which is within normal limits. This rate, which is expected to be lower for tumors compared to normals, ranged from 15–32% in our tumor samples.
  • Normalization: The reference normalization procedure utilized for our data normalization relies on an algorithm developed at St. Jude that utilizes a diploid chromosome for each sample to guide data normalization, as described1. In the first step, the CEL files (level 1 data) and Birdseed.txt files (Level 2 data) are read into dChip and model-based expression analysis (MBEI) is performed to generate probe level summarization values for each individual probe. This results in a file containing two columns for each individual sample: (1) the summarized probe value and (2) the genotype call. This file containing un-normalized data is exported from dChip as a text file and imported into R for reference normalization according to Pounds et al1. This algorithm requires two input files: (1) the dChip output file described above and (2) a text file defining each SNP on the AffyMetrix 6.0 chip according to chromosome and location. The reference chromosome for each sample was selected by using Nexus 6.0 software.  The reference normalization algorithm provides an output text file containing two columns for each sample: (1) the normalized probe value and (2) the genotype call.

Circular binary segmentation (CBS) was then applied to the output files in order to obtain segmented copy number information. This was performed in R using the DNAcopy BioConductor package. First, the log (base=2) of the ratios of each tumor sample's signal values over the signal values of the corresponding normal samples was calculated. After detecting outliers and smoothing the log ratio signal data, CBS was applied to segment the data into regions of estimated equal copy number. CBS was performed using default parameters including “nperm = 10,000”, “alpha=0.01”,”undo.splits=sdundo”, and “undo.SD=1”. This algorithm resulted in a segmented file for each tumor sample relative to the corresponding normal sample.

Reference

  1. Pounds S, et al. (2009). Reference alignment of SNP microarray signals for copy number analysis of tumors. Bioinformatics 25, 315-21 (PMID: 19052058)

Genome-Wide Human SNP Array 6.0 (Affymetrix) for Osteosarcoma (OS)

*Protocol performed at Nationwide Children’s Hospital (nucleic acid extraction), Children’s Hospital of Los Angeles (data generation) and Texas Children’s Hospital (data analysis/summarization).

DNA was extracted using the DNA/RNA co-isolation method using the Qiagen AllPrep Kit.

DNA samples were labeled using SNP 6.0 core reagent kit following Affymetrix guidelines. Hybridization of samples was performed following manufacturer's instructions for Affymetrix Genome-Wide Human SNP6, and arrays scanned using Affymetrix Gene ChiP Scanner 3000 7G.

Raw CEL files processed using Genotyping Console Software, and segmentation analysis performed using Partek Genomics Suite genomic segmentation algorithm.

HumanHap 550K Beadchip (Illumina) for Neuroblastoma (NBL)

*Protocol performed at Nationwide Children’s Hospital (extractions) and Children’s Hospital of Philadelphia.

The DNA was extracted at Nationwide Children's Molecular Genetics Laboratory (MGL) using either the Qiagen All-Prep Co-isolation method or the Qiagen Genomic Tips protocol.  QIAGEN Blood & Cell Culture DNA Kits and QIAGEN Genomic-tips with the Genomic DNA Buffer Set, provide an easy, safe and reliable method for the isolation of pure high molecular weight genomic DNA, direct from whole blood, lymphocytes and tissues. The procedure is based on optimized buffer system for lysis of cells and/or nuclei, followed by binding of genomic DNA to QIAGEN Anion Exchange Resin under appropriate low salt and pH conditions. RNA, proteins, dyes and low-molecular-weight impurities are removed by a medium-salt wash. Genomic DNA is eluted in a high-salt buffer and concentrated and desalted by isopropanol precipitation.

Nucleic acid labeling, hybridization array scanning and data normalization protocols were performed according to the Illumina manufacturer's protocol for the Illumina 550K array at the Children's Hospital of Philadelphia.

OverUnder algorithm (see Attiyeh EF et al.). Older L2 data that used reference genome hg18 was remapped to hg19 during analysis so that all L3 copy number segmentation results are using hg19.

Data transformation was done using the OverUnder algorithm1.

References

  1. Attiyeh EF et al. (2009). Genomic copy number determination in cancer cells from single nucleotide polymorphism microarrays based on quantitative genotyping corrected for aneuploidy. Genome Res 19(2), 276-83 (PMID: 19141597)

Human Omni5 BeadChip Kit (Illumina) for Pediatric Preclinical Testing Program (PPTP)

*Protocol performed at Nationwide Children’s Hospital.

Nucleic acid labeling for the Human Omni5 BeadChip was performed to Illumina’s manufacturer's standard protocol, please refer to Illumina Infinium LCG Quad Assay protocols manual.                         

Hybridization Human Omni5 BeadChip was performed to manufacturer's standard protocol, please refer to Illumina Infinium LCG Quad Assay protocols manual                         

Scanning Human Omni5 BeadChip was performed according to manufacturer's standard protocol using use Illumina HiScan instrument with iScan software.

Infinium Omni2.5Exome-8 Kit (Illumina) for Acute Leukemia of Ambiguous Lineage (ALAL)

*Protocol performed at St. Jude Children's Research Hospital.

DNA from leukemia and matched germline samples was prepared for hybridization to Illumina Infinium Omni2.5 Exome-8 SNP arrays according to the manufacturer’s protocol.

The raw intensity data (*.idat files) were analyzed by the Genotyping Module of Illumina Genome Studio software version 2.0.3. Normalized Log R Ratio (LRR) and B Allele Frequency (BAF) for all the available probes in each sample were extracted.

Methylation
Data Generation Protocols Data Analysis Protocols
HpaII tiny fragment Enrichment by Ligation-mediated PCR (HELP) Assay with NimbleGen Arrays (Roche) ALL P2
Infinium HumanMethylation27 BeadChip (Illumina) AML
Infinium HumanMethylation450 BeadChip (Illumina) AML , CCSK , NBL , OS , WT
Infinium MethylationEPIC BeadChip Kit (Illumina) ALAL

HpaII tiny fragment Enrichment by Ligation-mediated PCR (HELP) Assay with NimbleGen Arrays (Roche) for Acute Lymphoblastic Leukemia Phase 2 (ALL P2)

*Protocol performed at Weill Cornell Medical College.

DNA was extracted using QIAGEN QIAamp DNA Mini Kit according to manufacturer’s protocol at St. Jude’s Children’s Research Hospital.         

Nucleic acid labeling, hybridization and array scanning protocols were used according to NimbleGen (Roche) manufacturer’s protocols (see NimbleGen Dual-color DNA Labeling Kit, NimbleGen Hybridization System 4 and NimbleGen MS 200 Microarray Scanner respectively).                 

Normalization data transformation protocols were carried out at as follows: Median normalized log2 ratio of signal intensity of HpaII and MspI, as detailed in Thompson et al, Bioinformatics 2008;24:1161-7. Software: NimbleScan version 2.5.26, Value: median normalized log2-transformed HpaII/MspI ratios, "NA": if MspI signal intensity < 1 mean absolute deviation (MAD) above the median of random probe signals.                  

Transformation of Level 2 data matrix into per sample Level 3 files with probe set gene annotations added was performed at NCI Center for Bioinformatics and Information Technology.

Infinium HumanMethylation27 Bead Chip (Illumina) for Acute Myeloid Leukemia (AML)

*Protocols performed at the Johns Hopkins University.

Bisulfite conversion of genomic DNA was performed with EZ DNA methylation Kit (Zymo Research, Irvine, CA, #D5002) following the manufacturer’s protocol with modifications for the Infinium Methylation Assay. Briefly, one microgram of genomic DNA was mixed with 5 µl of Dilution Buffer and incubated at 37°C for 15 minutes and then mixed with 100 µl of conversion reagent prepared as instructed in the protocol. Mixtures were incubated in a thermocycler for 16 cycles at 95°C for 30 seconds and 50°C for 60 minutes. Bisulfite-converted DNA samples were loaded onto the provided 96-column plates for desulphonation, washing and elution. The concentration of bisulfite-converted, eluted DNA was measured by UV-absorbance using a NanoDrop-1000 (Thermo Fisher Scientific, Waltham, MA). Bisulfite-converted genomic DNA was analyzed using the Infinium Human Methylation27 Beadchip Kit (Illumina, San Diego, CA, #WG-311-1202). DNA amplification, fragmentation, array hybridization, extension and staining were performed with reagents provided in the kit according to the manufacturer’s protocol (Illumina Infinium II Methylation Assay, #WG-901-2701). Briefly, 4 µl of bisulfite-converted genomic DNA at a minimum concentration of 20 ng/µL) was added to 0.8 ml 96-well storage plate (Thermo Fisher Scientific), denatured in 0.014N sodium hydroxide, neutralized and then amplified for 20-24 hours at 37°C. Samples were fragmented at 37°C for 60 minutes and precipitated in isopropanol. Re-suspended samples were denatured in a 96-well plate heat block at 95°C for 20 minutes. 15 µl of each sample was loaded onto a 12-sample BeadChip, assembled in the hybridization chamber as instructed by the manufacturer and incubated at 48°C for 16-20 hours.  Following hybridization, the BeadChips were washed and assembled in a fluid flow-through station for primer-extension reaction and staining with reagents and buffers provided.

Polymer-coated BeadChips were scanned in an iScan scanner (Illumina) using Inf Methylation mode.

Signal intensity and Beta value data were extracted using the Methylation module of GenomeStudio (Illumina, v2011.1) software following the Methylation analysis pipeline without normalization or background subtraction using BeadChip content descriptors provided by the manufacturer (HumanMethylation27_270596_v.1.2.bpm).

Summary beta values for each locus with annotations for Illumina probe name, gene symbol, chromosome and CpG position (UCSC hg18).

Data were normalized using functional normalization (funnorm) as implemented in the minfi package and summarized as beta values [M /(M+U)] with annotation at each locus for Illumina probe name, gene symbol, chromosome and CpG position (UCSC hg19).  Probes having an annotated SNV within the CpG or SBE site are masked as NA across all samples.  Probes where the non-detection probability was >  0.01 are masked as NA for individual samples.

Infinium HumanMethylation450 Bead Chip (Illumina) for Acute Myeloid Leukemia (AML)

*Protocols performed at the Phoenix Children’s Hospital.

Bisulfite conversion of genomic DNA was performed with EZ DNA methylation Kit (Zymo Research, Irvine, CA, #D5002) following the manufacturer’s protocol with modifications for the Infinium Methylation Assay. Briefly, one microgram of genomic DNA was mixed with 5 µl of Dilution Buffer and incubated at 37°C for 15 minutes and then mixed with 100 µl of conversion reagent prepared as instructed in the protocol. Mixtures were incubated in a thermocycler for 16 cycles at 95°C for 30 seconds and 50°C for 60 minutes. Bisulfite-converted DNA samples were loaded onto the provided 96-column plates for desulphonation, washing and elution. Bisulfite-converted genomic DNA was analyzed using the Infinium Human Methylation450K Beadchip Kit (Illumina, San Diego, CA, #WG-314-1001). DNA amplification, fragmentation, array hybridization, extension and staining were performed with reagents provided in the kit according to the manufacturer’s protocol (Illumina Infinium II Methylation Assay, #WG-901-2701). Briefly, 4 µl of bisulfite-converted genomic DNA was added to 0.8 ml 96-well storage plate (Thermo Fisher Scientific), denatured in 0.014N sodium hydroxide, neutralized and then amplified for 20-24 hours at 37°C. Samples were fragmented at 37°C for 60 minutes and precipitated in isopropanol. Re-suspended samples were denatured in a 96-well plate heat block at 95°C for 20 minutes. 15 µl of each sample was loaded onto a 12-sample BeadChip, assembled in the hybridization chamber as instructed by the manufacturer and incubated at 48°C for 16-20 hours.  Following hybridization, the BeadChips were washed and assembled in a fluid flow-through station for primer-extension reaction and staining with reagents and buffers provided.

Polymer-coated BeadChips were scanned in an iScan scanner (Illumina) using Inf Methylation mode.

Methylated and unmethylated signal intensity and detection p-values were extracted after background correction and dye-bias equalization by normal-exponential convolution ('noob') as implemented in the minfi package.

Summary beta values for each locus with annotations for Illumina probe name, gene symbol, chromosome and CpG position (UCSC hg18).

Data were normalized using functional normalization (funnorm) as implemented in the minfi package and summarized as beta values [M /(M+U)] with annotation at each locus for Illumina probe name, gene symbol, chromosome and CpG position (UCSC hg19).  Probes having an annotated SNV within the CpG or SBE site are masked as NA across all samples.  Probes where the non-detection probability was > 0.01 are masked as NA for individual samples.

Infinium HumanMethylation450 Bead Chip (Illumina) for Clear Cell Sarcoma of the Kidney (CCSK)

*Protocol performed at Ann & Robert H. Lurie Children's Hospital.

DNA was extracted from tumor samples at Nationwide Children's BioPathology Center (BPC) by using the standard BPC protocol. The DNA samples were analyzed by Pico green to verify gDNA concentration, spectrophotometry to verify DNA purity, and gel electrophoresis to verify DNA quality.  DNA samples (1.5 ug) diluted in nuclease-free water were provided to the Northwestern University Genomics Core in 96-well plate format for Illumina 450K DNA methylation analysis. Randomly selected samples were tested by Northwestern University to verify that the correct concentration had been provided.                

Nucleic acid hybridization and labeling were performed according to the manufacturer's protocol for the Illumina 450K array at the Northwestern University Genomics Core Facility. Nucleic acid labeling is completed after the hybridization step with Illumina Infinium 450K arrays.   

The array scanning protocol was performed according to the manufacturer's protocol for the Illumina 450K array at the Northwestern University Genomics Core Facility.                 

Raw data files (1 red and 1 green .idat file per sample and 1 .sdf file from each array, which included 12 samples per array) were processed at the Northwestern University Genomics Core Facility by BeadStudio software. The following subtables were generated by BeadStudio and were downloaded in .txt format (Level 2 data): (1) the Sample Methylation Profile .txt, (2) the Control Profile .txt, and (3) the Control Probe Profile .txt. Several quality control steps were used for these data. Samples were subjected to the internal quality controls in the Bioconductor lumi package. Samples were subjected to a color balance check using the Bioconductor lumi package. Gender analysis was performed to increase our confidence that the data correctly corresponded to the expected sample. Because one of the X chromosomes is heavily methylated in females, the density of X-chromosome methylation is a good indicator of gender. Unsupervised hierarchical clustering based on all methylated regions on the X-chromosome was performed using average-linkage clustering with CLUSTER and the results were displayed with TREEVIEW. The tumors were assigned a gender based on their clustering in the tumor dendogram, which was checked against the known gender of the sample.               

The Sample Methylation Profile text, which included information for all of the samples, was broken down at the DCC into a single .txt file (level 2 data) per sample containing the following columns: (1) the sample ID, (2) probe name, (3) AVG_Beta value, (4) gene Symbol, (5) chromosome, and (6) position.

Infinium HumanMethylation450 Bead Chip (Illumina) for Neuroblastoma (NBL)

*Protocols performed at the USC Epigenome Center.

Labeling, hybridization and scanning protocols were performed following the manufacturer’s protocol using the Infinium Human Methylation450K Beadchip Kit (Illumina, San Diego, CA, #WG-314-1001).

Level 2 data contain background-corrected methylated (M) and unmethylated (U) summary intensities as extracted by the methylumi package.  Non-detection probabilities (P-values) were computed as the minimum of the two values (one per allele) for the empirical cumulative density function of the negative control probes in the appropriate color channel. Background correction is performed via normal-exponential deconvolution using out-of-band probes (Triche, Jr. et al, Nucl. Acids Res 2013). Multiple-batch archives have the intensities in each of the two channels multiplicatively scaled to match a reference sample (sample with R/G ratio of normalization control probes closest to 1.0.).

Level 3 data contain derived summary measures (beta values: M/(M+U) for each interrogated locus) with annotations (based on Illumina's manifest on GEO, GPL13534) for gene symbol, chromosome (UCSC hg19, Feb 2009), and CpG/CpH coordinate (UCSC hg19, Feb 2009). Probes having a common SNP (common SNP is a SNP with Minor Allele Frequency > 1% as defined by the UCSC snp135common track) within 10bp of the interrogated CpG site or having 15bp from the interrogated CpG site overlap with a REPEAT element (as defined by RepeatMasker and Tandem Repeat Finder Masks based on UCSC hg19, Feb 2009) are masked as NA across all samples, and probes with a non-detection probability (P-value) greater than 0.05 in a given sample are masked as NA on that chip. Probes that are mapped to multiple sites on hg19 are annotated as NA for chromosome and 0 for CpG/CpH coordinate.

Infinium HumanMethylation450 Bead Chip (Illumina) for Osteosarcoma (OS)

*Protocols performed at the Phoenix Children’s Hospital and Baylor College of Medicine.

Bisulfite conversion of genomic DNA was performed with EZ-96 DNA methylation Kit (Zymo Research, Irvine, CA, #D5002) following the manufacturer’s protocol with modifications for the Infinium Methylation Assay. Briefly, one microgram of genomic DNA was mixed with 5 µl of Dilution Buffer and incubated at 37°C for 15 minutes and then mixed with 100 µl of conversion reagent prepared as instructed in the protocol. Mixtures were incubated in a thermocycler for 16 cycles at 95°C for 30 seconds and 50°C for 60 minutes. Bisulfite-converted DNA samples were loaded onto the provided 96-column plates for desulphonation, washing and elution. Bisulfite-converted genomic DNA was analyzed using the Infinium Human Methylation450K Beadchip Kit (Illumina, San Diego, CA, #WG-314-1001). DNA amplification, fragmentation, array hybridization, extension and staining were performed with reagents provided in the kit according to the manufacturer’s protocol (Illumina Infinium II Methylation Assay, #WG-901-2701). Briefly, 4 µl of bisulfite-converted genomic DNA was added to 0.8 ml 96-well storage plate (Thermo Fisher Scientific), denatured in 0.014N sodium hydroxide, neutralized and then amplified for 20-24 hours at 37°C. Samples were fragmented at 37°C for 60 minutes and precipitated in isopropanol. Re-suspended samples were denatured in a 96-well plate heat block at 95°C for 20 minutes. 15 µl of each sample was loaded onto a 12-sample BeadChip, assembled in the hybridization chamber as instructed by the manufacturer and incubated at 48°C for 16-20 hours.  Following hybridization, the BeadChips were washed and assembled in a fluid flow-through station for primer-extension reaction and staining with reagents and buffers provided.

Polymer-coated BeadChips were scanned using Illumina iScan technology which outputs data in the format of IDAT files.  These are then used retrieve the probe intensities and calculate the beta-values. Raw unmethylated and methylated intensities were background corrected using out-of-band correction.

Probe intensities were then color corrected using Lumi's dye bias correction algorithm.  Beta-values were calculated from probe intensities and corrected for probe bias using the beta mixture quantile dilation (BMIQ) normalization method.

Infinium HumanMethylation450 Bead Chip (Illumina) for Wilms Tumor (WT)

*Protocol performed at Ann & Robert H. Lurie Children's Hospital.

DNA was extracted from tumor samples at Nationwide Children's BioPathology Center (BPC) by using the standard BPC protocol. The DNA samples were analyzed by Pico green to verify gDNA concentration, spectrophotometry to verify DNA purity, and gel electrophoresis to verify DNA quality.  DNA samples (1.5 ug) diluted in nuclease-free water were provided to the Northwestern University Genomics Core in 96-well plate format for Illumina 450K DNA methylation analysis. Randomly selected samples were tested by Northwestern University Genomics Core to verify that the correct concentration had been provided.                

Nucleic acid hybridization and labeling were performed according to the manufacturer's protocol for the Illumina 450K array at the Northwestern University Genomics Core Facility. Nucleic acid labeling is completed after the hybridization step with Illumina Infinium 450K arrays.   

The array scanning protocol was performed according to the manufacturer's protocol for the Illumina 450K array at the Northwestern University Genomics Core Facility.                 

Raw data files (1 red and 1 green .idat file per sample and 1 .sdf file from each array, which included 12 samples per array) were processed at the Northwestern University Genomics Core Facility by BeadStudio software. The following subtables were generated by BeadStudio and were downloaded in .txt format (Level 2 data): (1) the Sample Methylation Profile .txt, (2) the Control Profile .txt, and (3) the Control Probe Profile .txt. Several quality control steps were used for these data. Samples were subjected to the internal quality controls in the Bioconductor lumi package. Samples were subjected to a color balance check using the Bioconductor lumi package. Gender analysis was performed to increase our confidence that the data correctly corresponded to the expected sample. Because one of the X chromosomes is heavily methylated in females, the density of X-chromosome methylation is a good indicator of gender. Unsupervised hierarchical clustering based on all methylated regions on the X-chromosome was performed using average-linkage clustering with CLUSTER and the results were displayed with TREEVIEW. The tumors were assigned a gender based on their clustering in the tumor dendogram, which was checked against the known gender of the sample. The X-chromosome methylation profile corresponded to gender in the majority (~95%) of the samples; the other samples did not show gender-specific patterns.                 

The Sample Methylation Profile text, which included information for all of the samples, was broken down at the DCC into a single .txt file (level 2 data) per sample containing the following columns: (1) the sample ID, (2) probe name, (3) AVG_Beta value, (4) gene Symbol, (5) chromosome, and (6) position.

Infinium MethylationEPIC BeadChip Kit (Illumina) for Acute Leukemia of Ambiguous Lineage (ALAL)

*Protocol performed at St. Jude Children's Research Hospital.

Raw data from the Infinium MethylationEPIC BeadChip Kit (Illumina Inc.) were analyzed using the ChAMP1 R package.

In general, the raw *.idat files were imported through “minfi” method2 and then the following filters were applied to exclude the probes: 1) with detection P-value above 0.01 in one or more samples; 2) with beadcount <3 in at least 5% of samples; 3) as non-CpG probes, 4) identified as SNPs3; 5) aligned to multiple locations4 and 6) on the X or Y chromosome. After filtering, “BMIQ” normalization from ChAMP package was used as the author suggested to calculate methylation beta values. Batch effect was observed by the singular value decomposition method5 and adjusted by ComBat normalization method6.

References

  1. Morris TJ, et al. (2014). ChAMP: 450k Chip Analysis Methylation Pipeline. Bioinformatics. 30(3):428-30. (PMID: 24336642)
  2. Aryee MJ, et al. (2014). Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 30(10):1363-9. (PMID: 24478339)
  3. Zhou W, et al. (2017). Comprehensive characterization, annotation and innovative use of Infinium DNA methylation BeadChip probes. Nucleic Acids Res. 45(4):e22. (PMID: 27924034)
  4. Nordlund J, et al. (2013). Genome-wide signatures of differential DNA methylation in pediatric acute lymphoblastic leukemia. Genome Biol. 14(9):r105. (PMID: 24063430)
  5. Teschendorff AE, et al. (2009). An epigenetic signature in peripheral blood predicts active ovarian cancer. PloS One. 4(12):e8274. (PMID: 20019873)
  6. Johnson WE, et al. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 8(1):118-27. (PMID: 16632515)
miRNA Profiling
Platform/Sequencing Center Data Generation Protocols Data Analysis Protocols
MegaplexTM Primer Pools and TaqMan® MicroRNA Array OS
miRNA-seq (British Columbia Cancer Agency) ALL P2 , AML/AML-IF , RT , WT ALL P2 , AML/AML-IF , RT , WT

MegaplexTM Primer Pools and TaqMan® MicroRNA Array

*Protocols performed at Baylor College of Medicine.

RNA labeling done according to protocol information provided by Applied Biosystems (miRNAPcr-Labeling-TaqMan:01). Manufacturer's TaqMan®. protocol using ABI 7900. Raw cycle counts thresholded to 40 calculated by SDS software.

MicroRNA Sequencing (miRNA-seq)

*Protocols Performed at British Columbia Cancer Agency). 

MicroRNA-seq library construction

Small RNAs, containing microRNA (miRNA), in the flow-through material following mRNA purification on a MultiMACS separator (Miltenyi Biotec, Germany) are recovered by ethanol precipitation. MiRNA-seq libraries are constructed using a 96-well plate-based protocol developed at the BC Cancer Agency, Genome Sciences Centre.  Briefly, an adenylated single-stranded DNA 3’ adapter is selectively ligated to miRNAs using a truncated T4 RNA ligase2 (NEB Canada, cat. M0242L). An RNA 5’ adapter is then added, using a T4 RNA ligase (Ambion USA, cat. AM2141) and ATP. Next, first strand cDNA is synthesized using Superscript II Reverse Transcriptase (Invitrogen, cat.18064 014), and serves as the template for PCR. Index sequences (6 nucleotides) are introduced at this PCR step to enable multiplexed pooling of miRNA libraries. PCR products are pooled, then size-selected on an in-house developed 96-channel robot to enrich the miRNA containing fraction and remove adapter contaminants. Each size-selected indexed pool is ethanol precipitated and quality checked on an Agilent Bioanalyzer DNA 1000 chip and quantified using a Qubit fluorometer (Invitrogen, cat. Q32854). Each pool is then diluted to a target concentration for cluster generation and loaded into a single lane of a HiSeq 2000 flow cell for sequencing with a 31-bp main read (for the insert) and a 7-bp read for the index.  

miRNA/hg19 alignment:

Illumina miRNA sequencing reads were aligned to the hg19 reference using BWA version 0.5.7. This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Duplicated reads were marked with Picard Tools.

 

miRNA preprocessing, alignment and annotation

Briefly, the sequence data are separated into individual samples based on the index read sequences, and the reads undergo an initial QC assessment. Adapter sequence is then trimmed off, and the trimmed reads for each sample are aligned to the NCBI GRCh37-lite reference genome.

Routine QC assesses a subset of raw sequences from each pooled lane for the abundance of reads from each indexed sample in the pool, the proportion of reads that possibly originate from adapter dimers (i.e. a 5’ adapter joined to a 3’ adapter with no intervening biological sequence) and for the proportion of reads that map to human miRNAs. Sequencing error is estimated by a method originally developed for SAGE (Khattra et al., 2007).

Libraries that pass this QC stage are preprocessed for alignment. While the size-selected miRNAs vary somewhat in length, typically they are ~21 bp long, and so are shorter than the 31-bp read length. Given this, each read sequence extends some distance into the 3' sequencing adapter. Because this non-biological sequence can interfere with aligning the read to the reference genome, 3’ adapter sequence is identified and removed (trimmed) from a read. The adapter-trimming algorithm identifies as long an adapter sequence as possible, allowing a number of mismatches that depends on the adapter length found. A typical sequencing run yields several million reads; using only the first (5’) 15 bases of the 3’ adapter in trimming makes processing efficient, while minimizing the chance that an miRNA read will match the adapter sequence.

After each read has been processed, a summary report is generated containing the number of reads at each read length. Any trimmed read that is shorter than 15bp is discarded; remaining reads are submitted for alignment to the reference genome. BWA (Li and Durbin, 2009) alignment(s) for each read are checked with a series of three filters. A read with more than 3 alignments is discarded as too ambiguous. Only perfect alignments with no mismatches are used. Reads that fail the Illumina basecalling chastity filter are retained, while reads that have soft-clipped CIGAR strings are discarded.

For reads retained after filtering, each coordinate for each read alignment is annotated using a reference databases, and requiring a minimum 3-bp overlap between the alignment and an annotation. If a read has more than one alignment location, and the annotations for these are different, we use a priority list to assign a single annotation to the read, as long as only one alignment is to a miRNA. When there are multiple alignments to different miRNAs, the read is flagged as cross-mapped (de Hoon et al., 2010), and all of its miRNA annotations are preserved, while all of its non-miRNA annotations are discarded. This ensures that all annotation information about ambiguously mapped miRNAs is retained, and allows annotation ambiguity to be addressed in downstream analyses. Note that we consider miRNAs to be cross-mapped only if they map to different miRNAs, not to functionally identical miRNAs that are expressed from different locations in the genome. Such cases are indicated by miRNA miRBase names, which can have up to 4 separate sections separated by "-", e.g. hsa-mir-26a-1. A difference in the final (e.g. ‘-1’) section denotes functionally equivalent miRNAs expressed from different regions of the genome, and we consider only the first 3 sections (e.g. ‘hsa-mir-26a’) when comparing names. As long as a read maps to multiple miRNAs for which the first 3 sections of the name are identical (e.g. hsa-mir-26a-1 and hsa-mir-26a-2), it is treated as if it maps to only one miRNA, and is not flagged as cross-mapped.

The minimum depth of sequencing required to detect the miRNAs that are expressed in one sample is 1,000,000 reads per library mapped to miRBase annotations.

Finally, for each sample, the reads that correspond to particular miRNAs are summed and normalized to a million miRNA-aligned reads to generate the quantification files. 

MicroRNA Sequencing Analysis

*Protocols performed at British Columbia Cancer Agency

miRNA NMF methods

We identified groups of samples with similar abundance profiles using unsupervised non-negative matrix factorization (NMF) consensus clustering of reads-per-million (RPM) data for the 25% most-variant 5p or 3p miRBase v20 mature strands. We generated a heatmap for the discriminatory miRNAs that had the highest scores in each of the four NMF metagenes (Gaujoux and Seoighe 2010) as follows. We reordered columns (samples) in a RPM-normalized abundance matrix to match the NMF result. We log2-transformed and median-centered the rows (miRs), and then hierarchically clustered the rows using an absolute centered correlation distance metric and average linkage (de Hoon 2004, Saldanha 2004). 5p and 3p mature strand names were assigned using miRBase v20. We generated covariate association P-values with R’s Fisher exact test.

 

miRNA-Seq-Differential expression

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify miRs that were differentially expressed. Each run generated a pair of files: miRs ‘up’ and ‘down’. We filtered each file by removing miRs with median expression less than 50 RPKM in both of the input sample groups, and miRs for which the Wilcoxon BH adjusted P-value was greater than 0.05; then ranked the filtered results by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction. 

MicroRNA Sequencing (miRNA-seq)

*Protocols Performed at British Columbia Cancer Agency). 

MicroRNA-seq library construction

Small RNAs, containing microRNA (miRNA), in the flow-through material following mRNA purification on a MultiMACS separator (Miltenyi Biotec, Germany) are recovered by ethanol precipitation. MiRNA-seq libraries are constructed using a 96-well plate-based protocol developed at the BC Cancer Agency, Genome Sciences Centre.  Briefly, an adenylated single-stranded DNA 3’ adapter is selectively ligated to miRNAs using a truncated T4 RNA ligase2 (NEB Canada, cat. M0242L). An RNA 5’ adapter is then added, using a T4 RNA ligase (Ambion USA, cat. AM2141) and ATP. Next, first strand cDNA is synthesized using Superscript II Reverse Transcriptase (Invitrogen, cat.18064 014), and serves as the template for PCR. Index sequences (6 nucleotides) are introduced at this PCR step to enable multiplexed pooling of miRNA libraries. PCR products are pooled, then size-selected on an in-house developed 96-channel robot to enrich the miRNA containing fraction and remove adapter contaminants. Each size-selected indexed pool is ethanol precipitated and quality checked on an Agilent Bioanalyzer DNA 1000 chip and quantified using a Qubit fluorometer (Invitrogen, cat. Q32854). Each pool is then diluted to a target concentration for cluster generation and loaded into a single lane of a HiSeq 2000 flow cell for sequencing with a 31-bp main read (for the insert) and a 7-bp read for the index.  

miRNA/hg19 alignment:

Illumina miRNA sequencing reads were aligned to the hg19 reference using BWA version 0.5.7. This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Duplicated reads were marked with Picard Tools.

 

miRNA preprocessing, alignment and annotation

Briefly, the sequence data are separated into individual samples based on the index read sequences, and the reads undergo an initial QC assessment. Adapter sequence is then trimmed off, and the trimmed reads for each sample are aligned to the NCBI GRCh37-lite reference genome.

Routine QC assesses a subset of raw sequences from each pooled lane for the abundance of reads from each indexed sample in the pool, the proportion of reads that possibly originate from adapter dimers (i.e. a 5’ adapter joined to a 3’ adapter with no intervening biological sequence) and for the proportion of reads that map to human miRNAs. Sequencing error is estimated by a method originally developed for SAGE (Khattra et al., 2007).

Libraries that pass this QC stage are preprocessed for alignment. While the size-selected miRNAs vary somewhat in length, typically they are ~21 bp long, and so are shorter than the 31-bp read length. Given this, each read sequence extends some distance into the 3' sequencing adapter. Because this non-biological sequence can interfere with aligning the read to the reference genome, 3’ adapter sequence is identified and removed (trimmed) from a read. The adapter-trimming algorithm identifies as long an adapter sequence as possible, allowing a number of mismatches that depends on the adapter length found. A typical sequencing run yields several million reads; using only the first (5’) 15 bases of the 3’ adapter in trimming makes processing efficient, while minimizing the chance that an miRNA read will match the adapter sequence.

After each read has been processed, a summary report is generated containing the number of reads at each read length. Any trimmed read that is shorter than 15bp is discarded; remaining reads are submitted for alignment to the reference genome. BWA (Li and Durbin, 2009) alignment(s) for each read are checked with a series of three filters. A read with more than 3 alignments is discarded as too ambiguous. Only perfect alignments with no mismatches are used. Reads that fail the Illumina basecalling chastity filter are retained, while reads that have soft-clipped CIGAR strings are discarded.

For reads retained after filtering, each coordinate for each read alignment is annotated using a reference databases, and requiring a minimum 3-bp overlap between the alignment and an annotation. If a read has more than one alignment location, and the annotations for these are different, we use a priority list to assign a single annotation to the read, as long as only one alignment is to a miRNA. When there are multiple alignments to different miRNAs, the read is flagged as cross-mapped (de Hoon et al., 2010), and all of its miRNA annotations are preserved, while all of its non-miRNA annotations are discarded. This ensures that all annotation information about ambiguously mapped miRNAs is retained, and allows annotation ambiguity to be addressed in downstream analyses. Note that we consider miRNAs to be cross-mapped only if they map to different miRNAs, not to functionally identical miRNAs that are expressed from different locations in the genome. Such cases are indicated by miRNA miRBase names, which can have up to 4 separate sections separated by "-", e.g. hsa-mir-26a-1. A difference in the final (e.g. ‘-1’) section denotes functionally equivalent miRNAs expressed from different regions of the genome, and we consider only the first 3 sections (e.g. ‘hsa-mir-26a’) when comparing names. As long as a read maps to multiple miRNAs for which the first 3 sections of the name are identical (e.g. hsa-mir-26a-1 and hsa-mir-26a-2), it is treated as if it maps to only one miRNA, and is not flagged as cross-mapped.

The minimum depth of sequencing required to detect the miRNAs that are expressed in one sample is 1,000,000 reads per library mapped to miRBase annotations.

Finally, for each sample, the reads that correspond to particular miRNAs are summed and normalized to a million miRNA-aligned reads to generate the quantification files. 

MicroRNA Sequencing Analysis

*Protocols performed at British Columbia Cancer Agency

miRNA NMF methods

We identified groups of samples with similar abundance profiles using unsupervised non-negative matrix factorization (NMF) consensus clustering of reads-per-million (RPM) data for the 25% most-variant 5p or 3p miRBase v20 mature strands. We generated a heatmap for the discriminatory miRNAs that had the highest scores in each of the four NMF metagenes (Gaujoux and Seoighe 2010) as follows. We reordered columns (samples) in a RPM-normalized abundance matrix to match the NMF result. We log2-transformed and median-centered the rows (miRs), and then hierarchically clustered the rows using an absolute centered correlation distance metric and average linkage (de Hoon 2004, Saldanha 2004). 5p and 3p mature strand names were assigned using miRBase v20. We generated covariate association P-values with R’s Fisher exact test.

 

miRNA-Seq-Differential expression

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify miRs that were differentially expressed. Each run generated a pair of files: miRs ‘up’ and ‘down’. We filtered each file by removing miRs with median expression less than 50 RPKM in both of the input sample groups, and miRs for which the Wilcoxon BH adjusted P-value was greater than 0.05; then ranked the filtered results by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction. 

MicroRNA Sequencing (miRNA-seq)

*Protocols Performed at British Columbia Cancer Agency). 

MicroRNA-seq library construction

Small RNAs, containing microRNA (miRNA), in the flow-through material following mRNA purification on a MultiMACS separator (Miltenyi Biotec, Germany) are recovered by ethanol precipitation. MiRNA-seq libraries are constructed using a 96-well plate-based protocol developed at the BC Cancer Agency, Genome Sciences Centre.  Briefly, an adenylated single-stranded DNA 3’ adapter is selectively ligated to miRNAs using a truncated T4 RNA ligase2 (NEB Canada, cat. M0242L). An RNA 5’ adapter is then added, using a T4 RNA ligase (Ambion USA, cat. AM2141) and ATP. Next, first strand cDNA is synthesized using Superscript II Reverse Transcriptase (Invitrogen, cat.18064 014), and serves as the template for PCR. Index sequences (6 nucleotides) are introduced at this PCR step to enable multiplexed pooling of miRNA libraries. PCR products are pooled, then size-selected on an in-house developed 96-channel robot to enrich the miRNA containing fraction and remove adapter contaminants. Each size-selected indexed pool is ethanol precipitated and quality checked on an Agilent Bioanalyzer DNA 1000 chip and quantified using a Qubit fluorometer (Invitrogen, cat. Q32854). Each pool is then diluted to a target concentration for cluster generation and loaded into a single lane of a HiSeq 2000 flow cell for sequencing with a 31-bp main read (for the insert) and a 7-bp read for the index.  

miRNA/hg19 alignment:

Illumina miRNA sequencing reads were aligned to the hg19 reference using BWA version 0.5.7. This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Duplicated reads were marked with Picard Tools.

 

miRNA preprocessing, alignment and annotation

Briefly, the sequence data are separated into individual samples based on the index read sequences, and the reads undergo an initial QC assessment. Adapter sequence is then trimmed off, and the trimmed reads for each sample are aligned to the NCBI GRCh37-lite reference genome.

Routine QC assesses a subset of raw sequences from each pooled lane for the abundance of reads from each indexed sample in the pool, the proportion of reads that possibly originate from adapter dimers (i.e. a 5’ adapter joined to a 3’ adapter with no intervening biological sequence) and for the proportion of reads that map to human miRNAs. Sequencing error is estimated by a method originally developed for SAGE (Khattra et al., 2007).

Libraries that pass this QC stage are preprocessed for alignment. While the size-selected miRNAs vary somewhat in length, typically they are ~21 bp long, and so are shorter than the 31-bp read length. Given this, each read sequence extends some distance into the 3' sequencing adapter. Because this non-biological sequence can interfere with aligning the read to the reference genome, 3’ adapter sequence is identified and removed (trimmed) from a read. The adapter-trimming algorithm identifies as long an adapter sequence as possible, allowing a number of mismatches that depends on the adapter length found. A typical sequencing run yields several million reads; using only the first (5’) 15 bases of the 3’ adapter in trimming makes processing efficient, while minimizing the chance that an miRNA read will match the adapter sequence.

After each read has been processed, a summary report is generated containing the number of reads at each read length. Any trimmed read that is shorter than 15bp is discarded; remaining reads are submitted for alignment to the reference genome. BWA (Li and Durbin, 2009) alignment(s) for each read are checked with a series of three filters. A read with more than 3 alignments is discarded as too ambiguous. Only perfect alignments with no mismatches are used. Reads that fail the Illumina basecalling chastity filter are retained, while reads that have soft-clipped CIGAR strings are discarded.

For reads retained after filtering, each coordinate for each read alignment is annotated using a reference databases, and requiring a minimum 3-bp overlap between the alignment and an annotation. If a read has more than one alignment location, and the annotations for these are different, we use a priority list to assign a single annotation to the read, as long as only one alignment is to a miRNA. When there are multiple alignments to different miRNAs, the read is flagged as cross-mapped (de Hoon et al., 2010), and all of its miRNA annotations are preserved, while all of its non-miRNA annotations are discarded. This ensures that all annotation information about ambiguously mapped miRNAs is retained, and allows annotation ambiguity to be addressed in downstream analyses. Note that we consider miRNAs to be cross-mapped only if they map to different miRNAs, not to functionally identical miRNAs that are expressed from different locations in the genome. Such cases are indicated by miRNA miRBase names, which can have up to 4 separate sections separated by "-", e.g. hsa-mir-26a-1. A difference in the final (e.g. ‘-1’) section denotes functionally equivalent miRNAs expressed from different regions of the genome, and we consider only the first 3 sections (e.g. ‘hsa-mir-26a’) when comparing names. As long as a read maps to multiple miRNAs for which the first 3 sections of the name are identical (e.g. hsa-mir-26a-1 and hsa-mir-26a-2), it is treated as if it maps to only one miRNA, and is not flagged as cross-mapped.

The minimum depth of sequencing required to detect the miRNAs that are expressed in one sample is 1,000,000 reads per library mapped to miRBase annotations.

Finally, for each sample, the reads that correspond to particular miRNAs are summed and normalized to a million miRNA-aligned reads to generate the quantification files. 

MicroRNA Sequencing Analysis

*Protocols performed at British Columbia Cancer Agency

miRNA NMF methods

We identified groups of samples with similar abundance profiles using unsupervised non-negative matrix factorization (NMF) consensus clustering of reads-per-million (RPM) data for the 25% most-variant 5p or 3p miRBase v20 mature strands. We generated a heatmap for the discriminatory miRNAs that had the highest scores in each of the four NMF metagenes (Gaujoux and Seoighe 2010) as follows. We reordered columns (samples) in a RPM-normalized abundance matrix to match the NMF result. We log2-transformed and median-centered the rows (miRs), and then hierarchically clustered the rows using an absolute centered correlation distance metric and average linkage (de Hoon 2004, Saldanha 2004). 5p and 3p mature strand names were assigned using miRBase v20. We generated covariate association P-values with R’s Fisher exact test.

 

miRNA-Seq-Differential expression

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify miRs that were differentially expressed. Each run generated a pair of files: miRs ‘up’ and ‘down’. We filtered each file by removing miRs with median expression less than 50 RPKM in both of the input sample groups, and miRs for which the Wilcoxon BH adjusted P-value was greater than 0.05; then ranked the filtered results by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction. 

 

miRNA library construction, sequencing, and analysis

miRNA-Seq library construction, sequencing, read alignment (to mirBase v19), and miRNA expression profiling were performed as previously reported in the Cancer Genome Atlas Research Network (Cancer Genome Atlas Research Network, 2013a; 2013c).

MicroRNA Sequencing (miRNA-seq)

*Protocols Performed at British Columbia Cancer Agency). 

MicroRNA-seq library construction

Small RNAs, containing microRNA (miRNA), in the flow-through material following mRNA purification on a MultiMACS separator (Miltenyi Biotec, Germany) are recovered by ethanol precipitation. MiRNA-seq libraries are constructed using a 96-well plate-based protocol developed at the BC Cancer Agency, Genome Sciences Centre.  Briefly, an adenylated single-stranded DNA 3’ adapter is selectively ligated to miRNAs using a truncated T4 RNA ligase2 (NEB Canada, cat. M0242L). An RNA 5’ adapter is then added, using a T4 RNA ligase (Ambion USA, cat. AM2141) and ATP. Next, first strand cDNA is synthesized using Superscript II Reverse Transcriptase (Invitrogen, cat.18064 014), and serves as the template for PCR. Index sequences (6 nucleotides) are introduced at this PCR step to enable multiplexed pooling of miRNA libraries. PCR products are pooled, then size-selected on an in-house developed 96-channel robot to enrich the miRNA containing fraction and remove adapter contaminants. Each size-selected indexed pool is ethanol precipitated and quality checked on an Agilent Bioanalyzer DNA 1000 chip and quantified using a Qubit fluorometer (Invitrogen, cat. Q32854). Each pool is then diluted to a target concentration for cluster generation and loaded into a single lane of a HiSeq 2000 flow cell for sequencing with a 31-bp main read (for the insert) and a 7-bp read for the index.  

miRNA/hg19 alignment:

Illumina miRNA sequencing reads were aligned to the hg19 reference using BWA version 0.5.7. This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Duplicated reads were marked with Picard Tools.

 

miRNA preprocessing, alignment and annotation

Briefly, the sequence data are separated into individual samples based on the index read sequences, and the reads undergo an initial QC assessment. Adapter sequence is then trimmed off, and the trimmed reads for each sample are aligned to the NCBI GRCh37-lite reference genome.

Routine QC assesses a subset of raw sequences from each pooled lane for the abundance of reads from each indexed sample in the pool, the proportion of reads that possibly originate from adapter dimers (i.e. a 5’ adapter joined to a 3’ adapter with no intervening biological sequence) and for the proportion of reads that map to human miRNAs. Sequencing error is estimated by a method originally developed for SAGE (Khattra et al., 2007).

Libraries that pass this QC stage are preprocessed for alignment. While the size-selected miRNAs vary somewhat in length, typically they are ~21 bp long, and so are shorter than the 31-bp read length. Given this, each read sequence extends some distance into the 3' sequencing adapter. Because this non-biological sequence can interfere with aligning the read to the reference genome, 3’ adapter sequence is identified and removed (trimmed) from a read. The adapter-trimming algorithm identifies as long an adapter sequence as possible, allowing a number of mismatches that depends on the adapter length found. A typical sequencing run yields several million reads; using only the first (5’) 15 bases of the 3’ adapter in trimming makes processing efficient, while minimizing the chance that an miRNA read will match the adapter sequence.

After each read has been processed, a summary report is generated containing the number of reads at each read length. Any trimmed read that is shorter than 15bp is discarded; remaining reads are submitted for alignment to the reference genome. BWA (Li and Durbin, 2009) alignment(s) for each read are checked with a series of three filters. A read with more than 3 alignments is discarded as too ambiguous. Only perfect alignments with no mismatches are used. Reads that fail the Illumina basecalling chastity filter are retained, while reads that have soft-clipped CIGAR strings are discarded.

For reads retained after filtering, each coordinate for each read alignment is annotated using a reference databases, and requiring a minimum 3-bp overlap between the alignment and an annotation. If a read has more than one alignment location, and the annotations for these are different, we use a priority list to assign a single annotation to the read, as long as only one alignment is to a miRNA. When there are multiple alignments to different miRNAs, the read is flagged as cross-mapped (de Hoon et al., 2010), and all of its miRNA annotations are preserved, while all of its non-miRNA annotations are discarded. This ensures that all annotation information about ambiguously mapped miRNAs is retained, and allows annotation ambiguity to be addressed in downstream analyses. Note that we consider miRNAs to be cross-mapped only if they map to different miRNAs, not to functionally identical miRNAs that are expressed from different locations in the genome. Such cases are indicated by miRNA miRBase names, which can have up to 4 separate sections separated by "-", e.g. hsa-mir-26a-1. A difference in the final (e.g. ‘-1’) section denotes functionally equivalent miRNAs expressed from different regions of the genome, and we consider only the first 3 sections (e.g. ‘hsa-mir-26a’) when comparing names. As long as a read maps to multiple miRNAs for which the first 3 sections of the name are identical (e.g. hsa-mir-26a-1 and hsa-mir-26a-2), it is treated as if it maps to only one miRNA, and is not flagged as cross-mapped.

The minimum depth of sequencing required to detect the miRNAs that are expressed in one sample is 1,000,000 reads per library mapped to miRBase annotations.

Finally, for each sample, the reads that correspond to particular miRNAs are summed and normalized to a million miRNA-aligned reads to generate the quantification files. 

MicroRNA Sequencing Analysis

*Protocols performed at British Columbia Cancer Agency

miRNA NMF methods

We identified groups of samples with similar abundance profiles using unsupervised non-negative matrix factorization (NMF) consensus clustering of reads-per-million (RPM) data for the 25% most-variant 5p or 3p miRBase v20 mature strands. We generated a heatmap for the discriminatory miRNAs that had the highest scores in each of the four NMF metagenes (Gaujoux and Seoighe 2010) as follows. We reordered columns (samples) in a RPM-normalized abundance matrix to match the NMF result. We log2-transformed and median-centered the rows (miRs), and then hierarchically clustered the rows using an absolute centered correlation distance metric and average linkage (de Hoon 2004, Saldanha 2004). 5p and 3p mature strand names were assigned using miRBase v20. We generated covariate association P-values with R’s Fisher exact test.

 

miRNA-Seq-Differential expression

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify miRs that were differentially expressed. Each run generated a pair of files: miRs ‘up’ and ‘down’. We filtered each file by removing miRs with median expression less than 50 RPKM in both of the input sample groups, and miRs for which the Wilcoxon BH adjusted P-value was greater than 0.05; then ranked the filtered results by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction. 

MicroRNA Sequencing (miRNA-seq)

*Protocols Performed at British Columbia Cancer Agency). 

MicroRNA-seq library construction

Small RNAs, containing microRNA (miRNA), in the flow-through material following mRNA purification on a MultiMACS separator (Miltenyi Biotec, Germany) are recovered by ethanol precipitation. MiRNA-seq libraries are constructed using a 96-well plate-based protocol developed at the BC Cancer Agency, Genome Sciences Centre.  Briefly, an adenylated single-stranded DNA 3’ adapter is selectively ligated to miRNAs using a truncated T4 RNA ligase2 (NEB Canada, cat. M0242L). An RNA 5’ adapter is then added, using a T4 RNA ligase (Ambion USA, cat. AM2141) and ATP. Next, first strand cDNA is synthesized using Superscript II Reverse Transcriptase (Invitrogen, cat.18064 014), and serves as the template for PCR. Index sequences (6 nucleotides) are introduced at this PCR step to enable multiplexed pooling of miRNA libraries. PCR products are pooled, then size-selected on an in-house developed 96-channel robot to enrich the miRNA containing fraction and remove adapter contaminants. Each size-selected indexed pool is ethanol precipitated and quality checked on an Agilent Bioanalyzer DNA 1000 chip and quantified using a Qubit fluorometer (Invitrogen, cat. Q32854). Each pool is then diluted to a target concentration for cluster generation and loaded into a single lane of a HiSeq 2000 flow cell for sequencing with a 31-bp main read (for the insert) and a 7-bp read for the index.  

miRNA/hg19 alignment:

Illumina miRNA sequencing reads were aligned to the hg19 reference using BWA version 0.5.7. This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Duplicated reads were marked with Picard Tools.

 

miRNA preprocessing, alignment and annotation

Briefly, the sequence data are separated into individual samples based on the index read sequences, and the reads undergo an initial QC assessment. Adapter sequence is then trimmed off, and the trimmed reads for each sample are aligned to the NCBI GRCh37-lite reference genome.

Routine QC assesses a subset of raw sequences from each pooled lane for the abundance of reads from each indexed sample in the pool, the proportion of reads that possibly originate from adapter dimers (i.e. a 5’ adapter joined to a 3’ adapter with no intervening biological sequence) and for the proportion of reads that map to human miRNAs. Sequencing error is estimated by a method originally developed for SAGE (Khattra et al., 2007).

Libraries that pass this QC stage are preprocessed for alignment. While the size-selected miRNAs vary somewhat in length, typically they are ~21 bp long, and so are shorter than the 31-bp read length. Given this, each read sequence extends some distance into the 3' sequencing adapter. Because this non-biological sequence can interfere with aligning the read to the reference genome, 3’ adapter sequence is identified and removed (trimmed) from a read. The adapter-trimming algorithm identifies as long an adapter sequence as possible, allowing a number of mismatches that depends on the adapter length found. A typical sequencing run yields several million reads; using only the first (5’) 15 bases of the 3’ adapter in trimming makes processing efficient, while minimizing the chance that an miRNA read will match the adapter sequence.

After each read has been processed, a summary report is generated containing the number of reads at each read length. Any trimmed read that is shorter than 15bp is discarded; remaining reads are submitted for alignment to the reference genome. BWA (Li and Durbin, 2009) alignment(s) for each read are checked with a series of three filters. A read with more than 3 alignments is discarded as too ambiguous. Only perfect alignments with no mismatches are used. Reads that fail the Illumina basecalling chastity filter are retained, while reads that have soft-clipped CIGAR strings are discarded.

For reads retained after filtering, each coordinate for each read alignment is annotated using a reference databases, and requiring a minimum 3-bp overlap between the alignment and an annotation. If a read has more than one alignment location, and the annotations for these are different, we use a priority list to assign a single annotation to the read, as long as only one alignment is to a miRNA. When there are multiple alignments to different miRNAs, the read is flagged as cross-mapped (de Hoon et al., 2010), and all of its miRNA annotations are preserved, while all of its non-miRNA annotations are discarded. This ensures that all annotation information about ambiguously mapped miRNAs is retained, and allows annotation ambiguity to be addressed in downstream analyses. Note that we consider miRNAs to be cross-mapped only if they map to different miRNAs, not to functionally identical miRNAs that are expressed from different locations in the genome. Such cases are indicated by miRNA miRBase names, which can have up to 4 separate sections separated by "-", e.g. hsa-mir-26a-1. A difference in the final (e.g. ‘-1’) section denotes functionally equivalent miRNAs expressed from different regions of the genome, and we consider only the first 3 sections (e.g. ‘hsa-mir-26a’) when comparing names. As long as a read maps to multiple miRNAs for which the first 3 sections of the name are identical (e.g. hsa-mir-26a-1 and hsa-mir-26a-2), it is treated as if it maps to only one miRNA, and is not flagged as cross-mapped.

The minimum depth of sequencing required to detect the miRNAs that are expressed in one sample is 1,000,000 reads per library mapped to miRBase annotations.

Finally, for each sample, the reads that correspond to particular miRNAs are summed and normalized to a million miRNA-aligned reads to generate the quantification files. 

MicroRNA Sequencing Analysis

*Protocols performed at British Columbia Cancer Agency

miRNA NMF methods

We identified groups of samples with similar abundance profiles using unsupervised non-negative matrix factorization (NMF) consensus clustering of reads-per-million (RPM) data for the 25% most-variant 5p or 3p miRBase v20 mature strands. We generated a heatmap for the discriminatory miRNAs that had the highest scores in each of the four NMF metagenes (Gaujoux and Seoighe 2010) as follows. We reordered columns (samples) in a RPM-normalized abundance matrix to match the NMF result. We log2-transformed and median-centered the rows (miRs), and then hierarchically clustered the rows using an absolute centered correlation distance metric and average linkage (de Hoon 2004, Saldanha 2004). 5p and 3p mature strand names were assigned using miRBase v20. We generated covariate association P-values with R’s Fisher exact test.

 

miRNA-Seq-Differential expression

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify miRs that were differentially expressed. Each run generated a pair of files: miRs ‘up’ and ‘down’. We filtered each file by removing miRs with median expression less than 50 RPKM in both of the input sample groups, and miRs for which the Wilcoxon BH adjusted P-value was greater than 0.05; then ranked the filtered results by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction. 

MicroRNA Sequencing (miRNA-seq)

*Protocols Performed at British Columbia Cancer Agency). 

MicroRNA-seq library construction

Small RNAs, containing microRNA (miRNA), in the flow-through material following mRNA purification on a MultiMACS separator (Miltenyi Biotec, Germany) are recovered by ethanol precipitation. MiRNA-seq libraries are constructed using a 96-well plate-based protocol developed at the BC Cancer Agency, Genome Sciences Centre.  Briefly, an adenylated single-stranded DNA 3’ adapter is selectively ligated to miRNAs using a truncated T4 RNA ligase2 (NEB Canada, cat. M0242L). An RNA 5’ adapter is then added, using a T4 RNA ligase (Ambion USA, cat. AM2141) and ATP. Next, first strand cDNA is synthesized using Superscript II Reverse Transcriptase (Invitrogen, cat.18064 014), and serves as the template for PCR. Index sequences (6 nucleotides) are introduced at this PCR step to enable multiplexed pooling of miRNA libraries. PCR products are pooled, then size-selected on an in-house developed 96-channel robot to enrich the miRNA containing fraction and remove adapter contaminants. Each size-selected indexed pool is ethanol precipitated and quality checked on an Agilent Bioanalyzer DNA 1000 chip and quantified using a Qubit fluorometer (Invitrogen, cat. Q32854). Each pool is then diluted to a target concentration for cluster generation and loaded into a single lane of a HiSeq 2000 flow cell for sequencing with a 31-bp main read (for the insert) and a 7-bp read for the index.  

miRNA/hg19 alignment:

Illumina miRNA sequencing reads were aligned to the hg19 reference using BWA version 0.5.7. This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Duplicated reads were marked with Picard Tools.

 

miRNA preprocessing, alignment and annotation

Briefly, the sequence data are separated into individual samples based on the index read sequences, and the reads undergo an initial QC assessment. Adapter sequence is then trimmed off, and the trimmed reads for each sample are aligned to the NCBI GRCh37-lite reference genome.

Routine QC assesses a subset of raw sequences from each pooled lane for the abundance of reads from each indexed sample in the pool, the proportion of reads that possibly originate from adapter dimers (i.e. a 5’ adapter joined to a 3’ adapter with no intervening biological sequence) and for the proportion of reads that map to human miRNAs. Sequencing error is estimated by a method originally developed for SAGE (Khattra et al., 2007).

Libraries that pass this QC stage are preprocessed for alignment. While the size-selected miRNAs vary somewhat in length, typically they are ~21 bp long, and so are shorter than the 31-bp read length. Given this, each read sequence extends some distance into the 3' sequencing adapter. Because this non-biological sequence can interfere with aligning the read to the reference genome, 3’ adapter sequence is identified and removed (trimmed) from a read. The adapter-trimming algorithm identifies as long an adapter sequence as possible, allowing a number of mismatches that depends on the adapter length found. A typical sequencing run yields several million reads; using only the first (5’) 15 bases of the 3’ adapter in trimming makes processing efficient, while minimizing the chance that an miRNA read will match the adapter sequence.

After each read has been processed, a summary report is generated containing the number of reads at each read length. Any trimmed read that is shorter than 15bp is discarded; remaining reads are submitted for alignment to the reference genome. BWA (Li and Durbin, 2009) alignment(s) for each read are checked with a series of three filters. A read with more than 3 alignments is discarded as too ambiguous. Only perfect alignments with no mismatches are used. Reads that fail the Illumina basecalling chastity filter are retained, while reads that have soft-clipped CIGAR strings are discarded.

For reads retained after filtering, each coordinate for each read alignment is annotated using a reference databases, and requiring a minimum 3-bp overlap between the alignment and an annotation. If a read has more than one alignment location, and the annotations for these are different, we use a priority list to assign a single annotation to the read, as long as only one alignment is to a miRNA. When there are multiple alignments to different miRNAs, the read is flagged as cross-mapped (de Hoon et al., 2010), and all of its miRNA annotations are preserved, while all of its non-miRNA annotations are discarded. This ensures that all annotation information about ambiguously mapped miRNAs is retained, and allows annotation ambiguity to be addressed in downstream analyses. Note that we consider miRNAs to be cross-mapped only if they map to different miRNAs, not to functionally identical miRNAs that are expressed from different locations in the genome. Such cases are indicated by miRNA miRBase names, which can have up to 4 separate sections separated by "-", e.g. hsa-mir-26a-1. A difference in the final (e.g. ‘-1’) section denotes functionally equivalent miRNAs expressed from different regions of the genome, and we consider only the first 3 sections (e.g. ‘hsa-mir-26a’) when comparing names. As long as a read maps to multiple miRNAs for which the first 3 sections of the name are identical (e.g. hsa-mir-26a-1 and hsa-mir-26a-2), it is treated as if it maps to only one miRNA, and is not flagged as cross-mapped.

The minimum depth of sequencing required to detect the miRNAs that are expressed in one sample is 1,000,000 reads per library mapped to miRBase annotations.

Finally, for each sample, the reads that correspond to particular miRNAs are summed and normalized to a million miRNA-aligned reads to generate the quantification files. 

MicroRNA Sequencing Analysis

*Protocols performed at British Columbia Cancer Agency

miRNA NMF methods

We identified groups of samples with similar abundance profiles using unsupervised non-negative matrix factorization (NMF) consensus clustering of reads-per-million (RPM) data for the 25% most-variant 5p or 3p miRBase v20 mature strands. We generated a heatmap for the discriminatory miRNAs that had the highest scores in each of the four NMF metagenes (Gaujoux and Seoighe 2010) as follows. We reordered columns (samples) in a RPM-normalized abundance matrix to match the NMF result. We log2-transformed and median-centered the rows (miRs), and then hierarchically clustered the rows using an absolute centered correlation distance metric and average linkage (de Hoon 2004, Saldanha 2004). 5p and 3p mature strand names were assigned using miRBase v20. We generated covariate association P-values with R’s Fisher exact test.

 

miRNA-Seq-Differential expression

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify miRs that were differentially expressed. Each run generated a pair of files: miRs ‘up’ and ‘down’. We filtered each file by removing miRs with median expression less than 50 RPKM in both of the input sample groups, and miRs for which the Wilcoxon BH adjusted P-value was greater than 0.05; then ranked the filtered results by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction. 

MicroRNA Sequencing (miRNA-seq)

*Protocols Performed at British Columbia Cancer Agency). 

MicroRNA-seq library construction

Small RNAs, containing microRNA (miRNA), in the flow-through material following mRNA purification on a MultiMACS separator (Miltenyi Biotec, Germany) are recovered by ethanol precipitation. MiRNA-seq libraries are constructed using a 96-well plate-based protocol developed at the BC Cancer Agency, Genome Sciences Centre.  Briefly, an adenylated single-stranded DNA 3’ adapter is selectively ligated to miRNAs using a truncated T4 RNA ligase2 (NEB Canada, cat. M0242L). An RNA 5’ adapter is then added, using a T4 RNA ligase (Ambion USA, cat. AM2141) and ATP. Next, first strand cDNA is synthesized using Superscript II Reverse Transcriptase (Invitrogen, cat.18064 014), and serves as the template for PCR. Index sequences (6 nucleotides) are introduced at this PCR step to enable multiplexed pooling of miRNA libraries. PCR products are pooled, then size-selected on an in-house developed 96-channel robot to enrich the miRNA containing fraction and remove adapter contaminants. Each size-selected indexed pool is ethanol precipitated and quality checked on an Agilent Bioanalyzer DNA 1000 chip and quantified using a Qubit fluorometer (Invitrogen, cat. Q32854). Each pool is then diluted to a target concentration for cluster generation and loaded into a single lane of a HiSeq 2000 flow cell for sequencing with a 31-bp main read (for the insert) and a 7-bp read for the index.  

miRNA/hg19 alignment:

Illumina miRNA sequencing reads were aligned to the hg19 reference using BWA version 0.5.7. This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Duplicated reads were marked with Picard Tools.

 

miRNA preprocessing, alignment and annotation

Briefly, the sequence data are separated into individual samples based on the index read sequences, and the reads undergo an initial QC assessment. Adapter sequence is then trimmed off, and the trimmed reads for each sample are aligned to the NCBI GRCh37-lite reference genome.

Routine QC assesses a subset of raw sequences from each pooled lane for the abundance of reads from each indexed sample in the pool, the proportion of reads that possibly originate from adapter dimers (i.e. a 5’ adapter joined to a 3’ adapter with no intervening biological sequence) and for the proportion of reads that map to human miRNAs. Sequencing error is estimated by a method originally developed for SAGE (Khattra et al., 2007).

Libraries that pass this QC stage are preprocessed for alignment. While the size-selected miRNAs vary somewhat in length, typically they are ~21 bp long, and so are shorter than the 31-bp read length. Given this, each read sequence extends some distance into the 3' sequencing adapter. Because this non-biological sequence can interfere with aligning the read to the reference genome, 3’ adapter sequence is identified and removed (trimmed) from a read. The adapter-trimming algorithm identifies as long an adapter sequence as possible, allowing a number of mismatches that depends on the adapter length found. A typical sequencing run yields several million reads; using only the first (5’) 15 bases of the 3’ adapter in trimming makes processing efficient, while minimizing the chance that an miRNA read will match the adapter sequence.

After each read has been processed, a summary report is generated containing the number of reads at each read length. Any trimmed read that is shorter than 15bp is discarded; remaining reads are submitted for alignment to the reference genome. BWA (Li and Durbin, 2009) alignment(s) for each read are checked with a series of three filters. A read with more than 3 alignments is discarded as too ambiguous. Only perfect alignments with no mismatches are used. Reads that fail the Illumina basecalling chastity filter are retained, while reads that have soft-clipped CIGAR strings are discarded.

For reads retained after filtering, each coordinate for each read alignment is annotated using a reference databases, and requiring a minimum 3-bp overlap between the alignment and an annotation. If a read has more than one alignment location, and the annotations for these are different, we use a priority list to assign a single annotation to the read, as long as only one alignment is to a miRNA. When there are multiple alignments to different miRNAs, the read is flagged as cross-mapped (de Hoon et al., 2010), and all of its miRNA annotations are preserved, while all of its non-miRNA annotations are discarded. This ensures that all annotation information about ambiguously mapped miRNAs is retained, and allows annotation ambiguity to be addressed in downstream analyses. Note that we consider miRNAs to be cross-mapped only if they map to different miRNAs, not to functionally identical miRNAs that are expressed from different locations in the genome. Such cases are indicated by miRNA miRBase names, which can have up to 4 separate sections separated by "-", e.g. hsa-mir-26a-1. A difference in the final (e.g. ‘-1’) section denotes functionally equivalent miRNAs expressed from different regions of the genome, and we consider only the first 3 sections (e.g. ‘hsa-mir-26a’) when comparing names. As long as a read maps to multiple miRNAs for which the first 3 sections of the name are identical (e.g. hsa-mir-26a-1 and hsa-mir-26a-2), it is treated as if it maps to only one miRNA, and is not flagged as cross-mapped.

The minimum depth of sequencing required to detect the miRNAs that are expressed in one sample is 1,000,000 reads per library mapped to miRBase annotations.

Finally, for each sample, the reads that correspond to particular miRNAs are summed and normalized to a million miRNA-aligned reads to generate the quantification files. 

MicroRNA Sequencing Analysis

*Protocols performed at British Columbia Cancer Agency

miRNA NMF methods

We identified groups of samples with similar abundance profiles using unsupervised non-negative matrix factorization (NMF) consensus clustering of reads-per-million (RPM) data for the 25% most-variant 5p or 3p miRBase v20 mature strands. We generated a heatmap for the discriminatory miRNAs that had the highest scores in each of the four NMF metagenes (Gaujoux and Seoighe 2010) as follows. We reordered columns (samples) in a RPM-normalized abundance matrix to match the NMF result. We log2-transformed and median-centered the rows (miRs), and then hierarchically clustered the rows using an absolute centered correlation distance metric and average linkage (de Hoon 2004, Saldanha 2004). 5p and 3p mature strand names were assigned using miRBase v20. We generated covariate association P-values with R’s Fisher exact test.

 

miRNA-Seq-Differential expression

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify miRs that were differentially expressed. Each run generated a pair of files: miRs ‘up’ and ‘down’. We filtered each file by removing miRs with median expression less than 50 RPKM in both of the input sample groups, and miRs for which the Wilcoxon BH adjusted P-value was greater than 0.05; then ranked the filtered results by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction. 

 

miRNA library construction, sequencing, and analysis

miRNA-Seq library construction, sequencing, read alignment (to mirBase v19), and miRNA expression profiling were performed as previously reported in the Cancer Genome Atlas Research Network (Cancer Genome Atlas Research Network, 2013a; 2013c).

MicroRNA Sequencing (miRNA-seq)

*Protocols Performed at British Columbia Cancer Agency). 

MicroRNA-seq library construction

Small RNAs, containing microRNA (miRNA), in the flow-through material following mRNA purification on a MultiMACS separator (Miltenyi Biotec, Germany) are recovered by ethanol precipitation. MiRNA-seq libraries are constructed using a 96-well plate-based protocol developed at the BC Cancer Agency, Genome Sciences Centre.  Briefly, an adenylated single-stranded DNA 3’ adapter is selectively ligated to miRNAs using a truncated T4 RNA ligase2 (NEB Canada, cat. M0242L). An RNA 5’ adapter is then added, using a T4 RNA ligase (Ambion USA, cat. AM2141) and ATP. Next, first strand cDNA is synthesized using Superscript II Reverse Transcriptase (Invitrogen, cat.18064 014), and serves as the template for PCR. Index sequences (6 nucleotides) are introduced at this PCR step to enable multiplexed pooling of miRNA libraries. PCR products are pooled, then size-selected on an in-house developed 96-channel robot to enrich the miRNA containing fraction and remove adapter contaminants. Each size-selected indexed pool is ethanol precipitated and quality checked on an Agilent Bioanalyzer DNA 1000 chip and quantified using a Qubit fluorometer (Invitrogen, cat. Q32854). Each pool is then diluted to a target concentration for cluster generation and loaded into a single lane of a HiSeq 2000 flow cell for sequencing with a 31-bp main read (for the insert) and a 7-bp read for the index.  

miRNA/hg19 alignment:

Illumina miRNA sequencing reads were aligned to the hg19 reference using BWA version 0.5.7. This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Duplicated reads were marked with Picard Tools.

 

miRNA preprocessing, alignment and annotation

Briefly, the sequence data are separated into individual samples based on the index read sequences, and the reads undergo an initial QC assessment. Adapter sequence is then trimmed off, and the trimmed reads for each sample are aligned to the NCBI GRCh37-lite reference genome.

Routine QC assesses a subset of raw sequences from each pooled lane for the abundance of reads from each indexed sample in the pool, the proportion of reads that possibly originate from adapter dimers (i.e. a 5’ adapter joined to a 3’ adapter with no intervening biological sequence) and for the proportion of reads that map to human miRNAs. Sequencing error is estimated by a method originally developed for SAGE (Khattra et al., 2007).

Libraries that pass this QC stage are preprocessed for alignment. While the size-selected miRNAs vary somewhat in length, typically they are ~21 bp long, and so are shorter than the 31-bp read length. Given this, each read sequence extends some distance into the 3' sequencing adapter. Because this non-biological sequence can interfere with aligning the read to the reference genome, 3’ adapter sequence is identified and removed (trimmed) from a read. The adapter-trimming algorithm identifies as long an adapter sequence as possible, allowing a number of mismatches that depends on the adapter length found. A typical sequencing run yields several million reads; using only the first (5’) 15 bases of the 3’ adapter in trimming makes processing efficient, while minimizing the chance that an miRNA read will match the adapter sequence.

After each read has been processed, a summary report is generated containing the number of reads at each read length. Any trimmed read that is shorter than 15bp is discarded; remaining reads are submitted for alignment to the reference genome. BWA (Li and Durbin, 2009) alignment(s) for each read are checked with a series of three filters. A read with more than 3 alignments is discarded as too ambiguous. Only perfect alignments with no mismatches are used. Reads that fail the Illumina basecalling chastity filter are retained, while reads that have soft-clipped CIGAR strings are discarded.

For reads retained after filtering, each coordinate for each read alignment is annotated using a reference databases, and requiring a minimum 3-bp overlap between the alignment and an annotation. If a read has more than one alignment location, and the annotations for these are different, we use a priority list to assign a single annotation to the read, as long as only one alignment is to a miRNA. When there are multiple alignments to different miRNAs, the read is flagged as cross-mapped (de Hoon et al., 2010), and all of its miRNA annotations are preserved, while all of its non-miRNA annotations are discarded. This ensures that all annotation information about ambiguously mapped miRNAs is retained, and allows annotation ambiguity to be addressed in downstream analyses. Note that we consider miRNAs to be cross-mapped only if they map to different miRNAs, not to functionally identical miRNAs that are expressed from different locations in the genome. Such cases are indicated by miRNA miRBase names, which can have up to 4 separate sections separated by "-", e.g. hsa-mir-26a-1. A difference in the final (e.g. ‘-1’) section denotes functionally equivalent miRNAs expressed from different regions of the genome, and we consider only the first 3 sections (e.g. ‘hsa-mir-26a’) when comparing names. As long as a read maps to multiple miRNAs for which the first 3 sections of the name are identical (e.g. hsa-mir-26a-1 and hsa-mir-26a-2), it is treated as if it maps to only one miRNA, and is not flagged as cross-mapped.

The minimum depth of sequencing required to detect the miRNAs that are expressed in one sample is 1,000,000 reads per library mapped to miRBase annotations.

Finally, for each sample, the reads that correspond to particular miRNAs are summed and normalized to a million miRNA-aligned reads to generate the quantification files. 

MicroRNA Sequencing Analysis

*Protocols performed at British Columbia Cancer Agency

miRNA NMF methods

We identified groups of samples with similar abundance profiles using unsupervised non-negative matrix factorization (NMF) consensus clustering of reads-per-million (RPM) data for the 25% most-variant 5p or 3p miRBase v20 mature strands. We generated a heatmap for the discriminatory miRNAs that had the highest scores in each of the four NMF metagenes (Gaujoux and Seoighe 2010) as follows. We reordered columns (samples) in a RPM-normalized abundance matrix to match the NMF result. We log2-transformed and median-centered the rows (miRs), and then hierarchically clustered the rows using an absolute centered correlation distance metric and average linkage (de Hoon 2004, Saldanha 2004). 5p and 3p mature strand names were assigned using miRBase v20. We generated covariate association P-values with R’s Fisher exact test.

 

miRNA-Seq-Differential expression

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify miRs that were differentially expressed. Each run generated a pair of files: miRs ‘up’ and ‘down’. We filtered each file by removing miRs with median expression less than 50 RPKM in both of the input sample groups, and miRs for which the Wilcoxon BH adjusted P-value was greater than 0.05; then ranked the filtered results by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction. 

MicroRNA Sequencing (miRNA-seq)

*Protocols performed at British Columbia Cancer Agency. 

MicroRNA-seq library construction

Small RNAs, containing microRNA (miRNA), in the flow-through material following mRNA purification on a MultiMACS separator (Miltenyi Biotec, Germany) are recovered by ethanol precipitation. MiRNA-seq libraries are constructed using a 96-well plate-based protocol developed at the BC Cancer Agency, Genome Sciences Centre.  Briefly, an adenylated single-stranded DNA 3’ adapter is selectively ligated to miRNAs using a truncated T4 RNA ligase2 (NEB Canada, cat. M0242L). An RNA 5’ adapter is then added, using a T4 RNA ligase (Ambion USA, cat. AM2141) and ATP. Next, first strand cDNA is synthesized using Superscript II Reverse Transcriptase (Invitrogen, cat.18064 014), and serves as the template for PCR. Index sequences (6 nucleotides) are introduced at this PCR step to enable multiplexed pooling of miRNA libraries. PCR products are pooled, then size-selected on an in-house developed 96-channel robot to enrich the miRNA containing fraction and remove adapter contaminants. Each size-selected indexed pool is ethanol precipitated and quality checked on an Agilent Bioanalyzer DNA 1000 chip and quantified using a Qubit fluorometer (Invitrogen, cat. Q32854). Each pool is then diluted to a target concentration for cluster generation and loaded into a single lane of a HiSeq 2000 flow cell for sequencing with a 31-bp main read (for the insert) and a 7-bp read for the index.  

miRNA/hg19 alignment

Illumina miRNA sequencing reads were aligned to the hg19 reference using BWA version 0.5.7. This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Duplicated reads were marked with Picard Tools.

 

miRNA preprocessing, alignment, and annotation

Briefly, the sequence data are separated into individual samples based on the index read sequences, and the reads undergo an initial QC assessment. Adapter sequence is then trimmed off, and the trimmed reads for each sample are aligned to the NCBI GRCh37-lite reference genome.

Routine QC assesses a subset of raw sequences from each pooled lane for the abundance of reads from each indexed sample in the pool, the proportion of reads that possibly originate from adapter dimers (i.e. a 5’ adapter joined to a 3’ adapter with no intervening biological sequence) and for the proportion of reads that map to human miRNAs. Sequencing error is estimated by a method originally developed for SAGE (Khattra et al., 2007).

Libraries that pass this QC stage are preprocessed for alignment. While the size-selected miRNAs vary somewhat in length, typically they are ~21 bp long, and so are shorter than the 31-bp read length. Given this, each read sequence extends some distance into the 3' sequencing adapter. Because this non-biological sequence can interfere with aligning the read to the reference genome, 3’ adapter sequence is identified and removed (trimmed) from a read. The adapter-trimming algorithm identifies as long an adapter sequence as possible, allowing a number of mismatches that depends on the adapter length found. A typical sequencing run yields several million reads; using only the first (5’) 15 bases of the 3’ adapter in trimming makes processing efficient, while minimizing the chance that an miRNA read will match the adapter sequence.

After each read has been processed, a summary report is generated containing the number of reads at each read length. Any trimmed read that is shorter than 15bp is discarded; remaining reads are submitted for alignment to the reference genome. BWA (Li and Durbin, 2009) alignment(s) for each read are checked with a series of three filters. A read with more than 3 alignments is discarded as too ambiguous. Only perfect alignments with no mismatches are used. Reads that fail the Illumina basecalling chastity filter are retained, while reads that have soft-clipped CIGAR strings are discarded.

For reads retained after filtering, each coordinate for each read alignment is annotated using a reference databases, and requiring a minimum 3-bp overlap between the alignment and an annotation. If a read has more than one alignment location, and the annotations for these are different, we use a priority list to assign a single annotation to the read, as long as only one alignment is to a miRNA. When there are multiple alignments to different miRNAs, the read is flagged as cross-mapped (de Hoon et al., 2010), and all of its miRNA annotations are preserved, while all of its non-miRNA annotations are discarded. This ensures that all annotation information about ambiguously mapped miRNAs is retained, and allows annotation ambiguity to be addressed in downstream analyses. Note that we consider miRNAs to be cross-mapped only if they map to different miRNAs, not to functionally identical miRNAs that are expressed from different locations in the genome. Such cases are indicated by miRNA miRBase names, which can have up to 4 separate sections separated by "-", e.g. hsa-mir-26a-1. A difference in the final (e.g. ‘-1’) section denotes functionally equivalent miRNAs expressed from different regions of the genome, and we consider only the first 3 sections (e.g. ‘hsa-mir-26a’) when comparing names. As long as a read maps to multiple miRNAs for which the first 3 sections of the name are identical (e.g. hsa-mir-26a-1 and hsa-mir-26a-2), it is treated as if it maps to only one miRNA, and is not flagged as cross-mapped.

The minimum depth of sequencing required to detect the miRNAs that are expressed in one sample is 1,000,000 reads per library mapped to miRBase annotations.

Finally, for each sample, the reads that correspond to particular miRNAs are summed and normalized to a million miRNA-aligned reads to generate the quantification files.

 

miRNA library construction, sequencing, and analysis

miRNA-Seq library construction, sequencing, read alignment (to mirBase v19), and miRNA expression profiling were performed as previously reported in the Cancer Genome Atlas Research Network (Cancer Genome Atlas Research Network, 2013a; 2013c).

MicroRNA Sequencing Analysis

*Protocols performed at British Columbia Cancer Agency

miRNA NMF methods

We identified groups of samples with similar abundance profiles using unsupervised non-negative matrix factorization (NMF) consensus clustering of reads-per-million (RPM) data for the 25% most-variant 5p or 3p miRBase v20 mature strands. We generated a heatmap for the discriminatory miRNAs that had the highest scores in each of the four NMF metagenes (Gaujoux and Seoighe 2010) as follows. We reordered columns (samples) in a RPM-normalized abundance matrix to match the NMF result. We log2-transformed and median-centered the rows (miRs), and then hierarchically clustered the rows using an absolute centered correlation distance metric and average linkage (de Hoon 2004, Saldanha 2004). 5p and 3p mature strand names were assigned using miRBase v20. We generated covariate association P-values with R’s Fisher exact test.

 

miRNA-Seq-Differential expression

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify miRs that were differentially expressed. Each run generated a pair of files: miRs ‘up’ and ‘down’. We filtered each file by removing miRs with median expression less than 50 RPKM in both of the input sample groups, and miRs for which the Wilcoxon BH adjusted P-value was greater than 0.05; then ranked the filtered results by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction. 

MicroRNA Sequencing (miRNA-seq)

*Protocols performed at British Columbia Cancer Agency. 

MicroRNA-seq library construction

Small RNAs, containing microRNA (miRNA), in the flow-through material following mRNA purification on a MultiMACS separator (Miltenyi Biotec, Germany) are recovered by ethanol precipitation. MiRNA-seq libraries are constructed using a 96-well plate-based protocol developed at the BC Cancer Agency, Genome Sciences Centre.  Briefly, an adenylated single-stranded DNA 3’ adapter is selectively ligated to miRNAs using a truncated T4 RNA ligase2 (NEB Canada, cat. M0242L). An RNA 5’ adapter is then added, using a T4 RNA ligase (Ambion USA, cat. AM2141) and ATP. Next, first strand cDNA is synthesized using Superscript II Reverse Transcriptase (Invitrogen, cat.18064 014), and serves as the template for PCR. Index sequences (6 nucleotides) are introduced at this PCR step to enable multiplexed pooling of miRNA libraries. PCR products are pooled, then size-selected on an in-house developed 96-channel robot to enrich the miRNA containing fraction and remove adapter contaminants. Each size-selected indexed pool is ethanol precipitated and quality checked on an Agilent Bioanalyzer DNA 1000 chip and quantified using a Qubit fluorometer (Invitrogen, cat. Q32854). Each pool is then diluted to a target concentration for cluster generation and loaded into a single lane of a HiSeq 2000 flow cell for sequencing with a 31-bp main read (for the insert) and a 7-bp read for the index.  

miRNA/hg19 alignment

Illumina miRNA sequencing reads were aligned to the hg19 reference using BWA version 0.5.7. This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Duplicated reads were marked with Picard Tools.

 

miRNA preprocessing, alignment, and annotation

Briefly, the sequence data are separated into individual samples based on the index read sequences, and the reads undergo an initial QC assessment. Adapter sequence is then trimmed off, and the trimmed reads for each sample are aligned to the NCBI GRCh37-lite reference genome.

Routine QC assesses a subset of raw sequences from each pooled lane for the abundance of reads from each indexed sample in the pool, the proportion of reads that possibly originate from adapter dimers (i.e. a 5’ adapter joined to a 3’ adapter with no intervening biological sequence) and for the proportion of reads that map to human miRNAs. Sequencing error is estimated by a method originally developed for SAGE (Khattra et al., 2007).

Libraries that pass this QC stage are preprocessed for alignment. While the size-selected miRNAs vary somewhat in length, typically they are ~21 bp long, and so are shorter than the 31-bp read length. Given this, each read sequence extends some distance into the 3' sequencing adapter. Because this non-biological sequence can interfere with aligning the read to the reference genome, 3’ adapter sequence is identified and removed (trimmed) from a read. The adapter-trimming algorithm identifies as long an adapter sequence as possible, allowing a number of mismatches that depends on the adapter length found. A typical sequencing run yields several million reads; using only the first (5’) 15 bases of the 3’ adapter in trimming makes processing efficient, while minimizing the chance that an miRNA read will match the adapter sequence.

After each read has been processed, a summary report is generated containing the number of reads at each read length. Any trimmed read that is shorter than 15bp is discarded; remaining reads are submitted for alignment to the reference genome. BWA (Li and Durbin, 2009) alignment(s) for each read are checked with a series of three filters. A read with more than 3 alignments is discarded as too ambiguous. Only perfect alignments with no mismatches are used. Reads that fail the Illumina basecalling chastity filter are retained, while reads that have soft-clipped CIGAR strings are discarded.

For reads retained after filtering, each coordinate for each read alignment is annotated using a reference databases, and requiring a minimum 3-bp overlap between the alignment and an annotation. If a read has more than one alignment location, and the annotations for these are different, we use a priority list to assign a single annotation to the read, as long as only one alignment is to a miRNA. When there are multiple alignments to different miRNAs, the read is flagged as cross-mapped (de Hoon et al., 2010), and all of its miRNA annotations are preserved, while all of its non-miRNA annotations are discarded. This ensures that all annotation information about ambiguously mapped miRNAs is retained, and allows annotation ambiguity to be addressed in downstream analyses. Note that we consider miRNAs to be cross-mapped only if they map to different miRNAs, not to functionally identical miRNAs that are expressed from different locations in the genome. Such cases are indicated by miRNA miRBase names, which can have up to 4 separate sections separated by "-", e.g. hsa-mir-26a-1. A difference in the final (e.g. ‘-1’) section denotes functionally equivalent miRNAs expressed from different regions of the genome, and we consider only the first 3 sections (e.g. ‘hsa-mir-26a’) when comparing names. As long as a read maps to multiple miRNAs for which the first 3 sections of the name are identical (e.g. hsa-mir-26a-1 and hsa-mir-26a-2), it is treated as if it maps to only one miRNA, and is not flagged as cross-mapped.

The minimum depth of sequencing required to detect the miRNAs that are expressed in one sample is 1,000,000 reads per library mapped to miRBase annotations.

Finally, for each sample, the reads that correspond to particular miRNAs are summed and normalized to a million miRNA-aligned reads to generate the quantification files.

 

miRNA library construction, sequencing, and analysis

miRNA-Seq library construction, sequencing, read alignment (to mirBase v19), and miRNA expression profiling were performed as previously reported in the Cancer Genome Atlas Research Network (Cancer Genome Atlas Research Network, 2013a; 2013c).

MicroRNA Sequencing Analysis

*Protocols performed at British Columbia Cancer Agency

miRNA NMF methods

We identified groups of samples with similar abundance profiles using unsupervised non-negative matrix factorization (NMF) consensus clustering of reads-per-million (RPM) data for the 25% most-variant 5p or 3p miRBase v20 mature strands. We generated a heatmap for the discriminatory miRNAs that had the highest scores in each of the four NMF metagenes (Gaujoux and Seoighe 2010) as follows. We reordered columns (samples) in a RPM-normalized abundance matrix to match the NMF result. We log2-transformed and median-centered the rows (miRs), and then hierarchically clustered the rows using an absolute centered correlation distance metric and average linkage (de Hoon 2004, Saldanha 2004). 5p and 3p mature strand names were assigned using miRBase v20. We generated covariate association P-values with R’s Fisher exact test.

 

miRNA-Seq-Differential expression

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify miRs that were differentially expressed. Each run generated a pair of files: miRs ‘up’ and ‘down’. We filtered each file by removing miRs with median expression less than 50 RPKM in both of the input sample groups, and miRs for which the Wilcoxon BH adjusted P-value was greater than 0.05; then ranked the filtered results by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction. 

MicroRNA Sequencing (miRNA-seq)

*Protocols performed at British Columbia Cancer Agency. 

MicroRNA-seq library construction

Small RNAs, containing microRNA (miRNA), in the flow-through material following mRNA purification on a MultiMACS separator (Miltenyi Biotec, Germany) are recovered by ethanol precipitation. MiRNA-seq libraries are constructed using a 96-well plate-based protocol developed at the BC Cancer Agency, Genome Sciences Centre.  Briefly, an adenylated single-stranded DNA 3’ adapter is selectively ligated to miRNAs using a truncated T4 RNA ligase2 (NEB Canada, cat. M0242L). An RNA 5’ adapter is then added, using a T4 RNA ligase (Ambion USA, cat. AM2141) and ATP. Next, first strand cDNA is synthesized using Superscript II Reverse Transcriptase (Invitrogen, cat.18064 014), and serves as the template for PCR. Index sequences (6 nucleotides) are introduced at this PCR step to enable multiplexed pooling of miRNA libraries. PCR products are pooled, then size-selected on an in-house developed 96-channel robot to enrich the miRNA containing fraction and remove adapter contaminants. Each size-selected indexed pool is ethanol precipitated and quality checked on an Agilent Bioanalyzer DNA 1000 chip and quantified using a Qubit fluorometer (Invitrogen, cat. Q32854). Each pool is then diluted to a target concentration for cluster generation and loaded into a single lane of a HiSeq 2000 flow cell for sequencing with a 31-bp main read (for the insert) and a 7-bp read for the index.  

miRNA/hg19 alignment

Illumina miRNA sequencing reads were aligned to the hg19 reference using BWA version 0.5.7. This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Duplicated reads were marked with Picard Tools.

 

miRNA preprocessing, alignment, and annotation

Briefly, the sequence data are separated into individual samples based on the index read sequences, and the reads undergo an initial QC assessment. Adapter sequence is then trimmed off, and the trimmed reads for each sample are aligned to the NCBI GRCh37-lite reference genome.

Routine QC assesses a subset of raw sequences from each pooled lane for the abundance of reads from each indexed sample in the pool, the proportion of reads that possibly originate from adapter dimers (i.e. a 5’ adapter joined to a 3’ adapter with no intervening biological sequence) and for the proportion of reads that map to human miRNAs. Sequencing error is estimated by a method originally developed for SAGE (Khattra et al., 2007).

Libraries that pass this QC stage are preprocessed for alignment. While the size-selected miRNAs vary somewhat in length, typically they are ~21 bp long, and so are shorter than the 31-bp read length. Given this, each read sequence extends some distance into the 3' sequencing adapter. Because this non-biological sequence can interfere with aligning the read to the reference genome, 3’ adapter sequence is identified and removed (trimmed) from a read. The adapter-trimming algorithm identifies as long an adapter sequence as possible, allowing a number of mismatches that depends on the adapter length found. A typical sequencing run yields several million reads; using only the first (5’) 15 bases of the 3’ adapter in trimming makes processing efficient, while minimizing the chance that an miRNA read will match the adapter sequence.

After each read has been processed, a summary report is generated containing the number of reads at each read length. Any trimmed read that is shorter than 15bp is discarded; remaining reads are submitted for alignment to the reference genome. BWA (Li and Durbin, 2009) alignment(s) for each read are checked with a series of three filters. A read with more than 3 alignments is discarded as too ambiguous. Only perfect alignments with no mismatches are used. Reads that fail the Illumina basecalling chastity filter are retained, while reads that have soft-clipped CIGAR strings are discarded.

For reads retained after filtering, each coordinate for each read alignment is annotated using a reference databases, and requiring a minimum 3-bp overlap between the alignment and an annotation. If a read has more than one alignment location, and the annotations for these are different, we use a priority list to assign a single annotation to the read, as long as only one alignment is to a miRNA. When there are multiple alignments to different miRNAs, the read is flagged as cross-mapped (de Hoon et al., 2010), and all of its miRNA annotations are preserved, while all of its non-miRNA annotations are discarded. This ensures that all annotation information about ambiguously mapped miRNAs is retained, and allows annotation ambiguity to be addressed in downstream analyses. Note that we consider miRNAs to be cross-mapped only if they map to different miRNAs, not to functionally identical miRNAs that are expressed from different locations in the genome. Such cases are indicated by miRNA miRBase names, which can have up to 4 separate sections separated by "-", e.g. hsa-mir-26a-1. A difference in the final (e.g. ‘-1’) section denotes functionally equivalent miRNAs expressed from different regions of the genome, and we consider only the first 3 sections (e.g. ‘hsa-mir-26a’) when comparing names. As long as a read maps to multiple miRNAs for which the first 3 sections of the name are identical (e.g. hsa-mir-26a-1 and hsa-mir-26a-2), it is treated as if it maps to only one miRNA, and is not flagged as cross-mapped.

The minimum depth of sequencing required to detect the miRNAs that are expressed in one sample is 1,000,000 reads per library mapped to miRBase annotations.

Finally, for each sample, the reads that correspond to particular miRNAs are summed and normalized to a million miRNA-aligned reads to generate the quantification files.

 

miRNA library construction, sequencing, and analysis

miRNA-Seq library construction, sequencing, read alignment (to mirBase v19), and miRNA expression profiling were performed as previously reported in the Cancer Genome Atlas Research Network (Cancer Genome Atlas Research Network, 2013a; 2013c).

MicroRNA Sequencing Analysis

*Protocols performed at British Columbia Cancer Agency

miRNA NMF methods

We identified groups of samples with similar abundance profiles using unsupervised non-negative matrix factorization (NMF) consensus clustering of reads-per-million (RPM) data for the 25% most-variant 5p or 3p miRBase v20 mature strands. We generated a heatmap for the discriminatory miRNAs that had the highest scores in each of the four NMF metagenes (Gaujoux and Seoighe 2010) as follows. We reordered columns (samples) in a RPM-normalized abundance matrix to match the NMF result. We log2-transformed and median-centered the rows (miRs), and then hierarchically clustered the rows using an absolute centered correlation distance metric and average linkage (de Hoon 2004, Saldanha 2004). 5p and 3p mature strand names were assigned using miRBase v20. We generated covariate association P-values with R’s Fisher exact test.

 

miRNA-Seq-Differential expression

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify miRs that were differentially expressed. Each run generated a pair of files: miRs ‘up’ and ‘down’. We filtered each file by removing miRs with median expression less than 50 RPKM in both of the input sample groups, and miRs for which the Wilcoxon BH adjusted P-value was greater than 0.05; then ranked the filtered results by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction. 

 

miRNA library construction, sequencing, and analysis

miRNA-Seq library construction, sequencing, read alignment (to mirBase v19), and miRNA expression profiling were performed as previously reported in the Cancer Genome Atlas Research Network (Cancer Genome Atlas Research Network, 2013a; 2013c).

MicroRNA Sequencing (miRNA-seq)

*Protocols performed at British Columbia Cancer Agency. 

MicroRNA-seq library construction

Small RNAs, containing microRNA (miRNA), in the flow-through material following mRNA purification on a MultiMACS separator (Miltenyi Biotec, Germany) are recovered by ethanol precipitation. MiRNA-seq libraries are constructed using a 96-well plate-based protocol developed at the BC Cancer Agency, Genome Sciences Centre.  Briefly, an adenylated single-stranded DNA 3’ adapter is selectively ligated to miRNAs using a truncated T4 RNA ligase2 (NEB Canada, cat. M0242L). An RNA 5’ adapter is then added, using a T4 RNA ligase (Ambion USA, cat. AM2141) and ATP. Next, first strand cDNA is synthesized using Superscript II Reverse Transcriptase (Invitrogen, cat.18064 014), and serves as the template for PCR. Index sequences (6 nucleotides) are introduced at this PCR step to enable multiplexed pooling of miRNA libraries. PCR products are pooled, then size-selected on an in-house developed 96-channel robot to enrich the miRNA containing fraction and remove adapter contaminants. Each size-selected indexed pool is ethanol precipitated and quality checked on an Agilent Bioanalyzer DNA 1000 chip and quantified using a Qubit fluorometer (Invitrogen, cat. Q32854). Each pool is then diluted to a target concentration for cluster generation and loaded into a single lane of a HiSeq 2000 flow cell for sequencing with a 31-bp main read (for the insert) and a 7-bp read for the index.  

miRNA/hg19 alignment

Illumina miRNA sequencing reads were aligned to the hg19 reference using BWA version 0.5.7. This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Duplicated reads were marked with Picard Tools.

 

miRNA preprocessing, alignment, and annotation

Briefly, the sequence data are separated into individual samples based on the index read sequences, and the reads undergo an initial QC assessment. Adapter sequence is then trimmed off, and the trimmed reads for each sample are aligned to the NCBI GRCh37-lite reference genome.

Routine QC assesses a subset of raw sequences from each pooled lane for the abundance of reads from each indexed sample in the pool, the proportion of reads that possibly originate from adapter dimers (i.e. a 5’ adapter joined to a 3’ adapter with no intervening biological sequence) and for the proportion of reads that map to human miRNAs. Sequencing error is estimated by a method originally developed for SAGE (Khattra et al., 2007).

Libraries that pass this QC stage are preprocessed for alignment. While the size-selected miRNAs vary somewhat in length, typically they are ~21 bp long, and so are shorter than the 31-bp read length. Given this, each read sequence extends some distance into the 3' sequencing adapter. Because this non-biological sequence can interfere with aligning the read to the reference genome, 3’ adapter sequence is identified and removed (trimmed) from a read. The adapter-trimming algorithm identifies as long an adapter sequence as possible, allowing a number of mismatches that depends on the adapter length found. A typical sequencing run yields several million reads; using only the first (5’) 15 bases of the 3’ adapter in trimming makes processing efficient, while minimizing the chance that an miRNA read will match the adapter sequence.

After each read has been processed, a summary report is generated containing the number of reads at each read length. Any trimmed read that is shorter than 15bp is discarded; remaining reads are submitted for alignment to the reference genome. BWA (Li and Durbin, 2009) alignment(s) for each read are checked with a series of three filters. A read with more than 3 alignments is discarded as too ambiguous. Only perfect alignments with no mismatches are used. Reads that fail the Illumina basecalling chastity filter are retained, while reads that have soft-clipped CIGAR strings are discarded.

For reads retained after filtering, each coordinate for each read alignment is annotated using a reference databases, and requiring a minimum 3-bp overlap between the alignment and an annotation. If a read has more than one alignment location, and the annotations for these are different, we use a priority list to assign a single annotation to the read, as long as only one alignment is to a miRNA. When there are multiple alignments to different miRNAs, the read is flagged as cross-mapped (de Hoon et al., 2010), and all of its miRNA annotations are preserved, while all of its non-miRNA annotations are discarded. This ensures that all annotation information about ambiguously mapped miRNAs is retained, and allows annotation ambiguity to be addressed in downstream analyses. Note that we consider miRNAs to be cross-mapped only if they map to different miRNAs, not to functionally identical miRNAs that are expressed from different locations in the genome. Such cases are indicated by miRNA miRBase names, which can have up to 4 separate sections separated by "-", e.g. hsa-mir-26a-1. A difference in the final (e.g. ‘-1’) section denotes functionally equivalent miRNAs expressed from different regions of the genome, and we consider only the first 3 sections (e.g. ‘hsa-mir-26a’) when comparing names. As long as a read maps to multiple miRNAs for which the first 3 sections of the name are identical (e.g. hsa-mir-26a-1 and hsa-mir-26a-2), it is treated as if it maps to only one miRNA, and is not flagged as cross-mapped.

The minimum depth of sequencing required to detect the miRNAs that are expressed in one sample is 1,000,000 reads per library mapped to miRBase annotations.

Finally, for each sample, the reads that correspond to particular miRNAs are summed and normalized to a million miRNA-aligned reads to generate the quantification files.

 

miRNA library construction, sequencing, and analysis

miRNA-Seq library construction, sequencing, read alignment (to mirBase v19), and miRNA expression profiling were performed as previously reported in the Cancer Genome Atlas Research Network (Cancer Genome Atlas Research Network, 2013a; 2013c).

MicroRNA Sequencing Analysis

*Protocols performed at British Columbia Cancer Agency

miRNA NMF methods

We identified groups of samples with similar abundance profiles using unsupervised non-negative matrix factorization (NMF) consensus clustering of reads-per-million (RPM) data for the 25% most-variant 5p or 3p miRBase v20 mature strands. We generated a heatmap for the discriminatory miRNAs that had the highest scores in each of the four NMF metagenes (Gaujoux and Seoighe 2010) as follows. We reordered columns (samples) in a RPM-normalized abundance matrix to match the NMF result. We log2-transformed and median-centered the rows (miRs), and then hierarchically clustered the rows using an absolute centered correlation distance metric and average linkage (de Hoon 2004, Saldanha 2004). 5p and 3p mature strand names were assigned using miRBase v20. We generated covariate association P-values with R’s Fisher exact test.

 

miRNA-Seq-Differential expression

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify miRs that were differentially expressed. Each run generated a pair of files: miRs ‘up’ and ‘down’. We filtered each file by removing miRs with median expression less than 50 RPKM in both of the input sample groups, and miRs for which the Wilcoxon BH adjusted P-value was greater than 0.05; then ranked the filtered results by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction. 

MicroRNA Sequencing (miRNA-seq)

*Protocols Performed at British Columbia Cancer Agency). 

MicroRNA-seq library construction

Small RNAs, containing microRNA (miRNA), in the flow-through material following mRNA purification on a MultiMACS separator (Miltenyi Biotec, Germany) are recovered by ethanol precipitation. MiRNA-seq libraries are constructed using a 96-well plate-based protocol developed at the BC Cancer Agency, Genome Sciences Centre.  Briefly, an adenylated single-stranded DNA 3’ adapter is selectively ligated to miRNAs using a truncated T4 RNA ligase2 (NEB Canada, cat. M0242L). An RNA 5’ adapter is then added, using a T4 RNA ligase (Ambion USA, cat. AM2141) and ATP. Next, first strand cDNA is synthesized using Superscript II Reverse Transcriptase (Invitrogen, cat.18064 014), and serves as the template for PCR. Index sequences (6 nucleotides) are introduced at this PCR step to enable multiplexed pooling of miRNA libraries. PCR products are pooled, then size-selected on an in-house developed 96-channel robot to enrich the miRNA containing fraction and remove adapter contaminants. Each size-selected indexed pool is ethanol precipitated and quality checked on an Agilent Bioanalyzer DNA 1000 chip and quantified using a Qubit fluorometer (Invitrogen, cat. Q32854). Each pool is then diluted to a target concentration for cluster generation and loaded into a single lane of a HiSeq 2000 flow cell for sequencing with a 31-bp main read (for the insert) and a 7-bp read for the index.  

miRNA/hg19 alignment:

Illumina miRNA sequencing reads were aligned to the hg19 reference using BWA version 0.5.7. This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Duplicated reads were marked with Picard Tools.

 

miRNA preprocessing, alignment and annotation

Briefly, the sequence data are separated into individual samples based on the index read sequences, and the reads undergo an initial QC assessment. Adapter sequence is then trimmed off, and the trimmed reads for each sample are aligned to the NCBI GRCh37-lite reference genome.

Routine QC assesses a subset of raw sequences from each pooled lane for the abundance of reads from each indexed sample in the pool, the proportion of reads that possibly originate from adapter dimers (i.e. a 5’ adapter joined to a 3’ adapter with no intervening biological sequence) and for the proportion of reads that map to human miRNAs. Sequencing error is estimated by a method originally developed for SAGE (Khattra et al., 2007).

Libraries that pass this QC stage are preprocessed for alignment. While the size-selected miRNAs vary somewhat in length, typically they are ~21 bp long, and so are shorter than the 31-bp read length. Given this, each read sequence extends some distance into the 3' sequencing adapter. Because this non-biological sequence can interfere with aligning the read to the reference genome, 3’ adapter sequence is identified and removed (trimmed) from a read. The adapter-trimming algorithm identifies as long an adapter sequence as possible, allowing a number of mismatches that depends on the adapter length found. A typical sequencing run yields several million reads; using only the first (5’) 15 bases of the 3’ adapter in trimming makes processing efficient, while minimizing the chance that an miRNA read will match the adapter sequence.

After each read has been processed, a summary report is generated containing the number of reads at each read length. Any trimmed read that is shorter than 15bp is discarded; remaining reads are submitted for alignment to the reference genome. BWA (Li and Durbin, 2009) alignment(s) for each read are checked with a series of three filters. A read with more than 3 alignments is discarded as too ambiguous. Only perfect alignments with no mismatches are used. Reads that fail the Illumina basecalling chastity filter are retained, while reads that have soft-clipped CIGAR strings are discarded.

For reads retained after filtering, each coordinate for each read alignment is annotated using a reference databases, and requiring a minimum 3-bp overlap between the alignment and an annotation. If a read has more than one alignment location, and the annotations for these are different, we use a priority list to assign a single annotation to the read, as long as only one alignment is to a miRNA. When there are multiple alignments to different miRNAs, the read is flagged as cross-mapped (de Hoon et al., 2010), and all of its miRNA annotations are preserved, while all of its non-miRNA annotations are discarded. This ensures that all annotation information about ambiguously mapped miRNAs is retained, and allows annotation ambiguity to be addressed in downstream analyses. Note that we consider miRNAs to be cross-mapped only if they map to different miRNAs, not to functionally identical miRNAs that are expressed from different locations in the genome. Such cases are indicated by miRNA miRBase names, which can have up to 4 separate sections separated by "-", e.g. hsa-mir-26a-1. A difference in the final (e.g. ‘-1’) section denotes functionally equivalent miRNAs expressed from different regions of the genome, and we consider only the first 3 sections (e.g. ‘hsa-mir-26a’) when comparing names. As long as a read maps to multiple miRNAs for which the first 3 sections of the name are identical (e.g. hsa-mir-26a-1 and hsa-mir-26a-2), it is treated as if it maps to only one miRNA, and is not flagged as cross-mapped.

The minimum depth of sequencing required to detect the miRNAs that are expressed in one sample is 1,000,000 reads per library mapped to miRBase annotations.

Finally, for each sample, the reads that correspond to particular miRNAs are summed and normalized to a million miRNA-aligned reads to generate the quantification files. 

MicroRNA Sequencing Analysis

*Protocols performed at British Columbia Cancer Agency

miRNA NMF methods

We identified groups of samples with similar abundance profiles using unsupervised non-negative matrix factorization (NMF) consensus clustering of reads-per-million (RPM) data for the 25% most-variant 5p or 3p miRBase v20 mature strands. We generated a heatmap for the discriminatory miRNAs that had the highest scores in each of the four NMF metagenes (Gaujoux and Seoighe 2010) as follows. We reordered columns (samples) in a RPM-normalized abundance matrix to match the NMF result. We log2-transformed and median-centered the rows (miRs), and then hierarchically clustered the rows using an absolute centered correlation distance metric and average linkage (de Hoon 2004, Saldanha 2004). 5p and 3p mature strand names were assigned using miRBase v20. We generated covariate association P-values with R’s Fisher exact test.

 

miRNA-Seq-Differential expression

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify miRs that were differentially expressed. Each run generated a pair of files: miRs ‘up’ and ‘down’. We filtered each file by removing miRs with median expression less than 50 RPKM in both of the input sample groups, and miRs for which the Wilcoxon BH adjusted P-value was greater than 0.05; then ranked the filtered results by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction. 

MicroRNA Sequencing (miRNA-seq)

*Protocols Performed at British Columbia Cancer Agency). 

MicroRNA-seq library construction

Small RNAs, containing microRNA (miRNA), in the flow-through material following mRNA purification on a MultiMACS separator (Miltenyi Biotec, Germany) are recovered by ethanol precipitation. MiRNA-seq libraries are constructed using a 96-well plate-based protocol developed at the BC Cancer Agency, Genome Sciences Centre.  Briefly, an adenylated single-stranded DNA 3’ adapter is selectively ligated to miRNAs using a truncated T4 RNA ligase2 (NEB Canada, cat. M0242L). An RNA 5’ adapter is then added, using a T4 RNA ligase (Ambion USA, cat. AM2141) and ATP. Next, first strand cDNA is synthesized using Superscript II Reverse Transcriptase (Invitrogen, cat.18064 014), and serves as the template for PCR. Index sequences (6 nucleotides) are introduced at this PCR step to enable multiplexed pooling of miRNA libraries. PCR products are pooled, then size-selected on an in-house developed 96-channel robot to enrich the miRNA containing fraction and remove adapter contaminants. Each size-selected indexed pool is ethanol precipitated and quality checked on an Agilent Bioanalyzer DNA 1000 chip and quantified using a Qubit fluorometer (Invitrogen, cat. Q32854). Each pool is then diluted to a target concentration for cluster generation and loaded into a single lane of a HiSeq 2000 flow cell for sequencing with a 31-bp main read (for the insert) and a 7-bp read for the index.  

miRNA/hg19 alignment:

Illumina miRNA sequencing reads were aligned to the hg19 reference using BWA version 0.5.7. This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Duplicated reads were marked with Picard Tools.

 

miRNA preprocessing, alignment and annotation

Briefly, the sequence data are separated into individual samples based on the index read sequences, and the reads undergo an initial QC assessment. Adapter sequence is then trimmed off, and the trimmed reads for each sample are aligned to the NCBI GRCh37-lite reference genome.

Routine QC assesses a subset of raw sequences from each pooled lane for the abundance of reads from each indexed sample in the pool, the proportion of reads that possibly originate from adapter dimers (i.e. a 5’ adapter joined to a 3’ adapter with no intervening biological sequence) and for the proportion of reads that map to human miRNAs. Sequencing error is estimated by a method originally developed for SAGE (Khattra et al., 2007).

Libraries that pass this QC stage are preprocessed for alignment. While the size-selected miRNAs vary somewhat in length, typically they are ~21 bp long, and so are shorter than the 31-bp read length. Given this, each read sequence extends some distance into the 3' sequencing adapter. Because this non-biological sequence can interfere with aligning the read to the reference genome, 3’ adapter sequence is identified and removed (trimmed) from a read. The adapter-trimming algorithm identifies as long an adapter sequence as possible, allowing a number of mismatches that depends on the adapter length found. A typical sequencing run yields several million reads; using only the first (5’) 15 bases of the 3’ adapter in trimming makes processing efficient, while minimizing the chance that an miRNA read will match the adapter sequence.

After each read has been processed, a summary report is generated containing the number of reads at each read length. Any trimmed read that is shorter than 15bp is discarded; remaining reads are submitted for alignment to the reference genome. BWA (Li and Durbin, 2009) alignment(s) for each read are checked with a series of three filters. A read with more than 3 alignments is discarded as too ambiguous. Only perfect alignments with no mismatches are used. Reads that fail the Illumina basecalling chastity filter are retained, while reads that have soft-clipped CIGAR strings are discarded.

For reads retained after filtering, each coordinate for each read alignment is annotated using a reference databases, and requiring a minimum 3-bp overlap between the alignment and an annotation. If a read has more than one alignment location, and the annotations for these are different, we use a priority list to assign a single annotation to the read, as long as only one alignment is to a miRNA. When there are multiple alignments to different miRNAs, the read is flagged as cross-mapped (de Hoon et al., 2010), and all of its miRNA annotations are preserved, while all of its non-miRNA annotations are discarded. This ensures that all annotation information about ambiguously mapped miRNAs is retained, and allows annotation ambiguity to be addressed in downstream analyses. Note that we consider miRNAs to be cross-mapped only if they map to different miRNAs, not to functionally identical miRNAs that are expressed from different locations in the genome. Such cases are indicated by miRNA miRBase names, which can have up to 4 separate sections separated by "-", e.g. hsa-mir-26a-1. A difference in the final (e.g. ‘-1’) section denotes functionally equivalent miRNAs expressed from different regions of the genome, and we consider only the first 3 sections (e.g. ‘hsa-mir-26a’) when comparing names. As long as a read maps to multiple miRNAs for which the first 3 sections of the name are identical (e.g. hsa-mir-26a-1 and hsa-mir-26a-2), it is treated as if it maps to only one miRNA, and is not flagged as cross-mapped.

The minimum depth of sequencing required to detect the miRNAs that are expressed in one sample is 1,000,000 reads per library mapped to miRBase annotations.

Finally, for each sample, the reads that correspond to particular miRNAs are summed and normalized to a million miRNA-aligned reads to generate the quantification files. 

MicroRNA Sequencing Analysis

*Protocols performed at British Columbia Cancer Agency

miRNA NMF methods

We identified groups of samples with similar abundance profiles using unsupervised non-negative matrix factorization (NMF) consensus clustering of reads-per-million (RPM) data for the 25% most-variant 5p or 3p miRBase v20 mature strands. We generated a heatmap for the discriminatory miRNAs that had the highest scores in each of the four NMF metagenes (Gaujoux and Seoighe 2010) as follows. We reordered columns (samples) in a RPM-normalized abundance matrix to match the NMF result. We log2-transformed and median-centered the rows (miRs), and then hierarchically clustered the rows using an absolute centered correlation distance metric and average linkage (de Hoon 2004, Saldanha 2004). 5p and 3p mature strand names were assigned using miRBase v20. We generated covariate association P-values with R’s Fisher exact test.

 

miRNA-Seq-Differential expression

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify miRs that were differentially expressed. Each run generated a pair of files: miRs ‘up’ and ‘down’. We filtered each file by removing miRs with median expression less than 50 RPKM in both of the input sample groups, and miRs for which the Wilcoxon BH adjusted P-value was greater than 0.05; then ranked the filtered results by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction. 

MicroRNA Sequencing (miRNA-seq)

*Protocols Performed at British Columbia Cancer Agency). 

MicroRNA-seq library construction

Small RNAs, containing microRNA (miRNA), in the flow-through material following mRNA purification on a MultiMACS separator (Miltenyi Biotec, Germany) are recovered by ethanol precipitation. MiRNA-seq libraries are constructed using a 96-well plate-based protocol developed at the BC Cancer Agency, Genome Sciences Centre.  Briefly, an adenylated single-stranded DNA 3’ adapter is selectively ligated to miRNAs using a truncated T4 RNA ligase2 (NEB Canada, cat. M0242L). An RNA 5’ adapter is then added, using a T4 RNA ligase (Ambion USA, cat. AM2141) and ATP. Next, first strand cDNA is synthesized using Superscript II Reverse Transcriptase (Invitrogen, cat.18064 014), and serves as the template for PCR. Index sequences (6 nucleotides) are introduced at this PCR step to enable multiplexed pooling of miRNA libraries. PCR products are pooled, then size-selected on an in-house developed 96-channel robot to enrich the miRNA containing fraction and remove adapter contaminants. Each size-selected indexed pool is ethanol precipitated and quality checked on an Agilent Bioanalyzer DNA 1000 chip and quantified using a Qubit fluorometer (Invitrogen, cat. Q32854). Each pool is then diluted to a target concentration for cluster generation and loaded into a single lane of a HiSeq 2000 flow cell for sequencing with a 31-bp main read (for the insert) and a 7-bp read for the index.  

miRNA/hg19 alignment:

Illumina miRNA sequencing reads were aligned to the hg19 reference using BWA version 0.5.7. This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Duplicated reads were marked with Picard Tools.

 

miRNA preprocessing, alignment and annotation

Briefly, the sequence data are separated into individual samples based on the index read sequences, and the reads undergo an initial QC assessment. Adapter sequence is then trimmed off, and the trimmed reads for each sample are aligned to the NCBI GRCh37-lite reference genome.

Routine QC assesses a subset of raw sequences from each pooled lane for the abundance of reads from each indexed sample in the pool, the proportion of reads that possibly originate from adapter dimers (i.e. a 5’ adapter joined to a 3’ adapter with no intervening biological sequence) and for the proportion of reads that map to human miRNAs. Sequencing error is estimated by a method originally developed for SAGE (Khattra et al., 2007).

Libraries that pass this QC stage are preprocessed for alignment. While the size-selected miRNAs vary somewhat in length, typically they are ~21 bp long, and so are shorter than the 31-bp read length. Given this, each read sequence extends some distance into the 3' sequencing adapter. Because this non-biological sequence can interfere with aligning the read to the reference genome, 3’ adapter sequence is identified and removed (trimmed) from a read. The adapter-trimming algorithm identifies as long an adapter sequence as possible, allowing a number of mismatches that depends on the adapter length found. A typical sequencing run yields several million reads; using only the first (5’) 15 bases of the 3’ adapter in trimming makes processing efficient, while minimizing the chance that an miRNA read will match the adapter sequence.

After each read has been processed, a summary report is generated containing the number of reads at each read length. Any trimmed read that is shorter than 15bp is discarded; remaining reads are submitted for alignment to the reference genome. BWA (Li and Durbin, 2009) alignment(s) for each read are checked with a series of three filters. A read with more than 3 alignments is discarded as too ambiguous. Only perfect alignments with no mismatches are used. Reads that fail the Illumina basecalling chastity filter are retained, while reads that have soft-clipped CIGAR strings are discarded.

For reads retained after filtering, each coordinate for each read alignment is annotated using a reference databases, and requiring a minimum 3-bp overlap between the alignment and an annotation. If a read has more than one alignment location, and the annotations for these are different, we use a priority list to assign a single annotation to the read, as long as only one alignment is to a miRNA. When there are multiple alignments to different miRNAs, the read is flagged as cross-mapped (de Hoon et al., 2010), and all of its miRNA annotations are preserved, while all of its non-miRNA annotations are discarded. This ensures that all annotation information about ambiguously mapped miRNAs is retained, and allows annotation ambiguity to be addressed in downstream analyses. Note that we consider miRNAs to be cross-mapped only if they map to different miRNAs, not to functionally identical miRNAs that are expressed from different locations in the genome. Such cases are indicated by miRNA miRBase names, which can have up to 4 separate sections separated by "-", e.g. hsa-mir-26a-1. A difference in the final (e.g. ‘-1’) section denotes functionally equivalent miRNAs expressed from different regions of the genome, and we consider only the first 3 sections (e.g. ‘hsa-mir-26a’) when comparing names. As long as a read maps to multiple miRNAs for which the first 3 sections of the name are identical (e.g. hsa-mir-26a-1 and hsa-mir-26a-2), it is treated as if it maps to only one miRNA, and is not flagged as cross-mapped.

The minimum depth of sequencing required to detect the miRNAs that are expressed in one sample is 1,000,000 reads per library mapped to miRBase annotations.

Finally, for each sample, the reads that correspond to particular miRNAs are summed and normalized to a million miRNA-aligned reads to generate the quantification files. 

MicroRNA Sequencing Analysis

*Protocols performed at British Columbia Cancer Agency

miRNA NMF methods

We identified groups of samples with similar abundance profiles using unsupervised non-negative matrix factorization (NMF) consensus clustering of reads-per-million (RPM) data for the 25% most-variant 5p or 3p miRBase v20 mature strands. We generated a heatmap for the discriminatory miRNAs that had the highest scores in each of the four NMF metagenes (Gaujoux and Seoighe 2010) as follows. We reordered columns (samples) in a RPM-normalized abundance matrix to match the NMF result. We log2-transformed and median-centered the rows (miRs), and then hierarchically clustered the rows using an absolute centered correlation distance metric and average linkage (de Hoon 2004, Saldanha 2004). 5p and 3p mature strand names were assigned using miRBase v20. We generated covariate association P-values with R’s Fisher exact test.

 

miRNA-Seq-Differential expression

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify miRs that were differentially expressed. Each run generated a pair of files: miRs ‘up’ and ‘down’. We filtered each file by removing miRs with median expression less than 50 RPKM in both of the input sample groups, and miRs for which the Wilcoxon BH adjusted P-value was greater than 0.05; then ranked the filtered results by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction. 

 

miRNA library construction, sequencing, and analysis

miRNA-Seq library construction, sequencing, read alignment (to mirBase v19), and miRNA expression profiling were performed as previously reported in the Cancer Genome Atlas Research Network (Cancer Genome Atlas Research Network, 2013a; 2013c).

MicroRNA Sequencing (miRNA-seq)

*Protocols Performed at British Columbia Cancer Agency). 

MicroRNA-seq library construction

Small RNAs, containing microRNA (miRNA), in the flow-through material following mRNA purification on a MultiMACS separator (Miltenyi Biotec, Germany) are recovered by ethanol precipitation. MiRNA-seq libraries are constructed using a 96-well plate-based protocol developed at the BC Cancer Agency, Genome Sciences Centre.  Briefly, an adenylated single-stranded DNA 3’ adapter is selectively ligated to miRNAs using a truncated T4 RNA ligase2 (NEB Canada, cat. M0242L). An RNA 5’ adapter is then added, using a T4 RNA ligase (Ambion USA, cat. AM2141) and ATP. Next, first strand cDNA is synthesized using Superscript II Reverse Transcriptase (Invitrogen, cat.18064 014), and serves as the template for PCR. Index sequences (6 nucleotides) are introduced at this PCR step to enable multiplexed pooling of miRNA libraries. PCR products are pooled, then size-selected on an in-house developed 96-channel robot to enrich the miRNA containing fraction and remove adapter contaminants. Each size-selected indexed pool is ethanol precipitated and quality checked on an Agilent Bioanalyzer DNA 1000 chip and quantified using a Qubit fluorometer (Invitrogen, cat. Q32854). Each pool is then diluted to a target concentration for cluster generation and loaded into a single lane of a HiSeq 2000 flow cell for sequencing with a 31-bp main read (for the insert) and a 7-bp read for the index.  

miRNA/hg19 alignment:

Illumina miRNA sequencing reads were aligned to the hg19 reference using BWA version 0.5.7. This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Duplicated reads were marked with Picard Tools.

 

miRNA preprocessing, alignment and annotation

Briefly, the sequence data are separated into individual samples based on the index read sequences, and the reads undergo an initial QC assessment. Adapter sequence is then trimmed off, and the trimmed reads for each sample are aligned to the NCBI GRCh37-lite reference genome.

Routine QC assesses a subset of raw sequences from each pooled lane for the abundance of reads from each indexed sample in the pool, the proportion of reads that possibly originate from adapter dimers (i.e. a 5’ adapter joined to a 3’ adapter with no intervening biological sequence) and for the proportion of reads that map to human miRNAs. Sequencing error is estimated by a method originally developed for SAGE (Khattra et al., 2007).

Libraries that pass this QC stage are preprocessed for alignment. While the size-selected miRNAs vary somewhat in length, typically they are ~21 bp long, and so are shorter than the 31-bp read length. Given this, each read sequence extends some distance into the 3' sequencing adapter. Because this non-biological sequence can interfere with aligning the read to the reference genome, 3’ adapter sequence is identified and removed (trimmed) from a read. The adapter-trimming algorithm identifies as long an adapter sequence as possible, allowing a number of mismatches that depends on the adapter length found. A typical sequencing run yields several million reads; using only the first (5’) 15 bases of the 3’ adapter in trimming makes processing efficient, while minimizing the chance that an miRNA read will match the adapter sequence.

After each read has been processed, a summary report is generated containing the number of reads at each read length. Any trimmed read that is shorter than 15bp is discarded; remaining reads are submitted for alignment to the reference genome. BWA (Li and Durbin, 2009) alignment(s) for each read are checked with a series of three filters. A read with more than 3 alignments is discarded as too ambiguous. Only perfect alignments with no mismatches are used. Reads that fail the Illumina basecalling chastity filter are retained, while reads that have soft-clipped CIGAR strings are discarded.

For reads retained after filtering, each coordinate for each read alignment is annotated using a reference databases, and requiring a minimum 3-bp overlap between the alignment and an annotation. If a read has more than one alignment location, and the annotations for these are different, we use a priority list to assign a single annotation to the read, as long as only one alignment is to a miRNA. When there are multiple alignments to different miRNAs, the read is flagged as cross-mapped (de Hoon et al., 2010), and all of its miRNA annotations are preserved, while all of its non-miRNA annotations are discarded. This ensures that all annotation information about ambiguously mapped miRNAs is retained, and allows annotation ambiguity to be addressed in downstream analyses. Note that we consider miRNAs to be cross-mapped only if they map to different miRNAs, not to functionally identical miRNAs that are expressed from different locations in the genome. Such cases are indicated by miRNA miRBase names, which can have up to 4 separate sections separated by "-", e.g. hsa-mir-26a-1. A difference in the final (e.g. ‘-1’) section denotes functionally equivalent miRNAs expressed from different regions of the genome, and we consider only the first 3 sections (e.g. ‘hsa-mir-26a’) when comparing names. As long as a read maps to multiple miRNAs for which the first 3 sections of the name are identical (e.g. hsa-mir-26a-1 and hsa-mir-26a-2), it is treated as if it maps to only one miRNA, and is not flagged as cross-mapped.

The minimum depth of sequencing required to detect the miRNAs that are expressed in one sample is 1,000,000 reads per library mapped to miRBase annotations.

Finally, for each sample, the reads that correspond to particular miRNAs are summed and normalized to a million miRNA-aligned reads to generate the quantification files. 

MicroRNA Sequencing Analysis

*Protocols performed at British Columbia Cancer Agency

miRNA NMF methods

We identified groups of samples with similar abundance profiles using unsupervised non-negative matrix factorization (NMF) consensus clustering of reads-per-million (RPM) data for the 25% most-variant 5p or 3p miRBase v20 mature strands. We generated a heatmap for the discriminatory miRNAs that had the highest scores in each of the four NMF metagenes (Gaujoux and Seoighe 2010) as follows. We reordered columns (samples) in a RPM-normalized abundance matrix to match the NMF result. We log2-transformed and median-centered the rows (miRs), and then hierarchically clustered the rows using an absolute centered correlation distance metric and average linkage (de Hoon 2004, Saldanha 2004). 5p and 3p mature strand names were assigned using miRBase v20. We generated covariate association P-values with R’s Fisher exact test.

 

miRNA-Seq-Differential expression

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify miRs that were differentially expressed. Each run generated a pair of files: miRs ‘up’ and ‘down’. We filtered each file by removing miRs with median expression less than 50 RPKM in both of the input sample groups, and miRs for which the Wilcoxon BH adjusted P-value was greater than 0.05; then ranked the filtered results by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction. 

Whole Genome Sequencing
Sequencing Center Data Generation Protocol Data Analysis Protocol
British Columbia Cancer Agency (BCCA) ALL P1 , NBL , RT , ALAL ALL P1 , NBL , RT , ALAL
Complete Genomics Inc. (CGI) ALL P1/ P2 , AML , CCSK , NBL , OS , WT ALL P1/ P2 , AML , CCSK , NBL , OS , WT

Illumina genomic plate-based library construction (350-450bp insert size):

2ug of genomic DNA in a 96-well format was fragmented by Covaris E210 sonication for 30 seconds using a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency’s Genome Sciences Centre 96-well Genomic ~350bp-450bp insert Illumina Library Construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the DNA was purified in a 96-well microtitre plate using Ampure XP SPRI beads (40-45uL beads per 60uL DNA), and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE indexed primer set,  with cycle conditions: 98˚C for 30sec followed by 6 cycles of 98˚C  for 15 sec, 62˚C for 30 sec and 72˚C for 30 sec, and a final extension at 72˚C for 5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was gel purified (8% PAGE or 1.5% Metaphor agarose in an in-house custom built robot), and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was confirmed by Quant-iT dsDNA HS Assay prior to generating 100bp paired end reads on the Illumina HiSeq 2000/2500 platform using v3 chemistry.

WGS/hg19 alignment:

Illumina paired-end whole genome sequencing reads were aligned to the hg19 reference using BWA version 0.5.7.  This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Multiple lanes of sequences were merged and duplicated reads were marked with Picard Tools.

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

 

Genomic SNV analyses

SNVs from WGS-seq data were analyzed using all three methods described below:

Mpileup

SNVs were analyzed with SAMtools mpileup v.0.1.17 (Li et al., 2009) either on single or paired libraries.  Each chromosome was analyzed separately using the -C50-DSBuf parameters. The resulting vcf files were merged and filtered to remove low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6). Finally, SNVs were annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b) and the dbSNP v137 db membership assigned using snpSift  (Cingolani et al., 2012a).

Strelka

To analyze compartment specific SNVs, samples were analyzed pair wise with the default settings of Strelka v0.4.7 (Saunders et al., 2012).  Primary tumor samples and relapse/met were compared against the germline sample. In the absence of a germline sample, the relapse/met samples were compared against the primary tumor sample.

MutationSeq

SNVs were analyzed pair wise with SAMtools mpileup v.0.1.17 (Li et al., 2009).  Each chromosome was analyzed separately using the -C50-DSBuf parameters. Before merging the resulting vcf files, they were filtered to remove all indels and low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6).  The SNVs in the resulting vcf files were further filtered and scored using mutationSeq v1.0.2 and annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b)  and the dbSNP v137 and cosmic 64 db membership using snpSift  (Cingolani et al., 2012a).

 

Copy number variation (CNV) analysis

The techniques outlined in (Jones et al., 2010) were followed to analyze copy number changes. Sequence quality filtering was used to remove all reads of low mapping quality (Q < 10). Due to the varying amounts of sequence reads from each sample, aligned reference reads were first used to define genomic bins of equal reference coverage to which depths of alignments of sequence from each of the tumor samples were compared. This resulted in a measurement of the relative number of aligned reads from the tumors and reference in bins of variable length along the genome, where bin width is inversely proportional to the number of mapped reference reads. A hidden Markov model (HMM) was used to classify and segment continuous regions of copy number loss, neutrality, or gain using methodology outlined previously (Shah et al., 2006). The five states reported by the HMM were: loss (1), neutral (2), gain (3), amplification (4), and high-level amplification (5).

Amplified and deleted CNV regions are further screened for interspersed repeats, and low complexity DNA sequences, which includes long interspersed nuclear elements (LINE), short interspersed nuclear element (SINE), long terminal repeat elements (LTR), DNA repeat elements (DNA), low complexity repeats,  satellite repeats, simple repeats (micro-satellites), and RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA).

 

Repeat sequences in the genome pose challenges in the identification of CNVs with next generation sequencing data as the short reads sequenced from repetitive regions cannot be mapped unambiguously. Exclusion or random placement of the reads aligned to multiple regions can either reduce sensitivity of CNV detection or result in the identification of false deletions in repeated regions. Due to the limitations of both alignment and subsequent segmentation algorithms, CNVs called in the regions harboring highly repeated sequences should be carefully scrutinized. Therefore, in addition to focal CNV functional annotation, recurrence among patients, and presence of TransAbyss overlapping events, the number and types of repeats are added to the annotation of candidate CNVs to further narrow down the prioritized list for verification. It is recommended that the candidate CNVs be prioritized based on the presence of genes of interest, high recurrence among patients, presence of overlapping TransAbyss events, and low frequency or absence of repeat nuclear elements.

Illumina genomic plate-based library construction (350-450bp insert size):

2ug of genomic DNA in a 96-well format was fragmented by Covaris E210 sonication for 30 seconds using a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency’s Genome Sciences Centre 96-well Genomic ~350bp-450bp insert Illumina Library Construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the DNA was purified in a 96-well microtitre plate using Ampure XP SPRI beads (40-45uL beads per 60uL DNA), and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE indexed primer set,  with cycle conditions: 98˚C for 30sec followed by 6 cycles of 98˚C  for 15 sec, 62˚C for 30 sec and 72˚C for 30 sec, and a final extension at 72˚C for 5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was gel purified (8% PAGE or 1.5% Metaphor agarose in an in-house custom built robot), and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was confirmed by Quant-iT dsDNA HS Assay prior to generating 100bp paired end reads on the Illumina HiSeq 2000/2500 platform using v3 chemistry.

WGS/hg19 alignment:

Illumina paired-end whole genome sequencing reads were aligned to the hg19 reference using BWA version 0.5.7.  This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Multiple lanes of sequences were merged and duplicated reads were marked with Picard Tools.

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

 

Genomic SNV analyses

SNVs from WGS-seq data were analyzed using all three methods described below:

Mpileup

SNVs were analyzed with SAMtools mpileup v.0.1.17 (Li et al., 2009) either on single or paired libraries.  Each chromosome was analyzed separately using the -C50-DSBuf parameters. The resulting vcf files were merged and filtered to remove low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6). Finally, SNVs were annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b) and the dbSNP v137 db membership assigned using snpSift  (Cingolani et al., 2012a).

Strelka

To analyze compartment specific SNVs, samples were analyzed pair wise with the default settings of Strelka v0.4.7 (Saunders et al., 2012).  Primary tumor samples and relapse/met were compared against the germline sample. In the absence of a germline sample, the relapse/met samples were compared against the primary tumor sample.

MutationSeq

SNVs were analyzed pair wise with SAMtools mpileup v.0.1.17 (Li et al., 2009).  Each chromosome was analyzed separately using the -C50-DSBuf parameters. Before merging the resulting vcf files, they were filtered to remove all indels and low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6).  The SNVs in the resulting vcf files were further filtered and scored using mutationSeq v1.0.2 and annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b)  and the dbSNP v137 and cosmic 64 db membership using snpSift  (Cingolani et al., 2012a).

 

Copy number variation (CNV) analysis

The techniques outlined in (Jones et al., 2010) were followed to analyze copy number changes. Sequence quality filtering was used to remove all reads of low mapping quality (Q < 10). Due to the varying amounts of sequence reads from each sample, aligned reference reads were first used to define genomic bins of equal reference coverage to which depths of alignments of sequence from each of the tumor samples were compared. This resulted in a measurement of the relative number of aligned reads from the tumors and reference in bins of variable length along the genome, where bin width is inversely proportional to the number of mapped reference reads. A hidden Markov model (HMM) was used to classify and segment continuous regions of copy number loss, neutrality, or gain using methodology outlined previously (Shah et al., 2006). The five states reported by the HMM were: loss (1), neutral (2), gain (3), amplification (4), and high-level amplification (5).

Amplified and deleted CNV regions are further screened for interspersed repeats, and low complexity DNA sequences, which includes long interspersed nuclear elements (LINE), short interspersed nuclear element (SINE), long terminal repeat elements (LTR), DNA repeat elements (DNA), low complexity repeats,  satellite repeats, simple repeats (micro-satellites), and RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA).

Repeat sequences in the genome pose challenges in the identification of CNVs with next generation sequencing data as the short reads sequenced from repetitive regions cannot be mapped unambiguously. Exclusion or random placement of the reads aligned to multiple regions can either reduce sensitivity of CNV detection or result in the identification of false deletions in repeated regions. Due to the limitations of both alignment and subsequent segmentation algorithms, CNVs called in the regions harboring highly repeated sequences should be carefully scrutinized. Therefore, in addition to focal CNV functional annotation, recurrence among patients, and presence of TransAbyss overlapping events, the number and types of repeats are added to the annotation of candidate CNVs to further narrow down the prioritized list for verification. It is recommended that the candidate CNVs be prioritized based on the presence of genes of interest, high recurrence among patients, presence of overlapping TransAbyss events, and low frequency or absence of repeat nuclear elements.

Illumina genomic plate-based library construction (350-450bp insert size):

2ug of genomic DNA in a 96-well format was fragmented by Covaris E210 sonication for 30 seconds using a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency’s Genome Sciences Centre 96-well Genomic ~350bp-450bp insert Illumina Library Construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the DNA was purified in a 96-well microtitre plate using Ampure XP SPRI beads (40-45uL beads per 60uL DNA), and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE indexed primer set,  with cycle conditions: 98˚C for 30sec followed by 6 cycles of 98˚C  for 15 sec, 62˚C for 30 sec and 72˚C for 30 sec, and a final extension at 72˚C for 5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was gel purified (8% PAGE or 1.5% Metaphor agarose in an in-house custom built robot), and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was confirmed by Quant-iT dsDNA HS Assay prior to generating 100bp paired end reads on the Illumina HiSeq 2000/2500 platform using v3 chemistry.

WGS/hg19 alignment:

Illumina paired-end whole genome sequencing reads were aligned to the hg19 reference using BWA version 0.5.7.  This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Multiple lanes of sequences were merged and duplicated reads were marked with Picard Tools.

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

 

Genomic SNV analyses

SNVs from WGS-seq data were analyzed using all three methods described below:

Mpileup

SNVs were analyzed with SAMtools mpileup v.0.1.17 (Li et al., 2009) either on single or paired libraries.  Each chromosome was analyzed separately using the -C50-DSBuf parameters. The resulting vcf files were merged and filtered to remove low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6). Finally, SNVs were annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b) and the dbSNP v137 db membership assigned using snpSift  (Cingolani et al., 2012a).

Strelka

To analyze compartment specific SNVs, samples were analyzed pair wise with the default settings of Strelka v0.4.7 (Saunders et al., 2012).  Primary tumor samples and relapse/met were compared against the germline sample. In the absence of a germline sample, the relapse/met samples were compared against the primary tumor sample.

MutationSeq

SNVs were analyzed pair wise with SAMtools mpileup v.0.1.17 (Li et al., 2009).  Each chromosome was analyzed separately using the -C50-DSBuf parameters. Before merging the resulting vcf files, they were filtered to remove all indels and low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6).  The SNVs in the resulting vcf files were further filtered and scored using mutationSeq v1.0.2 and annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b)  and the dbSNP v137 and cosmic 64 db membership using snpSift  (Cingolani et al., 2012a).

 

Copy number variation (CNV) analysis

The techniques outlined in (Jones et al., 2010) were followed to analyze copy number changes. Sequence quality filtering was used to remove all reads of low mapping quality (Q < 10). Due to the varying amounts of sequence reads from each sample, aligned reference reads were first used to define genomic bins of equal reference coverage to which depths of alignments of sequence from each of the tumor samples were compared. This resulted in a measurement of the relative number of aligned reads from the tumors and reference in bins of variable length along the genome, where bin width is inversely proportional to the number of mapped reference reads. A hidden Markov model (HMM) was used to classify and segment continuous regions of copy number loss, neutrality, or gain using methodology outlined previously (Shah et al., 2006). The five states reported by the HMM were: loss (1), neutral (2), gain (3), amplification (4), and high-level amplification (5).

Amplified and deleted CNV regions are further screened for interspersed repeats, and low complexity DNA sequences, which includes long interspersed nuclear elements (LINE), short interspersed nuclear element (SINE), long terminal repeat elements (LTR), DNA repeat elements (DNA), low complexity repeats,  satellite repeats, simple repeats (micro-satellites), and RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA).

Repeat sequences in the genome pose challenges in the identification of CNVs with next generation sequencing data as the short reads sequenced from repetitive regions cannot be mapped unambiguously. Exclusion or random placement of the reads aligned to multiple regions can either reduce sensitivity of CNV detection or result in the identification of false deletions in repeated regions. Due to the limitations of both alignment and subsequent segmentation algorithms, CNVs called in the regions harboring highly repeated sequences should be carefully scrutinized. Therefore, in addition to focal CNV functional annotation, recurrence among patients, and presence of TransAbyss overlapping events, the number and types of repeats are added to the annotation of candidate CNVs to further narrow down the prioritized list for verification. It is recommended that the candidate CNVs be prioritized based on the presence of genes of interest, high recurrence among patients, presence of overlapping TransAbyss events, and low frequency or absence of repeat nuclear elements.

Illumina genomic plate-based library construction (350-450bp insert size):

2ug of genomic DNA in a 96-well format was fragmented by Covaris E210 sonication for 30 seconds using a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency’s Genome Sciences Centre 96-well Genomic ~350bp-450bp insert Illumina Library Construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the DNA was purified in a 96-well microtitre plate using Ampure XP SPRI beads (40-45uL beads per 60uL DNA), and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE indexed primer set,  with cycle conditions: 98˚C for 30sec followed by 6 cycles of 98˚C  for 15 sec, 62˚C for 30 sec and 72˚C for 30 sec, and a final extension at 72˚C for 5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was gel purified (8% PAGE or 1.5% Metaphor agarose in an in-house custom built robot), and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was confirmed by Quant-iT dsDNA HS Assay prior to generating 100bp paired end reads on the Illumina HiSeq 2000/2500 platform using v3 chemistry.

PCR-free whole genome sequencing:

Briefly, 500ng of genomic DNA was arrayed in a 96-well microtitre plate and subjected to shearing by sonication (Covaris LE220). Sheared DNA was end-repaired, and size selected using paramagnetic PCRClean DX beads (C-1003-450, Aline Biosciences) targeting a 300-400bp fraction.  After 3’ A-tailing, full length TruSeq adapters were ligated. Libraries were purified using paramagnetic (Aline Biosciences) beads. PCR-free genome library concentrations were quantified using a qPCR Library Quantification kit (KAPA, KK4824) prior to sequencing with paired-end 125 base reads on the Illumina HiSeq2500 platform using V4 chemistry according to manufacturer recommendations.

Illumina genomic plate-based library construction (350-450bp insert size):

2ug of genomic DNA in a 96-well format was fragmented by Covaris E210 sonication for 30 seconds using a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency’s Genome Sciences Centre 96-well Genomic ~350bp-450bp insert Illumina Library Construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the DNA was purified in a 96-well microtitre plate using Ampure XP SPRI beads (40-45uL beads per 60uL DNA), and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE indexed primer set,  with cycle conditions: 98˚C for 30sec followed by 6 cycles of 98˚C  for 15 sec, 62˚C for 30 sec and 72˚C for 30 sec, and a final extension at 72˚C for 5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was gel purified (8% PAGE or 1.5% Metaphor agarose in an in-house custom built robot), and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was confirmed by Quant-iT dsDNA HS Assay prior to generating 100bp paired end reads on the Illumina HiSeq 2000/2500 platform using v3 chemistry.

WGS/hg19 alignment:

Illumina paired-end whole genome sequencing reads were aligned to the hg19 reference using BWA version 0.5.7.  This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Multiple lanes of sequences were merged and duplicated reads were marked with Picard Tools.

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

 

Genomic SNV analyses

SNVs from WGS-seq data were analyzed using all three methods described below:

Mpileup

SNVs were analyzed with SAMtools mpileup v.0.1.17 (Li et al., 2009) either on single or paired libraries.  Each chromosome was analyzed separately using the -C50-DSBuf parameters. The resulting vcf files were merged and filtered to remove low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6). Finally, SNVs were annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b) and the dbSNP v137 db membership assigned using snpSift  (Cingolani et al., 2012a).

Strelka

To analyze compartment specific SNVs, samples were analyzed pair wise with the default settings of Strelka v0.4.7 (Saunders et al., 2012).  Primary tumor samples and relapse/met were compared against the germline sample. In the absence of a germline sample, the relapse/met samples were compared against the primary tumor sample.

MutationSeq

SNVs were analyzed pair wise with SAMtools mpileup v.0.1.17 (Li et al., 2009).  Each chromosome was analyzed separately using the -C50-DSBuf parameters. Before merging the resulting vcf files, they were filtered to remove all indels and low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6).  The SNVs in the resulting vcf files were further filtered and scored using mutationSeq v1.0.2 and annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b)  and the dbSNP v137 and cosmic 64 db membership using snpSift  (Cingolani et al., 2012a).

 

Copy number variation (CNV) analysis

The techniques outlined in (Jones et al., 2010) were followed to analyze copy number changes. Sequence quality filtering was used to remove all reads of low mapping quality (Q < 10). Due to the varying amounts of sequence reads from each sample, aligned reference reads were first used to define genomic bins of equal reference coverage to which depths of alignments of sequence from each of the tumor samples were compared. This resulted in a measurement of the relative number of aligned reads from the tumors and reference in bins of variable length along the genome, where bin width is inversely proportional to the number of mapped reference reads. A hidden Markov model (HMM) was used to classify and segment continuous regions of copy number loss, neutrality, or gain using methodology outlined previously (Shah et al., 2006). The five states reported by the HMM were: loss (1), neutral (2), gain (3), amplification (4), and high-level amplification (5).

Amplified and deleted CNV regions are further screened for interspersed repeats, and low complexity DNA sequences, which includes long interspersed nuclear elements (LINE), short interspersed nuclear element (SINE), long terminal repeat elements (LTR), DNA repeat elements (DNA), low complexity repeats,  satellite repeats, simple repeats (micro-satellites), and RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA).

 

Repeat sequences in the genome pose challenges in the identification of CNVs with next generation sequencing data as the short reads sequenced from repetitive regions cannot be mapped unambiguously. Exclusion or random placement of the reads aligned to multiple regions can either reduce sensitivity of CNV detection or result in the identification of false deletions in repeated regions. Due to the limitations of both alignment and subsequent segmentation algorithms, CNVs called in the regions harboring highly repeated sequences should be carefully scrutinized. Therefore, in addition to focal CNV functional annotation, recurrence among patients, and presence of TransAbyss overlapping events, the number and types of repeats are added to the annotation of candidate CNVs to further narrow down the prioritized list for verification. It is recommended that the candidate CNVs be prioritized based on the presence of genes of interest, high recurrence among patients, presence of overlapping TransAbyss events, and low frequency or absence of repeat nuclear elements.

Illumina genomic plate-based library construction (350-450bp insert size):

2ug of genomic DNA in a 96-well format was fragmented by Covaris E210 sonication for 30 seconds using a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency’s Genome Sciences Centre 96-well Genomic ~350bp-450bp insert Illumina Library Construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the DNA was purified in a 96-well microtitre plate using Ampure XP SPRI beads (40-45uL beads per 60uL DNA), and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE indexed primer set,  with cycle conditions: 98˚C for 30sec followed by 6 cycles of 98˚C  for 15 sec, 62˚C for 30 sec and 72˚C for 30 sec, and a final extension at 72˚C for 5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was gel purified (8% PAGE or 1.5% Metaphor agarose in an in-house custom built robot), and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was confirmed by Quant-iT dsDNA HS Assay prior to generating 100bp paired end reads on the Illumina HiSeq 2000/2500 platform using v3 chemistry.

WGS/hg19 alignment:

Illumina paired-end whole genome sequencing reads were aligned to the hg19 reference using BWA version 0.5.7.  This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Multiple lanes of sequences were merged and duplicated reads were marked with Picard Tools.

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

 

Genomic SNV analyses

SNVs from WGS-seq data were analyzed using all three methods described below:

Mpileup

SNVs were analyzed with SAMtools mpileup v.0.1.17 (Li et al., 2009) either on single or paired libraries.  Each chromosome was analyzed separately using the -C50-DSBuf parameters. The resulting vcf files were merged and filtered to remove low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6). Finally, SNVs were annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b) and the dbSNP v137 db membership assigned using snpSift  (Cingolani et al., 2012a).

Strelka

To analyze compartment specific SNVs, samples were analyzed pair wise with the default settings of Strelka v0.4.7 (Saunders et al., 2012).  Primary tumor samples and relapse/met were compared against the germline sample. In the absence of a germline sample, the relapse/met samples were compared against the primary tumor sample.

MutationSeq

SNVs were analyzed pair wise with SAMtools mpileup v.0.1.17 (Li et al., 2009).  Each chromosome was analyzed separately using the -C50-DSBuf parameters. Before merging the resulting vcf files, they were filtered to remove all indels and low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6).  The SNVs in the resulting vcf files were further filtered and scored using mutationSeq v1.0.2 and annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b)  and the dbSNP v137 and cosmic 64 db membership using snpSift  (Cingolani et al., 2012a).

 

Copy number variation (CNV) analysis

The techniques outlined in (Jones et al., 2010) were followed to analyze copy number changes. Sequence quality filtering was used to remove all reads of low mapping quality (Q < 10). Due to the varying amounts of sequence reads from each sample, aligned reference reads were first used to define genomic bins of equal reference coverage to which depths of alignments of sequence from each of the tumor samples were compared. This resulted in a measurement of the relative number of aligned reads from the tumors and reference in bins of variable length along the genome, where bin width is inversely proportional to the number of mapped reference reads. A hidden Markov model (HMM) was used to classify and segment continuous regions of copy number loss, neutrality, or gain using methodology outlined previously (Shah et al., 2006). The five states reported by the HMM were: loss (1), neutral (2), gain (3), amplification (4), and high-level amplification (5).

Amplified and deleted CNV regions are further screened for interspersed repeats, and low complexity DNA sequences, which includes long interspersed nuclear elements (LINE), short interspersed nuclear element (SINE), long terminal repeat elements (LTR), DNA repeat elements (DNA), low complexity repeats,  satellite repeats, simple repeats (micro-satellites), and RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA).

Repeat sequences in the genome pose challenges in the identification of CNVs with next generation sequencing data as the short reads sequenced from repetitive regions cannot be mapped unambiguously. Exclusion or random placement of the reads aligned to multiple regions can either reduce sensitivity of CNV detection or result in the identification of false deletions in repeated regions. Due to the limitations of both alignment and subsequent segmentation algorithms, CNVs called in the regions harboring highly repeated sequences should be carefully scrutinized. Therefore, in addition to focal CNV functional annotation, recurrence among patients, and presence of TransAbyss overlapping events, the number and types of repeats are added to the annotation of candidate CNVs to further narrow down the prioritized list for verification. It is recommended that the candidate CNVs be prioritized based on the presence of genes of interest, high recurrence among patients, presence of overlapping TransAbyss events, and low frequency or absence of repeat nuclear elements.

Illumina genomic plate-based library construction (350-450bp insert size):

2ug of genomic DNA in a 96-well format was fragmented by Covaris E210 sonication for 30 seconds using a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency’s Genome Sciences Centre 96-well Genomic ~350bp-450bp insert Illumina Library Construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the DNA was purified in a 96-well microtitre plate using Ampure XP SPRI beads (40-45uL beads per 60uL DNA), and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE indexed primer set,  with cycle conditions: 98˚C for 30sec followed by 6 cycles of 98˚C  for 15 sec, 62˚C for 30 sec and 72˚C for 30 sec, and a final extension at 72˚C for 5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was gel purified (8% PAGE or 1.5% Metaphor agarose in an in-house custom built robot), and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was confirmed by Quant-iT dsDNA HS Assay prior to generating 100bp paired end reads on the Illumina HiSeq 2000/2500 platform using v3 chemistry.

WGS/hg19 alignment:

Illumina paired-end whole genome sequencing reads were aligned to the hg19 reference using BWA version 0.5.7.  This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Multiple lanes of sequences were merged and duplicated reads were marked with Picard Tools.

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

 

Genomic SNV analyses

SNVs from WGS-seq data were analyzed using all three methods described below:

Mpileup

SNVs were analyzed with SAMtools mpileup v.0.1.17 (Li et al., 2009) either on single or paired libraries.  Each chromosome was analyzed separately using the -C50-DSBuf parameters. The resulting vcf files were merged and filtered to remove low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6). Finally, SNVs were annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b) and the dbSNP v137 db membership assigned using snpSift  (Cingolani et al., 2012a).

Strelka

To analyze compartment specific SNVs, samples were analyzed pair wise with the default settings of Strelka v0.4.7 (Saunders et al., 2012).  Primary tumor samples and relapse/met were compared against the germline sample. In the absence of a germline sample, the relapse/met samples were compared against the primary tumor sample.

MutationSeq

SNVs were analyzed pair wise with SAMtools mpileup v.0.1.17 (Li et al., 2009).  Each chromosome was analyzed separately using the -C50-DSBuf parameters. Before merging the resulting vcf files, they were filtered to remove all indels and low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6).  The SNVs in the resulting vcf files were further filtered and scored using mutationSeq v1.0.2 and annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b)  and the dbSNP v137 and cosmic 64 db membership using snpSift  (Cingolani et al., 2012a).

 

Copy number variation (CNV) analysis

The techniques outlined in (Jones et al., 2010) were followed to analyze copy number changes. Sequence quality filtering was used to remove all reads of low mapping quality (Q < 10). Due to the varying amounts of sequence reads from each sample, aligned reference reads were first used to define genomic bins of equal reference coverage to which depths of alignments of sequence from each of the tumor samples were compared. This resulted in a measurement of the relative number of aligned reads from the tumors and reference in bins of variable length along the genome, where bin width is inversely proportional to the number of mapped reference reads. A hidden Markov model (HMM) was used to classify and segment continuous regions of copy number loss, neutrality, or gain using methodology outlined previously (Shah et al., 2006). The five states reported by the HMM were: loss (1), neutral (2), gain (3), amplification (4), and high-level amplification (5).

Amplified and deleted CNV regions are further screened for interspersed repeats, and low complexity DNA sequences, which includes long interspersed nuclear elements (LINE), short interspersed nuclear element (SINE), long terminal repeat elements (LTR), DNA repeat elements (DNA), low complexity repeats,  satellite repeats, simple repeats (micro-satellites), and RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA).

Repeat sequences in the genome pose challenges in the identification of CNVs with next generation sequencing data as the short reads sequenced from repetitive regions cannot be mapped unambiguously. Exclusion or random placement of the reads aligned to multiple regions can either reduce sensitivity of CNV detection or result in the identification of false deletions in repeated regions. Due to the limitations of both alignment and subsequent segmentation algorithms, CNVs called in the regions harboring highly repeated sequences should be carefully scrutinized. Therefore, in addition to focal CNV functional annotation, recurrence among patients, and presence of TransAbyss overlapping events, the number and types of repeats are added to the annotation of candidate CNVs to further narrow down the prioritized list for verification. It is recommended that the candidate CNVs be prioritized based on the presence of genes of interest, high recurrence among patients, presence of overlapping TransAbyss events, and low frequency or absence of repeat nuclear elements.

Illumina genomic plate-based library construction (350-450bp insert size):

2ug of genomic DNA in a 96-well format was fragmented by Covaris E210 sonication for 30 seconds using a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency’s Genome Sciences Centre 96-well Genomic ~350bp-450bp insert Illumina Library Construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the DNA was purified in a 96-well microtitre plate using Ampure XP SPRI beads (40-45uL beads per 60uL DNA), and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE indexed primer set,  with cycle conditions: 98˚C for 30sec followed by 6 cycles of 98˚C  for 15 sec, 62˚C for 30 sec and 72˚C for 30 sec, and a final extension at 72˚C for 5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was gel purified (8% PAGE or 1.5% Metaphor agarose in an in-house custom built robot), and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was confirmed by Quant-iT dsDNA HS Assay prior to generating 100bp paired end reads on the Illumina HiSeq 2000/2500 platform using v3 chemistry.

PCR-free whole genome sequencing:

Briefly, 500ng of genomic DNA was arrayed in a 96-well microtitre plate and subjected to shearing by sonication (Covaris LE220). Sheared DNA was end-repaired, and size selected using paramagnetic PCRClean DX beads (C-1003-450, Aline Biosciences) targeting a 300-400bp fraction.  After 3’ A-tailing, full length TruSeq adapters were ligated. Libraries were purified using paramagnetic (Aline Biosciences) beads. PCR-free genome library concentrations were quantified using a qPCR Library Quantification kit (KAPA, KK4824) prior to sequencing with paired-end 125 base reads on the Illumina HiSeq2500 platform using V4 chemistry according to manufacturer recommendations.

Illumina genomic plate-based library construction (350-450bp insert size):

2ug of genomic DNA in a 96-well format was fragmented by Covaris E210 sonication for 30 seconds using a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency’s Genome Sciences Centre 96-well Genomic ~350bp-450bp insert Illumina Library Construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the DNA was purified in a 96-well microtitre plate using Ampure XP SPRI beads (40-45uL beads per 60uL DNA), and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE indexed primer set,  with cycle conditions: 98˚C for 30sec followed by 6 cycles of 98˚C for 15 sec, 62˚C for 30 sec and 72˚C for 30 sec, and a final extension at 72˚C for 5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was gel purified (8% PAGE or 1.5% Metaphor agarose in an in-house custom built robot), and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was confirmed by Quant-iT dsDNA HS Assay prior to generating 100bp paired end reads on the Illumina HiSeq 2000/2500 platform using v3 chemistry.

 

Whole genome bisulfite-Seq library construction and sequencing:

1-5 mg of Qubit (Life Technologies, Carlsbad, CA) quantified genomic DNA was utilized for library construction as described (Gascard et al., 2015). To track the efficiency of bisulfite conversion, 1 ng of unmethylated lambda DNA (Promega) was spiked into 1 µg genomic DNA quantified using Qubit fluorometry and arrayed in a 96-well microtitre plate. DNA was sheared to a target size of 300 bp using Covaris sonication and the fragments were end repaired using DNA ligase and dNTPs at 30o C for 30 minutes. Repaired DNA was purified using a 2:1 AMPure XP beads to sample ratio and eluted in 40 µL elution buffer in preparation for A-tailing; which involved the addition of adenosine to the 3’ end of DNA fragments using Klenow fragment and dATP, followed by incubation at 37o C for 30 minutes. Following reaction clean-up with magnetic beads, cytosine methylated paired-end adapters (5’- AmCAmCTmCTTTmCmCmCTAmCAmCGAmCGmCTmCTTmCmCGATmCT-3’ and 3’- GAGmCmCGTAAGGAmCGAmCTTGGmCGAGAAGGmCTAG-5’) were ligated to the DNA at 30o C for 20 minutes and adapter flanked DNA fragments bead were purified. Prior to bisulfite conversion an aliquot of library fragments was amplified with 10 cycles of PCR and sized on an Agilent Bioanalyzer High Sensitivity DNA chip. Amplicons were between 200-700 bp in length. Bisulfite conversion of the methylated adapter-ligated DNA fragments was achieved using the EZ Methylation-Gold kit (Zymo Research) following the manufacturer’s protocol. Five cycles of PCR using HiFi polymerase (Kapa Biosystems) was used to enrich the bisulfite converted DNA and introduce fault tolerant hexamer barcode sequences. Post-PCR purification and size-selection of bisulfite converted DNA was performed from precast 8% TBE gels (Invitrogen), extracting the 350-500 bp fraction, or 275-425 bp fraction if the former was of weak intensity. Gel slurries were added to Spin-X filter tubes (Fisher) and the eluate was ethanol precipitated and resuspended in EB. To determine final library concentrations, fragment sizes were assessed using a high sensitivity DNA assay (Agilent) and DNA quantified by Qubit fluorometry. Where necessary, libraries were diluted in elution buffer supplemented with 0.1% Tween-20 to achieve a concentration of 8 nM for Illumina HiSeq2000/2500 flowcell cluster generation. Libraries were sequenced using paired-end 100/125 nt V3/4 sequencing chemistry on an Illumina HiSeq2000/2500 following manufacturer's protocols (Illumina, Hayward, CA). Raw sequences from whole genome bisulfite sequencing (WGBS) were examined for quality, sample swap, reagent contamination and bisulfite conversion rate using custom in house scripts.

WGS/hg19 alignment:

Illumina paired-end whole genome sequencing reads were aligned to the hg19 reference using BWA version 0.5.7.  This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Multiple lanes of sequences were merged and duplicated reads were marked with Picard Tools.

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

 

Genomic SNV analyses

SNVs from WGS-seq data were analyzed using all three methods described below:

Mpileup

SNVs were analyzed with SAMtools mpileup v.0.1.17 (Li et al., 2009) either on single or paired libraries.  Each chromosome was analyzed separately using the -C50-DSBuf parameters. The resulting vcf files were merged and filtered to remove low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6). Finally, SNVs were annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b) and the dbSNP v137 db membership assigned using snpSift  (Cingolani et al., 2012a).

Strelka

To analyze compartment specific SNVs, samples were analyzed pair wise with the default settings of Strelka v0.4.7 (Saunders et al., 2012).  Primary tumor samples and relapse/met were compared against the germline sample. In the absence of a germline sample, the relapse/met samples were compared against the primary tumor sample.

MutationSeq

SNVs were analyzed pair wise with SAMtools mpileup v.0.1.17 (Li et al., 2009).  Each chromosome was analyzed separately using the -C50-DSBuf parameters. Before merging the resulting vcf files, they were filtered to remove all indels and low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6).  The SNVs in the resulting vcf files were further filtered and scored using mutationSeq v1.0.2 and annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b)  and the dbSNP v137 and cosmic 64 db membership using snpSift  (Cingolani et al., 2012a).

 

Copy number variation (CNV) analysis

The techniques outlined in (Jones et al., 2010) were followed to analyze copy number changes. Sequence quality filtering was used to remove all reads of low mapping quality (Q < 10). Due to the varying amounts of sequence reads from each sample, aligned reference reads were first used to define genomic bins of equal reference coverage to which depths of alignments of sequence from each of the tumor samples were compared. This resulted in a measurement of the relative number of aligned reads from the tumors and reference in bins of variable length along the genome, where bin width is inversely proportional to the number of mapped reference reads. A hidden Markov model (HMM) was used to classify and segment continuous regions of copy number loss, neutrality, or gain using methodology outlined previously (Shah et al., 2006). The five states reported by the HMM were: loss (1), neutral (2), gain (3), amplification (4), and high-level amplification (5).

Amplified and deleted CNV regions are further screened for interspersed repeats, and low complexity DNA sequences, which includes long interspersed nuclear elements (LINE), short interspersed nuclear element (SINE), long terminal repeat elements (LTR), DNA repeat elements (DNA), low complexity repeats,  satellite repeats, simple repeats (micro-satellites), and RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA).

 

Repeat sequences in the genome pose challenges in the identification of CNVs with next generation sequencing data as the short reads sequenced from repetitive regions cannot be mapped unambiguously. Exclusion or random placement of the reads aligned to multiple regions can either reduce sensitivity of CNV detection or result in the identification of false deletions in repeated regions. Due to the limitations of both alignment and subsequent segmentation algorithms, CNVs called in the regions harboring highly repeated sequences should be carefully scrutinized. Therefore, in addition to focal CNV functional annotation, recurrence among patients, and presence of TransAbyss overlapping events, the number and types of repeats are added to the annotation of candidate CNVs to further narrow down the prioritized list for verification. It is recommended that the candidate CNVs be prioritized based on the presence of genes of interest, high recurrence among patients, presence of overlapping TransAbyss events, and low frequency or absence of repeat nuclear elements.

Illumina genomic plate-based library construction (350-450bp insert size):

2ug of genomic DNA in a 96-well format was fragmented by Covaris E210 sonication for 30 seconds using a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency’s Genome Sciences Centre 96-well Genomic ~350bp-450bp insert Illumina Library Construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the DNA was purified in a 96-well microtitre plate using Ampure XP SPRI beads (40-45uL beads per 60uL DNA), and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE indexed primer set,  with cycle conditions: 98˚C for 30sec followed by 6 cycles of 98˚C for 15 sec, 62˚C for 30 sec and 72˚C for 30 sec, and a final extension at 72˚C for 5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was gel purified (8% PAGE or 1.5% Metaphor agarose in an in-house custom built robot), and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was confirmed by Quant-iT dsDNA HS Assay prior to generating 100bp paired end reads on the Illumina HiSeq 2000/2500 platform using v3 chemistry.

 

Whole genome bisulfite-Seq library construction and sequencing:

1-5 mg of Qubit (Life Technologies, Carlsbad, CA) quantified genomic DNA was utilized for library construction as described (Gascard et al., 2015). To track the efficiency of bisulfite conversion, 1 ng of unmethylated lambda DNA (Promega) was spiked into 1 µg genomic DNA quantified using Qubit fluorometry and arrayed in a 96-well microtitre plate. DNA was sheared to a target size of 300 bp using Covaris sonication and the fragments were end repaired using DNA ligase and dNTPs at 30o C for 30 minutes. Repaired DNA was purified using a 2:1 AMPure XP beads to sample ratio and eluted in 40 µL elution buffer in preparation for A-tailing; which involved the addition of adenosine to the 3’ end of DNA fragments using Klenow fragment and dATP, followed by incubation at 37o C for 30 minutes. Following reaction clean-up with magnetic beads, cytosine methylated paired-end adapters (5’- AmCAmCTmCTTTmCmCmCTAmCAmCGAmCGmCTmCTTmCmCGATmCT-3’ and 3’- GAGmCmCGTAAGGAmCGAmCTTGGmCGAGAAGGmCTAG-5’) were ligated to the DNA at 30o C for 20 minutes and adapter flanked DNA fragments bead were purified. Prior to bisulfite conversion an aliquot of library fragments was amplified with 10 cycles of PCR and sized on an Agilent Bioanalyzer High Sensitivity DNA chip. Amplicons were between 200-700 bp in length. Bisulfite conversion of the methylated adapter-ligated DNA fragments was achieved using the EZ Methylation-Gold kit (Zymo Research) following the manufacturer’s protocol. Five cycles of PCR using HiFi polymerase (Kapa Biosystems) was used to enrich the bisulfite converted DNA and introduce fault tolerant hexamer barcode sequences. Post-PCR purification and size-selection of bisulfite converted DNA was performed from precast 8% TBE gels (Invitrogen), extracting the 350-500 bp fraction, or 275-425 bp fraction if the former was of weak intensity. Gel slurries were added to Spin-X filter tubes (Fisher) and the eluate was ethanol precipitated and resuspended in EB. To determine final library concentrations, fragment sizes were assessed using a high sensitivity DNA assay (Agilent) and DNA quantified by Qubit fluorometry. Where necessary, libraries were diluted in elution buffer supplemented with 0.1% Tween-20 to achieve a concentration of 8 nM for Illumina HiSeq2000/2500 flowcell cluster generation. Libraries were sequenced using paired-end 100/125 nt V3/4 sequencing chemistry on an Illumina HiSeq2000/2500 following manufacturer's protocols (Illumina, Hayward, CA). Raw sequences from whole genome bisulfite sequencing (WGBS) were examined for quality, sample swap, reagent contamination and bisulfite conversion rate using custom in house scripts.

WGS/hg19 alignment:

Illumina paired-end whole genome sequencing reads were aligned to the hg19 reference using BWA version 0.5.7.  This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Multiple lanes of sequences were merged and duplicated reads were marked with Picard Tools.

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

 

Genomic SNV analyses

SNVs from WGS-seq data were analyzed using all three methods described below:

Mpileup

SNVs were analyzed with SAMtools mpileup v.0.1.17 (Li et al., 2009) either on single or paired libraries.  Each chromosome was analyzed separately using the -C50-DSBuf parameters. The resulting vcf files were merged and filtered to remove low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6). Finally, SNVs were annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b) and the dbSNP v137 db membership assigned using snpSift  (Cingolani et al., 2012a).

Strelka

To analyze compartment specific SNVs, samples were analyzed pair wise with the default settings of Strelka v0.4.7 (Saunders et al., 2012).  Primary tumor samples and relapse/met were compared against the germline sample. In the absence of a germline sample, the relapse/met samples were compared against the primary tumor sample.

MutationSeq

SNVs were analyzed pair wise with SAMtools mpileup v.0.1.17 (Li et al., 2009).  Each chromosome was analyzed separately using the -C50-DSBuf parameters. Before merging the resulting vcf files, they were filtered to remove all indels and low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6).  The SNVs in the resulting vcf files were further filtered and scored using mutationSeq v1.0.2 and annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b)  and the dbSNP v137 and cosmic 64 db membership using snpSift  (Cingolani et al., 2012a).

 

Copy number variation (CNV) analysis

The techniques outlined in (Jones et al., 2010) were followed to analyze copy number changes. Sequence quality filtering was used to remove all reads of low mapping quality (Q < 10). Due to the varying amounts of sequence reads from each sample, aligned reference reads were first used to define genomic bins of equal reference coverage to which depths of alignments of sequence from each of the tumor samples were compared. This resulted in a measurement of the relative number of aligned reads from the tumors and reference in bins of variable length along the genome, where bin width is inversely proportional to the number of mapped reference reads. A hidden Markov model (HMM) was used to classify and segment continuous regions of copy number loss, neutrality, or gain using methodology outlined previously (Shah et al., 2006). The five states reported by the HMM were: loss (1), neutral (2), gain (3), amplification (4), and high-level amplification (5).

Amplified and deleted CNV regions are further screened for interspersed repeats, and low complexity DNA sequences, which includes long interspersed nuclear elements (LINE), short interspersed nuclear element (SINE), long terminal repeat elements (LTR), DNA repeat elements (DNA), low complexity repeats,  satellite repeats, simple repeats (micro-satellites), and RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA).

Repeat sequences in the genome pose challenges in the identification of CNVs with next generation sequencing data as the short reads sequenced from repetitive regions cannot be mapped unambiguously. Exclusion or random placement of the reads aligned to multiple regions can either reduce sensitivity of CNV detection or result in the identification of false deletions in repeated regions. Due to the limitations of both alignment and subsequent segmentation algorithms, CNVs called in the regions harboring highly repeated sequences should be carefully scrutinized. Therefore, in addition to focal CNV functional annotation, recurrence among patients, and presence of TransAbyss overlapping events, the number and types of repeats are added to the annotation of candidate CNVs to further narrow down the prioritized list for verification. It is recommended that the candidate CNVs be prioritized based on the presence of genes of interest, high recurrence among patients, presence of overlapping TransAbyss events, and low frequency or absence of repeat nuclear elements.

Illumina genomic plate-based library construction (350-450bp insert size):

2ug of genomic DNA in a 96-well format was fragmented by Covaris E210 sonication for 30 seconds using a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency’s Genome Sciences Centre 96-well Genomic ~350bp-450bp insert Illumina Library Construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the DNA was purified in a 96-well microtitre plate using Ampure XP SPRI beads (40-45uL beads per 60uL DNA), and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE indexed primer set,  with cycle conditions: 98˚C for 30sec followed by 6 cycles of 98˚C for 15 sec, 62˚C for 30 sec and 72˚C for 30 sec, and a final extension at 72˚C for 5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was gel purified (8% PAGE or 1.5% Metaphor agarose in an in-house custom built robot), and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was confirmed by Quant-iT dsDNA HS Assay prior to generating 100bp paired end reads on the Illumina HiSeq 2000/2500 platform using v3 chemistry.

 

Whole genome bisulfite-Seq library construction and sequencing:

1-5 mg of Qubit (Life Technologies, Carlsbad, CA) quantified genomic DNA was utilized for library construction as described (Gascard et al., 2015). To track the efficiency of bisulfite conversion, 1 ng of unmethylated lambda DNA (Promega) was spiked into 1 µg genomic DNA quantified using Qubit fluorometry and arrayed in a 96-well microtitre plate. DNA was sheared to a target size of 300 bp using Covaris sonication and the fragments were end repaired using DNA ligase and dNTPs at 30o C for 30 minutes. Repaired DNA was purified using a 2:1 AMPure XP beads to sample ratio and eluted in 40 µL elution buffer in preparation for A-tailing; which involved the addition of adenosine to the 3’ end of DNA fragments using Klenow fragment and dATP, followed by incubation at 37o C for 30 minutes. Following reaction clean-up with magnetic beads, cytosine methylated paired-end adapters (5’- AmCAmCTmCTTTmCmCmCTAmCAmCGAmCGmCTmCTTmCmCGATmCT-3’ and 3’- GAGmCmCGTAAGGAmCGAmCTTGGmCGAGAAGGmCTAG-5’) were ligated to the DNA at 30o C for 20 minutes and adapter flanked DNA fragments bead were purified. Prior to bisulfite conversion an aliquot of library fragments was amplified with 10 cycles of PCR and sized on an Agilent Bioanalyzer High Sensitivity DNA chip. Amplicons were between 200-700 bp in length. Bisulfite conversion of the methylated adapter-ligated DNA fragments was achieved using the EZ Methylation-Gold kit (Zymo Research) following the manufacturer’s protocol. Five cycles of PCR using HiFi polymerase (Kapa Biosystems) was used to enrich the bisulfite converted DNA and introduce fault tolerant hexamer barcode sequences. Post-PCR purification and size-selection of bisulfite converted DNA was performed from precast 8% TBE gels (Invitrogen), extracting the 350-500 bp fraction, or 275-425 bp fraction if the former was of weak intensity. Gel slurries were added to Spin-X filter tubes (Fisher) and the eluate was ethanol precipitated and resuspended in EB. To determine final library concentrations, fragment sizes were assessed using a high sensitivity DNA assay (Agilent) and DNA quantified by Qubit fluorometry. Where necessary, libraries were diluted in elution buffer supplemented with 0.1% Tween-20 to achieve a concentration of 8 nM for Illumina HiSeq2000/2500 flowcell cluster generation. Libraries were sequenced using paired-end 100/125 nt V3/4 sequencing chemistry on an Illumina HiSeq2000/2500 following manufacturer's protocols (Illumina, Hayward, CA). Raw sequences from whole genome bisulfite sequencing (WGBS) were examined for quality, sample swap, reagent contamination and bisulfite conversion rate using custom in house scripts.

WGS/hg19 alignment:

Illumina paired-end whole genome sequencing reads were aligned to the hg19 reference using BWA version 0.5.7.  This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Multiple lanes of sequences were merged and duplicated reads were marked with Picard Tools.

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

 

Genomic SNV analyses

SNVs from WGS-seq data were analyzed using all three methods described below:

Mpileup

SNVs were analyzed with SAMtools mpileup v.0.1.17 (Li et al., 2009) either on single or paired libraries.  Each chromosome was analyzed separately using the -C50-DSBuf parameters. The resulting vcf files were merged and filtered to remove low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6). Finally, SNVs were annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b) and the dbSNP v137 db membership assigned using snpSift  (Cingolani et al., 2012a).

Strelka

To analyze compartment specific SNVs, samples were analyzed pair wise with the default settings of Strelka v0.4.7 (Saunders et al., 2012).  Primary tumor samples and relapse/met were compared against the germline sample. In the absence of a germline sample, the relapse/met samples were compared against the primary tumor sample.

MutationSeq

SNVs were analyzed pair wise with SAMtools mpileup v.0.1.17 (Li et al., 2009).  Each chromosome was analyzed separately using the -C50-DSBuf parameters. Before merging the resulting vcf files, they were filtered to remove all indels and low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6).  The SNVs in the resulting vcf files were further filtered and scored using mutationSeq v1.0.2 and annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b)  and the dbSNP v137 and cosmic 64 db membership using snpSift  (Cingolani et al., 2012a).

 

Copy number variation (CNV) analysis

The techniques outlined in (Jones et al., 2010) were followed to analyze copy number changes. Sequence quality filtering was used to remove all reads of low mapping quality (Q < 10). Due to the varying amounts of sequence reads from each sample, aligned reference reads were first used to define genomic bins of equal reference coverage to which depths of alignments of sequence from each of the tumor samples were compared. This resulted in a measurement of the relative number of aligned reads from the tumors and reference in bins of variable length along the genome, where bin width is inversely proportional to the number of mapped reference reads. A hidden Markov model (HMM) was used to classify and segment continuous regions of copy number loss, neutrality, or gain using methodology outlined previously (Shah et al., 2006). The five states reported by the HMM were: loss (1), neutral (2), gain (3), amplification (4), and high-level amplification (5).

Amplified and deleted CNV regions are further screened for interspersed repeats, and low complexity DNA sequences, which includes long interspersed nuclear elements (LINE), short interspersed nuclear element (SINE), long terminal repeat elements (LTR), DNA repeat elements (DNA), low complexity repeats,  satellite repeats, simple repeats (micro-satellites), and RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA).

Repeat sequences in the genome pose challenges in the identification of CNVs with next generation sequencing data as the short reads sequenced from repetitive regions cannot be mapped unambiguously. Exclusion or random placement of the reads aligned to multiple regions can either reduce sensitivity of CNV detection or result in the identification of false deletions in repeated regions. Due to the limitations of both alignment and subsequent segmentation algorithms, CNVs called in the regions harboring highly repeated sequences should be carefully scrutinized. Therefore, in addition to focal CNV functional annotation, recurrence among patients, and presence of TransAbyss overlapping events, the number and types of repeats are added to the annotation of candidate CNVs to further narrow down the prioritized list for verification. It is recommended that the candidate CNVs be prioritized based on the presence of genes of interest, high recurrence among patients, presence of overlapping TransAbyss events, and low frequency or absence of repeat nuclear elements.

Illumina genomic plate-based library construction (350-450bp insert size):

2ug of genomic DNA in a 96-well format was fragmented by Covaris E210 sonication for 30 seconds using a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency’s Genome Sciences Centre 96-well Genomic ~350bp-450bp insert Illumina Library Construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the DNA was purified in a 96-well microtitre plate using Ampure XP SPRI beads (40-45uL beads per 60uL DNA), and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE indexed primer set,  with cycle conditions: 98˚C for 30sec followed by 6 cycles of 98˚C for 15 sec, 62˚C for 30 sec and 72˚C for 30 sec, and a final extension at 72˚C for 5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was gel purified (8% PAGE or 1.5% Metaphor agarose in an in-house custom built robot), and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was confirmed by Quant-iT dsDNA HS Assay prior to generating 100bp paired end reads on the Illumina HiSeq 2000/2500 platform using v3 chemistry.

 

Whole genome bisulfite-Seq library construction and sequencing:

1-5 mg of Qubit (Life Technologies, Carlsbad, CA) quantified genomic DNA was utilized for library construction as described (Gascard et al., 2015). To track the efficiency of bisulfite conversion, 1 ng of unmethylated lambda DNA (Promega) was spiked into 1 µg genomic DNA quantified using Qubit fluorometry and arrayed in a 96-well microtitre plate. DNA was sheared to a target size of 300 bp using Covaris sonication and the fragments were end repaired using DNA ligase and dNTPs at 30o C for 30 minutes. Repaired DNA was purified using a 2:1 AMPure XP beads to sample ratio and eluted in 40 µL elution buffer in preparation for A-tailing; which involved the addition of adenosine to the 3’ end of DNA fragments using Klenow fragment and dATP, followed by incubation at 37o C for 30 minutes. Following reaction clean-up with magnetic beads, cytosine methylated paired-end adapters (5’- AmCAmCTmCTTTmCmCmCTAmCAmCGAmCGmCTmCTTmCmCGATmCT-3’ and 3’- GAGmCmCGTAAGGAmCGAmCTTGGmCGAGAAGGmCTAG-5’) were ligated to the DNA at 30o C for 20 minutes and adapter flanked DNA fragments bead were purified. Prior to bisulfite conversion an aliquot of library fragments was amplified with 10 cycles of PCR and sized on an Agilent Bioanalyzer High Sensitivity DNA chip. Amplicons were between 200-700 bp in length. Bisulfite conversion of the methylated adapter-ligated DNA fragments was achieved using the EZ Methylation-Gold kit (Zymo Research) following the manufacturer’s protocol. Five cycles of PCR using HiFi polymerase (Kapa Biosystems) was used to enrich the bisulfite converted DNA and introduce fault tolerant hexamer barcode sequences. Post-PCR purification and size-selection of bisulfite converted DNA was performed from precast 8% TBE gels (Invitrogen), extracting the 350-500 bp fraction, or 275-425 bp fraction if the former was of weak intensity. Gel slurries were added to Spin-X filter tubes (Fisher) and the eluate was ethanol precipitated and resuspended in EB. To determine final library concentrations, fragment sizes were assessed using a high sensitivity DNA assay (Agilent) and DNA quantified by Qubit fluorometry. Where necessary, libraries were diluted in elution buffer supplemented with 0.1% Tween-20 to achieve a concentration of 8 nM for Illumina HiSeq2000/2500 flowcell cluster generation. Libraries were sequenced using paired-end 100/125 nt V3/4 sequencing chemistry on an Illumina HiSeq2000/2500 following manufacturer's protocols (Illumina, Hayward, CA). Raw sequences from whole genome bisulfite sequencing (WGBS) were examined for quality, sample swap, reagent contamination and bisulfite conversion rate using custom in house scripts.

PCR-free whole genome sequencing:

Briefly, 500ng of genomic DNA was arrayed in a 96-well microtitre plate and subjected to shearing by sonication (Covaris LE220). Sheared DNA was end-repaired, and size selected using paramagnetic PCRClean DX beads (C-1003-450, Aline Biosciences) targeting a 300-400bp fraction.  After 3’ A-tailing, full length TruSeq adapters were ligated. Libraries were purified using paramagnetic (Aline Biosciences) beads. PCR-free genome library concentrations were quantified using a qPCR Library Quantification kit (KAPA, KK4824) prior to sequencing with paired-end 125 base reads on the Illumina HiSeq2500 platform using V4 chemistry according to manufacturer recommendations.

PCR-free whole genome sequencing:

To minimize library bias and coverage gaps associated with PCR amplification of high GC or AT-rich regions we have implemented a version of the TruSeq DNA PCR-free kit (E6875-6877B-GSC, New England Biolabs), automated on a Microlab NIMBUS liquid handling robot (Hamilton).

WGS/hg19 alignment:

Illumina paired-end whole genome sequencing reads were aligned to the hg19 reference using BWA version 0.5.7.  This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Multiple lanes of sequences were merged and duplicated reads were marked with Picard Tools.

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

 

Genomic SNV analyses

SNVs from WGS-seq data were analyzed using all three methods described below:

Mpileup

SNVs were analyzed with SAMtools mpileup v.0.1.17 (Li et al., 2009) either on single or paired libraries.  Each chromosome was analyzed separately using the -C50-DSBuf parameters. The resulting vcf files were merged and filtered to remove low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6). Finally, SNVs were annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b) and the dbSNP v137 db membership assigned using snpSift  (Cingolani et al., 2012a).

Strelka

To analyze compartment specific SNVs, samples were analyzed pair wise with the default settings of Strelka v0.4.7 (Saunders et al., 2012).  Primary tumor samples and relapse/met were compared against the germline sample. In the absence of a germline sample, the relapse/met samples were compared against the primary tumor sample.

MutationSeq

SNVs were analyzed pair wise with SAMtools mpileup v.0.1.17 (Li et al., 2009).  Each chromosome was analyzed separately using the -C50-DSBuf parameters. Before merging the resulting vcf files, they were filtered to remove all indels and low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6).  The SNVs in the resulting vcf files were further filtered and scored using mutationSeq v1.0.2 and annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b)  and the dbSNP v137 and cosmic 64 db membership using snpSift  (Cingolani et al., 2012a).

 

Copy number variation (CNV) analysis

The techniques outlined in (Jones et al., 2010) were followed to analyze copy number changes. Sequence quality filtering was used to remove all reads of low mapping quality (Q < 10). Due to the varying amounts of sequence reads from each sample, aligned reference reads were first used to define genomic bins of equal reference coverage to which depths of alignments of sequence from each of the tumor samples were compared. This resulted in a measurement of the relative number of aligned reads from the tumors and reference in bins of variable length along the genome, where bin width is inversely proportional to the number of mapped reference reads. A hidden Markov model (HMM) was used to classify and segment continuous regions of copy number loss, neutrality, or gain using methodology outlined previously (Shah et al., 2006). The five states reported by the HMM were: loss (1), neutral (2), gain (3), amplification (4), and high-level amplification (5).

Amplified and deleted CNV regions are further screened for interspersed repeats, and low complexity DNA sequences, which includes long interspersed nuclear elements (LINE), short interspersed nuclear element (SINE), long terminal repeat elements (LTR), DNA repeat elements (DNA), low complexity repeats,  satellite repeats, simple repeats (micro-satellites), and RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA).

 

Repeat sequences in the genome pose challenges in the identification of CNVs with next generation sequencing data as the short reads sequenced from repetitive regions cannot be mapped unambiguously. Exclusion or random placement of the reads aligned to multiple regions can either reduce sensitivity of CNV detection or result in the identification of false deletions in repeated regions. Due to the limitations of both alignment and subsequent segmentation algorithms, CNVs called in the regions harboring highly repeated sequences should be carefully scrutinized. Therefore, in addition to focal CNV functional annotation, recurrence among patients, and presence of TransAbyss overlapping events, the number and types of repeats are added to the annotation of candidate CNVs to further narrow down the prioritized list for verification. It is recommended that the candidate CNVs be prioritized based on the presence of genes of interest, high recurrence among patients, presence of overlapping TransAbyss events, and low frequency or absence of repeat nuclear elements.

PCR-free whole genome sequencing:

To minimize library bias and coverage gaps associated with PCR amplification of high GC or AT-rich regions we have implemented a version of the TruSeq DNA PCR-free kit (E6875-6877B-GSC, New England Biolabs), automated on a Microlab NIMBUS liquid handling robot (Hamilton).

WGS/hg19 alignment:

Illumina paired-end whole genome sequencing reads were aligned to the hg19 reference using BWA version 0.5.7.  This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Multiple lanes of sequences were merged and duplicated reads were marked with Picard Tools.

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

 

Genomic SNV analyses

SNVs from WGS-seq data were analyzed using all three methods described below:

Mpileup

SNVs were analyzed with SAMtools mpileup v.0.1.17 (Li et al., 2009) either on single or paired libraries.  Each chromosome was analyzed separately using the -C50-DSBuf parameters. The resulting vcf files were merged and filtered to remove low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6). Finally, SNVs were annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b) and the dbSNP v137 db membership assigned using snpSift  (Cingolani et al., 2012a).

Strelka

To analyze compartment specific SNVs, samples were analyzed pair wise with the default settings of Strelka v0.4.7 (Saunders et al., 2012).  Primary tumor samples and relapse/met were compared against the germline sample. In the absence of a germline sample, the relapse/met samples were compared against the primary tumor sample.

MutationSeq

SNVs were analyzed pair wise with SAMtools mpileup v.0.1.17 (Li et al., 2009).  Each chromosome was analyzed separately using the -C50-DSBuf parameters. Before merging the resulting vcf files, they were filtered to remove all indels and low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6).  The SNVs in the resulting vcf files were further filtered and scored using mutationSeq v1.0.2 and annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b)  and the dbSNP v137 and cosmic 64 db membership using snpSift  (Cingolani et al., 2012a).

 

Copy number variation (CNV) analysis

The techniques outlined in (Jones et al., 2010) were followed to analyze copy number changes. Sequence quality filtering was used to remove all reads of low mapping quality (Q < 10). Due to the varying amounts of sequence reads from each sample, aligned reference reads were first used to define genomic bins of equal reference coverage to which depths of alignments of sequence from each of the tumor samples were compared. This resulted in a measurement of the relative number of aligned reads from the tumors and reference in bins of variable length along the genome, where bin width is inversely proportional to the number of mapped reference reads. A hidden Markov model (HMM) was used to classify and segment continuous regions of copy number loss, neutrality, or gain using methodology outlined previously (Shah et al., 2006). The five states reported by the HMM were: loss (1), neutral (2), gain (3), amplification (4), and high-level amplification (5).

Amplified and deleted CNV regions are further screened for interspersed repeats, and low complexity DNA sequences, which includes long interspersed nuclear elements (LINE), short interspersed nuclear element (SINE), long terminal repeat elements (LTR), DNA repeat elements (DNA), low complexity repeats,  satellite repeats, simple repeats (micro-satellites), and RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA).

Repeat sequences in the genome pose challenges in the identification of CNVs with next generation sequencing data as the short reads sequenced from repetitive regions cannot be mapped unambiguously. Exclusion or random placement of the reads aligned to multiple regions can either reduce sensitivity of CNV detection or result in the identification of false deletions in repeated regions. Due to the limitations of both alignment and subsequent segmentation algorithms, CNVs called in the regions harboring highly repeated sequences should be carefully scrutinized. Therefore, in addition to focal CNV functional annotation, recurrence among patients, and presence of TransAbyss overlapping events, the number and types of repeats are added to the annotation of candidate CNVs to further narrow down the prioritized list for verification. It is recommended that the candidate CNVs be prioritized based on the presence of genes of interest, high recurrence among patients, presence of overlapping TransAbyss events, and low frequency or absence of repeat nuclear elements.

PCR-free whole genome sequencing:

To minimize library bias and coverage gaps associated with PCR amplification of high GC or AT-rich regions we have implemented a version of the TruSeq DNA PCR-free kit (E6875-6877B-GSC, New England Biolabs), automated on a Microlab NIMBUS liquid handling robot (Hamilton).

WGS/hg19 alignment:

Illumina paired-end whole genome sequencing reads were aligned to the hg19 reference using BWA version 0.5.7.  This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Multiple lanes of sequences were merged and duplicated reads were marked with Picard Tools.

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

 

Genomic SNV analyses

SNVs from WGS-seq data were analyzed using all three methods described below:

Mpileup

SNVs were analyzed with SAMtools mpileup v.0.1.17 (Li et al., 2009) either on single or paired libraries.  Each chromosome was analyzed separately using the -C50-DSBuf parameters. The resulting vcf files were merged and filtered to remove low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6). Finally, SNVs were annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b) and the dbSNP v137 db membership assigned using snpSift  (Cingolani et al., 2012a).

Strelka

To analyze compartment specific SNVs, samples were analyzed pair wise with the default settings of Strelka v0.4.7 (Saunders et al., 2012).  Primary tumor samples and relapse/met were compared against the germline sample. In the absence of a germline sample, the relapse/met samples were compared against the primary tumor sample.

MutationSeq

SNVs were analyzed pair wise with SAMtools mpileup v.0.1.17 (Li et al., 2009).  Each chromosome was analyzed separately using the -C50-DSBuf parameters. Before merging the resulting vcf files, they were filtered to remove all indels and low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6).  The SNVs in the resulting vcf files were further filtered and scored using mutationSeq v1.0.2 and annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b)  and the dbSNP v137 and cosmic 64 db membership using snpSift  (Cingolani et al., 2012a).

 

Copy number variation (CNV) analysis

The techniques outlined in (Jones et al., 2010) were followed to analyze copy number changes. Sequence quality filtering was used to remove all reads of low mapping quality (Q < 10). Due to the varying amounts of sequence reads from each sample, aligned reference reads were first used to define genomic bins of equal reference coverage to which depths of alignments of sequence from each of the tumor samples were compared. This resulted in a measurement of the relative number of aligned reads from the tumors and reference in bins of variable length along the genome, where bin width is inversely proportional to the number of mapped reference reads. A hidden Markov model (HMM) was used to classify and segment continuous regions of copy number loss, neutrality, or gain using methodology outlined previously (Shah et al., 2006). The five states reported by the HMM were: loss (1), neutral (2), gain (3), amplification (4), and high-level amplification (5).

Amplified and deleted CNV regions are further screened for interspersed repeats, and low complexity DNA sequences, which includes long interspersed nuclear elements (LINE), short interspersed nuclear element (SINE), long terminal repeat elements (LTR), DNA repeat elements (DNA), low complexity repeats,  satellite repeats, simple repeats (micro-satellites), and RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA).

Repeat sequences in the genome pose challenges in the identification of CNVs with next generation sequencing data as the short reads sequenced from repetitive regions cannot be mapped unambiguously. Exclusion or random placement of the reads aligned to multiple regions can either reduce sensitivity of CNV detection or result in the identification of false deletions in repeated regions. Due to the limitations of both alignment and subsequent segmentation algorithms, CNVs called in the regions harboring highly repeated sequences should be carefully scrutinized. Therefore, in addition to focal CNV functional annotation, recurrence among patients, and presence of TransAbyss overlapping events, the number and types of repeats are added to the annotation of candidate CNVs to further narrow down the prioritized list for verification. It is recommended that the candidate CNVs be prioritized based on the presence of genes of interest, high recurrence among patients, presence of overlapping TransAbyss events, and low frequency or absence of repeat nuclear elements.

PCR-free whole genome sequencing:

To minimize library bias and coverage gaps associated with PCR amplification of high GC or AT-rich regions we have implemented a version of the TruSeq DNA PCR-free kit (E6875-6877B-GSC, New England Biolabs), automated on a Microlab NIMBUS liquid handling robot (Hamilton).

PCR-free whole genome sequencing:

Briefly, 500ng of genomic DNA was arrayed in a 96-well microtitre plate and subjected to shearing by sonication (Covaris LE220). Sheared DNA was end-repaired, and size selected using paramagnetic PCRClean DX beads (C-1003-450, Aline Biosciences) targeting a 300-400bp fraction.  After 3’ A-tailing, full length TruSeq adapters were ligated. Libraries were purified using paramagnetic (Aline Biosciences) beads. PCR-free genome library concentrations were quantified using a qPCR Library Quantification kit (KAPA, KK4824) prior to sequencing with paired-end 125 base reads on the Illumina HiSeq2500 platform using V4 chemistry according to manufacturer recommendations.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

All CGI protocols can be found via the TARGET Data Matrix under the whole genome datasets where it is applied in "README" files.

Whole Exome Sequencing
Sequencing Center Data Generation Protocols Data Analysis Protocols
Baylor College of Medicine ALL P2 , AML , NBL MDLS , PPTP , WT ALL P2 , AML , NBL MDLS , PPTP , WT
Broad Institute NBL NBL
St. Jude Children’s Research Hospital (SJCRH) ALAL ALAL

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

 

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

 

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

 

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

 

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

 

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Library construction

Specimen processing, DNA extraction, standard QC and Illumina paired-end pre-capture libraries were prepared according to the manufacturer's protocol (Illumina Inc, San Diego, CA) with the following modifications: 0.5 - 1ug genomic DNA in 100ul volume was sheared into fragments of approximately 300 base pairs in a Covaris E210 system (Covaris, Inc. Woburn, MA). The setting was 10% duty cycle, intensity of 4,200 cycles per burst for 120 seconds. Fragment size was checked using a 2.2% Flash Gel DNA Cassette (Lonza, Walkersville, MD, Cat. No.57023). End-repair of fragmented DNA was performed in 90ul total reaction volume containing sheared DNA, 9 ul 10X buffer, 5 ul END Repair Enzyme Mix and H2O (NEBNext End-Repair Module, New England BioLabs, Ipswich, MA, Cat. No. E6050L), incubated at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 60ul containing end-repaired DNA, 6ul 10X buffer, 3ul Klenow fragment (NEBNext dA-Tailing Module; Cat. No. E6053L) and H2O followed by incubation at 37°C for 30 minutes. Illumina multiplex adapter ligation (NEBNext Quick Ligation Module Cat. No. E6056L) was performed in a total reaction volume of 90ul containing 18ul 5X buffer, 5ul ligase, 0.5ul 100uM adaptor and H2O at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and modified barcode primers (manuscript in preparation) was performed in 170ul reactions containing 85ul of 2x Phusion High-Fidelity PCR master mix, adaptor ligated DNA, 1.75ul of 50uM primers and H2O. PCR was performed using a 5-minute initial denaturation at 95°C, 6-10 cycles of 15 seconds at 95°C, 15 seconds at 60°C and 30 seconds at 72°C followed by a final extension for 5 minute at 72°C. Agencourt XP Beads (Beckman Coulter Genomics, Inc., Danvers, MA, Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After purification, PCR product quantification and size distribution was determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517).

Exome capture

Illumina pre-capture libraries (1ug DNA input) were hybridized in solution to SeqCap EZ Human Exome 2.0 (Nimblegen, Madison, WI) probes targeting approximately 44Mbs of sequence from approximately 30K genes according to the manufacturer's protocol with the following modifications: hybridization enhancing oligos IHE1, IHE2 and IHE3 replaced oligos HE1.1 and HE2.1 and post-capture LM-PCR was performed using 14 cycles. Capture libraries were quantified using Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). The efficiency of the capture was evaluated by performing a qPCR-based quality check on the built-in controls (qPCR SYBR Green assays, Applied Biosystems, Grand Island, NY). Four standardized oligo sets, RUNX2, PRKG1, SMG1, and NLK, were employed as internal quality controls. The enrichment of the capture libraries was estimated to range from 7- to 9-fold over background.

Library templates were prepared for sequencing using Illumina's cBot cluster generation system with TruSeq PE Cluster Generation Kits (Part no. PE-401-3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6-9 pM in hybridization buffer in order to achieve a load density of ~800K clusters/mm2. Each library pool was loaded in a single lane of a HiSeq flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Part no. FC-401-3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300-400 million successful reads on each lane of a flow cell, with approximately 9-10 Gb produced per sample. With these sequencing yields, samples achieved an average of 95% of the targeted exome bases covered to a depth of 20X or greater.

Real Time Analysis (RTA) software was used to process the image analysis and nucleotide base calling. On average, about 80-100 million successful reads, consisting of 2X 100 bp, were generated on each lane of a flow cell.

Whole Exome Sequencing

*Protocols were performed at Baylor College of Medicine.

Mapping Reads

Illumina HiSeq bcl files were processed using BCLConvertor v1.7.1. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into fastq files. The fastq files were aligned to human reference genome build37 (NCBI) using BWA (bwa-0.5.9-R16) with default parameters with the following exceptions: seed sequence: 40 bpseed mismatch: 2, total mismatches allowed: 3. BAM files generated from alignment were preprocessed using GATK (v1.3-8-gb0e6afe) [1] to recalibrate and locally realign reads.

Mutation Detection

Sequence variants were called from tumor and matched normal BAM files using Atlas [2] an integrative variant analysis suite of tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in whole exome capture sequencing (WXS) data. The suite implements logistic regression models trained on validated WXS data to identify the true variants. ATLAS-SNP-2 (v1.3) [3] and ATLAS-Indel-2 (v0.3.1) along with Pindel (v0.2.4q) [4] were run on the BAM files producing variant data that were further filtered to remove all those observed fewer than 5 times or were present in less than 0.08 of the reads (e.g., variant allele fraction must be greater than 0.08 to undergo validation). At least one variant read of Q30 or better was required, and the variant had to lie in the central portion of the read (15% from the 5' end of the read and 20% from the 3' end). In addition, reads harboring the variant must have been observed in both forward and reverse orientations. Finally, the variant base was not observed in the normal tissue. Indels were discovered by similar processing except indels must have been observed in at least 10 of the reads.

Whole Exome Sequencing

*Protocols were performed at the Broad Institute.  Please reference Pugh et al. (Published in final edited form as:Nat Genet. 2013 Mar; 45(3): 279–284).

The generation, sequencing, and analysis of 222 pairs of exome libraries at the Broad Institute was performed using a previously described protocol. Due to the small quantities of DNA available, 81 DNA samples were amplified using Phi29-based multiple-strand displacement whole genome amplification (Repli-g service, QIAgen). Exonic regions were captured by in-solution hybridization using RNA baits similar to those described but supplemented with additional probes capturing additional genes listed in ReqSeq in addition to the original Consensus Coding Sequence (CCDS) set. In total, ~33 Mb of genomic sequence was targeted, consisting of 193,094 exons from 18,863 genes annotated by the CCDS and RefSeq databases as coding for protein or micro-RNA (accessed November 2010). Sequencing of 76 bp paired-end reads was performed using Illumina Genome Analyzer IIx and HiSeq 2000 instruments. Reads were aligned to the hg19/GRCh37 build of the reference human genome sequenceusing BWA. PCR duplicates were flagged in the bam files for exclusion from further analysis using the Picard MarkDuplicates tool. To confirm sample identity, copy number profiles derived from sequence data were compared with those derived from microarray data when available. Candidate somatic base substitutions were detected using muTect (previously referred to as muTector) and insertions and deletions were detected using IndelGenotyper. Segmental copy number ratios were calculated as the ratio of tumor fraction read-depth to the average fractional read-depth observed in normal samples for that region.

Removal of oxoG library preparation artifact

Cases sequenced using WGA and native DNA were sequenced more than eight months apart by the Sequencing Platform at the Broad Institute. Initial comparison of candidate mutation calls from these two data sets identified a preponderance of apparent G>T or C>A substitutions of low allele fraction (<0.15) and within specific sequence contexts (Supplementary Figure 2A). We subsequently characterized this artifact and developed a method to detect and remove these events. In brief, these artifacts are introduced at the DNA shearing step of the library construction process and arise from the oxidation of guanine bases (oxoG) by high-energy sonication. During downstream PCR, oxoG bases preferentially pair with thymine rather than cytosine, resulting in apparent G>T or C>A substitutions of low allele fraction and enriched within specific sequence contexts (Supplementary Figure 2B). Consistent with this mechanism, the intensity of the sonication process was increased with the introduction of a new 150 bp shearing protocol between preparation of the WGA and native DNA samples.

The number of artifacts in a library was apparently sample-dependent (Supplementary Figure 2C) and these events were found in unmatched tumor and normal libraries. In some cases, thousands of candidate mutations were called in cases with a heavily affected tumor sample and an unaffected normal. However, nearly every sample had at least one such artifact and we have observed similar events in publically available data sets from other centers, suggesting a common artifact mode that was exacerbated in some of our samples. To address this problem, we devised a method to differentiate oxoG artifacts from bona fide mutations.

Due to the modification of only one strand of a G:C base-pair (i.e. only the G base), reads supporting the artifact have characteristic read-orientation conferred upon adapter ligation. Therefore, all reads supporting an artifact were almost exclusively derived from the first or second read of the Illumina HiSeq instrument. Bona fide variants are supported by near-equal numbers of first and second reads. We made use of the skewed read-orientation combinations and low allele fractions characteristic of this artifact to identify and remove oxoG artifacts from mutation calls in our cohort (i.e. removal of all variants with allele fraction <0.1 or exclusively supported by a single read orientation). 

Whole Exome Sequencing

*Protocols were performed at the Broad Institute. Please reference Pugh et al. (Published in final edited form as:Nat Genet. 2013 Mar; 45(3): 279–284).

Verification of somatic mutations and rearrangements

We used a combination of genotyping and sequencing technologies to verify random candidate mutations (PCR/Sanger and PCR/HiSeq sequencing of candidates from Complete Genomics and BC Cancer Agency Illumina WGS and RNA-seq data), as well as mutations supportive of our significance analyses (Sequenom and PCR/MiSeq of WES and WGS data). Combining all of the validation experiments resulted in overall validation rates of 87% for substitutions (525/605 candidates, 241/282 coding) and 34% for indels (27/79 candidates, 26/41 coding). Some mutations were verified using multiple technologies and therefore the total number of candidate mutations verified is lower than the sum total of mutations described in the Supplementary Note. See Supplementary Note for details and cross-platform comparisons.

Integrated analysis of somatic variation from exome and genome data sets

Somatic mutations detected in WGS, WES, and RNA-seq data sets were annotated using Oncotator (See Broad Institute Cancer Genome Analysis webpage). Genes mutated at a statistically significant frequency were identified using MutSig, a method that identifies genes with mutation frequencies greater than expected by chance, given detected background mutation rates, gene length and callable sequence in each tumor/normal pair. The relationship between mutation frequency and age of diagnosis was tested using the Spearman rank test. The implementation of the Kolmogorov-Smirnov test in R version 2.11.1 (ks.test) was used to test differences in mutation frequency distributions of several clinical variables (Supplementary Table 4).

Germline variant analysis

Detection of pathogenic germline variation at base-pair resolution in a cohort of cancer patients is complicated by selection of an appropriately matched and sized control population, relatively high carrier frequencies for unrelated disorders, and complex genetics underlying cancer predisposition. To nominate germline variants predisposing to neuroblastoma, we searched for enrichment of putative functional variants in the blood-derived DNA samples from our WES cohort compared to normal DNAs from 1,974 European American individuals sequenced by the National Heart, Lung, and Blood Institute Grand Opportunity Exome Sequencing Project (ESP). As indel calls from the ESP cohort were not publically available at the time of our study, we did not include them in our analysis.

To ensure consistency and accuracy of germline variant detection, all neuroblastoma WES cases were called simultaneously with 800 WES cases from the 1000Genomes project using the UnifiedGenotyper from the Genome Analysis Toolkit. A principal component analysis of the genotype calls was performed to determine the ethnic background of our cases (Supplementary Figure 7) with respect to three 1000Genomes populations. As over 80% of our cohort was Caucasian or ad-mixed Caucasian, we downloaded genotyping calls and coverage information from 1,974 European American individuals available on the ESP website to serve as a control population. To focus our analysis on rare variation consistent with the low prevalence of neuroblastoma, we removed from both data sets all variants present in individuals sequenced as part of the 1000 Genomes project. Next, we generated two lists of rare variants: overlaps with clinically-reported variants recorded in ClinVar (downloaded 4/27/2012, 284 variants in neuroblastoma, 2,947 in ESP) and loss-of-function variants in any of 924 genes listed in the Cancer Gene Census, Familial Cancer database, or a list of DNA repair genes (86 neuroblastoma, 1,068 ESP). We then tested each gene for significant enrichment of variants in the neuroblastoma compared to the ESP cohort (1-tailed Fisher’s exact test, Supplementary Tables 7 and 8).

The germline ClinVar analysis uncovered four genes of significance driven by single variants seen at greater frequency in neuroblastoma compared to ESP: CYP2D6, NOD2, SLC34A3, and HPD. All of these variants are present at low frequency in an expanded European American ESP cohort (rs5030865 in 1/8,524 chromosomes, rs104895438 in 5/8600, rs121918239 in 14/8514, and rs137852868 in 11/8600), suggesting they are benign polymorphisms. Note that, while candidates detected by this approach are not significant after correction for multiple testing, we believe there is sufficient biological rationale and supporting evidence for validation in larger cohorts. We also looked for overlap with sites recorded in COSMIC. This analysis identified a TP53 variant associated with Li-Fraumeni syndrome.

Whole Exome Sequencing

*Protocols were performed at St. Jude Children’s Research Hospital.

Library construction utilized DNA tagmentation (fragmentation and adapter attachment) performed using the reagent provided in the Illumina Nextera rapid exome kit (version 1.2) and was performed using the Caliper Biosciences (Perking Elmer) Sciclone G3. First-round PCR (10 cycles) was performed using Illumina Nextera kit v1.2 reagents, and clean-up steps employ BC/Agencourt AMPure XP beads. Target capture utilized Illumina Nextera rapid capture exome kit v1.2 and supplied hybridization and associated reagents. The pre-hybridization pool size was 12 samples, and second round PCR (10 cycles) performed with Nextera kit v1.2 reagents. Library quality control was performed using a Victor fluorescence plate reader with Quant-it dsDNA reagents for pre-pool quantitation, and Agilent Bio-analyzer 2200 for final library quantitation. Paired-end sequencing was performed using Illumina HiSeq 2500 with read length 100 bp.

Whole Exome Sequencing

*Protocols were performed at St. Jude Children’s Research Hospital.

Paired-end WXS data were aligned to the human reference genome GRCh37 by BWA1 (version 0.7.12). Samtools2 (version 1.3.1) were used to generate chromosomal coordinate-sorted and indexed bam files, and then Picard (version 1.129) MarkDuplicates module was used for marking PCR duplication.

SNV/indel calling and filter workflow. The GATK UnifiedGenotyper module was used to identify SNVs and indels from leukemia and germline samples, which were filtered by a homemade pipeline, excluding: 1) reported common SNPs/indels from UCSC dbSNP v142; 2) germline mutations detected from matched germline control samples. All the non-silent SNVs/indels yield from the filtering pipeline were manually reviewed and only the highly reliable somatic ones were reported. Meanwhile, adjacent nucleotide changes on the same allele were merged into a single mutation.

For patients with flow sorted subpopulations of leukemia cells sequenced, the mutation calling for each population was performed de novo. Mutations detected from some/one of the samples were checked across the other samples from the same patient.

References

  1. Li H, et al. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 25(14):1754-60. (PMID: 19451168)
  2. Li H, et al. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics. 25(16):2078-9. (PMID: 19505943)
mRNA Sequencing
Sequencing Center Data Generation Protocols Data Analysis Protocols
British Columbia Cancer Agency (BCCA) ALL , AML , NBL , RT , WT , ALAL ALL , AML , NBL , RT , WT , ALAL
NCI Center for Cancer Research CCSK , NBL CCSK , NBL
St. Jude Children’s Research Hospital (SJCRH) ALAL

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

Strand-specific ribodepletion RNA sequencing:

Enzymatic reactions were set-up in a 96-well plate (Thermo Fisher Scientific) on a Microlab NIMBUS liquid handler (Hamilton Robotics, USA). 100ng of DNase I treated total RNA in 6 µL was hybridized to rRNA probes in a 7.5 µL reaction. Heat-sealed plates were incubated at 95oC for 2 minutes followed by incremental reduction in temperature by 0.1oC per second to 22oC (730 cycles). The rRNA in DNA hybrids were digested using RNase H in a 10 µL reaction incubated in a thermocycler at 37oC for 30 minutes. To remove excess rRNA probes (DNA) and residual genomic DNA contamination, DNase I was added in a total reaction volume of 25 µL and incubated at 37oC for 30 minutes. RNA was purified using RNA MagClean DX beads (Aline Biosciences, USA) with 15 minutes of binding time, 7 minutes clearing on a magnet followed by two 70% ethanol washes, 5 minutes to air dry the RNA pellet and elution in 36uL DEPC water. The plate containing RNA was stored at -80oC prior to cDNA synthesis.

First-strand cDNA was synthesized from the purified RNA (minus rRNA) using the Maxima H Minus First Strand cDNA Synthesis kit (Thermo-Fisher, USA) and random hexamer primers at a concentration of 8ng/µL along with a final concentration of 0.4µg/µL Actinomycin D, followed by PCR Clean DX bead purification on a Microlab NIMBUS robot (Hamilton Robotics, USA). The second strand cDNA was synthesized following the NEBNext Ultra Directional Second Strand cDNA Synthesis protocol (NEB) that incorporates dUTP in the dNTP mix, allowing the second strand to be digested using USERTM enzyme (NEB) in the post-adapter ligation reaction and thus achieving strand specificity.

cDNA was fragmented by Covaris LE220 sonication for 130seconds (2x65seconds) at a “Duty cycle” of 30%, 450 Peak Incident Power (W) and 200 Cycles per Burst in a 96-well microTUBE Plate (P/N: 520078) to achieve 200-250 bp average fragment lengths. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based library construction protocol on a Microlab NIMBUS robot (Hamilton Robotics, USA). Briefly, the sheared cDNA was subject to end-repair and phosphorylation in a single reaction using an enzyme premix (NEB) containing T4 DNA polymerase, Klenow DNA Polymerase and T4 polynucleotide kinase, incubated at 20oC for 30 minutes. Repaired cDNA was purified in 96-well format using PCR Clean DX beads (Aline Biosciences, USA), and 3’ A-tailed (adenylation) using Klenow fragment (3’ to 5’ exo minus) and incubation at 37oC for 30 minutes prior to enzyme heat inactivation. Illumina PE adapters were ligated at 20oC for 15 minutes. The adapter-ligated products were purified using PCR Clean DX beads, then digested with USERTM enzyme (1 U/µL, NEB) at 37oC for 15 minutes followed immediately by 13 cycles of indexed PCR using Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) and Illumina’s PE primer set. PCR parameters: 98˚C for 1 minute followed by 13 cycles of 98˚C 15 seconds, 65˚C 30 seconds and 72˚C 30 seconds, and then 72˚C 5 minutes. The PCR products were purified and size-selected using a 1:1 PCR Clean DX beads-to-sample ratio (twice), and the eluted DNA quality was assessed with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA) and quantified using a Quant-iT dsDNA High Sensitivity Assay Kit on a Qubit fluorometer (Invitrogen) prior to library pooling and size-corrected final molar concentration calculation for Illumina HiSeq2500 sequencing with paired-end 75 base reads.

RNA-seq lite library construction (AML-IF):

For each sample, approximately 10ng of total RNA was processed using the SMART(TM) cDNA synthesis protocol including SMARTScribe Reverse Transcriptase (Clontech, #639536). This method deploys a modified oligo(dT) primer to prime the first strand synthesis reaction and a template switching mechanism to generate full-length single-stranded cDNAs containing the complete 5’ end of the mRNA as well as universal priming sequences for end-to-end amplification during 20 cycles of PCR. The amplified cDNA was subject to Illumina paired-end library construction using NEBNext paired-end DNA sample Prep Kit (NEB, E6000B-25). Libraries were sequenced on Illumina HiSeq2000 instruments.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-seq lite library construction (AML-IF):

For each sample, approximately 10ng of total RNA was processed using the SMART(TM) cDNA synthesis protocol including SMARTScribe Reverse Transcriptase (Clontech, #639536). This method deploys a modified oligo(dT) primer to prime the first strand synthesis reaction and a template switching mechanism to generate full-length single-stranded cDNAs containing the complete 5’ end of the mRNA as well as universal priming sequences for end-to-end amplification during 20 cycles of PCR. The amplified cDNA was subject to Illumina paired-end library construction using NEBNext paired-end DNA sample Prep Kit (NEB, E6000B-25). Libraries were sequenced on Illumina HiSeq2000 instruments.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-seq lite library construction (AML-IF):

For each sample, approximately 10ng of total RNA was processed using the SMART(TM) cDNA synthesis protocol including SMARTScribe Reverse Transcriptase (Clontech, #639536). This method deploys a modified oligo(dT) primer to prime the first strand synthesis reaction and a template switching mechanism to generate full-length single-stranded cDNAs containing the complete 5’ end of the mRNA as well as universal priming sequences for end-to-end amplification during 20 cycles of PCR. The amplified cDNA was subject to Illumina paired-end library construction using NEBNext paired-end DNA sample Prep Kit (NEB, E6000B-25). Libraries were sequenced on Illumina HiSeq2000 instruments.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-seq lite library construction (AML-IF):

For each sample, approximately 10ng of total RNA was processed using the SMART(TM) cDNA synthesis protocol including SMARTScribe Reverse Transcriptase (Clontech, #639536). This method deploys a modified oligo(dT) primer to prime the first strand synthesis reaction and a template switching mechanism to generate full-length single-stranded cDNAs containing the complete 5’ end of the mRNA as well as universal priming sequences for end-to-end amplification during 20 cycles of PCR. The amplified cDNA was subject to Illumina paired-end library construction using NEBNext paired-end DNA sample Prep Kit (NEB, E6000B-25). Libraries were sequenced on Illumina HiSeq2000 instruments.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-seq lite library construction (AML-IF):

For each sample, approximately 10ng of total RNA was processed using the SMART(TM) cDNA synthesis protocol including SMARTScribe Reverse Transcriptase (Clontech, #639536). This method deploys a modified oligo(dT) primer to prime the first strand synthesis reaction and a template switching mechanism to generate full-length single-stranded cDNAs containing the complete 5’ end of the mRNA as well as universal priming sequences for end-to-end amplification during 20 cycles of PCR. The amplified cDNA was subject to Illumina paired-end library construction using NEBNext paired-end DNA sample Prep Kit (NEB, E6000B-25). Libraries were sequenced on Illumina HiSeq2000 instruments.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-seq lite library construction (AML-IF):

For each sample, approximately 10ng of total RNA was processed using the SMART(TM) cDNA synthesis protocol including SMARTScribe Reverse Transcriptase (Clontech, #639536). This method deploys a modified oligo(dT) primer to prime the first strand synthesis reaction and a template switching mechanism to generate full-length single-stranded cDNAs containing the complete 5’ end of the mRNA as well as universal priming sequences for end-to-end amplification during 20 cycles of PCR. The amplified cDNA was subject to Illumina paired-end library construction using NEBNext paired-end DNA sample Prep Kit (NEB, E6000B-25). Libraries were sequenced on Illumina HiSeq2000 instruments.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

Strand-specific ribodepletion RNA sequencing:

Enzymatic reactions were set-up in a 96-well plate (Thermo Fisher Scientific) on a Microlab NIMBUS liquid handler (Hamilton Robotics, USA). 100ng of DNase I treated total RNA in 6 µL was hybridized to rRNA probes in a 7.5 µL reaction. Heat-sealed plates were incubated at 95oC for 2 minutes followed by incremental reduction in temperature by 0.1oC per second to 22oC (730 cycles). The rRNA in DNA hybrids were digested using RNase H in a 10 µL reaction incubated in a thermocycler at 37oC for 30 minutes. To remove excess rRNA probes (DNA) and residual genomic DNA contamination, DNase I was added in a total reaction volume of 25 µL and incubated at 37oC for 30 minutes. RNA was purified using RNA MagClean DX beads (Aline Biosciences, USA) with 15 minutes of binding time, 7 minutes clearing on a magnet followed by two 70% ethanol washes, 5 minutes to air dry the RNA pellet and elution in 36uL DEPC water. The plate containing RNA was stored at -80oC prior to cDNA synthesis.

First-strand cDNA was synthesized from the purified RNA (minus rRNA) using the Maxima H Minus First Strand cDNA Synthesis kit (Thermo-Fisher, USA) and random hexamer primers at a concentration of 8ng/µL along with a final concentration of 0.4µg/µL Actinomycin D, followed by PCR Clean DX bead purification on a Microlab NIMBUS robot (Hamilton Robotics, USA). The second strand cDNA was synthesized following the NEBNext Ultra Directional Second Strand cDNA Synthesis protocol (NEB) that incorporates dUTP in the dNTP mix, allowing the second strand to be digested using USERTM enzyme (NEB) in the post-adapter ligation reaction and thus achieving strand specificity.

cDNA was fragmented by Covaris LE220 sonication for 130seconds (2x65seconds) at a “Duty cycle” of 30%, 450 Peak Incident Power (W) and 200 Cycles per Burst in a 96-well microTUBE Plate (P/N: 520078) to achieve 200-250 bp average fragment lengths. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based library construction protocol on a Microlab NIMBUS robot (Hamilton Robotics, USA). Briefly, the sheared cDNA was subject to end-repair and phosphorylation in a single reaction using an enzyme premix (NEB) containing T4 DNA polymerase, Klenow DNA Polymerase and T4 polynucleotide kinase, incubated at 20oC for 30 minutes. Repaired cDNA was purified in 96-well format using PCR Clean DX beads (Aline Biosciences, USA), and 3’ A-tailed (adenylation) using Klenow fragment (3’ to 5’ exo minus) and incubation at 37oC for 30 minutes prior to enzyme heat inactivation. Illumina PE adapters were ligated at 20oC for 15 minutes. The adapter-ligated products were purified using PCR Clean DX beads, then digested with USERTM enzyme (1 U/µL, NEB) at 37oC for 15 minutes followed immediately by 13 cycles of indexed PCR using Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) and Illumina’s PE primer set. PCR parameters: 98˚C for 1 minute followed by 13 cycles of 98˚C 15 seconds, 65˚C 30 seconds and 72˚C 30 seconds, and then 72˚C 5 minutes. The PCR products were purified and size-selected using a 1:1 PCR Clean DX beads-to-sample ratio (twice), and the eluted DNA quality was assessed with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA) and quantified using a Quant-iT dsDNA High Sensitivity Assay Kit on a Qubit fluorometer (Invitrogen) prior to library pooling and size-corrected final molar concentration calculation for Illumina HiSeq2500 sequencing with paired-end 75 base reads.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

Strand-specific ribodepletion RNA sequencing:

Enzymatic reactions were set-up in a 96-well plate (Thermo Fisher Scientific) on a Microlab NIMBUS liquid handler (Hamilton Robotics, USA). 100ng of DNase I treated total RNA in 6 µL was hybridized to rRNA probes in a 7.5 µL reaction. Heat-sealed plates were incubated at 95oC for 2 minutes followed by incremental reduction in temperature by 0.1oC per second to 22oC (730 cycles). The rRNA in DNA hybrids were digested using RNase H in a 10 µL reaction incubated in a thermocycler at 37oC for 30 minutes. To remove excess rRNA probes (DNA) and residual genomic DNA contamination, DNase I was added in a total reaction volume of 25 µL and incubated at 37oC for 30 minutes. RNA was purified using RNA MagClean DX beads (Aline Biosciences, USA) with 15 minutes of binding time, 7 minutes clearing on a magnet followed by two 70% ethanol washes, 5 minutes to air dry the RNA pellet and elution in 36uL DEPC water. The plate containing RNA was stored at -80oC prior to cDNA synthesis.

First-strand cDNA was synthesized from the purified RNA (minus rRNA) using the Maxima H Minus First Strand cDNA Synthesis kit (Thermo-Fisher, USA) and random hexamer primers at a concentration of 8ng/µL along with a final concentration of 0.4µg/µL Actinomycin D, followed by PCR Clean DX bead purification on a Microlab NIMBUS robot (Hamilton Robotics, USA). The second strand cDNA was synthesized following the NEBNext Ultra Directional Second Strand cDNA Synthesis protocol (NEB) that incorporates dUTP in the dNTP mix, allowing the second strand to be digested using USERTM enzyme (NEB) in the post-adapter ligation reaction and thus achieving strand specificity.

cDNA was fragmented by Covaris LE220 sonication for 130seconds (2x65seconds) at a “Duty cycle” of 30%, 450 Peak Incident Power (W) and 200 Cycles per Burst in a 96-well microTUBE Plate (P/N: 520078) to achieve 200-250 bp average fragment lengths. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based library construction protocol on a Microlab NIMBUS robot (Hamilton Robotics, USA). Briefly, the sheared cDNA was subject to end-repair and phosphorylation in a single reaction using an enzyme premix (NEB) containing T4 DNA polymerase, Klenow DNA Polymerase and T4 polynucleotide kinase, incubated at 20oC for 30 minutes. Repaired cDNA was purified in 96-well format using PCR Clean DX beads (Aline Biosciences, USA), and 3’ A-tailed (adenylation) using Klenow fragment (3’ to 5’ exo minus) and incubation at 37oC for 30 minutes prior to enzyme heat inactivation. Illumina PE adapters were ligated at 20oC for 15 minutes. The adapter-ligated products were purified using PCR Clean DX beads, then digested with USERTM enzyme (1 U/µL, NEB) at 37oC for 15 minutes followed immediately by 13 cycles of indexed PCR using Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) and Illumina’s PE primer set. PCR parameters: 98˚C for 1 minute followed by 13 cycles of 98˚C 15 seconds, 65˚C 30 seconds and 72˚C 30 seconds, and then 72˚C 5 minutes. The PCR products were purified and size-selected using a 1:1 PCR Clean DX beads-to-sample ratio (twice), and the eluted DNA quality was assessed with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA) and quantified using a Quant-iT dsDNA High Sensitivity Assay Kit on a Qubit fluorometer (Invitrogen) prior to library pooling and size-corrected final molar concentration calculation for Illumina HiSeq2500 sequencing with paired-end 75 base reads.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

Strand-specific ribodepletion RNA sequencing:

Enzymatic reactions were set-up in a 96-well plate (Thermo Fisher Scientific) on a Microlab NIMBUS liquid handler (Hamilton Robotics, USA). 100ng of DNase I treated total RNA in 6 µL was hybridized to rRNA probes in a 7.5 µL reaction. Heat-sealed plates were incubated at 95oC for 2 minutes followed by incremental reduction in temperature by 0.1oC per second to 22oC (730 cycles). The rRNA in DNA hybrids were digested using RNase H in a 10 µL reaction incubated in a thermocycler at 37oC for 30 minutes. To remove excess rRNA probes (DNA) and residual genomic DNA contamination, DNase I was added in a total reaction volume of 25 µL and incubated at 37oC for 30 minutes. RNA was purified using RNA MagClean DX beads (Aline Biosciences, USA) with 15 minutes of binding time, 7 minutes clearing on a magnet followed by two 70% ethanol washes, 5 minutes to air dry the RNA pellet and elution in 36uL DEPC water. The plate containing RNA was stored at -80oC prior to cDNA synthesis.

First-strand cDNA was synthesized from the purified RNA (minus rRNA) using the Maxima H Minus First Strand cDNA Synthesis kit (Thermo-Fisher, USA) and random hexamer primers at a concentration of 8ng/µL along with a final concentration of 0.4µg/µL Actinomycin D, followed by PCR Clean DX bead purification on a Microlab NIMBUS robot (Hamilton Robotics, USA). The second strand cDNA was synthesized following the NEBNext Ultra Directional Second Strand cDNA Synthesis protocol (NEB) that incorporates dUTP in the dNTP mix, allowing the second strand to be digested using USERTM enzyme (NEB) in the post-adapter ligation reaction and thus achieving strand specificity.

cDNA was fragmented by Covaris LE220 sonication for 130seconds (2x65seconds) at a “Duty cycle” of 30%, 450 Peak Incident Power (W) and 200 Cycles per Burst in a 96-well microTUBE Plate (P/N: 520078) to achieve 200-250 bp average fragment lengths. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based library construction protocol on a Microlab NIMBUS robot (Hamilton Robotics, USA). Briefly, the sheared cDNA was subject to end-repair and phosphorylation in a single reaction using an enzyme premix (NEB) containing T4 DNA polymerase, Klenow DNA Polymerase and T4 polynucleotide kinase, incubated at 20oC for 30 minutes. Repaired cDNA was purified in 96-well format using PCR Clean DX beads (Aline Biosciences, USA), and 3’ A-tailed (adenylation) using Klenow fragment (3’ to 5’ exo minus) and incubation at 37oC for 30 minutes prior to enzyme heat inactivation. Illumina PE adapters were ligated at 20oC for 15 minutes. The adapter-ligated products were purified using PCR Clean DX beads, then digested with USERTM enzyme (1 U/µL, NEB) at 37oC for 15 minutes followed immediately by 13 cycles of indexed PCR using Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) and Illumina’s PE primer set. PCR parameters: 98˚C for 1 minute followed by 13 cycles of 98˚C 15 seconds, 65˚C 30 seconds and 72˚C 30 seconds, and then 72˚C 5 minutes. The PCR products were purified and size-selected using a 1:1 PCR Clean DX beads-to-sample ratio (twice), and the eluted DNA quality was assessed with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA) and quantified using a Quant-iT dsDNA High Sensitivity Assay Kit on a Qubit fluorometer (Invitrogen) prior to library pooling and size-corrected final molar concentration calculation for Illumina HiSeq2500 sequencing with paired-end 75 base reads.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

Strand-specific ribodepletion RNA sequencing:

Enzymatic reactions were set-up in a 96-well plate (Thermo Fisher Scientific) on a Microlab NIMBUS liquid handler (Hamilton Robotics, USA). 100ng of DNase I treated total RNA in 6 µL was hybridized to rRNA probes in a 7.5 µL reaction. Heat-sealed plates were incubated at 95oC for 2 minutes followed by incremental reduction in temperature by 0.1oC per second to 22oC (730 cycles). The rRNA in DNA hybrids were digested using RNase H in a 10 µL reaction incubated in a thermocycler at 37oC for 30 minutes. To remove excess rRNA probes (DNA) and residual genomic DNA contamination, DNase I was added in a total reaction volume of 25 µL and incubated at 37oC for 30 minutes. RNA was purified using RNA MagClean DX beads (Aline Biosciences, USA) with 15 minutes of binding time, 7 minutes clearing on a magnet followed by two 70% ethanol washes, 5 minutes to air dry the RNA pellet and elution in 36uL DEPC water. The plate containing RNA was stored at -80oC prior to cDNA synthesis.

First-strand cDNA was synthesized from the purified RNA (minus rRNA) using the Maxima H Minus First Strand cDNA Synthesis kit (Thermo-Fisher, USA) and random hexamer primers at a concentration of 8ng/µL along with a final concentration of 0.4µg/µL Actinomycin D, followed by PCR Clean DX bead purification on a Microlab NIMBUS robot (Hamilton Robotics, USA). The second strand cDNA was synthesized following the NEBNext Ultra Directional Second Strand cDNA Synthesis protocol (NEB) that incorporates dUTP in the dNTP mix, allowing the second strand to be digested using USERTM enzyme (NEB) in the post-adapter ligation reaction and thus achieving strand specificity.

cDNA was fragmented by Covaris LE220 sonication for 130seconds (2x65seconds) at a “Duty cycle” of 30%, 450 Peak Incident Power (W) and 200 Cycles per Burst in a 96-well microTUBE Plate (P/N: 520078) to achieve 200-250 bp average fragment lengths. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based library construction protocol on a Microlab NIMBUS robot (Hamilton Robotics, USA). Briefly, the sheared cDNA was subject to end-repair and phosphorylation in a single reaction using an enzyme premix (NEB) containing T4 DNA polymerase, Klenow DNA Polymerase and T4 polynucleotide kinase, incubated at 20oC for 30 minutes. Repaired cDNA was purified in 96-well format using PCR Clean DX beads (Aline Biosciences, USA), and 3’ A-tailed (adenylation) using Klenow fragment (3’ to 5’ exo minus) and incubation at 37oC for 30 minutes prior to enzyme heat inactivation. Illumina PE adapters were ligated at 20oC for 15 minutes. The adapter-ligated products were purified using PCR Clean DX beads, then digested with USERTM enzyme (1 U/µL, NEB) at 37oC for 15 minutes followed immediately by 13 cycles of indexed PCR using Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) and Illumina’s PE primer set. PCR parameters: 98˚C for 1 minute followed by 13 cycles of 98˚C 15 seconds, 65˚C 30 seconds and 72˚C 30 seconds, and then 72˚C 5 minutes. The PCR products were purified and size-selected using a 1:1 PCR Clean DX beads-to-sample ratio (twice), and the eluted DNA quality was assessed with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA) and quantified using a Quant-iT dsDNA High Sensitivity Assay Kit on a Qubit fluorometer (Invitrogen) prior to library pooling and size-corrected final molar concentration calculation for Illumina HiSeq2500 sequencing with paired-end 75 base reads.

Strand-specific ribodepletion RNA sequencing:

To remove cytoplasmic and mitochondrial ribosomal RNA (rRNA) species from total RNA NEBNext rRNA Depletion Kit for Human/Mouse/Rat was used (NEB, E6310X).

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific ribodepletion RNA sequencing:

To remove cytoplasmic and mitochondrial ribosomal RNA (rRNA) species from total RNA NEBNext rRNA Depletion Kit for Human/Mouse/Rat was used (NEB, E6310X).

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific ribodepletion RNA sequencing:

To remove cytoplasmic and mitochondrial ribosomal RNA (rRNA) species from total RNA NEBNext rRNA Depletion Kit for Human/Mouse/Rat was used (NEB, E6310X).

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific ribodepletion RNA sequencing:

To remove cytoplasmic and mitochondrial ribosomal RNA (rRNA) species from total RNA NEBNext rRNA Depletion Kit for Human/Mouse/Rat was used (NEB, E6310X).

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific ribodepletion RNA sequencing:

To remove cytoplasmic and mitochondrial ribosomal RNA (rRNA) species from total RNA NEBNext rRNA Depletion Kit for Human/Mouse/Rat was used (NEB, E6310X).

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific ribodepletion RNA sequencing:

To remove cytoplasmic and mitochondrial ribosomal RNA (rRNA) species from total RNA NEBNext rRNA Depletion Kit for Human/Mouse/Rat was used (NEB, E6310X).

Strand-specific ribodepletion RNA sequencing:

Enzymatic reactions were set-up in a 96-well plate (Thermo Fisher Scientific) on a Microlab NIMBUS liquid handler (Hamilton Robotics, USA). 100ng of DNase I treated total RNA in 6 µL was hybridized to rRNA probes in a 7.5 µL reaction. Heat-sealed plates were incubated at 95oC for 2 minutes followed by incremental reduction in temperature by 0.1oC per second to 22oC (730 cycles). The rRNA in DNA hybrids were digested using RNase H in a 10 µL reaction incubated in a thermocycler at 37oC for 30 minutes. To remove excess rRNA probes (DNA) and residual genomic DNA contamination, DNase I was added in a total reaction volume of 25 µL and incubated at 37oC for 30 minutes. RNA was purified using RNA MagClean DX beads (Aline Biosciences, USA) with 15 minutes of binding time, 7 minutes clearing on a magnet followed by two 70% ethanol washes, 5 minutes to air dry the RNA pellet and elution in 36uL DEPC water. The plate containing RNA was stored at -80oC prior to cDNA synthesis.

First-strand cDNA was synthesized from the purified RNA (minus rRNA) using the Maxima H Minus First Strand cDNA Synthesis kit (Thermo-Fisher, USA) and random hexamer primers at a concentration of 8ng/µL along with a final concentration of 0.4µg/µL Actinomycin D, followed by PCR Clean DX bead purification on a Microlab NIMBUS robot (Hamilton Robotics, USA). The second strand cDNA was synthesized following the NEBNext Ultra Directional Second Strand cDNA Synthesis protocol (NEB) that incorporates dUTP in the dNTP mix, allowing the second strand to be digested using USERTM enzyme (NEB) in the post-adapter ligation reaction and thus achieving strand specificity.

cDNA was fragmented by Covaris LE220 sonication for 130seconds (2x65seconds) at a “Duty cycle” of 30%, 450 Peak Incident Power (W) and 200 Cycles per Burst in a 96-well microTUBE Plate (P/N: 520078) to achieve 200-250 bp average fragment lengths. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based library construction protocol on a Microlab NIMBUS robot (Hamilton Robotics, USA). Briefly, the sheared cDNA was subject to end-repair and phosphorylation in a single reaction using an enzyme premix (NEB) containing T4 DNA polymerase, Klenow DNA Polymerase and T4 polynucleotide kinase, incubated at 20oC for 30 minutes. Repaired cDNA was purified in 96-well format using PCR Clean DX beads (Aline Biosciences, USA), and 3’ A-tailed (adenylation) using Klenow fragment (3’ to 5’ exo minus) and incubation at 37oC for 30 minutes prior to enzyme heat inactivation. Illumina PE adapters were ligated at 20oC for 15 minutes. The adapter-ligated products were purified using PCR Clean DX beads, then digested with USERTM enzyme (1 U/µL, NEB) at 37oC for 15 minutes followed immediately by 13 cycles of indexed PCR using Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) and Illumina’s PE primer set. PCR parameters: 98˚C for 1 minute followed by 13 cycles of 98˚C 15 seconds, 65˚C 30 seconds and 72˚C 30 seconds, and then 72˚C 5 minutes. The PCR products were purified and size-selected using a 1:1 PCR Clean DX beads-to-sample ratio (twice), and the eluted DNA quality was assessed with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA) and quantified using a Quant-iT dsDNA High Sensitivity Assay Kit on a Qubit fluorometer (Invitrogen) prior to library pooling and size-corrected final molar concentration calculation for Illumina HiSeq2500 sequencing with paired-end 75 base reads.

*Protocols were performed through the laboratory of Dr. Javed Khan.

RNA-seq Library construction and sequencing by Illumina HiSeq2000:

RNA-seq libraries were prepared using Illumina TruSeq Stranded Total RNA Sample Preparation kits according to the manufacturer's protocol. Briefly, ribosomal RNA was removed using Ribo-Zero Gold beads. After purification, total RNA was fragmented to 200nt pieces and then reverse-transcribed using reverse transcriptase and random primers. Second strand cDNA was synthesized using DNA polymerase I and RNase H. These cDNA fragments were added as a single base and ligated with adaptors. The products were purified and enriched with PCR to create the final RNA-seq libraries. RNA libraries were sequenced on Illumina HiSeq2000 using 100bp paired-end sequencing according to the manufacturer's protocol.

*Protocols were performed in the laboratory of Dr. Javed Khan.

Reads Alignment:

Align reads to reference genome (GRCh37) using Tophat version 2.0.8b with default options, expect for options specifying number of processor threads and fusion search. An example code for alignment with fastq files is shown below.

-o tophat.out –p 6 --fusion-search –fusion-min-dist 100000 GRCh37 read_1.fq read_2.fq

Gene and isoform expression:

Gene and isoform expression from RNA-seq data was generated using Cufflinks version 2.1.1. with default options and supplied reference annotation (Homo_sapiens.GRCh37.71.gtf) for estimation of expression. Cufflinks will not assemble novel transcripts, and it will ignore alignments not structurally compatible with any reference transcript.

Exon expression:

RPKM for a given ExonX is determined by:  ( (raw base counts / median read length) * 10^9) / (total reads * exon length). The raw base counts for a given ExonX is the total number of bases aligned to that genomic segment. Raw base counts are used instead of raw read counts because in many   cases only a portion of a read will align to a given exon. 

Gene fusion:

Gene fusion file was generated using defuse version 0.6.1 with default parameters and with reference annotation Homo_sapiens.GRCh37.69.  

*Protocols were performed through the laboratory of Dr. Javed Khan.

RNA-seq Library construction and sequencing by Illumina HiSeq2000:

RNA-seq libraries were prepared using Illumina TruSeq Stranded Total RNA Sample Preparation kits according to the manufacturer's protocol. Briefly, ribosomal RNA was removed using Ribo-Zero Gold beads. After purification, total RNA was fragmented to 200nt pieces and then reverse-transcribed using reverse transcriptase and random primers. Second strand cDNA was synthesized using DNA polymerase I and RNase H. These cDNA fragments were added as a single base and ligated with adaptors. The products were purified and enriched with PCR to create the final RNA-seq libraries. RNA libraries were sequenced on Illumina HiSeq2000 using 100bp paired-end sequencing according to the manufacturer's protocol.

*Protocols were performed in the laboratory of Dr. Javed Khan.

Reads Alignment:

Align reads to reference genome (GRCh37) using Tophat version 2.0.8b with default options, expect for options specifying number of processor threads and fusion search. An example code for alignment with fastq files is shown below.

-o tophat.out –p 6 --fusion-search –fusion-min-dist 100000 GRCh37 read_1.fq read_2.fq

RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment. Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

Gene Coverage Analysis Protocol: www.bcgsc.ca/downloads/genomes/Homo_sapiens/hg19/1000genomes/bwa_ind/genome/  

Data Level: 3 Data File: *.gene.quantification.txt

The gene coverage analysis was performed with our internal analysis pipeline version 1.1 using "composite" gene annotations from the hg19 (GRCh37-lite) version of the TCGA GAF v3.0. These composite gene models were created in June 2011 by UNC (with assistance from UCSC) based on the annotations in the "UCSC genes" database. Each composite gene annotation was generated by collapsing all transcripts of that gene into a single model such that exonic bases in a composite gene model were the union of exonic bases from all known transcripts of the gene. Thus, the locations of the exonic boundaries used for the gene coverage analysis were not based on a single canonical transcript for each gene. Consequently, the exonic boundaries in a composite gene model may not correspond to the actual boundaries of the expressed transcripts. For simplicity, throughout this document and in the gene coverage results files, a composite gene model is simply referred to as a gene, and it is associated with the id of the gene whose transcripts contributed to that composite model. To generate the raw read counts, we first counted the number of bases of each read that were inside exonic regions in a gene, and then divided this total base count by the read length. Thus our values for the raw number of reads were not whole numbers (i.e. if the entire 50bp read mapped to an exon, we would add 50 to the total base count, which would ultimately contribute 1 to the raw read count. However, if only 25 bases of the read's alignment fell within an exon's boundaries, the total base count would be incremented by 25, which would ultimately contribute 0.5 to the raw read count). In order to comply with the file format specification enforced by the DCC validator, our raw read counts are rounded to the closest whole number. A gene's raw read count is the sum of raw read counts for exons belonging to the gene. Gene coverage is its raw read count divided by the sum of its exon lengths. RPKM is calculated using the formula: (number of reads mapped to all exons in a gene x 1,000,000,000)/(NORM_TOTAL x sum of the lengths of all exons in the gene )

[Note: NORM_TOTAL = the total number of reads that are mapped to all exons from the composite gene models. (i.e. sum of the fractional read count for all exons)]

If a read alignment contained a deletion or a large gap, the read did not contribute coverage inside the region spanned by the deletion/gap. Each of the paired end reads was counted separately. We excluded reads from pairs that failed Illumina's Chastity filter, as well as reads with mapping quality < 10.

*.gene.quantification.txt: A tab-delimited text file containing the following fields: - gene = Gene ID from GAF (version 3.0). The ID follows the nomenclature '<HUGO gene symbol>|<Entrez ID>'. If the combination of the HUGO symbol and the Entrez ID is not unique, an additional 'NofM' descriptor is added. An ID with '?' indicates that the HUGO gene symbol or Entrez ID is not available. e.g. U80769|?; TRNA_Pseudo|?|8of100 - raw_counts = Sum of fraction of reads (rounded off to nearest integer - restricted by the RNA-seq validator) that mapped to collapsed transcripts representing a specific gene. Reads from pairs that did not pass Illumina’s Chastity filter or with mapping quality less than 10, i.e. reads that did not map uniquely, were excluded from calculation. - median_length_normalized = Average coverage over all exons in the collapsed transcripts i.e. sum of the coverage depth at each base in all exons divided by the sum of the exon lengths - RPKM = Reads per kilobase of exon per million. Calculation described in detail below.

Exon Coverage Analysis:

Data Level: 3 Data File: **.exon.quantification.txt

The exon coverage analysis was performed with our internal analysis pipeline version 1.1 using "composite" gene annotations from the hg19 (GRCh37-lite) version of the TCGA GAF v3.0. These composite gene models were created in June 2011 by UNC (with assistance from UCSC) based on the annotations in the "UCSC genes" database. Similar to the gene coverage analysis, all transcripts of a given gene were collapsed into a single model such that exonic bases in a composite gene model were the union of exonic bases from all known transcripts of the gene. For simplicity, throughout this document and in the exon coverage results files, the collapsed exons are simply referred to as an exon. To generate the raw read counts, we first counted the number of bases of each read that were inside an exonic region, and then divided this total base count by the read length. Thus our values for the raw number of reads were not whole numbers (i.e. if the entire 50bp read mapped to an exon, we would add 50 to the total base count, which would ultimately contribute 1 to the raw read count. However, if only 25 bases of the read's alignment fell within an exon's boundaries, the total base count would be incremented by 25, which would ultimately contribute 0.5 to the raw read count). In order to comply with the file format specification enforced by the DCC validator, our raw read counts are rounded to the closest whole number. Exon coverage is the raw read count of an exon divided by its length. RPKM is calculated using the formula (number of reads (fractional) mapped to an exon x 1,000,000,000)/(NORM_TOTAL x length of an exon) [Note: NORM_TOTAL = the total number of reads (fractional) that mapped to exons, excluding those in the mitochondrial chromosome] If a read alignment contained a deletion or a large gap, the read did not contribute coverage inside the region spanned by the deletion/gap. Each of the paired end reads was counted separately. We excluded reads from pairs that failed Illumina's Chastity filter, as well as reads with mapping quality < 10.

**.exon.quantification.txt A tab-delimited text file containing the following fields: - exon = Exon coordinates according to GAF (version 3.0) with the nomenclature, chr<chromosome number>:<start coordinate>-<end coordinate>:<strand>. '.' in the <strand> indicates that there was no strand information available. e.g. chr10:120810487-120810613:. - raw_counts = Sum of fraction of reads (rounded off to nearest integer - restricted by the RNA-seq validator) that mapped to an exon. Reads from pairs that did not pass Illumina’s Chastity filter or with mapping quality less than 10 were excluded from calculation. - median_length_normalized = Average coverage over the exon i.e. the sum of the coverage depth at each base in an exon divided by the length of the exon. - RPKM = Reads per kilobase of exon per million.                 

Gene and isoform expression:

Gene and isoform expression from RNA-seq data was generated using Cufflinks version 2.1.1. with default options and supplied reference annotation (Homo_sapiens.GRCh37.71.gtf) for estimation of expression. Cufflinks will not assemble novel transcripts, and it will ignore alignments not structurally compatible with any reference transcript.

Exon expression:

Exon expression file was generated using dexseq_count.py included in R package DEXseq 1.12.1 with annotation (Homo_sapiens.GRCh37.71.gff) and default parameters except for –p yes (indicates the data is paired end) and –s no (indicates the data is not from a strand-specific assay).

Gene fusion:

Gene fusion file was generated using defuse version 0.6.1 with default parameters and with reference annotation Homo_sapiens.GRCh37.69.  

*Protocols were performed through the laboratory of Dr. Javed Khan.

RNA-seq Library construction and sequencing by Illumina HiSeq2000:

PolyA+ RNA was purified using the MACS mRNA isolation kit (Miltenyi Biotec, Bergisch Gladbach, Germany), from 5-10ug of DNaseI-treated total RNA as per the manufacturer’s instructions. Double-stranded cDNA was synthesized from the purified polyA+ RNA using the Superscript Double-Stranded cDNA Synthesis kit (Invitrogen, Carlsbad, CA, USA) and random hexamer primers (Invitrogen) at a concentration of 5µM. The cDNA was fragmented by sonication and a paired-end sequencing library prepared following the Illumina paired-end library preparation protocol (Illumina, Hayward, CA, USA). RNA samples were prepared by Illumina TruSeqRNA Sample Preparation V2 kits according to the manufacturer's protocol. Poly-A containing mRNA was purified using poly-T oligo-attached magnetic beads and then fragmented. RNA fragments of ~200bp were reverse-transcribed and ligated with adaptors for sequencing. RNA libraries were sequenced on Illumina HiSeq2000 using 100bp paired-end sequencing according to the manufacturer's protocol.

*Protocols were performed in the laboratory of Dr. Javed Khan.

Reads Alignment:

Align reads to reference genome (GRCh37) using Tophat version 2.0.8b with default options, expect for options specifying number of processor threads and fusion search. An example code for alignment with fastq files is shown below.

-o tophat.out –p 6 --fusion-search –fusion-min-dist 100000 GRCh37 read_1.fq read_2.fq

Gene and isoform expression:

Gene and isoform expression from RNA-seq data was generated using Cufflinks version 2.1.1. with default options and supplied reference annotation (Homo_sapiens.GRCh37.71.gtf) for estimation of expression. Cufflinks will not assemble novel transcripts, and it will ignore alignments not structurally compatible with any reference transcript.

Exon expression:

RPKM for a given ExonX is determined by:  ( (raw base counts / median read length) * 10^9) / (total reads * exon length). The raw base counts for a given ExonX is the total number of bases aligned to that genomic segment. Raw base counts are used instead of raw read counts because in many   cases only a portion of a read will align to a given exon. 

Gene fusion:

Gene fusion file was generated using defuse version 0.6.1 with default parameters and with reference annotation Homo_sapiens.GRCh37.69.  

*Protocols were performed through the laboratory of Dr. Javed Khan.

RNA-seq Library construction and sequencing by Illumina HiSeq2000:

PolyA+ RNA was purified using the MACS mRNA isolation kit (Miltenyi Biotec, Bergisch Gladbach, Germany), from 5-10ug of DNaseI-treated total RNA as per the manufacturer’s instructions. Double-stranded cDNA was synthesized from the purified polyA+ RNA using the Superscript Double-Stranded cDNA Synthesis kit (Invitrogen, Carlsbad, CA, USA) and random hexamer primers (Invitrogen) at a concentration of 5µM. The cDNA was fragmented by sonication and a paired-end sequencing library prepared following the Illumina paired-end library preparation protocol (Illumina, Hayward, CA, USA). RNA samples were prepared by Illumina TruSeqRNA Sample Preparation V2 kits according to the manufacturer's protocol. Poly-A containing mRNA was purified using poly-T oligo-attached magnetic beads and then fragmented. RNA fragments of ~200bp were reverse-transcribed and ligated with adaptors for sequencing. RNA libraries were sequenced on Illumina HiSeq2000 using 100bp paired-end sequencing according to the manufacturer's protocol.

*Protocols were performed in the laboratory of Dr. Javed Khan.

Reads Alignment:

Align reads to reference genome (GRCh37) using Tophat version 2.0.8b with default options, expect for options specifying number of processor threads and fusion search. An example code for alignment with fastq files is shown below.

-o tophat.out –p 6 --fusion-search –fusion-min-dist 100000 GRCh37 read_1.fq read_2.fq

RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment. Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

Gene Coverage Analysis Protocol: www.bcgsc.ca/downloads/genomes/Homo_sapiens/hg19/1000genomes/bwa_ind/genome/  

Data Level: 3 Data File: *.gene.quantification.txt

The gene coverage analysis was performed with our internal analysis pipeline version 1.1 using "composite" gene annotations from the hg19 (GRCh37-lite) version of the TCGA GAF v3.0. These composite gene models were created in June 2011 by UNC (with assistance from UCSC) based on the annotations in the "UCSC genes" database. Each composite gene annotation was generated by collapsing all transcripts of that gene into a single model such that exonic bases in a composite gene model were the union of exonic bases from all known transcripts of the gene. Thus, the locations of the exonic boundaries used for the gene coverage analysis were not based on a single canonical transcript for each gene. Consequently, the exonic boundaries in a composite gene model may not correspond to the actual boundaries of the expressed transcripts. For simplicity, throughout this document and in the gene coverage results files, a composite gene model is simply referred to as a gene, and it is associated with the id of the gene whose transcripts contributed to that composite model. To generate the raw read counts, we first counted the number of bases of each read that were inside exonic regions in a gene, and then divided this total base count by the read length. Thus our values for the raw number of reads were not whole numbers (i.e. if the entire 50bp read mapped to an exon, we would add 50 to the total base count, which would ultimately contribute 1 to the raw read count. However, if only 25 bases of the read's alignment fell within an exon's boundaries, the total base count would be incremented by 25, which would ultimately contribute 0.5 to the raw read count). In order to comply with the file format specification enforced by the DCC validator, our raw read counts are rounded to the closest whole number. A gene's raw read count is the sum of raw read counts for exons belonging to the gene. Gene coverage is its raw read count divided by the sum of its exon lengths. RPKM is calculated using the formula: (number of reads mapped to all exons in a gene x 1,000,000,000)/(NORM_TOTAL x sum of the lengths of all exons in the gene )

[Note: NORM_TOTAL = the total number of reads that are mapped to all exons from the composite gene models. (i.e. sum of the fractional read count for all exons)]

If a read alignment contained a deletion or a large gap, the read did not contribute coverage inside the region spanned by the deletion/gap. Each of the paired end reads was counted separately. We excluded reads from pairs that failed Illumina's Chastity filter, as well as reads with mapping quality < 10.

*.gene.quantification.txt: A tab-delimited text file containing the following fields: - gene = Gene ID from GAF (version 3.0). The ID follows the nomenclature '<HUGO gene symbol>|<Entrez ID>'. If the combination of the HUGO symbol and the Entrez ID is not unique, an additional 'NofM' descriptor is added. An ID with '?' indicates that the HUGO gene symbol or Entrez ID is not available. e.g. U80769|?; TRNA_Pseudo|?|8of100 - raw_counts = Sum of fraction of reads (rounded off to nearest integer - restricted by the RNA-seq validator) that mapped to collapsed transcripts representing a specific gene. Reads from pairs that did not pass Illumina’s Chastity filter or with mapping quality less than 10, i.e. reads that did not map uniquely, were excluded from calculation. - median_length_normalized = Average coverage over all exons in the collapsed transcripts i.e. sum of the coverage depth at each base in all exons divided by the sum of the exon lengths - RPKM = Reads per kilobase of exon per million. Calculation described in detail below.

Exon Coverage Analysis:

Data Level: 3 Data File: **.exon.quantification.txt

The exon coverage analysis was performed with our internal analysis pipeline version 1.1 using "composite" gene annotations from the hg19 (GRCh37-lite) version of the TCGA GAF v3.0. These composite gene models were created in June 2011 by UNC (with assistance from UCSC) based on the annotations in the "UCSC genes" database. Similar to the gene coverage analysis, all transcripts of a given gene were collapsed into a single model such that exonic bases in a composite gene model were the union of exonic bases from all known transcripts of the gene. For simplicity, throughout this document and in the exon coverage results files, the collapsed exons are simply referred to as an exon. To generate the raw read counts, we first counted the number of bases of each read that were inside an exonic region, and then divided this total base count by the read length. Thus our values for the raw number of reads were not whole numbers (i.e. if the entire 50bp read mapped to an exon, we would add 50 to the total base count, which would ultimately contribute 1 to the raw read count. However, if only 25 bases of the read's alignment fell within an exon's boundaries, the total base count would be incremented by 25, which would ultimately contribute 0.5 to the raw read count). In order to comply with the file format specification enforced by the DCC validator, our raw read counts are rounded to the closest whole number. Exon coverage is the raw read count of an exon divided by its length. RPKM is calculated using the formula (number of reads (fractional) mapped to an exon x 1,000,000,000)/(NORM_TOTAL x length of an exon) [Note: NORM_TOTAL = the total number of reads (fractional) that mapped to exons, excluding those in the mitochondrial chromosome] If a read alignment contained a deletion or a large gap, the read did not contribute coverage inside the region spanned by the deletion/gap. Each of the paired end reads was counted separately. We excluded reads from pairs that failed Illumina's Chastity filter, as well as reads with mapping quality < 10.

**.exon.quantification.txt A tab-delimited text file containing the following fields: - exon = Exon coordinates according to GAF (version 3.0) with the nomenclature, chr<chromosome number>:<start coordinate>-<end coordinate>:<strand>. '.' in the <strand> indicates that there was no strand information available. e.g. chr10:120810487-120810613:. - raw_counts = Sum of fraction of reads (rounded off to nearest integer - restricted by the RNA-seq validator) that mapped to an exon. Reads from pairs that did not pass Illumina’s Chastity filter or with mapping quality less than 10 were excluded from calculation. - median_length_normalized = Average coverage over the exon i.e. the sum of the coverage depth at each base in an exon divided by the length of the exon. - RPKM = Reads per kilobase of exon per million.                 

Gene and isoform expression:

Gene and isoform expression from RNA-seq data was generated using Cufflinks version 2.1.1. with default options and supplied reference annotation (Homo_sapiens.GRCh37.71.gtf) for estimation of expression. Cufflinks will not assemble novel transcripts, and it will ignore alignments not structurally compatible with any reference transcript.

Exon expression:

Exon expression file was generated using dexseq_count.py included in R package DEXseq 1.12.1 with annotation (Homo_sapiens.GRCh37.71.gff) and default parameters except for –p yes (indicates the data is paired end) and –s no (indicates the data is not from a strand-specific assay).

Gene fusion:

Gene fusion file was generated using defuse version 0.6.1 with default parameters and with reference annotation Homo_sapiens.GRCh37.69.  

St Jude’s RNA-seq Protocol for ALAL:

At SJCRH, total RNA quality and quantity were assessed on Agilent RNA6000 chips (Agilent Technologies) and Qubit (Life Technologies). RNA-seq libraries were prepared from 500 ng of total RNA for each sample following Illumina RNA-seq protocols, including DNase treatment and phenol purification, cDNA conversion, fragmentation by Covaris Ultrasonicator, end repair, deoxyadenosine tailing, adaptor ligation and PCR amplification (ten cycles). Libraries with a 10 pM concentration were clustered on an Illumina cBot, and each flow cell was loaded onto a HiSeq instrument for sequencing using the Illumina 2×100 bp sequencing kit.

Other Targeted Sequencing
Sequencing Platform Data Generation Protocols Data Analysis Protocols
Kinome Sequencing ALL P1 ALL P1
Targeted Capture Sequencing AML , NBL , WT
Targeted Resequencing (Sanger) ALL P1 ALL P1

Kinome Sanger Sequencing

*Protocols performed at British Columbia Cancer Agency. Please refer to Loh et al. (Tyrosine kinome sequencing of pediatric acute lymphoblastic leukemia: a report from the Children's Oncology Group TARGET Project, ////// .MathJax_Preview {color: #888} #MathJax_Message {position: fixed; left: 1px; bottom: 2px; background-color: #E6E6E6; border: 1px solid #959595; margin: 0px; padding: 2px 8px; z-index: 102; color: black; font-size: 80%; width: auto; white-space: nowrap} #MathJax_MSIE_Frame {position: absolute; top: 0; left: 0; width: 0px; z-index: 101; border: 0px; margin: 0px; padding: 0px} .MathJax_Error {color: #CC0000; font-style: italic} window.name="mainwindow"; .pmc-wm {background:transparent repeat-y top left;background-image:url(/corehtml/pmc/pmcgifs/wm-blood.gif);background-size: auto, contain} .print-view{display:block} div.pmc_para_cit li.highlight, div.pmc_para_cit li.highlight .one_line_source { background: #E0E0E0; } a.bibr.highlight { background: #E0E0E0; } .print-log { position:absolute;left:-10000px;top:auto;width:1px;height:1px;overflow:hidden; }.print-log li { list-style-image: url('https://www.ncbi.nlm.nih.gov/stat?jsevent=print&ncbi_app=pmc&ncbi_db=pmc&ncbi_pcid=%2Farticles%2FPMC3548168%2F&ncbi_pdid=article&ncbi_phid=F4FC2909838A448100000000009E0097'); } .print-log { position:absolute;left:-10000px;top:auto;width:1px;height:1px;overflow:hidden; } .MathJax_Hover_Frame {border-radius: .25em; -webkit-border-radius: .25em; -moz-border-radius: .25em; -khtml-border-radius: .25em; box-shadow: 0px 0px 15px #83A; -webkit-box-shadow: 0px 0px 15px #83A; -moz-box-shadow: 0px 0px 15px #83A; -khtml-box-shadow: 0px 0px 15px #83A; border: 1px solid #A6D ! important; display: inline-block; position: absolute} .MathJax_Menu_Button .MathJax_Hover_Arrow {position: absolute; cursor: pointer; display: inline-block; border: 2px solid #AAA; border-radius: 4px; -webkit-border-radius: 4px; -moz-border-radius: 4px; -khtml-border-radius: 4px; font-family: 'Courier New',Courier; font-size: 9px; color: #F0F0F0} .MathJax_Menu_Button .MathJax_Hover_Arrow span {display: block; background-color: #AAA; border: 1px solid; border-radius: 3px; line-height: 0; padding: 4px} .MathJax_Hover_Arrow:hover {color: white!important; border: 2px solid #CCC!important} .MathJax_Hover_Arrow:hover span {background-color: #CCC!important} #MathJax_Zoom {position: absolute; background-color: #F0F0F0; overflow: auto; display: block; z-index: 301; padding: .5em; border: 1px solid black; margin: 0; font-weight: normal; font-style: normal; text-align: left; text-indent: 0; text-transform: none; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; -webkit-box-sizing: content-box; -moz-box-sizing: content-box; box-sizing: content-box; box-shadow: 5px 5px 15px #AAAAAA; -webkit-box-shadow: 5px 5px 15px #AAAAAA; -moz-box-shadow: 5px 5px 15px #AAAAAA; -khtml-box-shadow: 5px 5px 15px #AAAAAA} #MathJax_ZoomOverlay {position: absolute; left: 0; top: 0; z-index: 300; display: inline-block; width: 100%; height: 100%; border: 0; padding: 0; margin: 0; background-color: white; opacity: 0; filter: alpha(opacity=0)} #MathJax_ZoomFrame {position: relative; display: inline-block; height: 0; width: 0} #MathJax_ZoomEventTrap {position: absolute; left: 0; top: 0; z-index: 302; display: inline-block; border: 0; padding: 0; margin: 0; background-color: white; opacity: 0; filter: alpha(opacity=0)} #MathJax_About {position: fixed; left: 50%; width: auto; text-align: center; border: 3px outset; padding: 1em 2em; background-color: #DDDDDD; color: black; cursor: default; font-family: message-box; font-size: 120%; font-style: normal; text-indent: 0; text-transform: none; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; z-index: 201; border-radius: 15px; -webkit-border-radius: 15px; -moz-border-radius: 15px; -khtml-border-radius: 15px; box-shadow: 0px 10px 20px #808080; -webkit-box-shadow: 0px 10px 20px #808080; -moz-box-shadow: 0px 10px 20px #808080; -khtml-box-shadow: 0px 10px 20px #808080} #MathJax_About.MathJax_MousePost {outline: none} .MathJax_Menu {position: absolute; background-color: white; color: black; width: auto; padding: 2px; border: 1px solid #CCCCCC; margin: 0; cursor: default; font: menu; text-align: left; text-indent: 0; text-transform: none; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; z-index: 201; box-shadow: 0px 10px 20px #808080; -webkit-box-shadow: 0px 10px 20px #808080; -moz-box-shadow: 0px 10px 20px #808080; -khtml-box-shadow: 0px 10px 20px #808080} .MathJax_MenuItem {padding: 2px 2em; background: transparent} .MathJax_MenuArrow {position: absolute; right: .5em; padding-top: .25em; color: #666666; font-family: 'Arial unicode MS'; font-size: .75em} .MathJax_MenuActive .MathJax_MenuArrow {color: white} .MathJax_MenuArrow.RTL {left: .5em; right: auto} .MathJax_MenuCheck {position: absolute; left: .7em; font-family: 'Arial unicode MS'} .MathJax_MenuCheck.RTL {right: .7em; left: auto} .MathJax_MenuRadioCheck {position: absolute; left: 1em} .MathJax_MenuRadioCheck.RTL {right: 1em; left: auto} .MathJax_MenuLabel {padding: 2px 2em 4px 1.33em; font-style: italic} .MathJax_MenuRule {border-top: 1px solid #CCCCCC; margin: 4px 1px 0px} .MathJax_MenuDisabled {color: GrayText} .MathJax_MenuActive {background-color: Highlight; color: HighlightText} .MathJax_MenuDisabled:focus, .MathJax_MenuLabel:focus {background-color: #E8E8E8} .MathJax_ContextMenu:focus {outline: none} .MathJax_ContextMenu .MathJax_MenuItem:focus {outline: none} #MathJax_AboutClose {top: .2em; right: .2em} .MathJax_Menu .MathJax_MenuClose {top: -10px; left: -10px} .MathJax_MenuClose {position: absolute; cursor: pointer; display: inline-block; border: 2px solid #AAA; border-radius: 18px; -webkit-border-radius: 18px; -moz-border-radius: 18px; -khtml-border-radius: 18px; font-family: 'Courier New',Courier; font-size: 24px; color: #F0F0F0} .MathJax_MenuClose span {display: block; background-color: #AAA; border: 1.5px solid; border-radius: 18px; -webkit-border-radius: 18px; -moz-border-radius: 18px; -khtml-border-radius: 18px; line-height: 0; padding: 8px 0 6px} .MathJax_MenuClose:hover {color: white!important; border: 2px solid #CCC!important} .MathJax_MenuClose:hover span {background-color: #CCC!important} .MathJax_MenuClose:hover:focus {outline: none} . 2013 Jan 17; 121(3): 485–488).

Patient selection and characteristics

Forty-five cryopreserved diagnostic bone marrow or peripheral blood specimens with at least 80% blasts from children with newly diagnosed ALL were selected for kinome sequencing (Table 1), including 23 from COG P9906 that lacked JAK mutations and 22 AALL0232 patients of unknown JAK mutation status. AALL0232 eligibility included age at least 10 years and/or initial peripheral blood white blood cell count of at least 50 000/μL. Minimal residual disease (MRD) burden was determined via flow cytometry in 1 of 2 central reference laboratories at day 29 of induction therapy. All P9906/AALL0232 patients or their patients/guardians provided informed consent for treatment and for banking of specimens for future research in accordance with the Declaration of Helsinki. Institutional review board approval for the laboratory studies was granted by St Jude Children's Research Hospital and the University of New Mexico.

GEP and sample selection

RNA extraction and GEP characterization for P9906 cases have been described previously., Affymetrix U133 Plus Version 2.0 gene expression microarray and Affymetrix SNP Version 6.0 microarray profiling were performed on 608 patients consecutively enrolled on AALL0232 with sufficient banked material available; 325 were used as a training set, after modeling the Ph-like GEP on the BCR-ABL1+ patients within this training set (n = 21).

We then applied Prediction Analysis for Microarrays (PAM), trained using Ph+ cases to identify all Ph-like cases (supplemental Figure 1, available on the Blood Web site; see the Supplemental Materials link at the top of the online article). We classified patients in the test set (283 AALL0232 patients) using this Ph-like signature and assessed the outcome of all Ph-like patients enrolled on AALL0232. We further applied this Ph-like PAM algorithm to the COG P9906 samples to identify Ph-like cases in that cohort, and then assessed the prognosis of this group of ALL cases.

We selected 45 P9906 and AALL0232 cases that were either predicted to be Ph-like by PAM (31 cases; 12 of 23 from 9906 and 19 of 22 from AALL0232), or had high CRLF2 expression or other features suggestive of activated kinase signaling (n = 14) for sequence analysis of 126 genes that encode TKs or mediators of kinase signaling (supplemental Table 1). The entire coding and untranslated regions of each selected gene were subsequently amplified by PCR of whole genome amplified (QIAGEN) genomic DNA and subjected to Sanger sequencing (Beckman Coulter Genomics). A CEPH sample (NA19085) was included as a normal control. Sequence variations were detected using SNPdetector and novel, putative nonsilent coding mutations were selected for validation. Forty-one novel variants that failed in the validation assay or had no matching germline samples were compared with germline variants identified by the National Center for Biotechnology Information Exome Sequencing Project (http://evs.gs.washington.edu/EVS) and 1000 Genomes Project deposited in dbSNP 135 (http://www.ncbi.nlm.nih.gov/projects/SNP). For the 22 patient samples from AALL0232, we performed Sanger sequencing separately for the 5 most commonly mutated exons of JAK1 and JAK2. The gene expression data for COG P9906 have been deposited at the National Center for Biotechnology Information Gene Expression Omnibus (accession no. {"type":"entrez-geo","attrs":{"text":"GSE11877","term_id":"11877","extlink":"1"}}GSE11877). The gene expression data without metadata for COG AALL0232 are deposited at the National Cancer Institute caArray site, project identifier EXP-578 (https://array.nci.nih.gov/caarray/project/EXP-578).


Kinome Sanger Sequencing

*Protocols performed at British Columbia Cancer Agency. Please refer to Loh et al. (Tyrosine kinome sequencing of pediatric acute lymphoblastic leukemia: a report from the Children's Oncology Group TARGET Project, ////// .MathJax_Preview {color: #888} #MathJax_Message {position: fixed; left: 1px; bottom: 2px; background-color: #E6E6E6; border: 1px solid #959595; margin: 0px; padding: 2px 8px; z-index: 102; color: black; font-size: 80%; width: auto; white-space: nowrap} #MathJax_MSIE_Frame {position: absolute; top: 0; left: 0; width: 0px; z-index: 101; border: 0px; margin: 0px; padding: 0px} .MathJax_Error {color: #CC0000; font-style: italic} window.name="mainwindow"; .pmc-wm {background:transparent repeat-y top left;background-image:url(/corehtml/pmc/pmcgifs/wm-blood.gif);background-size: auto, contain} .print-view{display:block} div.pmc_para_cit li.highlight, div.pmc_para_cit li.highlight .one_line_source { background: #E0E0E0; } a.bibr.highlight { background: #E0E0E0; } .print-log { position:absolute;left:-10000px;top:auto;width:1px;height:1px;overflow:hidden; }.print-log li { list-style-image: url('https://www.ncbi.nlm.nih.gov/stat?jsevent=print&ncbi_app=pmc&ncbi_db=pmc&ncbi_pcid=%2Farticles%2FPMC3548168%2F&ncbi_pdid=article&ncbi_phid=F4FC2909838A448100000000009E0097'); } .print-log { position:absolute;left:-10000px;top:auto;width:1px;height:1px;overflow:hidden; } .MathJax_Hover_Frame {border-radius: .25em; -webkit-border-radius: .25em; -moz-border-radius: .25em; -khtml-border-radius: .25em; box-shadow: 0px 0px 15px #83A; -webkit-box-shadow: 0px 0px 15px #83A; -moz-box-shadow: 0px 0px 15px #83A; -khtml-box-shadow: 0px 0px 15px #83A; border: 1px solid #A6D ! important; display: inline-block; position: absolute} .MathJax_Menu_Button .MathJax_Hover_Arrow {position: absolute; cursor: pointer; display: inline-block; border: 2px solid #AAA; border-radius: 4px; -webkit-border-radius: 4px; -moz-border-radius: 4px; -khtml-border-radius: 4px; font-family: 'Courier New',Courier; font-size: 9px; color: #F0F0F0} .MathJax_Menu_Button .MathJax_Hover_Arrow span {display: block; background-color: #AAA; border: 1px solid; border-radius: 3px; line-height: 0; padding: 4px} .MathJax_Hover_Arrow:hover {color: white!important; border: 2px solid #CCC!important} .MathJax_Hover_Arrow:hover span {background-color: #CCC!important} #MathJax_Zoom {position: absolute; background-color: #F0F0F0; overflow: auto; display: block; z-index: 301; padding: .5em; border: 1px solid black; margin: 0; font-weight: normal; font-style: normal; text-align: left; text-indent: 0; text-transform: none; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; -webkit-box-sizing: content-box; -moz-box-sizing: content-box; box-sizing: content-box; box-shadow: 5px 5px 15px #AAAAAA; -webkit-box-shadow: 5px 5px 15px #AAAAAA; -moz-box-shadow: 5px 5px 15px #AAAAAA; -khtml-box-shadow: 5px 5px 15px #AAAAAA} #MathJax_ZoomOverlay {position: absolute; left: 0; top: 0; z-index: 300; display: inline-block; width: 100%; height: 100%; border: 0; padding: 0; margin: 0; background-color: white; opacity: 0; filter: alpha(opacity=0)} #MathJax_ZoomFrame {position: relative; display: inline-block; height: 0; width: 0} #MathJax_ZoomEventTrap {position: absolute; left: 0; top: 0; z-index: 302; display: inline-block; border: 0; padding: 0; margin: 0; background-color: white; opacity: 0; filter: alpha(opacity=0)} #MathJax_About {position: fixed; left: 50%; width: auto; text-align: center; border: 3px outset; padding: 1em 2em; background-color: #DDDDDD; color: black; cursor: default; font-family: message-box; font-size: 120%; font-style: normal; text-indent: 0; text-transform: none; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; z-index: 201; border-radius: 15px; -webkit-border-radius: 15px; -moz-border-radius: 15px; -khtml-border-radius: 15px; box-shadow: 0px 10px 20px #808080; -webkit-box-shadow: 0px 10px 20px #808080; -moz-box-shadow: 0px 10px 20px #808080; -khtml-box-shadow: 0px 10px 20px #808080} #MathJax_About.MathJax_MousePost {outline: none} .MathJax_Menu {position: absolute; background-color: white; color: black; width: auto; padding: 2px; border: 1px solid #CCCCCC; margin: 0; cursor: default; font: menu; text-align: left; text-indent: 0; text-transform: none; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; z-index: 201; box-shadow: 0px 10px 20px #808080; -webkit-box-shadow: 0px 10px 20px #808080; -moz-box-shadow: 0px 10px 20px #808080; -khtml-box-shadow: 0px 10px 20px #808080} .MathJax_MenuItem {padding: 2px 2em; background: transparent} .MathJax_MenuArrow {position: absolute; right: .5em; padding-top: .25em; color: #666666; font-family: 'Arial unicode MS'; font-size: .75em} .MathJax_MenuActive .MathJax_MenuArrow {color: white} .MathJax_MenuArrow.RTL {left: .5em; right: auto} .MathJax_MenuCheck {position: absolute; left: .7em; font-family: 'Arial unicode MS'} .MathJax_MenuCheck.RTL {right: .7em; left: auto} .MathJax_MenuRadioCheck {position: absolute; left: 1em} .MathJax_MenuRadioCheck.RTL {right: 1em; left: auto} .MathJax_MenuLabel {padding: 2px 2em 4px 1.33em; font-style: italic} .MathJax_MenuRule {border-top: 1px solid #CCCCCC; margin: 4px 1px 0px} .MathJax_MenuDisabled {color: GrayText} .MathJax_MenuActive {background-color: Highlight; color: HighlightText} .MathJax_MenuDisabled:focus, .MathJax_MenuLabel:focus {background-color: #E8E8E8} .MathJax_ContextMenu:focus {outline: none} .MathJax_ContextMenu .MathJax_MenuItem:focus {outline: none} #MathJax_AboutClose {top: .2em; right: .2em} .MathJax_Menu .MathJax_MenuClose {top: -10px; left: -10px} .MathJax_MenuClose {position: absolute; cursor: pointer; display: inline-block; border: 2px solid #AAA; border-radius: 18px; -webkit-border-radius: 18px; -moz-border-radius: 18px; -khtml-border-radius: 18px; font-family: 'Courier New',Courier; font-size: 24px; color: #F0F0F0} .MathJax_MenuClose span {display: block; background-color: #AAA; border: 1.5px solid; border-radius: 18px; -webkit-border-radius: 18px; -moz-border-radius: 18px; -khtml-border-radius: 18px; line-height: 0; padding: 8px 0 6px} .MathJax_MenuClose:hover {color: white!important; border: 2px solid #CCC!important} .MathJax_MenuClose:hover span {background-color: #CCC!important} .MathJax_MenuClose:hover:focus {outline: none} . 2013 Jan 17; 121(3): 485–488).

Patient selection and characteristics

Forty-five cryopreserved diagnostic bone marrow or peripheral blood specimens with at least 80% blasts from children with newly diagnosed ALL were selected for kinome sequencing (Table 1), including 23 from COG P9906 that lacked JAK mutations and 22 AALL0232 patients of unknown JAK mutation status. AALL0232 eligibility included age at least 10 years and/or initial peripheral blood white blood cell count of at least 50 000/μL. Minimal residual disease (MRD) burden was determined via flow cytometry in 1 of 2 central reference laboratories at day 29 of induction therapy. All P9906/AALL0232 patients or their patients/guardians provided informed consent for treatment and for banking of specimens for future research in accordance with the Declaration of Helsinki. Institutional review board approval for the laboratory studies was granted by St Jude Children's Research Hospital and the University of New Mexico.

GEP and sample selection

RNA extraction and GEP characterization for P9906 cases have been described previously., Affymetrix U133 Plus Version 2.0 gene expression microarray and Affymetrix SNP Version 6.0 microarray profiling were performed on 608 patients consecutively enrolled on AALL0232 with sufficient banked material available; 325 were used as a training set, after modeling the Ph-like GEP on the BCR-ABL1+ patients within this training set (n = 21).

We then applied Prediction Analysis for Microarrays (PAM), trained using Ph+ cases to identify all Ph-like cases (supplemental Figure 1, available on the Blood Web site; see the Supplemental Materials link at the top of the online article). We classified patients in the test set (283 AALL0232 patients) using this Ph-like signature and assessed the outcome of all Ph-like patients enrolled on AALL0232. We further applied this Ph-like PAM algorithm to the COG P9906 samples to identify Ph-like cases in that cohort, and then assessed the prognosis of this group of ALL cases.

We selected 45 P9906 and AALL0232 cases that were either predicted to be Ph-like by PAM (31 cases; 12 of 23 from 9906 and 19 of 22 from AALL0232), or had high CRLF2 expression or other features suggestive of activated kinase signaling (n = 14) for sequence analysis of 126 genes that encode TKs or mediators of kinase signaling (supplemental Table 1). The entire coding and untranslated regions of each selected gene were subsequently amplified by PCR of whole genome amplified (QIAGEN) genomic DNA and subjected to Sanger sequencing (Beckman Coulter Genomics). A CEPH sample (NA19085) was included as a normal control. Sequence variations were detected using SNPdetector and novel, putative nonsilent coding mutations were selected for validation. Forty-one novel variants that failed in the validation assay or had no matching germline samples were compared with germline variants identified by the National Center for Biotechnology Information Exome Sequencing Project (http://evs.gs.washington.edu/EVS) and 1000 Genomes Project deposited in dbSNP 135 (http://www.ncbi.nlm.nih.gov/projects/SNP). For the 22 patient samples from AALL0232, we performed Sanger sequencing separately for the 5 most commonly mutated exons of JAK1 and JAK2. The gene expression data for COG P9906 have been deposited at the National Center for Biotechnology Information Gene Expression Omnibus (accession no. {"type":"entrez-geo","attrs":{"text":"GSE11877","term_id":"11877","extlink":"1"}}GSE11877). The gene expression data without metadata for COG AALL0232 are deposited at the National Cancer Institute caArray site, project identifier EXP-578 (https://array.nci.nih.gov/caarray/project/EXP-578).

Targeted Capture Sequencing

*Protocols were performed at British Columbia Cancer Agency and Fred Hutchinson Cancer Research Center.

Sample cohort selection

article,aside,figcaption,figure,footer,header,hgroup,main,nav,section{display:block}mark{background:#FF0;color:#000} Some sequence mutations identified in the relapse-enriched discovery cohort, along with some previously published variants in adult AML, were further analyzed in an additional 600-plus cases (a variety of sample combinations including primary and some relapsed tumors, along with some matched normal as well). The TARGET AML project team employed targeted capture sequencing to look at the presence and frequency of alterations in 400 gene variants. This validation effort was performed in an unbiased cohort that was randomly selected from patients enrolled on a single COG protocol, which allowed for determination of the frequency of these changes across a broader spectrum of AML subtypes.

Probe design for custom capture validation sequencing

Probes were designed using Agilent's SureDesign online web-based tool, located at URL: https://earray.chem.agilent.com/suredesign/.  There were 420 targets provided as HUGO gene symbols along with 560 RefSeq IDs.  RefSeq IDs were used in order to avoid ambiguity in targeting the incorrect isoform of a gene symbol. In addition to these 560 RefSeqIDs, 15 genomic positions were also included in the target region.

During the design phase, the RefSeq IDs were limited to coding exons, UTRs (both 5' and 3' UTR).  Each region was padded with an additional 10 bases on both the 5' and 3' ends.  This resulted in an overall target space of 2.376 Mbp (megabase pairs). Probe density was specified at 2x, with moderately stringent repeat masking, and balanced boosting options selected.

According to the Agilent SureDesign report, 43,137 probes were designed for a total size 2.785 Mbp (this total counts overlapping bases in adjacent probes separately).  98.7144% of the 2.376 Mbp target region was covered by a probe. Custom Java code was developed at the British Columbia Cancer Agency Genome Sciences Centre (BCGSC) to independently verify the accuracy of the probe design.  This code incorporated a BLAT alignment of each probe against the reference genome (H. sapiens, hg19, GRCh37, February 2009) to ensure the target region was covered. Once the probe design was verified, Agilent SureSelect XT Custom 0.5-2.9Mb probes were ordered.

Whole genome library construction and multiplex custom gene capture

Genomic DNA libraries from which gene regions of interest are captured were constructed according to British Columbia Cancer Agency Genome Sciences Centre (BCGSC) plate-based and paired-end library protocols on a Biomek FX liquid handling robot (Beckman-Coulter, USA). Briefly, 1ug of high molecular weight genomic DNA was sonicated (Covaris E210) in a 60uL volume to 200-300bp. Sonicated DNA was purified with magnetic beads (Agencourt, Ampure). The DNA fragments were end-repaired, phosphorylated and bead purified in preparation for A-tailing.  Illumina sequencing adapters were ligated overnight at 20oC and adapter ligated products bead purified and enriched with 4 cycles of PCR using primers containing a hexamer index that enables library pooling.  94ng from each of 19 to 24 different libraries were pooled prior to custom capture using Agilent SureSelect XT Custom 0.5-2.9Mb probes. The pooled libraries were hybridized to the RNA probes at 65oC for 24 hours. Following hybridization, streptavidin-coated magnetic beads (Dynal, MyOne) were used for custom capture. Post-capture material was purified on MinElute columns (Qiagen) followed by post-capture enrichment with 10 cycles of PCR using primers that maintain the library-specific indices. Paired-end 100 base reads were sequenced per pool in a single lane of an Illumina HiSeq2500 instrument. 

Targeted Capture Sequencing

*Protocols were performed at British Columbia Cancer Agency and Children's Hospital of Philadelphia.

Sample cohort selection

Some sequence mutations identified in the discovery cohort, along with some previously published variants, were further analyzed in an additional 500 cases (tumor and matched normal samples). The TARGET NBL project team employed targeted capture sequencing to look at the presence and frequency of alterations in 400 gene variants. This validation effort was performed in an unbiased cohort that was randomly selected from patients enrolled on a single COG protocol, which allowed for determination of the frequency of these changes across a broader spectrum of NBL subtypes.

Probe design for custom capture validation sequencing

Probes were designed using Agilent's SureDesign online web-based tool, located at URL: https://earray.chem.agilent.com/suredesign/.  There were 420 targets provided as HUGO gene symbols along with 560 RefSeq IDs.  RefSeq IDs were used in order to avoid ambiguity in targeting the incorrect isoform of a gene symbol. In addition to these 560 RefSeqIDs, 15 genomic positions were also included in the target region.

During the design phase, the RefSeq IDs were limited to coding exons, UTRs (both 5' and 3' UTR).  Each region was padded with an additional 10 bases on both the 5' and 3' ends.  This resulted in an overall target space of 2.376 Mbp (megabase pairs). Probe density was specified at 2x, with moderately stringent repeat masking, and balanced boosting options selected.

According to the Agilent SureDesign report, 43,137 probes were designed for a total size 2.785 Mbp (this total counts overlapping bases in adjacent probes separately).  98.7144% of the 2.376 Mbp target region was covered by a probe. Custom Java code was developed at the British Columbia Cancer Agency Genome Sciences Centre (BCGSC) to independently verify the accuracy of the probe design.  This code incorporated a BLAT alignment of each probe against the reference genome (H. sapiens, hg19, GRCh37, February 2009) to ensure the target region was covered. Once the probe design was verified, Agilent SureSelect XT Custom 0.5-2.9Mb probes were ordered.

Whole genome library construction and multiplex custom gene capture

Genomic DNA libraries from which gene regions of interest are captured were constructed according to British Columbia Cancer Agency Genome Sciences Centre (BCGSC) plate-based and paired-end library protocols on a Biomek FX liquid handling robot (Beckman-Coulter, USA). Briefly, 1ug of high molecular weight genomic DNA was sonicated (Covaris E210) in a 60uL volume to 200-300bp. Sonicated DNA was purified with magnetic beads (Agencourt, Ampure). The DNA fragments were end-repaired, phosphorylated and bead purified in preparation for A-tailing.  Illumina sequencing adapters were ligated overnight at 20oC and adapter ligated products bead purified and enriched with 4 cycles of PCR using primers containing a hexamer index that enables library pooling.  94ng from each of 19 to 24 different libraries were pooled prior to custom capture using Agilent SureSelect XT Custom 0.5-2.9Mb probes. The pooled libraries were hybridized to the RNA probes at 65oC for 24 hours. Following hybridization, streptavidin-coated magnetic beads (Dynal, MyOne) were used for custom capture. Post-capture material was purified on MinElute columns (Qiagen) followed by post-capture enrichment with 10 cycles of PCR using primers that maintain the library-specific indices. Paired-end 100 base reads were sequenced per pool in a single lane of an Illumina HiSeq2500 instrument. 

Targeted Capture Sequencing

*Protocols were performed at British Columbia Cancer Agency and Ann & Robert H. Lurie Children's Hospital.

Sample cohort selection

Some sequence mutations identified in the poor outcome discovery cohort, along with some previously published variants, were further analyzed in an additional 550-plus cases (tumor samples only). The TARGET KT project team employed targeted capture sequencing to look at the presence and frequency of alterations in 400 gene variants. This validation effort was performed in an unbiased cohort that was randomly selected from patients enrolled on a single COG protocol, which allowed for determination of the frequency of these changes across a broader spectrum of WT subtypes.

Probe design for custom capture validation sequencing

Probes were designed using Agilent's SureDesign online web-based tool, located at URL: https://earray.chem.agilent.com/suredesign/.  There were 420 targets provided as HUGO gene symbols along with 560 RefSeq IDs.  RefSeq IDs were used in order to avoid ambiguity in targeting the incorrect isoform of a gene symbol. In addition to these 560 RefSeqIDs, 15 genomic positions were also included in the target region.

During the design phase, the RefSeq IDs were limited to coding exons, UTRs (both 5' and 3' UTR).  Each region was padded with an additional 10 bases on both the 5' and 3' ends.  This resulted in an overall target space of 2.376 Mbp (megabase pairs). Probe density was specified at 2x, with moderately stringent repeat masking, and balanced boosting options selected.

According to the Agilent SureDesign report, 43,137 probes were designed for a total size 2.785 Mbp (this total counts overlapping bases in adjacent probes separately).  98.7144% of the 2.376 Mbp target region was covered by a probe. Custom Java code was developed at the British Columbia Cancer Agency Genome Sciences Centre (BCGSC) to independently verify the accuracy of the probe design.  This code incorporated a BLAT alignment of each probe against the reference genome (H. sapiens, hg19, GRCh37, February 2009) to ensure the target region was covered. Once the probe design was verified, Agilent SureSelect XT Custom 0.5-2.9Mb probes were ordered.

Whole genome library construction and multiplex custom gene capture

Genomic DNA libraries from which gene regions of interest are captured were constructed according to British Columbia Cancer Agency Genome Sciences Centre (BCGSC) plate-based and paired-end library protocols on a Biomek FX liquid handling robot (Beckman-Coulter, USA). Briefly, 1ug of high molecular weight genomic DNA was sonicated (Covaris E210) in a 60uL volume to 200-300bp. Sonicated DNA was purified with magnetic beads (Agencourt, Ampure). The DNA fragments were end-repaired, phosphorylated and bead purified in preparation for A-tailing.  Illumina sequencing adapters were ligated overnight at 20oC and adapter ligated products bead purified and enriched with 4 cycles of PCR using primers containing a hexamer index that enables library pooling.  94ng from each of 19 to 24 different libraries were pooled prior to custom capture using Agilent SureSelect XT Custom 0.5-2.9Mb probes. The pooled libraries were hybridized to the RNA probes at 65oC for 24 hours. Following hybridization, streptavidin-coated magnetic beads (Dynal, MyOne) were used for custom capture. Post-capture material was purified on MinElute columns (Qiagen) followed by post-capture enrichment with 10 cycles of PCR using primers that maintain the library-specific indices. Paired-end 100 base reads were sequenced per pool in a single lane of an Illumina HiSeq2500 instrument. 

Targeted Sanger Resequencing

*Protocols performed at British Columbia Cancer Agency. Please refer to Roberts et al. (Genetic alterations activating kinase and cytokine receptor signaling in high-risk acute lymphoblastic leukemia. Published in final edited form as: Cancer Cell. 2012 Aug 14; 22(2): 153–166).

mRNA-seq and whole genome sequencing

mRNA-seq was performed using a method similar to that previously described (). For WGS, Illumina paired-end whole genome shotgun libraries were prepared from 1 μg of genomic DNA as described (). Sequencing was performed on the Illumina Genome Analyzer GAIIx or HiSeq 2000 platforms. Methods for library preparation, sequencing and detection of rearrangements, DNA copy number alterations and sequence variations are provided in the Supplemental Experimental Procedures.

RT-PCR, genomic mapping and sequencing

Putative rearrangements identified by mRNA-seq were validated by RT-PCR and Sanger sequencing. Leukemic cell RNA was reverse-transcribed using Superscript III (Life Technologies) and fusion products amplified with Phusion HF polymerase (New England Biolabs). Genomic mapping of the EBF1-PDGFRB and BCR-JAK2 rearrangement breakpoints was performed using whole genome amplified (Qiagen, Germany) leukemic cell DNA.

Targeted Sanger Resequencing

*Protocols performed at British Columbia Cancer Agency. Please refer to Roberts et al. (Genetic alterations activating kinase and cytokine receptor signaling in high-risk acute lymphoblastic leukemia. Published in final edited form as: Cancer Cell. 2012 Aug 14; 22(2): 153–166).

mRNA-seq and whole genome sequencing

mRNA-seq was performed using a method similar to that previously described (). For WGS, Illumina paired-end whole genome shotgun libraries were prepared from 1 μg of genomic DNA as described (). Sequencing was performed on the Illumina Genome Analyzer GAIIx or HiSeq 2000 platforms. Methods for library preparation, sequencing and detection of rearrangements, DNA copy number alterations and sequence variations are provided in the Supplemental Experimental Procedures.

RT-PCR, genomic mapping and sequencing

Putative rearrangements identified by mRNA-seq were validated by RT-PCR and Sanger sequencing. Leukemic cell RNA was reverse-transcribed using Superscript III (Life Technologies) and fusion products amplified with Phusion HF polymerase (New England Biolabs). Genomic mapping of the EBF1-PDGFRB and BCR-JAK2 rearrangement breakpoints was performed using whole genome amplified (Qiagen, Germany) leukemic cell DNA.

Last Updated: