Skip to main content

Breadcrumb

  1. Home
  2. Programs
  3. Cancer Genome Characterization Initiative
  4. cgci resources
  5. Experimental Methods for HIV+ Tumor Molecular Characterization Project

Experimental Methods for HIV+ Tumor Molecular Characterization Project

On this page researchers can find detailed information describing how CGCI data was generated by genomic platform, including protocols for establishing high-quality nucleic acid samples.

CGCI Sample Naming 

CGCI samples are named using a coding system specific to OCG characterization programs. Please use the OCG Sample Codes document to properly discern the metadata reiterated within the sample name. OCG Sample Codes_finalized 05 2017.pdf

Nucleic Acid Sample Processing

CGCI project teams use high-quality RNA and DNA from case-matched tumor and normal tissues to generate comprehensive genomics data. See the Sample Preparation Protocols below for details.

*Cick on the project symbol under each method to open up protocols.*

Column One Sample Preparation Protocols
DNA/RNA Extraction with AllPrep (DNA) and mirVana (total RNA with small RNA) Isolation Kits, DNA extraction with QIAmp DNA Blood Mini Kit CaCx
Sample Preparation Protocols
Whole Genome Sequencing
Data Generation Protocols Data Analysis Protocols
CaCx CaCx

Data Generation Protocols

The data generation protocols for the HTMCP- Cervical Cancer project were acquired from the following manuscript.

Gagliardi A, Porter VL, Zong Z, et al. Analysis of Ugandan cervical carcinomas identifies human papillomavirus clade-specific epigenome and transcriptome landscapes. Nat Genet. 2020;52(8):800-810. (PMID: 32747824)

Whole genome sequencing library construction

One microgram of DNA was arrayed in each well of a 96-well plate and sheared by sonication (Covaris). Sheared DNA was end-repaired and size selected using AMPure XP beads targeting a 300-400bp fraction. After 3’ A-tailing, full length TruSeq adapters were ligated. Libraries were purified using AMPure XP beads. Library fragment sizes were assessed using an aliquot of PCR amplified library DNA on an Agilent 2100 Bioanalyzer DNA1000 chip, or Caliper GX DNA1000 chip.

Genome library construction for custom capture

DNA (500 ng) was sonicated (Covaris) to 250-350 bp, purified using PCRClean DX magnetic beads (Aline Biosciences), end-repaired, phosphorylated and bead purified before A-tailing using a custom NEB Paired-End Sample Prep Premix Kit. Illumina sequencing adapters were ligated overnight at 16oC, bead purified and enriched with 6 cycles of PCR using indexed primers enabling library pooling and sequenced using paired-end 125 base reads in a single flowcell lane.

Whole genome sequencing

Tumor genomes were sequenced to a target depth of 80X coverage, and normal blood samples to 40X coverage using 125bp reads. Sequencing was performed on an Illumina HiSeq2500.

Significantly mutated genes (SMGs)

MutSig2CV (https://software.broadinstitute.org/cancer/cga/mutsig)was used to identify significantly mutated genes (SMGs) using default parameters and a genome-wide coverage file. Genes meeting a threshold of q<0.1 and passing manual review in IGV for 3 samples with a mutation were considered to be SMGs.

Experimental protocols

To request more information or approval regarding the following protocol, please contact BC Cancer at labqa@bcgsc.ca

96-well PCR Free Library Construction on NIMBUS for Illumina Sequencing

Data Analysis Protocols

The data analysis protocols for the HTMCP- Cervical Cancer project were acquired from the following manuscript.

Gagliardi A, Porter VL, Zong Z, et al. Analysis of Ugandan cervical carcinomas identifies human papillomavirus clade-specific epigenome and transcriptome landscapes. Nat Genet. 2020;52(8):800-810. (PMID: 32747824)

Estimation of tumor content  

Tumor purity and ploidy were estimated using Ploidetect (L.C., in preparation). Tumor reads and heterozygous SNP allele frequencies in non-overlapping bins (~100 kb and equal coverage in the matched normal samples) were computed for each case. Read counts were modelled using Gaussian mixture models (GMM), modified to restrict component means as a fixed depth apart and component variances to be equal to one another. Allele frequencies were modelled using a separate GMM, incorporating priors from the first. Models were generated for each possible value of tumor purity and scored based on the mean likelihood of both the depth and the allele frequency GMMs. All results were verified by review of GMM parameters and their fit to the data. Estimates were congruent with observed copy number data in 104 out of 118 samples. In the remaining cases, purity and ploidy were determined by review of alternate models.

Somatic alteration detection

Tumor and normal sequencing reads were aligned to the human reference genome (hg19) using BWA-MEM v0.7.). Read duplicates were marked using sambamba(v0.5.5). Somatic single nucleotide variants (SNVs) were identified using Strelka (v1.0.6). A panel of 2,735 genes including mutated oncogenes, tumor suppressors, epigenetic modifiers, splicing factors, or other genes recurrently mutated (≥3 cases) in this cohort, were selected for targeting sequencing in the extended cohort. The coding mutation rate was reported for each tumor as the number of coding SNVs (low, moderate or high SNPeff annotation) per Mb.

Mutation signatures and HRD score

SNVs were categorized into 96 mutation classes based on 6 variant types and 16 trinucleotide contexts. For each sample, values of the 96 classes were used to compute a non-negative least squares deconvolution based on 30 previously described mutational signatures (COSMIC). The APOBEC signature reported for each sample is the max exposure value of signature 2 or 13.

HRD scores were computed using the R package HRDtools (v0.0.0.9), as the arithmetic sum of loss of heterozygosity (LOH), TAI, and LST scores, determined based on published guidelines.

Copy ratio landscape comparisons

Copy number alterations (CNAs) between cohorts were called and analysed using (v4.0.9, https://gatkforums.broadinstitute.org/gatk/discussion/9143/how-to-call-somatic-copy-number-variants-using-gatk4-cnv) and GISTIC2.0 (v2.0.17). Genomic intervals were prepared by dividing the reference genome into equally sized bins (1000bp). A panel of normals was generated to median sample reference counts. Allele counts were collected independently for the tumor and matching normal. Continuous segments were modelled with both the allelic ratios and copy ratios.

Germline CNAs previously identified in the TCGA cervical cancer (CESC) study were filtered out to remove any potential germline CNVs in this cohort. Segments were excluded if 75% or more of the segment overlapped with these.

Somatic copy number alterations in TCGA CESC tumors were determined previously using SNP 6.0 arrays11 and these were downloaded from the Broad GDAC website. The 178 samples in TCGA CESC core set were used for comparison to our HIV- samples.

To determine regions of CNA variance between cohorts (HIV- vs. HIV+, HIV- vs. TCGA), analyses were performed on each cohort separately according to the GISTIC2 (v2.0.22) documentation (http://software.broadinstitute.org/cancer/software/genepattern/modules/docs/GISTIC_2.0) with the parameters -qvt 0.25, -genegistic 1, -broad 1, -brlen 0.5, -conf 0.99, -armpeel 1, -savegene 1, -gcm extreme, -maxseg 3000. The genome was binned into 1kb segments and the fraction of patients having a copy gain (>0.1) or loss (<-0.1) was calculated based on the mean segment values for each cohort. Significantly amplified and deleted chromosome arms were identified using an FDR threshold < 0.25. Unique or shared arms and cytobands were identified as those significant in one cohort, and not the other.

Specific copy number alterations in samples

Regions of CNA in individual samples were identified using the Hidden Markov model-based approach CNAseq (v0.0.6). Amplifications were defined as those with predicted total copies greater than twice the estimated tumor ploidy, and homozygous deletions were defined as those predicted to have loss of all copies.

Non-coding mutation hotspots

Non-coding variants annotated by SNPeff as 5' Flank, 3' Flank, IGR, 3' UTR, Intron, 5'UTR, RNA, Splice_Site, Translation_Start_Site were used as input to Rainstorm , with all parameters set at default values (k=4).

Of the 3,094 hotspot regions identified, we focused on 3,539 potential point mutation hotspots, present in 3 or more samples.  These were filtered for those called by both Strelka and MutationSeq and did not reside in centromeric regions. Further filtering removed any variant called in a normal sample, reducing the potential non-coding hotspots to 404, of which 7 (high confidence) were confirmed by manual review.

Hotspots were annotated as ‘potential promoter’, ‘potential enhancer’, or ‘intergenic’ using ChIP-seq data (enhancer= intersect of H3K4me1 and H3K27ac peaks, promoter= H3K4me3).

We assessed the potential for other non-coding hotspots to alter transcription factor binding using motifBreakR26.

DNA methylation analysis

Human DNA methylation using the EPIC array (Illumina) was performed by The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, Canada. The DNA methylation beta-values for 115 samples were binarized as unmethylated (beta<=0.25) and methylated (beta>0.25). The 8,000 most variable probes were clustered using the ConsensusClusterPlus (v1.38.0, R) with a ‘binary’ distance and ‘ward.D2’ clustering method with 1,000 iterations.

Differentially methylated probes (DMP) and differentially methylated regions (DMR) between clade A7- and A9-infected samples were determined using CHAMP (v2.10.2, R) (q<0.05); DMRs used the ‘bumphunter’ method. For the DMPs, associated gene, genomic feature, and CpG island features of these probes came from CHAMP. DMRs were intersected with protein-coding genes (hg19 Ensembl (v75), n=20,232) using bedtools (v2.27.1).

Human and viral gene expression and gene ontology enrichment analyses

Clustering analysis was performed with ConsenusClusterPlus (v1.38.0, R) using log10(RPKM) values using the ‘Pearson’ method and ‘ward.D2’ linkage with 1000 iterations. Human genes used included the top 1,000 most variable genes (RPKM > 5 in at least one patient). All 118 samples were included in human gene clustering and 117 in viral gene clustering (no gff file was available for HPV51).

Differential gene expression between groups (A7 vs. A9; E6 and E7 high vs. low) was performed using the DESeq2 (v1.14.1, R). Genes were filtered using an adjusted p-value <0.05, >1.5-fold change in mean expression, and a baseMean expression >1000. For the A7 vs. A9 comparison, the differential analysis was normalized for histology using a multifactorial approach. Results from the normalized analysis were compared to those using only squamous A7 and A9 samples to ensure histology correction was only removing expression differences attributed to histologies (89% concordance).

Functional enrichment of the significantly differentially expressed genes in the A7 vs. A9 comparison was performed using STRING (v11.0). For visualisation, enrichment scores for A7-enriched ontologies were set to negative values. Functional enrichment of the significantly differentially expressed upregulated and downregulated gene lists for the E6 and E7 analysis was performed using HOMER (v4.10.3).

ERV quantification

5,467,457 repeat elements and hg19 coordinates (chromosomes 1-22, X) were downloaded from RepeatMasker Open v.4.0.5 (http://www.repeatmasker.org/faq.htm). To minimize read count bias from nearby expressed protein-coding genes, we filtered for ERVs >10 kb away from their nearest gene. Raw expression values were calculated by counting the number of reads that mapped unambiguously (mates mapped within 10 kb) to each region and were normalized for sequencing depth and length by conversion to RPKMs.

Transcriptome Sequencing
Data Generation Protocols Data Analysis Protocols
CaCx CaCx

Data Generation Protocols

The data generation protocols for the HTMCP-Cervical Cancer project were acquired from the following manuscript.

Gagliardi A, Porter VL, Zong Z, et al. Analysis of Ugandan cervical carcinomas identifies human papillomavirus clade-specific epigenome and transcriptome landscapes. Nat Genet. 2020;52(8):800-810. (PMID: 32747824)

PolyA RNA library construction

Polyadenylated (PolyA+) messenger RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 3 µg total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. Eluted PolyA+ RNA was ethanol precipitated and resuspended in 10 µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

First-strand cDNA was synthesized from the purified messenger RNA using the Maxima H Minus First Strand cDNA Synthesis kit (Thermo-Fisher, USA) and random hexamer primers with 0.4 µg/µL Actinomycin D, followed by PCR Clean DX bead purification on a Microlab NIMBUS robot (Hamilton Robotics, USA). The second strand cDNA was synthesized following the NEBNext Ultra Directional Second Strand cDNA Synthesis protocol (NEB) that incorporates dUTP in the dNTP mix, allowing the second strand to be digested using USERTM enzyme (NEB) in the post-adapter ligation reaction and thus achieving strand specificity.

cDNA was fragmented by sonication (Covaris) to achieve 200-250 bp fragment lengths. Sheared cDNA was then subject to end-repair and phosphorylation in a single reaction using enzyme premix (NEB) containing T4 DNA polymerase, Klenow DNA Polymerase and T4 polynucleotide kinase. Repaired cDNA was purified and 3’ A-tailed (adenylation) using Klenow fragment (3’ to 5’ exo minus). Illumina PE adapters were ligated. The adapter-ligated products were purified using PCR Clean DX beads, then digested with USERTM enzyme (1 U/µL, NEB) followed immediately by 13 cycles of indexed PCR using Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) and Illumina’s PE primer set. PCR products were purified and size-selected using a 1:1 PCR Clean DX beads-to-sample ratio (twice), and the eluted DNA quality was quantified prior to library pooling and size-corrected final molar concentration calculation.

Transcriptome sequencing

Transcriptomes were sequenced using 75bp paired-end reads on Illumina HiSeq2500.

Experimental protocols

To request more information or approval regarding the following protocols, please contact BC Cancer at labqa@bcgsc.ca

96-well Plate-based Strand-specific cDNA Synthesis using Maxima H Minus on Hamilton NIMBUS

Nimbus-assisted 96-well PCR-enriched Library Construction for Illumina Sequencing

Magnetic bead-based mRNA isolation 

miRNA3 - Plate Format miRNA Library Construction

Data Analysis Protocol

The data analysis protocols for the HTMCP-Cervical Cancer project were acquired from the following manuscript.

Gagliardi A, Porter VL, Zong Z, et al. Analysis of Ugandan cervical carcinomas identifies human papillomavirus clade-specific epigenome and transcriptome landscapes. Nat Genet. 2020;52(8):800-810. (PMID: 32747824)

Expression profiling

RNA-Seq reads were aligned to the human reference genome (hg19) with BWA-MEM (v0.7.6a). Reads aligned to exon junctions were repositioned in the genome as large-gapped alignments using in-house software (JAGuaR v1.7.5). Unambiguously aligned, filtered reads were used to calculate coverage over the total collapsed exonic regions in each gene as annotated in Ensembl v69, and RPKM values were calculated for each gene. Structural variants in RNA Seq data were identified using the assembly-based tools ABySS v1.3.4 and TransABySS v1.4.10 and alignment-based tool DeFuse (v0.6.2).

Gene expression and gene ontology enrichment analyses

Clustering analysis was performed with ConsenusClusterPlus (v1.38.0, R) using log10(RPKM) values using the ‘Pearson’ method and ‘ward.D2’ linkage with 1000 iterations. Human genes used included the top 1,000 most variable genes (RPKM > 5 in at least one patient). All 118 samples were included in human gene clustering and 117 in viral gene clustering (no gff file was available for HPV51).

Differential gene expression between groups (A7 vs. A9; E6 and E7 high vs. low) was performed using the DESeq2 (v1.14.1, R). Genes were filtered using an adjusted p-value <0.05, >1.5-fold change in mean expression, and a baseMean expression >1000. For the A7 vs. A9 comparison, the differential analysis was normalized for histology using a multifactorial approach. Results from the normalized analysis were compared to those using only squamous A7 and A9 samples to ensure histology correction was only removing expression differences attributed to histologies (89% concordance).

Functional enrichment of the significantly differentially expressed genes in the A7 vs. A9 comparison was performed using STRING (v11.0). For visualisation, enrichment scores for A7-enriched ontologies were set to negative values. Functional enrichment of the significantly differentially expressed upregulated and downregulated gene lists for the E6 and E7 analysis was performed using HOMER (v4.10.3).

Estimation of immune cell content

To quantify expression signatures of leukocytes in our samples, we ran CIBERSORT (v1.0.4) on the expression RPKMs using -p 500, -q and -a options and the LM22 signature matrix provided. Total CD4+ T-cell content is the sum of the scores from the CD4+ T cells in CIBERSORT; naive, memory resting, memory activated, follicular helper and regulatory T cells.

ChIP-Sequencing
Data Generation Protocols Data Analysis Protocols
CaCx CaCx

Data Generation Protocols

The data generation protocols for the HTMCP-Cervical Cancer project were acquired from the following manuscript.

Gagliardi A, Porter VL, Zong Z, et al. Analysis of Ugandan cervical carcinomas identifies human papillomavirus clade-specific epigenome and transcriptome landscapes. Nat Genet. 2020;52(8):800-810. (PMID: 32747824)

Native chromatin immunoprecipitation (ChIP) sequencing

Fifty-two tumor samples were lysed in 0.1% Triton X-100, 0.1% Deoxycholate buffer plus protease inhibitors (PI).  Extracted chromatin was digested with micrococcal nuclease (MNase) enzyme (NEB) and the reaction quenched using 250 µM of EDTA. 1% Triton X-100 and 1% Deoxycholate were mixed and added to the samples on ice. 4% of digested chromatin was used as input control, the remaining was pre-cleared with Protein A/G Dynabeads (Invitrogen) in IP buffer (20 mM Tris-HCl [pH7.5], 2 mM EDTA, 150 mM NaCl, 0.1% Triton X-100, 0.1% Deoxycholate, PI) at 4oC for 1.5 hours. Supernatants were transferred to a 96-well plate containing the antibody-bead complex, and incubated overnight at 4oC with agitation. Immunoprecipitated samples were washed twice with low salt buffer (20 mM Tris-HCl [pH 8.0], 0.1% SDS, 1% Triton X-100, 2 mM EDTA, 150 mM NaCl) and twice with high salt buffer (same, but with 500 mM NaCl). DNA-antibody complexes were eluted in Elution Buffer (100 mM NaHCO3, 1% SDS), at 65°C for 1.5 hours with mixing (1350 rpm). Qiagen Protease was used to digest protein in the eluted DNA at 50°C for 30 minutes with mixing (600 rpm). ChIP DNA was purified using Sera-Mag beads (Fisher Scientific) with 30% PEG before library construction as described for custom capture.

Amplified libraries were purified as described above (ALINE Biosciences) and the DNA quality and quantity determined using Caliper LabChip GX DNA High Sensitivity assay (PerkinElmer) and the Quant-iT dsDNA high sensitivity assay (ThermoFisher Scientific).

Experimental protocols

To request more information or approval regarding the following protocols, please contact BC Cancer at labqa@bcgsc.ca

qPCR of Native ChIP Libraries

Native ChIP Using 100,000 Cells

Data Analysis Protocol

The data analysis protocols for the HTMCP- Cervical Cancer project were acquired from the following manuscript.

Gagliardi A, Porter VL, Zong Z, et al. Analysis of Ugandan cervical carcinomas identifies human papillomavirus clade-specific epigenome and transcriptome landscapes. Nat Genet. 2020;52(8):800-810. (PMID: 32747824)

ChIP-Seq alignment and peak calling

ChIP sequence reads (75nt) were aligned to the human reference genome (hg19) with BWA-MEM (v0.7.6a, parameters: -M). Read duplicates were marked using sambamba (v0.5.5). Forty-seven samples had all 6 histone marks (4 broad: H3K4me1, H3K9me3, H3K27me3, H3K36me3 and 2 narrow: H3K4me3 and H3K27ac), and 5 had a subset of these.

Peaks were called using MACS2 (v2.1.1) with default parameters, comparing each mark to its control. Bedgraph output files were converted to the library size-normalized bigWig format for manual inspection using the UCSC and IGV genome browsers.

ChIP-seq data quality was assessed using Encode guidelines. Samples had a minimum of 50 million sequenced reads for narrow marks and 100 million for broad marks. The percentage of uniquely mapped reads was above 70%, and the percentage of duplicated reads varied between 1 and 10%. The non-redundant fraction, fraction of reads in peaks (FRIP) and sequencing saturation using preseq v2.0.2 (https://github.com/smithlabcode/preseq) were also assessed.

ChIP clustering analyses

The union of peaks for each histone mark was found by concatenating peak files and merging overlapping regions using bedtools v2.27.1. The normalized coverage of each sample in the peak union was counted using deeptools (v3.0.2). For each mark, the top 1% most variable peaks were clustered using the ConsensusClusterPlus (v1.38.0, R) using the ‘pearson’ distance and ‘complete’ clustering method with 1000 iterations for k=2-10 clusters. The 54 consensus clustering solutions (6 marks x 9 solutions) were then analysed using a Cluster of Clusters Analysis (COCA). For active marks, pairwise probabilities were generated for 27 solutions (3 marks x 9 solutions). For marks in which some samples had missing data, pairwise comparisons were normalized to exclude samples in those marks. Matrices of probabilities (54x52, 27x52) were clustered using pheatmap (v1.0.10, R) with the ‘pearson’ distance and ‘complete’ clustering method.

H3K4me3, H3K27ac and H3K4me1 peaks differentially present between HPV clades (A7 vs. A9) were determined using DiffBind with DESeq2 method (v2.2.12, R, FDR<0.01, fold-change>2). Coverage at peaks was counted 500bp around the centre of the peak, and a multifactorial experimental design was performed to normalize histology differences (referred to as blocking factor). Significantly differential peaks were intersected using bedtools (v2.27.1). Associated genes were identified using the nearest transcription start site (TSS) to the differential H3K4me1 and intersected H3K4me3/H3K27ac regions, identified using bedtools (v2.27.1) to RefSeq’s hg19 annotation.

HPV integration events and ChIP

HPV integration sites were determined using chimeric reads mapping to both human and HPV genomes. Within each sample, integration sites were merged into a single integration event (n=257) if they were <500 kb apart. HPV integration hotspots were determined by counting the number of events that fell within a 500 kb bin across the genome.

ChIP-seq alterations at HPV events were clustered using the log2(fold-change) of normalized coverage (RPM) of the integrated sample versus the mean RPM of the unintegrated samples using the ‘pheatmap’ (v1.0.10, R) with a ‘ward.D2’ clustering method. Events <20 kb were extended to 20 kb to obtain adequate coverage of the region.

For each mark per event (6 modifications in 99 events), a control peakset was made by randomly selecting 1,000 peak regions of the same mark on the same chromosome as the event, and extending the peaks from the center to the same size as the event. Normalized ChIP-seq coverage of the histone modification at these 1,000 random peaks was counted in the 52 samples, and the log2(fold change) of coverage was calculated for the integrated sample. A p-value was calculated for the fold change of the integration event based on the distribution of fold changes in these control peaks. Benjamini-Hochberg adjusted p-values<0.05 was regarded as significant.

Targeted Sequencing
Data Generation Protocols Data Analysis Protocols
CaCx CaCx

Data Generation Protocols

The data generation protocols for the HTMCP-Cervical Cancer project were acquired from the following manuscript.

Gagliardi A, Porter VL, Zong Z, et al. Analysis of Ugandan cervical carcinomas identifies human papillomavirus clade-specific epigenome and transcriptome landscapes. Nat Genet. 2020;52(8):800-810. (PMID: 32747824)

Genome library construction for custom capture

DNA (500 ng) was sonicated (Covaris) to 250-350 bp, purified using PCRClean DX magnetic beads (Aline Biosciences), end-repaired, phosphorylated and bead purified before A-tailing using a custom NEB Paired-End Sample Prep Premix Kit. Illumina sequencing adapters were ligated overnight at 16oC, bead purified and enriched with 6 cycles of PCR using indexed primers enabling library pooling and sequenced using paired-end 125 base reads in a single flowcell lane.

Custom capture validation of SNVs

Out of the cohort of 212 cervical cancer patients, 118 comprised the discovery cohort and 89 comprised the extension cohort. From the 89 extension cohort, targeted sequencing was performed in two phases. The first phase was to capture 2,735 selected genes, and the second phase was to capture the KMT2D gene and non-coding hotspots.

DNA libraries from the 89 patients were pooled prior to hybridization capture of 2,735 target genes using SureSelect XT custom probes (Agilent) and RNA probes at 65°C for 24 hours. Streptavidin-coated magnetic beads (Dynal, MyOne) were used for custom capture, followed by purification on MinElute columns (Qiagen) and enrichment with 10 PCR cycles using primers that maintain library-specific indices. Pooled libraries were sequenced generating 125bp paired-end reads. To capture the KMT2D gene and non-coding hotspots, 544 120bp xGen Lockdown probes were designed and synthesized (Integrated DNA Technologies) for targeted capture sequencing.

Experimental protocols

To request more information or approval regarding the following protocols, please contact BC Cancer at labqa@bcgsc.ca

Hybridization Capture of Illumina Libraries using xGEN Lockdown Probes

Multiplex Agilent Whole Exome or Target Capture

Data Generation Protocols

The data generation protocols for the HTMCP-Cervical Cancer project were acquired from the following manuscript.

Gagliardi A, Porter VL, Zong Z, et al. Analysis of Ugandan cervical carcinomas identifies human papillomavirus clade-specific epigenome and transcriptome landscapes. Nat Genet. 2020;52(8):800-810. (PMID: 32747824)

Somatic alteration detection

Tumor and normal sequencing reads were aligned to the human reference genome (hg19) using BWA-MEM v0.7.6a). Read duplicates were marked using sambamba (v0.5.5). Somatic single nucleotide variants (SNVs) were identified using Strelka (v1.0.6). A panel of 2,735 genes including mutated oncogenes, tumor suppressors, epigenetic modifiers, splicing factors, or other genes recurrently mutated (≥3 cases) in this cohort, were selected for targeting sequencing in the extended cohort. The coding mutation rate was reported for each tumor as the number of coding SNVs (low, moderate or high SNPeff annotation) per Mb.

Mutation signatures and HRD score

SNVs were categorized into 96 mutation classes based on 6 variant types and 16 trinucleotide contexts. For each sample, values of the 96 classes were used to compute a non-negative least squares deconvolution based on 30 previously described mutational signatures (COSMIC). The APOBEC signature reported for each sample is the max exposure value of signature 2 or 13.

HRD scores were computed using the R package HRDtools (v0.0.0.9), as the arithmetic sum of loss of heterozygosity (LOH), TAI, and LST scores, determined based on published guidelines.

HPV Typing and Integration
Data Generation Protocol Data Analysis Protocols
CaCx CaCx

Data Generation Protocols

The data generation protocols for the HTMCP-Cervical Cancer project were acquired from the following manuscript.

Gagliardi A, Porter VL, Zong Z, et al. Analysis of Ugandan cervical carcinomas identifies human papillomavirus clade-specific epigenome and transcriptome landscapes. Nat Genet. 2020;52(8):800-810. (PMID: 32747824)

HPV typing

Microbial detection, HPV typing and HPV integration detection were performed using BioBloom tools (BBT, v2.0.11b). Where 2 or more HPV types were integrated (n=3), the dominant type was determined by E6/E7 expression. Where no integration was found (n=9), the dominant HPV was determined by the type with the most read evidence.

Data Analysis Protocols

The data analysis protocols for the HTMCP-Cervical Cancer project were acquired from the following manuscript.

Gagliardi A, Porter VL, Zong Z, et al. Analysis of Ugandan cervical carcinomas identifies human papillomavirus clade-specific epigenome and transcriptome landscapes. Nat Genet. 2020;52(8):800-810. (PMID: 32747824)

HPV expression

To determine expression of HPV genes, fasta genome references and gff annotation files were downloaded from NCBI for 16 HPV strains. HPV51 did not have a gff file so one infected sample was excluded (samples with HPV expression n=117). Samples were aligned to their HPV strain using BWA-mem v0.7.6a Sambamba. The fraction of reads with sequencing quality greater than Q10 within gene boundaries were counted and normalized to reads per kilobase of exons per million reads mapped to HPV (RPKM).

Clustering analysis was performed on E1, E2, E6 and E7 log10(RPKM) with ConsenusClusterPlus R package (v1.38.0) using the ‘pearson’ method and ‘ward.D2’ linkage, and the sample scaled expression was visualized using the ‘pheatmap’ R package (v1.0.10). E4 and E5 z-scores were added after clustering as these genes are not present in every HPV type. Differential gene expression between clusters was done by running DESeq2 R package (v1.14.1) to compare each cluster to the two other clusters, and genes were filtered using an adjusted p-value <0.05, >1.5-fold change in mean expression, and a baseMean expression >1000. Functional enrichment of the significantly differentially expressed genes was performed using STRING (v11.0).

HPV integration events and ChIP

HPV integration sites were determined using chimeric reads mapping to both human and HPV genomes. Within each sample, integration sites were merged into a single integration event (n=257) if they were <500 kb apart. HPV integration hotspots were determined by counting the number of events that fell within a 500 kb bin across the genome.

ChIP-seq alterations at HPV events were clustered using the log2(fold-change) of normalized coverage (RPM) of the integrated sample versus the mean RPM of the unintegrated samples using the ‘pheatmap’ (v1.0.10, R) with a ‘ward.D2’ clustering method. Events <20 kb were extended to 20 kb to obtain adequate coverage of the region.

For each mark per event (6 modifications in 99 events), a control peakset was made by randomly selecting 1,000 peak regions of the same mark on the same chromosome as the event, and extending the peaks from the center to the same size as the event. Normalized ChIP-seq coverage of the histone modification at these 1,000 random peaks was counted in the 52 samples, and the log2(fold change) of coverage was calculated for the integrated sample. A p-value was calculated for the fold change of the integration event based on the distribution of fold changes in these control peaks. Benjamini-Hochberg adjusted p-values<0.05 was regarded as significant.

HPV integration events and expression

For each integration event (n=257), we identified all protein-coding genes (hg19 Ensembl (v75), n=20,232) that fell within the event +/-10 kb, which revealed 255 genes near integration events. Fold changes of integrated samples were calculated based on the mean expression of all samples lacking events, and p-values were derived from the distribution of expression of the gene across all samples. Oncogenes were identified by OncoKB. The same method was applied to identify ERVs upregulated at HPV integration events (n=34 events). Samples were labelled as having a statistically significant integration event if they had a fold change≥2 and Benjamini-Hochberg adjusted p-value≤0.05 for genes, and fold change≥10 and Benjamini-Hochberg adjusted p-value≤0.05 for ERVs based on the distribution of fold changes for each respectively. Samples with significant events were correlated with T cell infiltration scores from CIBERSORT, and genes from the gene ontologies for dsRNA sensing pathways (GO:0043330) and type I interferon signalling (GO:0060337).

Last Updated: