Introduction to TCGA Land Content

From Array Suite Wiki


Introduction to TCGA Land

The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing. TCGALand includes data from over 33 cancer types, with RNA-Seq, DNA-Seq, Copy Number, Methylation, Expression array (Agilent), and protein array (RPPA) data.

The latest version of TCGA data can be found within TCGA_B38_GC33. Learn more about the curation effort to unify data here: https://digitalinsights.qiagen.com/news/blog/discovery/a-better-way-to-explore-tcga-data/

Data Sources

Most TCGA data are sourced from the GDC data portal: GDC; additional data are sourced from publicly-deposited data and curated from TCGA-related publications.


Land Version Genome Build Gene Model Note
TCGA_B38_GC33 Human.B38 OmicsoftGenCode_V33 ***Current Version***
TCGA_B37 Human.B37.3 OmicsoftGene20130723 Archived
TCGA_B38 Human.B38 OmicsoftGenCode_V24 Archived

Data Types

  • CNV Calling: Gistic2 Call and TCGA Land CNV Call (Segment Data)
  • DNA-Seq Somatic Mutation
  • Expression Ratio (Agilent)
  • Methylation450 BeadChip (n.b. technical restrictions mean these data are only available in TCGA_B37)
  • Mass Spectrometry (MS)
  • miRNA-Seq
  • RPPA (protein array)
  • RPPA_RBN (protein array)
    • Replicate-based normalization for cross-tumors comparisons (RBN) (M.D. Anderson)
  • RNA-Seq, including:
    • Single-end and Paired-end fusion calling
    • RNA-Seq somatic mutation, from matched tumor/normal pairs
    • Exon Junction and Exon Usage
    • Expression (Gene- and Transcript- level quantification)
  • Metadata, including TCGA Marker Paper information

Laboratory Methods

Agilent Expression Array (Agilent G4502A)

Illumina HiSeq sequencing (GAII, GAIIx, HiSeq 2000, HiSeq2500)

Illumina DNA Sequencing

Methylation450 Array (B37 genome, from Level 3 beta values)

Processing Methods

Expression Data

Omicsoft Affymetrix Microarray Preprocessing

Note: For expression arrays, there may be some discrepancies between published data and the values in TCGA Land. Please see the accompanied wiki page here for an explanation of where these differences arise.

RNA-Seq data

OmicScript Pipeline and Building Land From RNA-Seq Data

HLA (Class I) identification using the RnaSeq aligned reads. The HLA OptiType program aligns RNA-seq reads to the HLA Reference genome, and then performs an optimization to determine the most likely HLA Class I allele. See OptiType - precision HLA typing from next-generation sequencing data.pdf for a description of the algorithm. TCGA has classified this information as restricted access.

DNA-seq data

Omicsoft does not reprocess other genomic data, but extracts data directly from original datasets.

  • TCGA_B38: Mutation calls provided in B38 are public somatic mutation data derived from merging the mutations in all MAF files downloaded from the GDC (v27.0). To see more information on the mutation calling pipelines used by the GDC, please visit MAF_source
  • TCGA_B37: Collating all of the TCGA data, especially the DNA somatic mutation data, has been quite complex. TCGA data historically has been housed in various repositories (Broad Firehose, UCSC Cancer Genomics Hub (CGHub), TCGA Data Portal, cBioPortal). With data generation for TCGA now wrapping up, the NCI is attempting to store all of the generated data (both raw and processed) on the Genomics Data Commons (GDC). However, as GDC and cBioPortal appear to update their respective databases at different times, there are still some discrepancies between the two portals. In order to provide our users with the most comprehensive TCGA dataset, OmicSoft is actively trying to merge data from these different TCGA repositories to provide one unified dataset. When we first began curating the TCGA land, our starting files were the MAF files downloaded from the TCGA Data Portal (which is now deprecated and has been replaced with the NCI GDC) and Broad Firehose. We have been updating our TCGA land as new annotated somatic mutations are released on the GDC. cBioPortal uses their own curation and analysis pipeline that differs from GDC. Recently, we have merged mutation calls from cBioPortal to address this discrepancy.


Users may notice that some a discrepancy in numbers of samples with data in B37 and B38 lands. GDC Legacy data (source for TCGA_B37 data) contained mutation calling from arrays and GDC , while data in B38 are exclusively from whole exome sequencing (WXS) data. A small number of WXS cases fail to pass the QC and harmonization process at GDC and are thus excluded from the pipeline.

Excluded samples: TCGA-09-0365-01A, TCGA-09-0367-01A, TCGA-13-0757-01A, TCGA-13-0758-01A, TCGA-13-0764-01A, TCGA-DS-A1OB-01A


Mass Spectrometry (MS)

MS raw data is downloaded from: The Clinical Proteomic Tumor Analysis Consortium (CPTAC). This data is typically obtained from 4 centers: The Broad Institute, Pacific Northwest National Laboratory (PNNL), Johns Hopkins University, and Vanderbilt. Currently, the data from these projects available in land are log2 ratios (iTRAQ) taken exclusively from the Broad Institute and PNNL. There are two types of protein levels reported: 1) overall protein levels and 2) variant levels (i.e. phosphorylation). For example, note the entries for BRAF:

Mass spec.png

A note on Copy Number Variation data

Some differences may be observed between legacy TCGA Lands and the current TCGA_B38_GC33 Land, because

  • With the new TCGA_B38_GC33, we reprocessed from GDC coordinates directly onto Human Genome B38; the legacy TCGA_B38 Land data were remapped from TCGA_B37 data.
  • For the current TCGA_B38_GC33 release, we processed Gistic2 calls from masked segmentation files from GDC, using Gistic2 parameters used by GDC, calling 5-level categories (Homozygous_Deletion, Heterozygous_Deletion, Diploid, Gain, Amplification) . TCGA_B37/TCGA_B38 Gistic2 calls were imported from the Broad GDAC Firehose, which differed from the current GDC Pipeline Gistic2 pipeline.


Key Meta Data Columns

  • Tumor Type: The types of tumor. See Primary Grouping for details.
  • Sample Type: The types of sample indicating where the sample is from. It includes information such as whether it is from normal or tumor tissue and whether it is primary, recurrent tumor, or from cell line etc.
  • Land Tissue: The tissue from which the cell line was derived, using OmicSoft's curation Controlled Vocabulary
  • Land Sample Type: A detailed description of the cell type from which the cell line was derived, using OmicSoft's curation Controlled Vocabulary
  • Tumor or Normal: Indicates whether a sample is from a tumor or normal sample.

Primary Grouping

Tumor Types (All 33 tumor types from TCGA)

SampleDistributionbyTumorType.png

Curated Marker Papers

As part curating and updating the TCGA Lands, OmicSoft adds new metadata columns corresponding to TCGA consortium publications.

TCGAMarkerPaper.png

Found under Sample Metadata | Clinical Data | TCGAMarkerPaper, the column name format is TumorType_PublicationTitleYear Description; additional information (such as PubMedID) can be found by hovering over the metadata column.

For example, to find curated columns from Comprehensive molecular portraits of human breast tumours (Nature 2012), search for BRCA_Nature2012 to see eight additional curated columns to filter and group data on.

TCGAMarkerPaper BRCA Nature2012.png.


Related Articles

EnvelopeLarge2.png