Introduction to GTEx Land Content

From Array Suite Wiki

GTEx Lands

The Genotype-Tissue Expression project (GTEx) aims to create a comprehensive public atlas of gene expression and regulation across multiple human tissues. GTEx project can help to understand the correlation between tissue-specific gene expression and human diseases. According to GTEx Portal, “GTEx will help researchers to understand inherited susceptibility to disease and will be a resource database and tissue bank for many studies in the future.” It contains RNA-Seq and Affymetrix expression data for all normal tissues. It provides high quality normal control samples to benchmark researchers’ patient or drug response sample data.

It can be used in conjunction with other Lands (like TCGA and CCLE Lands, for instance) to create virtual Lands, and allows comparisons across datasets as we use controlled vocabularies and we process our expression and RNA-Seq data with our standard pipelines.

Land Versions

  • GTEx_B38_GC33: GTEx-v8 data, aligned to Human.B38 and OmicsoftGenCode.V33. This is the latest version of the Land and will continue to be updated with new content and metadata.
  • GTEx_B37: GTEx-v8 data, aligned to Human.B37.3 and OmicsoftGene20130723.
  • GTEx_B38: GTEx-v6 data, aligned to Human.B38 and OmicsoftGenCode.V24. This Land is superceded by GTEx_B38_GC33 and will not be updated.

Data Source

GTEx Portal GTEx v8.

Data Types

  • 819 samples with Affymetrix Expression data (HuGene-1_1-st-v1)
  • 16963 samples with RNA-Seq data; based on SRA files
  • 201 samples with MS data; based on matrices from PMID 32916130
    • 418 samples in GTEx, flagged as poor quality or from cell lines, are excluded from this Land.

Laboratory Methods

  • Affymetrix Expression Array
  • Illumina TrueSeq RNA sequencing
  • Mass Spectrometry proteomics

Processing Methods

RNA-Seq data:

  • MS proteomics data: [1] protein-level expression for 201 samples
  • HLA (Class I) identification using the RnaSeq aligned reads. GTEx has classified this information as restricted access. The HLA OptiType program aligns RNA-seq reads to the HLA Reference genome, and then performs an optimization to determine the most likely HLA Class I allele. See OptiType - precision HLA typing from next-generation sequencing data.pdf for a description of the algorithm.

Statistical analyses

Over 1300 statistical comparisons were performed to reveal key expression differences between GTEx cohorts.

  • 92 statistical comparisons were performed, corresponding to the 52 sub-tissues described in TissueDetail_GTEx, using DEseq2 v1.30. For each comparison all samples in the Case group (one TissueDetail, e.g. Liver) was compared to an aggregated control group comprised of carefully-selected samples representing all other TissueDetail groups.
  • 1222 statistical comparisons were performed
    • TissueDetail_GTEx vs others
      • Within a tissue
      • Within a tissue + sex
      • Within a tissue + age range
      • Within a tissue + sex + age range
    • Male vs female
      • Within a tissue
      • Within a tissue and age range
    • Age range vs others
      • Within a tissue
      • Within a tissue + sex

The control samples were chosen by:

  • Calculating the mean expression per gene within a TissueDetail_GTEx category
  • Creating groups of 8 (or 2 in case of Tissue_GTEx: Brain) and comparing the mean per gene expression of this subgroup to the mean per gene expression of all the samples
  • Selecting the most representative group, defined as cosine similarity method between the profiles of expression seen in the subgroup and the entire tissue
    • Sampling brain tissue: Because of the large number of TissueDetail_GTEx samples from different brain regions, only two samples from each TissueDetail was selected

Key Meta Data Columns

  • Tissue: Tissues such as brain, blood, heart, lung, kidney etc., using OmicSoft controlled vocabulary
  • Tissue_GTEx: Tissues, using GTEx controlled vocabulary
  • TissueDetail_GTEx: Sub-category within a tissue, such as Brain - Amygdala, Brain - Cortex, Brain - Hippocampus, Brain - Spinal cord (cervical c-1) etc., using GTEx terminology

OncoSampleType: curated by OmicSoft Land curation team using OmicSoft's controlled vocabularies, allowing users to easily merge the data with OncoLands such as TCGA.

Tumor or Normal: indicates whether a sample is from tumor sample for normal sample. All GTEx data are normal samples.

Key Views

GTEx data are tissue-specific data. One of the most common way to visualize the data is to group the data by Tissue:

GTEx GeneFPKMforEGFR.png

If the user is interested in more detailed information with in a tissue type, the data can be filtered for one or a few tissue types, and then grouped by Tissue Detail Type:

GTEx GeneFPKMforEGFRbyTissueDetail.png

Multiple metadata fields can be combined to reveal differences in expression across sample groups, such as CXCL9 expression, which is significantly down-regulated in the analysis of sun-exposed skin in males from 20-29 years vs other groups. Sample-level expression shows a trend of increased expression of CXCL9 in sun-exposed skin in older age groups.

GTEx multilevel.png

Differential expression from each statistical comparison can be explored in a volcano plot, such as the comparison of sun-exposed skin from male samples at age range 20-29 years vs other age ranges.

GTEx comparisons.png

MS proteomics expression in GTEx tissue samples can be visualized as variable plots, or comparing to RNA-seq expression from the same samples using the RNA-seq to MS Integration View.


Related Articles