Getting Started with DNAseq Analysis

From Array Suite Wiki

Getting Started with DNA-seq pipeline functions

Array Studio provides a suite of tools to quickly, easily, and reliably process DNA-seq data. Users have the choice of either executing each step of the analysis one-by-one, or can use the DNA-seq pipeline function. The following set of videos will walk the user through the functions automatically executed by the standard DNA-seq pipeline, starting with raw reads in .fastq format, then will walk through some of the common downstream analysis functions. By the end, the user will have raw and aligned QC, aligned .bam and [Bas|.bas] files, annotated variants, and copy-number variation data.

  • DNA-seq pipeline function [00:30]
  • DNA-seq downstream analysis functions [01:00]

<HTML5video type="youtube" width="900" height="500" autoplay="false">ejMUR-izqck?rel=0</HTML5video>

Running the DNA-seq pipeline

Running your DNA-seq (whole-genome or whole-exome data) through the Array Studio DNA-seq pipeline, only takes a few mouse-clicks. Output data can be used for downstream analysis.

  • DNA-seq Workflow window [00:10]
  • Run the DNA-seq pipeline [00:25]
  • DNA-seq pipeline options [01:00]
  • View OmicScript for module command [02:50]
  • Pipeline output objects [03:50]
  • Locating .BAM Summary (.bas) files [04:35]

<HTML5video type="youtube" width="900" height="500" autoplay="false">69ZJr0imkIU?rel=0</HTML5video>

Raw Data QC

Before aligning your DNA-seq data, you must first perform quality control (QC) on the raw data, to spot common problems like adapter or barcode sequence contamination, degraded quality at ends of reads, or problematic samples. The Array Studio Raw Data QC Wizard reports a number of useful measures of raw NGS quality, and can be generated as part of the DNA-seq pipeline function. However, this module should be run before running the pipeline, to determine what read filtering or trimming might need to be performed.

Additional information about how to interpret these functions can be found in the RNA-seq Raw Data QC Analysis video. It is also recommended that you view the video on Filtering and Trimming NGS reads.

  • Run the Raw Data QC Wizard [00:14]
  • Raw Data QC Output [02:30]

<HTML5video type="youtube" width="900" height="500" autoplay="false">yCDoAutjEV4?rel=0</HTML5video>

Map DNA-seq Reads to Genome

Array Studio uses OSA4 to align NGS reads to the genome as part of the DNA-seq pipeline. Users can also align reads independently, and have greater control over mapping parameters.

  • Run the Map DNA-seq Reads to Genome function [00:15]
  • NGS object output [03:15]
  • Add sample metadata to NGS object Design table [03:35]

Aligned Data QC

Array Studio automatically generates an Alignment Report after aligning reads to the genome or transcriptome. Additional alignment statistics can be generated by running the Aligned Data QC module.

  • Run the Aligned Data QC Module [00:10]
  • Aligned QC Output Data object [01:25]

<HTML5video type="youtube" width="900" height="500" autoplay="false">WcY3hDhJahU?rel=0</HTML5video>

Identify DNA Sequence Variation

The Array Studio DNAseq pipeline will automatically run "Summarize Variant Data" to identify SNPs, insertions, and deletions. Users can also run this module outside of the pipeline, to have more control over cutoffs of read and mapping quality, Insertion/Deletion calling, and more. The output Variant Report can be annotated with Mutation Annotator databases.

  • Run the Summarize Variant Data module [00:14]
    • Specify output options, including Vcf files [01:14]
  • Variant Report output [02:20]
  • Annotate a Variant Report by Variant Databases [03:16]

<HTML5video type="youtube" width="900" height="500" autoplay="false">Y3JZTi-kG3k?rel=0</HTML5video>

Generate and Annotate VCF Variant Data

Variant Call Format (VCF) data is the most common format for reporting sequence variation. Array Studio variant detection can output merged or individual VCF files, and can organize and annotate these data for efficient filtering in Array Studio.

  • Generate .vcf output from "Summarize Variant Data" [00:05]
  • Convert "Summarize Variant Data" pipeline output to .vcf [00:13]
  • Stream .vcf data in Array Studio [01:51]
  • Annotate .vcf data to generate an interactive OSCR object [2:32]
  • OSCR output of annotated .vcf file [03:32]
  • Filter Annotated OSCR data by column contents [04:05]
  • View selected Variant Details [04:35]

Identify Somatic and Germline Mutations in Matched-Pair Data

Array Studio includes a function for identifying somatic vs. germline variants in NGS data, by comparing matched-pair samples from the same subject. Output data can be filtered by gene, location, variant type, and more. Reports and Vcf output can also be annotated for predicted functional significance, and variant data can be directly viewed in the Array Studio Genome Browser.

  • Design Table metadata requirements for Matched-pair grouping [00:25]
  • The Matched-Pair Variant module [01:05]
  • Matched-Pair Variant output [02:15]
  • Annotate Matched-Pair Variant Report/VCF file [02:33]
  • Filter report by Annotation Result [04:05]
  • View Variation in the Genome Browser [04:45]

Summarize NGS Coverage Data to Detect Copy Number Variants

DNAseq Whole-Genome and Whole-Exome data can be processed in Array Studio to detect amplification and deletion events. By comparing the relative signal between samples from the same subject, regions with unusually high or low signal in the disease sample will be flagged as a potential Copy Number Variation (CNV) event.

  • NGS Design Metadata requirements [00:17]
  • Run the Summarize Copy Number Variations (CNV) Module [00:45]
  • CNV Summary Output [02:23]
  • Segment the CNV Summary for cleaner output [02:48]
  • CNV Segmentation table output [03:40]

Visualize NGS Copy Number Variations

Segmented Copy Number Variation data, detected from DNA-seq data, can be visualized in multiple ways. Users can view the log2ratios of Disease:Normal signal by segment, as a scatter plot, along a chromosome schematic, or in the Genome Browser.

  • Scatterplot of Discrete Copy Number Segments [00:05]
  • Interpreting Log2Ratio of Signal as Copy Number [00:44]
  • Segment Chromosome View [01:58]
  • Segment View [02:45]
  • View Coverage for Segment in the Genome Browser [03:14]
  • Scale Coverage of DNAseq data between samples [03:48]

Integrate CNV chip and DNA-seq Visualizations in the Omicsoft Genome Browser

After importing and segmenting your SNP chip and DNAseq (WXS/WGS) data into Array Studio, you can quickly visualize them in the Genome Browser. Array Studio Views can take you to a region of interest, or you can just explore.

  • Source Data: DNA-seq pipeline .bam and segmented CNV calls, SNP chip intensity and segmented CNV calls [00:19]
  • Open Genome Browser session on region of interest [01:18]
  • Set .bam display track properties to compare coverage [02:17]
  • Add SNP Array probe-level intensity data [02:50]
  • Set SNP display track properties to compare intensities [03:29]
  • Add Copy number predictions [04:15]
  • Customize Copy number tracks [04:55]
  • Choose a new region from Analysis to view with custom Genome Browser tracks [05:45]

Related Articles