Ngs MapRnaSeqReadsToGenome.pdf
Map RNA-Seq Reads to Genome (Illumina)
The Map RNA-Seq Reads to Genome (Illumina) module allows the user to map raw sequence reads to the genome. This module returns a number of summary statistics, an NGS dataset (used for further downstream analysis like exon junction generation, paired fusion gene detection, and more), as well as return a microarray dataset containing expression values.
General
Notes on duplicates
This function will not mark duplicate reads in SAM entries.
Input/Output
Accepted file formats include FASTQ, FASTA, QSEC, and AUTO (AUTO allows the use of any combination of the listed file types).
Basic
In the Basic section, the user has a number of options:
- Reads are paired: The user can choose whether this is a paired end sequencing analysis, and if so, the reads will automatically be paired using a numbering logic (e.g. _1, _2 or .1, .2).
- By default, OmicSoft will pre-sort input files to identify paired samples. In exceptional situations (e.g. if you have grouped multiple input files into a single sample with Sample Registration or "Add List", and some files have the same names across multiple folders), you may want to select "pair files in order", so that input files are assumed to already be pre-sorted in the proper order.
- Search for novel exon junctions during the alignment.
- For quality encoding, the user can choose Automatic (recommended) or explicitly set the quality encoding as either Illumina or Sanger.
Performance and Reporting
For the performance and reporting section, there are a number of important settings.
- Total penalty is the total number of indels or mismatches allowed for a successful mapping.
- The penalty is defined as the maximal number of mismatches allowed plus the gap penalty if an indel is present in the alignment. Usually we set the gap penalty to one or two (default is two). By default, Omicsoft automatically set the maximal penalty for each read to Max (2, (read length - 31) / 15) based on trimmed read length. Below is a table of automatic penalty for reads with 17- 106 nt:
Read Length Penalty 17-76 2 77-91 3 92-106 4
- Fixed - User can override the automatic penalty setting and set the number to any fixed number. Penalty values of 2, 3, 4 are mostly common for reads < 100 nt.
- Thread number is the total number of threads to be allocated to the process. The more threads that are allocated, the faster the algorithm will run. By default, this is set to the number of CPUs on the user’s computer. This should not be set to a greater number of CPUs than available, but can be reduced at the user’s discretion.
- Job number - Specifying the parallel "Job number" will spawn off new processes to run the alignments. If you have 24 samples, you could specify "Job number" = 12 to run 12 alignments at once.
- Non-unique mapping is for handling ties (reads mapped to multiple locations on the genome). You can report up to a specified number of ties (if mapped to more than this number of locations, the read will be unmapped), or choose to exclude them completely from the mapping and counting.
- Optionally, SAM files can be generated, and the user can choose not to import the data directly into the project, as well as an output folder for the results.
- The user can specify the output folder for the results.
Advanced
In the Advanced tab, the user can set a number of options related to detecting indels and paired ends data options.
- Detect indels (middle indels + end indels) - Select this option to perform indel detection.
- Indel penalty can be set, and is defined as the allowable open gap penalty.
- Maximal middle insertion size, maximal middle deletion size, maximum end insertion size, maximal end deletion size, and minimal distal end size can be set in this section as well.
- The Adapter Stripping section allows the user to strip adapters from the 3’ end of the read, by specifying the adapter sequence. Optionally, the user can choose to exclude any unmatched reads (without adapters) from further analysis and mapping.
- For paired end data, the user can set the expected insert size of the paired end reads, the standard deviation of the insert size, and the strand mode for the pairs (different strand—Illumina data or same strand—SOLID data).
- For the read trimming section, the user can choose to trim the reads using a quality score of a specified amount or below. If that base pair has a quality score below the specified amount, the read is trimmed at that point, although the algorithm will only trim, at most, down to a read size of 17 base pairs.
- Advanced trimming allows the user to trim by various options:
- Trim first # nucleotides- Will remove the specified number of nucleotides from beginning of the sequence.
- Trim last # nucleotides - Will remove the specified number of nucleotides from the end of the sequence.
- Trim by quality - See above (default is 2).
- Trim by final length - Will remove nucleotides from the end of the sequence to achieve the specified final length.
- The Adapter Stripping section allows the user to strip adapters from the 3’ end of the read, by specifying the adapter sequence.
- (Deprecated) Optionally, the user can choose to exclude any unmatched reads (without adapters) from further analysis and mapping.
- Trim reads first - The user should understand the order of operations that takes place when doing the 3' end adapter stripping during an alignment:
1. Quality trimming/other trimming options
2. Strip adapters
See AdapterStripping 3'End and AdapterStripping Right for more details.
If a read contains any sequence representing the barcode (Multiplex Identifier (MID) ) at the end of the read, this sequence may interfere with the adapter stripping module. The 3' adapter stripping does a localized alignment at the right end of the read, but its unable to find internal adapters. Thus, you will get no adapter stripping.
You may remove the MID sequence using either the MID Extraction + Adapter Stripping module, or choose to trim the # of bases of the MID sequence from the end of the read using Advanced Trimming options (i.e. "Trim last nucleotides").
In the Misc section, the user can choose to:
- Write unpaired/unmapped files into separate files (for later analysis).
- Exclude unmapped reads in BAM files.
- replace existing BAM files.
- keep trimmed base. If this option is selected, the whole read will be kept in the BAM file after the trimming (quality trimming or other customized trimming). We will treat trimmed read portion as soft clipping. Note, the rule does not apply to adapter stripped portion.
- optimize the bam file for storage. If this option is selected, Array Studio will generate compressed BAM to save 40-60% of storage space. More details can be found here: CompressBam
- Zip format - Select which format is used in compressing the files (default is "None").
- Finally, the user can choose to name the output file.
Preview
In the "Preview" tab, the user can select the option to "Preview the reads (sampling + align + QC) and the "Sampling percentage" to be used.
Results
The resulting files could include an alignment summary table and the NGS dataset (for downstream analysis like creation of exon junction data, fusion data, etc.). If paired data is used the expression dataset will contain the combination of both sets of pairs.
The various calculations for the Alignment report are as follows:
- Observation 1 - Name of first half of pair
- Observation 2 -Name of second half of pair
- Total read # - Total # of reads
- Uniquely paired read # - Reads that are both uniquely mapped and paired
- Non-uniquely paired read # - Reads that are not uniquely mapped but are paired
- Uniquely mapped read #1 - Reads that are uniquely mapped, but not paired, from file #1
- Uniquely mapped read #2 - Reads that are uniquely mapped, but not paired, from file #2
- Non-uniquely mapped read #1 - Reads that are non-uniquely mapped, but not paired, from file #1
- Non-uniquely mapped read #2 - Reads that are non-uniquely mapped, but not paired, from file #2
- Unmapped read #1 - Reads that are unmapped from file #1
- Unmapped read #2 - Reads that are unmapped from file #2
- Uniquely paired read % - Percentage of reads that are both uniquely mapped and paired
- Non-uniquely paired read % - Percentage of reads that are not uniquely mapped but are paired
- Uniquely mapped read 1 % - Percentage of reads that are uniquely mapped, but not paired, from file #1
- Uniquely mapped read 2 % - Percentage of reads that are uniquely mapped, but not paired, from file #2
- Non-uniquely mapped read 1 % - Percentage of reads that are non-uniquely mapped, but not paired, from file #1
- Non-uniquely mapped read 2 % - Percentage of reads that are non-uniquely mapped, but not paired, from file #2
- Unmapped read 1 % - Percentage of reads that are unmapped from file #1
- Unmapped read 2 % - Percentage of reads that are unmapped from file #2