Filter NGS Files
This module can be used to preprocess NGS files before importing. Quality encoding can be used to trim the data before filtering. The order of operations for trimming/stripping/filtering are as follows:
- Trim reads on length/quality (Advanced tab)
- Strip adapters from reads (Advanced tab)
- The order of these two steps can be toggled by "Trim reads first."
- Filter reads on length and quality (General tab)
- Filter reads by source (Advanced tab)
To open this module, please go to Analysis | NGS | Preprocess | Filter.
Input Data Requirements
Accepted file formats include FASTQ, QSEC, SFF and AUTO (AUTO allows the use of any combination of the listed file types).
Add files to menu
- Add button will add samples by selections
- Add Folder will add all samples in the selected folder (local project only)
- Search will find files based on sample registration (server project only)
- Add list will allow users to add files from a list (even add a grouping file for alignment functions).
- Quality encoding: Illumina quality scores, Sanger quality scores, or Automatic (figures out the quality scoring on its own).
- Since 2011, Illumina's CASAVA pipeline (v1.8+) has used Sanger quality encoding, not Illumina.
- Job number: Parallel job number
- Filter out reads if trim length < xx: Will filter out any reads from the file if the length after quality trimming is less than the specified amount.
- Filter out reads if maximal quality is <xx: Will filter out any reads where the maximum quality score is below the specified value for that read (i.e. if all the quality scores are relatively low, then filter this read).
- Filter out reads if average quality is <xx: Will filter out any reads where the average quality score is below the specified value for that read (i.e. if the average quality score for a read is relatively low, the filter this read).
- Filter out reads if poly AGCT rate is >=xx%: Will filter out any reads where the poly AGCT rate is greater than or equal to the specified percentage for that particular read. This means if a read has too many A's, G's, C's, or T's, it is most likely an artifact and the user may want to filter it out. In this case, it is usually not able to be uniquely aligned.
- Input files are paired: Specify whether this is a paired experiment (see here for details on paired-end naming conventions).
- Zip format: Select which format is used in compressing the files.
- Paired end filtering:
- Filtering out the pair if both reads fail the filtering criteria will keep both reads if either read passes all filters.
- Filtering out the pair if either read fails the filtering criteria will filter both reads if either read fails a filter.
- Generate flag files only (.ff or ff2): With this option checked, only flag files .ff (for single end) or .ff2 (for paired end) which contain the filtering information will be stored in the output folder, no new NGS files will be generated. If uncheck this option, then new NGS files with only the reads passing the filtering criteria will be stored in the output folder.
If users have checked this option, modules such as map RNA-seq module require that the bam output folder be the same as the folder of .ff or .ff2 files (or move the .ff file). Array Studio will get filter information for the source read files from .ff or .ff2 files in the destination folder.
- Output folder - can be specified for the newly filtered files.
- For the read trimming section, the user can choose to trim the reads using a quality score of a specified amount or below. If that base pair has a quality score below the specified amount, the read is trimmed at that point, although the algorithm will only trim, at most, down to a read size of 17 base pairs.
- Advanced trimming allows the user to trim by various options:
- Trim first # nucleotides: Will remove the specified number of nucleotides from beginning of the sequence.
- Trim last # nucleotides: Will remove the specified number of nucleotides from the end of the sequence.
- Trim by quality: See above (default is 2).
- Trim by final length: Will remove nucleotides from the end of the sequence to achieve the specified final length.
The Adapter Stripping window appears after selecting the "Customize" button.
- No Adapter Stripping: No attempt will be made to remove adapter sequences from reads.
- Strip 3' end adapters (end only): The 3' ends of reads will be compared to the adapter sequence for a match.
- Strip right adapter (middle or end): The adapter sequence will be checked for a match within the read, and will trim the adapter sequence, along with any sequence 3' to the adapter.
- Strip multiple adapters: Multiple adapter sequences can be listed.
- (Deprecated) Exclude unmatched reads - The user can choose to exclude unmatched reads (without adapters) from further analysis and mapping.
- Trim reads first - The user should understand the order of operations that takes place when doing the 3' end adapter stripping during an alignment. See AdapterStripping 3'End and AdapterStripping Right for more details.
- Note: starting from V8.0, when AdapterStripping is on, the FilterNgsFiles will generate *.filtered.fastq.gz in the output folder, no matter whether WriteFilterFiles is set to True or False. It is generating * .filtered.fastq.gz file not old * .filtered+stripped.fastq.gz. Stripping is being considered as a part of trimming. It is running at the same time with filtering. Each read is trimmed, stripped, then align to the filter source, if not aligned, the remaining trimmed+stripped read part will be written to the * .filtered.fastq.gz file.
Filter By Source
Specifies a list of filter sources. Used in RNA-Seq pipelines to remove rRNA and other contaminants. Reads will first be aligned to adapters (or other sequences in the selected sources) with mismatches. If aligned, the read will be filtered. If the read cannot be aligned, the software will try to align the first 25 bp of the read to adapters. If the first 25 bp can be aligned with perfect match, this read will be filtered.
Currently supported resource including:
- Custom fasta file - Specifies a custom file for filtering by source. This can be a .fasta file with the naming convention of:
>Category$SequenceName (i.e >rRNA$5SRNA indicates a sequence that belongs to the rRNA category, and is the sequence for 5S rRNA)
Note Filter source determines whether the whole reads matches "IlluminaAdapters", "Ercc", "Human.rRNA", or "Human.tRNA". It does not do adapter stripping.
Note The user should understand the order of operations that takes place when doing the 3' end adapter stripping during an alignment:
- Quality trimming/other trimming options
- Strip adapters
If a read contains any sequence representing the barcode (Multiplex Identifier (MID) ) at the end of the read, this sequence may interfere with the adapter stripping module. The 3' adapter stripping does a localized alignment at the right end of the read, but is unable to find internal adapters. Thus, you will get no adapter stripping. You may remove the MID sequence using either the MID Extraction + Adapter Stripping module, or choose to trim the # of bases of the MID sequence from the end of the read using Advanced Trimming options (i.e. "Trim last nucleotides").
If the user de-selected "generate .ff2 files", then a new set of .fastq files will be generated in the specified folder, as "*.filtered.fastq.gz".
If the user selected "generate .ff2 files", the user will find a set of .ff2 files that contain trimming information for the OSA aligner. As noted above, the .ff2 files should be saved in the eventual output folder for BAM files from the alignment step.
A summary report is also generated, containing columns for the "File Name", "Total Read #" and "Passed Read #".
If Filter By Source is enabled, a MicroArray data "SourceFilteringCounts" will be generated to report number of reads filtered due to match to sources (e.g. reads matches with rRNA sequences).
Read more about Understand Filter Ngs Files reports.
Very large files (150 million reads +) may fail on this analysis, unless you Increase Cloud analysis memory map.