Filter for significant genes from multiple tests
Case Study:Filter for significant genes from multiple tests
Overview
Large-scale genomic experiments often include multiple timepoints, treatments, genotypes, and tissues. Array Studio has multiple modules, including ANOVA and General Linear Model, that can account for experimental design in identifying significantly up- or down-regulated genes in one or more tests. The output Inference Report contains fold-change and p-value for each test, and can be filtered by one or more columns to find significant genes from the specified test.
But what if you want to find the set of genes that were significantly different in any of the statistical tests, instead of one particular test? For example, what if you wanted to find a set of candidate gene expression markers for different cell types? Depending on how complex you want the filter logic to be, Array Studio has multiple methods to find interesting genes.
Example:Blueprint Gene Expression
The example dataset is count data from the RNA-seq expression data from Blueprint, using DEseq to look for differentially-expressed genes in the Tissue column, with all permutations of comparisons between Cord Blood, Bone Marrow, Thymus, Tonsil, and Venous blood.
Using Lists as Filters
Although -Omic data, Table data, and Inference Reports can be permanently subset by observation and variable, it is usually more convenient to "hide" data you aren't interested in, by filtering.
One convenient way to filter is by a List object, which is just a list of items (such as sampleIDs, gene/probeset names, etc).
Lists can be applied to filter columns (e.g. the GeneID column) to only show GeneIDs in your list.
The examples below demonstrate ways to generate lists of the genes passing one or more criteria, so that these lists can be used to look at these significant genes in various ways.
Filtering by matching any columns
By default, the Filter window in the View Controller works by AND logic, so that if filters are set for P-value <.05 in Cord Blood vs Bone Marrow and Thymus vs Bone Marrow,
only variables that have P-value <.05 for BOTH tests will be displayed:
However, this AND logic can be switched to OR logic by clicking on the Options button in the Row filter tab, and selecting Match Any Column:
Now, the Volcano plot shows all genes that are significantly different in Cord blood vs Bone marrow or Thymus vs Bone marrow.
A list of these genes can be generated by right clicking List in the Solution Explorer, then selecting Add List from Visible Rows:
Combining Lists to build complex logical filters
If you are interested in a more complex logic, such as P-value <.05 AND Fold-change > 2 in Cord blood vs Bone marrow or Thymus vs Bone marrow, simply matching ANY filter column will not work.
Instead, you can build a list of genes that pass both the fold-change and P-value criteria in the first test:
Then, clear the filters, and build a list of genes that pass both the fold-change and P-value criteria in the second test:
Finally, combine the two lists, by right-clicking List and selecting Manage Lists:
Select the lists to combine, select Make List from Union, and click Perform operation, to create a new list with genes from either list:
The "Union" list contains all genes that passed the criteria P-value <.05 AND Fold-change > 2 in either test. In this way, multiple tests can be filtered for specific criteria, and genes that pass these tests can be grouped for further study.
Summarize Inference Reports to generate significant lists
Another way to generate a set of lists for complex queries is by SummarizeInferenceReport.pdf.
For example, if you wanted to find the list of genes passing the criteria P-value <.05 AND Fold-change > 2 for any of the ten tests, you could either set up the filter logic for each test, and save a separate list each time, or simply click MicroArray | Inference | Summarize Inference Report:
Select the Estimates (tests) of interest (e.g. All of them), specify the query (e.g. Fold change >2, Raw PValue <.05), and click Add, then Submit:
Clicking on any row of the resulting table will display the passing genes in the Details Window, and selecting all rows of the table will display the union of those individual results:
As before, create a list from selected rows to have a list of any gene that was significantly up-regulated in any of the tests.
Test Case
To rapidly generate a list of significantly up- and down-regulated genes, at the cutoff of log2-fold-change(Estimate)<-5 or >5 AND P-value <.0001 (highlighted below),
Summarize Inference Report can be set up like this:
Generating a union list of ~2000 genes significantly up or down in at least one test.
This list can be used, for example, to filter the input genes for Hierarchical Clustering:
To see the significant gene expression patterns within each tissue:
Tonsil, bone marrow, and thymus all cluster, while some cord blood samples have gene signatures more closely matching venous blood.