The Gene Shaving module will perform Gene Shaving on an -Omic data object, and use "gap statistics" to find the best cluster size. It attempts to identify groups of genes that have coherent expression and are optimal for various properties of the variation in their expression.
The basic idea of Gene Shaving is that:
- For the -Omic data, compute its 1st Principal component as a super-gene, i.e., the value for the super-gene in each observation is a linear combination of each gene in the observation. The super-gene has the overall largest variation, which means we can easily separate the samples into different groups by this super-gene.
- Calculate the correlation between each gene and the super-gene.
- If a gene is highly correlated with the super-gene, it means we can also use this gene to group samples. Genes with high correlation trends to have coherent effect and we want to cluster them together.
- If a gene has a low correlation with the super gene, it means it has low variation and the module "shaves" it off (usually the module shaves off 5%~10% genes with low correlation).
- Generate a new -Omic data object with the remaining genes.
- repeat 1-3 until only one gene is left.
This will generate a sequential -Omic data set. We use "gap statistics" to determine which -Omic data should be used as the optimized cluster.
If the user wants to generate more than one cluster of genes, this module orthogonalizes each row of current -Omic data with respect to the optimized cluster, and uses the orthogonalized -Omic data as the initial data in step 1.
To run this module, type MicroArray | Pattern | Gene Shaving.
Input Data Requirements
This module works on -Omic data types.
- Project & Data: The window includes a dropdown box to select the Project and Data object to be filtered.
- Variables: Selections can be made on which variables should be included in the filtering (options include All variables, Selected variables, Visible variables, and Customized variables (select any pre-generated Lists)).
- Observations: Selections can be made on which observations should be included in the filtering (options include All observations, Selected observations, Visible observations, and Customized observations (select any pre-generated Lists).
- Output name: The user can choose to name the output data object.
- Component number: Specify the number of clusters to generate.
- Shave percent: Define the Shave percent in each iteration. (default is 5)
- Permutation number: This module uses permutations to determine the cluster size (calculate the statistics gap). Users can define a permutation number here
- A larger permutation number takes a longer time, but generates more accurate estimates.
- Output view: The user has the option to automatically output a HeatmapView, ProfileView or None.
- Report gap statistics: Checking the checkbox will create a .GapStats Table in the Table tab of the Solution Explorer to show how the statistics gap changes as the cluster size changes.
The command will create a Cluster object in the Cluster section of the user's project in the Solution Explorer. If Report gap statistics was checked, the user will have the option to choose columns for the special GapStats LineView generated in the new GapStat Table as show below:
In gap statistics, the cluster size with largest gap will be used as the optimized cluster, as displayed in the Profile and Heatmap View.
Profile view: in this case, the first cluster has 4 genes, and the profile shows the expressions of the 4 genes over all samples. Select the lines to see the genes in the cluster.
Heatmap view: in this case, the first cluster has 4 genes, the Heatmap shows the expressions of the 4 genes over all samples.