The Regression module allows the user to perform regression on a dataset and uses external cross validation to find the best model. The user can choose from Lasso, Ridge Regression, PLS regression, Support Vector Machine and Neural Network. The user then has the option of generating one or more models from each model type chosen. This model can then be used with a new dataset to Predict the Regression. An .omodel file is generated by this command, as is a Table containing Prediction information (in the Prediction tab of the Solution Explorer) and a Cross Validation Report Table (in the Prediction tab of the Solution Explorer).
To run this module, type MicroArray | Predict | Regression.
Input Data Requirements
This module works on -Omic data types.
- Project & Data: The window includes a dropdown box to select the Project and Data object to be filtered.
- Variables: Selections can be made on which variables should be included in the filtering (options include All variables, Selected variables, Visible variables, and Customized variables (select any pre-generated Lists)).
- Observations: Selections can be made on which observations should be included in the filtering (options include All observations, Selected observations, Visible observations, and Customized observations (select any pre-generated Lists).
- Output name: The user can choose to name the output data object.
- Predict: Specify which column in the design table that contains the variable to predict the model on.
Step 1 (required) Specify Model
Clicking Specify Model opens the Specify Regression Models window. First, the user can choose a Model type in the Options section. Available model types include Lasso, Ridge Regression,Partial Least Square, Support Vector Machine and Neural Network
Once the user select a model in the Options, the corresponding algorithm parameters will become available.
- Lasso - Lambda count: Array Studio tunes the Lambda by trying different values of Lambda from 0 to 1. By default, Lambda count is 101, which means Array Studio tunes 101 Lambda steps from 0, 0.01, ..., until 1.
- Ridge - fraction count: Array Studio tunes fraction parameters for Ridge by trying different value from 0 to 1. By default, fraction count is 101, which means Array Studio tunes 101 steps from 0, 0.01, ..., until 1.
- PLS - K Search List: PLS is a supervised version of PCA. It applies dimension reduction method to summarized and classify data. K here is the range of reduced dimensions in the PLS algorithm.
- SVM - Cost Search List (2^): Define the Cost searching range in Support Vector Machine algorithm. A more detailed discussion can be found here.
- SVM - Gamma Search List (2^): Define the gamma searching range in Support Vector Machine algorithm. A more detailed discussion can be found here
- NN -Size: Define the number of hidden layers for Neural Network algorithm.
Step 2 (optional): Specify Model File
Step 2 involves specifying the model file (which will be generated) name and location. This is optional, as by default Array Studio will save model files in the My Documents\OmicSoft\Models folder as an .omodel file.
Step 3 (optional): Change Classification Options
- Observation normalization: Define the method to normalize observation: None (default), Center, or CenterScale. This option should be used when the training dataset and test dataset come from very different samples (this option also needs to be selected when Predicting).
- Normalized against all variables: This option is only meaningful when the user did not use all variables to do the classification. If so, this option will normalize the observation using all the variables in the dataset (instead of just the variables chosen in the Variables section).
- Select variable based on F-test: Checking this box will allow Array studio do a variable pre-screen based on F-test. If unchecked, all variables will be used---this will increase the time and memory needed for Array Studio to perform the Classification and is not a recommended option. A number of variables in the hundreds is recommended, and thus checking this option is highly recommended.
- Selection size:Define the selection size of the variables (50 by default)
- Cross validation fold: The Cross validation fold can be used to set the number of cross-validations run on the models (selecting the same number of cross-validation runs as samples will be equivalent to the “Leave one out” model of classification.
- Leave one out - Select this option to see a table of the cross-validated predictions that support the bar charts of % accuracy. This will help in determining how well the classifiers do at predicting a particular class of interest.
- Output model #: The user can decide how many top models will be outputted. By default, this option is set to 1, and will only output the best model out of all models chosen to be run.
- Report cross validation predictions: The user can also choose to output the prediction result from cross validation.
- Random number seed: This value is used for initialize the randomizer in the module (e.g., the cross validation). By default this is set to 0, which uses the system clock.
An .omodel file is generated, and placed in the location specified by Step 2 of the process. Essentially, this .omodel file contains the Regression used, as well as the data for the selected variables for that dataset. This .omodel file can then be used to Predict Response.
In the Solution Explorer, two Tables are generated under the Prediction Tab. The first table is the .Predicted Table, and contains a column for chip number, a column for Observed, and then one column for each of the regression models exported by the command. It also contains any covariate information from the Design Table. This can be used as a first step in determining how each of the models performed via cross-validation. An example of a .Predicted table is shown below.
The second table generated by the command is a Cross Validation Report (.CVReport). This contains a RMSError Mean and RMSError standard deviation for each generated model. It automatically generates a view, containing the RMSError.Mean on the Y Axis, with model type on the x axis.
Finally, a List is generated in the List tab of the Solution Explorer. It contains the variables selected by the command to be used for the models.