Classification.pdf

From Array Suite Wiki

Classification

Overview

The Classification module allows the user to perform classification on a dataset, and uses external cross validation to find the best model. The user can choose from KNN, SVM, PLS-DA, LDA, NeuralNet, and NaiveBayes. The user then has the option of generating one or more models from those chosen. This model can then be used with a new dataset to Predict the classification. A .omodel file is generated by this command, as is a Table containing Prediction information (in the Prediction tab of the Solution Explorer) and a Cross Validation Report Table (in the Prediction tab of the Solution Explorer).

To run this module, type MicroArray | Predict | Classification.

Classification menu.png

Input Data Requirements

It works on -Omic data types.

General Options

Classification0.png

Input/Output

  • Project & Data: The window includes a dropdown box to select the Project and Data object to be filtered.
  • Variables: Selections can be made on which variables should be included in the filtering (options include All variables, Selected variables, Visible variables, and Customized variables (select any pre-generated Lists)).
  • Observations: Selections can be made on which observations should be included in the filtering (options include All observations, Selected observations, Visible observations, and Customized observations (select any pre-generated Lists).
  • Output name: The user can choose to name the output data object.


Options

  • Classify: Specify which factor in the design table contains the variable to classify.

Step 1 (required) Specify Model

Classification1.png

Clicking Specify Model opens the Specify Classification Models window. First, the user can choose a Model type in the Options section. Available model types include K-Nearest Neighbor, Partial Least Square, Support Vector Machine, Neural Network, Linear Discriminant Analysis and Naive Bayes.

Once the user select a model in the Options, the corresponding algorithm parameters will become available.

  • KNN - K Search List: Define the range of numbers of nearest neighbors in the KNN algorithm.
  • KNN - Scale: choose whether to define different variables before applying KNN algorithm.
  • PLS - K Search List: PLS is a supervised version of PCA. It applies dimension reduction method to summarize and classify data. K here is the range of reduced dimension in the PLS algorithm.
  • SVM - Cost Search List (2^): Define the Cost searching range in Support Vector Machine algorithm. A more detailed discussion can be found here.
  • SVM - Gamma Search List (2^): Define the gamma searching range in Support Vector Machine algorithm, a more detailed discussion can be found here
  • NN -Size: Define the number of hidden layers for Neural Network algorithm.
  • LDA - Pool Variance: This option is only meaningful when the user did not use all variables to do the LDA (Linear Discriminant Analysis) classification. If set to Yes, Array Studio will calculate the variance using all the variables in the dataset (instead of just the variables chosen in the Variables section).

Step 2 (optional): Specify Model File

Step 2 involves specifying the model file (which will be generated) name and location. This is optional, as by default Array Studio will save model files in the My Documents\OmicSoft\Models folder as an .omodel file.

Step 3 (optional): Change Classification Options

Classification3.png

  • Observation normalization: Define the method to normalize observations: None (default), Center, or CenterScale. This option should be used when the training dataset and test dataset come from very different samples (the same selection should be selected when Predicting).
    • Normalized against all variables: This option is only meaningful when the user did not use all variables to do the classification. If so, this option will normalize the observations using all the variables in the dataset (instead of just the variables chosen in the Variables section).
  • Select variable based on F-test: Checking this box will allow Array studio to do a variable pre-screen based on F-test. If unchecked, all variables will be used---this will increase the time and memory needed for Array Studio to perform the Classification and is not recommended. A number of variables in the hundreds is recommended, and thus checking this option is highly recommended.
    • Selection size:Define the selection size of the variables (50 by default)
  • Cross validation fold: The Cross validation fold can be used to set the number of cross-validations run on the models (selecting the same number of cross-validation runs as samples will be equivalent to the “Leave one out” model of classification).
    • Leave one out - Select this option to see a table of the cross-validated predictions that support the bar charts of % accuracy.  This will help in determining how well the classifiers do at predicting a particular class of interest.
  • Output model #: The user can decide how many top models will be outputted. By default, this option is set to 1, and will only output the best model out of all models chosen to be run.
  • Report cross validation predictions: The user can also choose to output the prediction result from cross validation.
  • Random number seed: This value is used to initialize the randomizer in the module (e.g., the cross validation). By default this is set to 0, which uses the system clock.


Output Results

In the Solution Explorer, two Tables are generated under the Prediction Tab. The first table is the .Predicted Table, and contains a column for chip number, a column for the Observed classification, and then one column for each of the classification models exported by the command. This can be used as a first step in determining how each of the models performed via cross-validation. If the model did not correctly pick the classification of a sample, it will be highlighted in red in the table. An example of a .Predicted table is shown below.

Classification6.png

The second table generated by the command is a Cross Validation Report (.CVReport). This contains a mean accuracy and standard deviation for each generated model. It automatically generates a view, containing the Accuracy.Mean on the Y Axis, with model type on the x axis.

Classification7.png

Finally, a List is generated in the List tab of the Solution Explorer. It contains the variables selected by the command to be used for the models.

Classification8.png


OmicScript

Classification


Related Articles

EnvelopeLarge2.png