# CorrelationQC.pdf

# Correlation-based QC

## Overview

The **Correlation-based QC** command calculates **Median Absolute Deviation** (MAD Scores, discussed below) for each Group, using a variety of correlation methods. A new Table is generated in the **Solution Explorer** under the **QC folder**. By default, this command generates 2 views: a ScatterView comparing MAD score by Group, and a TableView. Outliers can be selected and removed from -Omic data, using the **Exclude selection** button of the Task/Update tab. If used in the standard fashion (i.e. running Correlation-based QC once per experiment), a standard cutoff can be applied to search for outliers. The standard cutoff used by Array Studio is that MAD Scores <-5 are considered outliers.

To run this module, type **MicroArray | QC | Correlation-based QC**.

### Input Data Requirements

This module works on -Omic data types.

## General Options

### Input/Output

**Project & Data**: The window includes a dropdown box to select the Project and Data object to be filtered.**Variables**: Selections can be made on which variables should be included in the filtering (options include All variables, Selected variables, Visible variables, and Customized variables (select any pre-generated Lists)).**Observations**: Selections can be made on which observations should be included in the filtering (options include All observations, Selected observations, Visible observations, and Customized observations (select any pre-generated Lists).

**Output name**: The user can choose to name the output data object.

### Options

**Group**: the user can choose a Group by which to run the correlation-based QC. This will be used to generate the MAD score and group statistics.**Method**: the user can choose different methods for correlation, including Pearson, Spearman, and Kendall (see http://en.wikipedia.org/wiki/Correlation).**Cutoff**: The Cutoff box can be used to change the cutoff for the MAD Score (default is -5). This will change whether the Fail column for that sample will be marked Y or N.**Output correlation summary statistics**: If the user checks this box, this analysis will output a Table object that contains the correlation statistic summary for each group.

## Output Results

If the **Output correlation summary statistics** box is checked, a Table object similar to the following will be generated:

An example ScatterView of MAD Scores versus Group is shown below.

A TableView is generated that includes the Pass/Fail status, the Group, Average Correlation of the observation in its group, Correlation Difference (between the average correlation of the observation in its group and the average correlation of the entire group), and all other columns from the Design Table. An example is shown below.

The **Exclude Selection** button (available in the **Task** tab of the **View Controller**) will generate a List of observation IDs from the current -Omic data, but excludes any selected observations. This List can be used for future analysis and QC processes (using the Observations box in most Analysis windows). In addition, it will rerun the Correlation-based QC command, excluding the selected samples.

### Further information

Further information on the calculation of MAD scores:

- For each sample, calculate the correlation difference. This is simply a difference between the average of all the pair wise correlations that involve the sample (for the same group) and the average of all the pair wise correlations that do not involve the sample. For example, if we have a, b, c, d for group 1, the correlation difference of sample a is: The difference of Average (correlation(a, b), correlation(a, c), correlation(a, d)) and Average (correlation(b, c), correlation (b, d), correlation(c, d)). You can see that if sample
*a*is an outlier, then the difference will be negative.

- Now we have a vector of values (one for each sample). We simply convert this vector to MAD scores (robust Z-scores) by subtracting the medians, then dividing it by median absolute deviations (MAD). We use a standard MAD cutoff (e.g. -5) to determine the outliers.

The MAD score is calculated using the following formula:

Median Absolute Deviation= median (|Yi-Median Correlation Differences|)

MAD Score = (Sample *a* Correlation Difference - Median Correlation Differences)/ (MAD * 1.4826)

The MAD (Median Absolute Deviation) and Median (of the Correlation Differences) are calculated for the entire dataset and therefore are the same numbers for each sample. However, these numbers will change if the dataset is changed.