# NmfClustering.pdf

# NMF Clustering

## Overview

The **NMF Clustering** command will allow the user to perform non-negative matrix factorization. A more detailed description of applying NMF to clustering Microarray data can be found here.

Basically, the Omic-data matrix **V**(*p × n*), where *p* is the number of variables (e.g., genes) and *n* is the number of observations (e.g., samples), is factorized into two matrices **H**(*p×k*) and **W**(*k×n*), with the property that all three matrices have no negative elements. Here *k* is the number of clusters. Our goal is to find a small number (*k*) of metagenes (saved in matrix **H**), each defined as a positive linear combination of the *p* genes. We can then approximate the gene expression pattern of samples as positive linear combinations of these metagenes. The way we determine samples clustering is this:

Given a factorization **V ∼ HW**, we can use matrix **W** to group the *n* samples into *k* clusters. Each sample is placed into a cluster corresponding to the most highly expressed metagene in the sample; that is, sample *j* is placed in cluster *i* if the *w _{ij}* is the largest entry in column

*j*.

Note that in matrix **V**, each row is a variable and each column is an observation, we cluster the observations with **W** where **V ~ HW**. It is easy to imply that for matrix **V ^{T}**, each row is an observation and each column is a variable. We can also easily cluster variables with

**H**where

^{T}**V**.

^{T}~ W^{T}H^{T}Both variable clusters and observation clusters will be generated. The command will create Cluster objects (variable and observation) in the **Cluster** tab of the **Solution Explorer**. Using the **View Controller**, these Clusters can be used in a variety of Views with Data objects. In addition, a Heatmap or ProfileView of the clustering will be generated for the selected Data object.

Note: While there is no theoretical limit on the number of variables that can be clustered, this algorithm may require more memory than others. An **Out-of-Memory** error may be generated when there is not enough available RAM in the computer to handle the number of selected variables.

To run this module, type **MicroArray | Pattern | NMF Clustering**.

### Input Data Requirements

This module works on -Omic data types.

## General Options

### Input/Output

**Project & Data**: The window includes a dropdown box to select the Project and Data object to be filtered.**Variables**: Selections can be made on which variables should be included in the filtering (options include All variables, Selected variables, Visible variables, and Customized variables (select any pre-generated Lists)).**Observations**: Selections can be made on which observations should be included in the filtering (options include All observations, Selected observations, Visible observations, and Customized observations (select any pre-generated Lists).

**Output name**: The user can choose to name the output data object.

### Options

**Search cluster#**:The user can either enter a single cluster number for the algorithm to find (e.g., tell the program to find 9 clusters), or the user can used the range option and Array Studio searches and finds the best number of clusters based on that range by Silhouette.**Current Data scale**: Select the data scale, available options are Log2 and Linear scale. This should be set to the scale used for the selected Data object (i.e. if the data has been logged, select Log2).**Output view**: The user has the option to automatically output a**Heatmap**,**Profile View**, or**None**.**Stopping rule**: The NMF function uses a numerical approaching algorithm. The iteration stops when the statistics from last two iterations has a difference smaller than the stopping rule(default is 1E-6 and should not be changed unless the user is familiar with the algorithm).**Max iteration**: The maximum number of iterations for the clustering algorithm (default is 100).**Normalize variables**: Checking this box will scale the variables during the clustering algorithm (checked by default).**Optimize observation clustering**: Checking this box will consider observation clustering by calculating the sum of each observation's silhouette when finding the best*k*if the user input a range of*k*.**Optimize variable clustering**: Checking this box will consider variable clustering by calculating the sum of each variable's silhouette when finding the best*k*if the user input a range of*k*.**Output W matrix**: Output the**W**matrix as an -Omic data object for observation clustering.**Output H matrix**: Output the**H**matrix as an -Omic data object for variable clustering.

## Output Results

The command will generate reports for the silhouettes for all cluster numbers. The heatmap for both observation and variables can be found under the corresponding Omic-data.

Note: we do not provide the clustering tree within each cluster, but we put the observation/variable with less dissimilarity closer.

The Silhouettes table for observation and variable are also generated (if the user choose to optimize both observation and variables). The Silhouettes table shows the assigned cluster name, the nearest neighbor cluster name and the width for each observation/variable under each possible *k* value.