Principal Component Analysis
The Principal Component Analysis module generates a Principal Component Analysis (PCA) on the selected dataset. It summarizes each observation by original variables into principal components. Each component is a linear combination of original variables in a way that maximizes its variance. Thus, with the first several principal components, users can check the overall variance of each observation. The user can visualize each observation in 2D or 3D plot to find outliers.
To run this module, type MicroArray | QC | Principal Component Analysis.
Input Data Requirements
This module works on -Omic data objects.
- Project & Data: The window includes a dropdown box to select the Project and Data object to be filtered.
- Variables: Selections can be made on which variables should be included in the filtering (options include All variables, Selected variables, Visible variables, and Customized variables (select any pre-generated Lists)).
- Observations: Selections can be made on which observations should be included in the filtering (options include All observations, Selected observations, Visible observations, and Customized observations (select any pre-generated Lists).
- Output name: The user can choose to name the output data object.
- Component number: Sets the number of Principal components to be generated (by default=2). Setting the Component number to 3 will generate a 3D plot (shown in Output Result). Setting the Component number to 4 or more will generate a pairwise scatterview plot of the PCA for the top components, up to the number specified (also shown in Output Result).
- Group: A Group can be selected from the -Omic data's design table to assign different colors to different levels in this group.
- Scale variables: If the Scale Variables option is checked (which is the default), Array Studio uses an adjusted "unit variance" scaling (similar to the methods used by programs like SIMCA). The variables are first centered and then scaled using unit variance, with the scaling factor determined by Standard Deviation * Sqr((n-1)/n).
- Output scores: Checking this box will output the PCA score for each calculated component for each sample. For example, if the Component number is 2, the first 2 component scores for each sample would be generated in PCA score tables.
- Output loadings: Checking this box will generate a loading plot and .PcaLoadings Table. In PCA, the loadings are the final weight for each variable. The larger the absolute value a variable has, the more the variable can contribute to the components, thus the more important the variable is.
- Generate list for top % variables: Selecting this option creates a list (one for each component) of the top "x" % of rows contributing to that component. It does this by ranking the absolute value of the loadings for each component. This list can then be used with the "Reorder Rows by List" table function on the loadings table to display the loadings in a ranked order.
- Output eigen values: Each principle component has a eigen value. The first components will have larger eigen values than later components. The larger eigen value a component has, the more variance it presents for the whole variance structure. Checking this box will generate a Table object with the eigen values.
- Output ordered heatmap : Checking this box will generate a heatmap view for each component in the original Data object (the generated ordered heatmap uses eigen values from SVD to order the rows and columns)
- Calculate Hotelling T2: Generates a T2 Hotelling ellipse using the Alpha level--0.05 by default. The data out of the ellipse has a small probability that it has a same variance structure as data points in the ellipse. A detailed description can be found here. Please note that, if users specify grouping information, then there will be more than one ellipse in the 2D plot (if there are m levels in the grouping factor, m + 1 ellipses would be shown in the plot)
- Alpha level: The value used to control the size of ellipse. The larger alpha value is, the smaller the ellipse is. Default value is 0.05.
An example 2D PCA plot with grouping information is shown below. In this case, the grouping factor is “treatment”. There are two levels in this factor: control and DBP. There are 3 ellipses representing 3 confidence regions: one for all samples (the black ellipse), one for control samples (small blue ellipse) and one for DBP (the green ellipse). Each data point is calculated with all samples, i.e., the loading and scores are computed using all samples. But each ellipse is determined by corresponding group.
The main purpose of PCA is to find potential outliers. For 2D PCA plots, the user can easy find the outliers by the ellipse. The data points outside of the ellipse should have a (significantly) different data pattern than the ones inside the ellipse.
An example loadings plot is shown below.
An example of an ordered heatmap is shown below.
An example of generated Eigen values report is shown below:
Setting the PCA component number to 3 will generate an interactive 3D scores plot. For 3D plot, the user can rotate the plot to find the potential outliers. Currently, Array Studio does not provide the hotelling T2 for 3D plot.
Setting the PCA component number to >=4 will generate a scores plot similar to below.