Search

An Introduction to Machine Learning Algorithm: Principal Component Analysis

Written by Dhairya Shandilya

"Analysis is the critical starting point of strategic thinking" - Kenichi Ohmae

Large sets of data are increasingly common and most often difficult to interpret. So in order to interpret such datasets, Principal Component Analysis(PCA) was invented to reduce the dimensionality of such large datasets as much as possible with minimal data(information) loss. So, in this article, we'll primarily talk about PCA, its history, some basic explanation of the principal components', limitations, its applications, and some similar techniques.


What is Principal Component Analysis?


Principal component analysis, or PCA, is a statistical procedure that allows you to summarize the data in large data tables by means of a smaller set of “summary indices” that can be more easily visualized and analyzed. Also, it helps in making data easier to visualize and study. It is a simple technique for extracting data from confusing and complex sets of data and summarizes data into a much simpler form without information loss.


History of PCA


Karl Pearson, an English mathematician, and biostatistician in 1901 invented PCA, as an analog of the principal axis theorem in mechanics, which was later independently developed and named by Harold Hotelling, American mathematical statistician, and influential economic theorist during the 1930s.


Details


PCA is defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first principal component( also called as the first coordinate), the second greatest variance on the second principal component, and so on. Now, in the next few precise points, we'll learn how PCA actually works:-


1) Mean Centering-

In this process of mean centering, we first have to compute the variable averages. This vector is interpretable as a point (here in red) in space. The point is situated in the center of the point swarm. The average point is now the origin by the subtraction of the averages from the data corresponding to a re-positioning of the coordinate system.


2) First Principal Component-


After the mean-centering, the set of data is ready for computation of the first summary index, i.e. the first principal component. This component is the line in the K-dimensional variable space that best approximates the data in the set in the least-squares sense. The line passes through the average point so that each data graphed as points (observations) can now be projected on the line in order to get a coordinate value along the PC-line. The new coordinate value that we'll get is known as a score.


3) Second Principal Component-


The second principal component (PC2) is oriented in such a way that it reflects the second largest source of variation in the data while being orthogonal to the first PC. PC2 also passes through the average point.


4) Two PCs define a model plane-

Two PCs form a plane which is a window to the multidimensional space, which can easily be graphically visualized. Each observation may be projected onto this plane, giving a score for each.


Limitations


The applicability of PCA is limited by certain assumptions like PCA can capture linear correlations between the features but fails when this assumption is violated. Now, another major limitation is the removal of the mean-centering process before constructing the covariance matrix for PCA.


Applications


There are multiple applications of PCA. Some are listed below:-


1) Quantitative finance- In quantitative finance, PCA can directly be applied to the risk management of IRD (Interest Rate Derivative) portfolios.


2) Neuroscience- PCA's variance is used in neuroscience to recognize certain properties of a stimulus that increases a neuron's probability of generating an action potential. This technique is called the spike-triggered covariance analysis.


Similar Techniques


1) Network Component Analysis (NCA)- It models the gene regulatory network as a bipartite graph whose vertices can be divided into two parts i.e. the regulatory factors and the gene regulated factor.


2) Independent Component Analysis (ICA)- It faces similar problems as PCA, but finds separable components rather than successive approximations.


So, what are your views about Principal Component Analysis (PCA), must comment in the comment section below. Also, please like, share, and log in to Techvik for more interesting tech-related blogs like this.

References:


1) https://royalsocietypublishing.org/doi/10.1098/rsta.2015.0202#d3e1956

2) https://en.wikipedia.org/wiki/Principal_component_analysis#Similar_techniques

3) https://blog.umetrics.com/what-is-principal-component-analysis-pca-and-how-it-is-used

TECHVIK

Copyright © 2019 by Techvik.