Outlier Detection using Principal Component Analysis

Summary

My first article presents an analysis of an experiment on a structured dataset containing \(5\) outliers. The goal of the experiment is to find the outliers, and the analysis is based on \(1,500\) observations and \(7\) features.

After standardization the data, a model of a set of linearly uncorrelated principal components was created.

After creating the linear model, a figure showing outliers in the dataset was created, and finally a table of statistics of the principal components was created.

References

Nomenclature

\begin{array}{|l|l|}
\hline
\textbf{Symbol} & \textbf{Definition} \\
\hline
\mathbb{R} & \text{Set of real numbers}\\
\mu(\cdot) & \text{Mean}\\
||\cdot||_{2} & \text{Euclidean norm}\\
\sigma^{2}(\cdot) & \text{Variance}\\
m & \text{Number of observations in the experiment}\\
n & \text{Number of features in the experiment}\\
X: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m} & m \times n \text{ design matrix}\\
\mathbf{x_{k}} \in \mathbb{R}^{m}& \text{The k-th feature vector in } X\\
L & \text{The number of principal components}\\
U: \mathbb{R}^{L} \rightarrow \mathbb{R}^{m} & m \times L \text{ matrix that contains the left singular }\\
&\text{orthogonal unit vectors of }X \text{ corresponding to}\\
& \text{the largest singular values}\\
\lambda_{k} & \text{The k-th eigenvalue of the covariance matrix } X^{T}X\\
\mathbf{u_{k}} \in \mathbb{R}^{m} & \text{The k-th left singular orthogonal unit vector of }X\\
\mathbf{c_{k}} \in \mathbb{R}^{m} & \text{The k-th principal component}\\
\mathcal{N}\left(\mu,\sigma^{2}\right) & \text{Normal distribution with mean } \mu \text{ and variance } \sigma^{2}\\
\hline
\end{array}

Dataset

My dataset consists of \(n = 7\) features and \(m = 1,500\) observations. Each entry in the dataset contains random noise. The first \(5\) observations are shown below:


\begin{array}{|r|r|r|r|r|r|r|}
\hline \mathbf{x_{1}}
& \mathbf{x_{2}}
& \mathbf{x_{3}}
& \mathbf{x_{4}}
& \mathbf{x_{5}}
& \mathbf{x_{6}}
& \mathbf{x_{7}}\\
\hline
63 & 55 & 59 & 57 & 67 & 52 & 70\\
64 & -2 & -3 & -4 & -5 & 51 & 74\\
65 & 55 & 60 & 57 & 71 & 51 & 76\\
69 & 59 & 64 & 61 & 75 & 55 & 79\\
-1 & 59 & -3 & -4 & -5 & 56 & 79\\
\hline
\end{array}

Design Matrix

The design matrix \( X \) is derived from the dataset by standardization. Because you do not want to compare apples and oranges, I decided to scale the dataset such that the features have the properties of a standard normal distibution ( \(\mu = 0\) and \(\sigma^{2} = 1\) ).  Thus for \(k \in \left\{1,2,3,4,5,6,7\right\}\) holds \[ \mathbf{x_{k}}\sim \mathcal{N}\left(0,1\right) \]

Principal Component Analysis

The goal of the algorithm is to find uncorrelated principal components that retains most of the information. The algorithm uses the Singular Value Decomposition of \(X\) to find the principal components [1]:

The inputs of the algorithm are the components to keep \(L = 4\) and the design matrix \(X\). 

The output is a linear transformation of \(X\) in \(4\) uncorrelated principal components.

Model

The linear transformation of \( X\) in \(4\)  uncorrelated principal components has the model below.\[\begin{bmatrix}\mathbf{c_{1}}&\mathbf{c_{2}}&\mathbf{c_{3}}&\mathbf{c_{4}}\end{bmatrix}
=\begin{bmatrix}97.14\mathbf{u_{1}}&22.55\mathbf{u_{2}}&17.62\mathbf{u_{3}}&9.30\mathbf{u_{4}}\end{bmatrix}=\]
\[U
\begin{bmatrix}
97.14&0&0&0\\
0&22.55&0&0\\
0&0&17.62&0\\
0&0&0&9.30\\
\end{bmatrix}\]
where the matrix at the left-hand side contains the principal components, and at the right-hand side you see the diagonal matrix containing the singular values of \(X\).

The figure below show you 5 numbered observations that can be considered as outliers.

Evaluation

Statistics are computed for each principal component. The results are shown in the table below.


\begin{array}{|c|r|r|r|}
\hline
k &\sigma^{2}\left(\mathbf{c_{k}}\right) &\sqrt{\lambda_{k}} &\mu\left(\mathbf{c_{k}}\right) \\
\hline
1 & 6.2907864 & 97.14 &0\\
2 & 0.3390017 & 22.55 &0\\
3 & 0.2067414 & 17.62 &0\\
4 & 0.0576600 & 09.30 &0\\\hline
\end{array}

Because of \(\mathbf{c_{1}}\) has the highest variance, you may conclude that the first component retains most of the information.

Moreover, there holds that \(\lambda_{k} = 1,500\sigma^{2}\left(\mathbf{c_{k}}\right) \forall k \in \{1,2,3,4\}\) due to the equation \[n \sigma^{2}\left(\mathbf{c_{k}}\right) = ||\mathbf{c_{k}}||^{2}_{2} =\mathbf{c^{T}_{k}}\mathbf{c_{k}}=\sqrt{\lambda_{k}}\left(\mathbf{u^{T}_{k}}\mathbf{u_{k}}\right)\sqrt{\lambda_{k}} = \lambda_{k}.\]

Conclusion

The experiment was performed on a structured dataset and contains highly deviant observations in comparison with the other \(1,495\) observations. This analysis has shown that the \(5\) outliers can be confidently found using principal components.

Machine Learning Experiment

(c) 2018, Jan Ruijgrok. All Rights Reserved.