Abr 7 — 2022
Outliers in data preprocessing: Spotting the odd one out!
More than 2.5 quintillion bytes of data were generated daily in 2020 (source: GS Analytics). To put this in perspective, a quintillion is a million million million or, more simply, a 1 followed by 18 zeros. It is therefore not surprising that a significant amount of this data is subject to errors. Data scientists are aware of this and routinely check their databases for values that stand out from the rest. This process referred to as “outlier identification” in the case of numerical variables, has become a standard step in data preprocessing.
The search for outliers
The search for univariate outliers is quite straightforward. For instance, if we are dealing with human heights and most of the individuals´ measurements are expected to range between 150cm and 190cm, then heights such as 1,70cm and 1700cm must be understood to be annotation errors. Aside from such gross outliers, which should definitely be cleaned when performing data preprocessing tasks, there is still room for outliers that are inherent to the type of data we are dealing with. For instance, some people could be 140cm or 200cm tall. This type of outlier is typically identified with rules of thumb such as the absolute value of the z-score being greater than 3. Unless there is an obvious reason (such as an annotation error), these outliers should not be removed/cleaned in general, still, it is important to identify them and monitor their influence on the modelling task to be performed.
A more difficult problem arises when we are dealing with multivariate data. For example, imagine that we are dealing with human heights and weights and that we have obtained the data represented in the scatterplot below. The individual marked in red is not a univariate outlier in either of the two dimensions separately, however, when jointly considering both height and weight this individual clearly stands out from the rest.
A popular technique for the identification of multivariate outliers is based on the use of the Mahalanobis distance, which is just a measure of how far a point x is from the centre of the data. Mathematically speaking, the formula is as follows:
where mu represents the mean vector (i.e., the centre of the data) and Sigma the covariance matrix, both of them typically being estimated from the data by the sample mean vector and the sample covariance matrix.
Interestingly, the Mahalanobis distance may be used for drawing tolerance ellipses of points that are at a certain Mahalanobis distance from the centre of the data, thus allowing us to easily identify outliers. For instance, returning to the example of human height and weight, it can be seen that the individual marked in red is actually the most outlying point when taking into account the graphical shape of our dataset.
In fact, one could understand the Mahalanobis distance as the multivariate alternative to the z-score. More precisely, ‘being at a Mahalanobis distance d from the centre’ is the multivariate equivalent of ‘being d standard deviations away from the mean’ in the univariate setting. Therefore, under certain assumptions, such as the data being obtained from a multivariate Gaussian distribution, it is possible to estimate the proportion of individuals lying inside and outside a tolerance ellipse. In the case above, we are representing a 95% tolerance ellipse, meaning that around 95% of the data points are expected to lie inside the ellipse if the data is obtained from a multivariate Gaussian distribution.
The identification of multivariate outliers becomes even more problematic as the number of dimensions increases because it is no longer possible to represent the data points in a scatterplot. In such a case, we should rely on two/three-dimensional scatterplots for selected subsets of the variables or for new carefully-constructed variables obtained from dimensional reduction techniques. Quite conveniently, the Mahalanobis distance may still be used as a tool for identifying multivariate outliers in higher dimensions, even when it is no longer possible to draw tolerance ellipses. For this purpose, it is common to find graphics such as the one below, where the indices of the individuals on the dataset are plotted against their corresponding Mahalanobis distances. The blue dashed horizontal line represents the same level as that marked by the tolerance ellipse above. It is easy to spot the three individuals lying outside the tolerance ellipse by looking at the three points above the blue dashed horizontal line and, in particular, the individual marked in red is shown again to clearly stand out from the other data points.
As a drawback of this method for the identification of multivariate outliers, some authors have pointed out that the Mahalanobis distance is itself very influenced by the outliers. For instance, imagine that five additional individuals — also marked in red in the scatterplot below — are added to the dataset. The tolerance ellipse (in red) has now been broadened and contains the individual previously considered as the most outlying. To avoid this problem, we may replace the sample mean vector and the sample covariance matrix in the definition of the Mahalanobis distance by other alternatives that are not strongly influenced by the outliers. A popular option is the Minimum Covariance Estimator (MCD) for jointly estimating the mean vector and the covariance matrix, which will identify a tolerance ellipse that is closer to the original ellipse (the blue one) than to the ellipse heavily influenced by the outliers (the red one).
Another potential drawback for the identification of multivariate outliers is the shape of the dataset since the Mahalanobis distance only takes account of linear relationships between variables. More specifically, the Mahalanobis distance should not be used when there is clear evidence that there exist several clusters of individuals in the data or, more generally, if the shape of the dataset is not somehow elliptical. In this case, we may want to tap into different techniques such as “depth-based” and “density-based” outlier detection techniques.
To summarise, in this article we have seen a prominent technique for outlier identification that should be performed as a data preprocessing task. Additionally, data analysts or data scientists may also be interested in reducing the influence of outliers on the resulting model by considering techniques that are less sensitive to the presence of outliers (for such purposes, the reader is directed to classic books on robust statistics). However, the study of outliers should not end there since it is also important to ultimately analyse the influence of the outliers on the performance of the analytical model. More precisely, one must be careful with the so-called influential points, which are outliers that, when deleted from the dataset, noticeably change the resulting model and its outputs. Further analysis of the reasons why these influential points appear in the dataset must be performed, not only by the data professionals but also by experts with vast specific domain knowledge on the nature of the problem.