What are outliers?
An outlier is an extremely high or extremely low data point relative to the nearest data point and the rest of the neighboring co-existing values in a data.
Outliers are extreme values that stand out greatly from the overall pattern of values in a dataset.
Formal Definition:
An outlier is an observation that appears far away and diverges from an overall pattern in a sample. Outliers in input data can skew and mislead the training process of machine learning algorithms resulting in longer training times, less accurate models, and ultimately poorer results.
Image Credit:datascience.foundation
Example: If in your class there are 100 students and 99 students got the average mask of 52% and there is 1 student who got 99.9% on the test that 1 student(with 99%) is the outlier because of the average mask of whole class will become higher but that’s not the true representation of your class average.
Reasons For Outliers present in the data:
Data entry errors (human errors)
· Measurement errors (instrument errors)
· Experimental errors (data extraction or experiment planning/executing errors)
· Intentional (dummy outliers made to test detection methods)
· Data processing errors (data manipulation errors)
· Sampling errors (extracting or mixing data from wrong or various sources)
· Natural (not an error, novelties in data)
When is Outliers Dangerous?
This is an interesting question because sometimes outliers can be useful like in Anomaly Detection Algorithms. Where you need Outliers to detect some of the suspicious stuff going on.
Example: If you want to find if there is something suspicious going on in a credit card transaction you have to create a model that has outliers because that suspicious credit card transaction will be a little different than the normal transaction, in this case, you have to keep the outliers in the dataset as those your model will be dependent on those outliers to catch suspicious transaction.
The most disturbing part of the Outliers is it is easy to find outliers in your dataset but it is difficult to know what to do with those outliers you can keep those outliers, you can remove them or you can change that outlier value into something related to your data. So you have to be very careful while handling Outliers.
Effects of Outliers in Machine Learning Algorithms?
There are some sets of algorithms where you have to be very careful with Outliers like Linear Regression, Logistic Regression, AdaBoost, and Deep Learning. And the common pattern in this algorithm is you have to calculate weights.
So if you want to know if the outliers will impact the model then you have to check if that algorithm is weight-based.
There are algorithms like tree-based (decision tree, random forest, etc) where there is not much impact on the model than the weight-based algorithm discussed above.
How to treat Outliers?
Trimming
It means to remove the outliers from your data, But the problem with this approach is if there is so much amount of outliers and you remove it all then your data will become thin(data becomes less). The advantage of this process is it’s very fast just you have to remove the outliers.
Capping
The thing is to remember in this is your outlier will always be in the lowest or highest side of your data not in the middle, So the thing you can do here is put the limit on both sides (higher and lower side) lower than that limit are all outliers and higher then that limit are all outliers. And give all the outliers the limit value, lower outliers will get a lower limit value and upper outliers will get an upper limit value.
There are a few more treatment methods for outliers like treating your outliers as a missing value and then treating it accordingly. The other method is you do Discretization which means you take a numeric column and create a range of it, if you have numbers from 1–100 you can create a range like [1–10], [1–20], and so on and you can also name that range.
How to detect Outliers?
If you think there are outliers in your data how will you know that there are outliers in your data, so there are 3 types to know the outliers.
Normal Distribution
If the column in which you are working is normally distributed, then it is very simple because in between 1 standard deviation, there is 68.2% information, and in between 2 standard deviations there is 95.4%. and between 3 standard deviation, there is 99.7% information. So if any data point is outside of (mean + 3 standard deviation) and (mean — 3 standard deviation), in that case, you consider it as an outlier.
Skewed Distribution
If the column in which you are working is not normally distributed but is skewed in this case you use the Interquartile Range(Basically a box plot)
Technique for Outliers Detection and Removal
Z-Score Technique
When using this technique it is assumed that the column in which you are working is Normally Distributed.
By using the above code you will get the lower and upper limit of your column and then you can decide if you want to remove your outliers or replace your outliers with your limit value.
IQR and Box-Plot Method
You use this method by the column of your data in which you are working is Skewed. To use this method you have to first know what is box plot and IQR. In the box -plot you have percentiles.
In this method, you have to find out the IQR and then find the upper and lower limits of the data. You can also visually see the outliers by plotting the box plot. I have written the code to find the limits and draw a box plot.
Percentile Treatment
This is a very simple technique in which you have to decide your quartile, in this method, you have to make sure that how much you left in the upper you should leave that same amount in the lower limit as well.
If you are leaving 1 percentile(which is the most appropriate to leave) in the upper limit you have to leave 1 percentile in the lower limit also.
Frequently Asked Questions (FAQs)
1. What are outliers in machine learning?
Outliers are data points that significantly deviate from the majority of the data. They can be caused by errors, anomalies, or simply rare events.
2. Why are outliers problematic for machine learning models?
Outliers can negatively impact the performance of machine learning models in several ways:
Overfitting: Models can focus on fitting the outliers rather than the underlying patterns in the majority of the data.
Reduced accuracy: Outliers can pull the model’s predictions towards themselves, leading to inaccurate predictions for other data points.
Unstable models: The presence of outliers can make the model’s predictions sensitive to small changes in the data.
3. How can outliers be detected?
There are several methods for detecting outliers, including:
Distance-based measures: These measures, like Z-score and interquartile range (IQR), calculate the distance of a data point from the center of the data distribution.
Visualization techniques: Techniques like boxplots and scatter plots can visually identify data points that lie far away from the majority of the data.
Clustering algorithms: Clustering algorithms can automatically group similar data points, isolating outliers as separate clusters.
4. How can we handle outliers?
There are several approaches to handling outliers in machine learning:
Removing outliers: This is a simple approach but can lead to information loss.
Clipping: Outliers are capped to a certain value instead of being removed completely.
Transformation: Data can be transformed to reduce the impact of outliers, such as using log transformations for skewed data.
Robust models: Certain models are less sensitive to outliers, such as decision trees and support vector machines.
Conclusion
Throughout this exercise, we saw how in the data analysis phase one can encounter some unusual data i.e. outliers. We learned about techniques that can be used to detect and remove those outliers.
By the way…
Call to action
Hi, Everydaycodings— I’m building a newsletter that covers deep topics in the space of engineering. If that sounds interesting, subscribe and don’t miss anything. If you have some thoughts you’d like to share or a topic suggestion, reach out to me via LinkedIn or X(Twitter).
References
And if you’re interested in diving deeper into these concepts, here are some great starting points:
Kaggle Stories - Each episode of Kaggle Stories takes you on a journey behind the scenes of a Kaggle notebook project, breaking down tech stuff into simple stories.
Machine Learning - This series covers ML fundamentals & techniques to apply ML to solve real-world problems using Python & real datasets while highlighting best practices & limits.