Outliers are not complex as they look; you only need to understand what they entail and how they are calculated in A level maths. Like other mathematical and statistical concepts, an outlier value has its peculiarities, and its workings are guided by several theorems, making it easy to calculate during certain situations.
These extreme values are not to be unjustly removed because they may form a critical part of the distribution. Removing them without thorough consideration puts your statistical results at risk of invalidation.
This page has everything worth knowing about outliers. It examines what outliers represent, how they are calculated and their usefulness for data cleaning.
What Are Outliers?
Outliers are extreme odd data values that do not conform to the general pattern of a selected data. They mostly result from mistakes incurred during data collection or emerge from two extreme situations. Notably, they are abnormal data points outside the normal distribution pattern. For instance, in the set of extreme numbers below, 2 and 210 are outliers:
2, 98, 99, 101, 101, 102, 103, 105, 107, 210
2 is extremely low compared to the next least number in the distribution, while 210 is extremely high compared to the next high value. A life example would have Pinocchio in the class of several children. The length of his nose is an outlier because it's longer than all those of all the children in the class.
Outliers have a way of affecting the statistics value of a range of calculated values. For instance, they can inflate the mean value without affecting the median. The mode value is the most affected as its value is mostly distorted by outliers.
it's important to consider outlier data when calculating the mean or range of selected data unless the question states otherwise.
However, outliers can only be calculated or exempted from a series of statistical calculations if the values are identified. Note that outliers can only be removed from a data set if they appear as erroneous values. There must be a good explanation for eliminating outliers; otherwise, the value of selected data may be rendered impure.
How to Calculate Outliers?
Outliers are calculated under two distinct circumstances. One, when it affects the interquartile range; or when it affects the standard deviation without the mean.
Most times, the outlier values usually lie at the extreme sides of the interquartile range. When they are noticeable in the interquartile range, they are best calculated using the formula:
- Value less than Q1 ' K (interquartile range)
- Value higher than Q3 ' K (interquartile range)
- Where k is a constant usually given by the examiner. It's commonly represented as K=1.5
The steps involved in finding an outlier using an interquartile range include the following;
- Find the interquartile range (IQR) by looking for the 25th percentile (Q1) and 75th percentile (Q3) in the distribution.
- Multiply the Interquartile range value by 1.5
- Then add the result of the second step to the Q3. Also, subtract the value from Q1
- Arrange the numbers in the data set in order
- Now include the low and high values gotten from your calculations
- Identify numbers that are above or below the new values you've added to the data set
- If you do this well, you should get the outlier value.
Notably, these formulas will help you arrive at two distinct values, which can be regarded as the fence value. On the other hand, when outliers affect the standard deviation of a distribution and not the mean, the formula below may be used:
- Value less than x̄ - kó
- A value greater than x̄ + kó
- Where k is a constant usually given by the examiner. It's commonly represented as K=2.
Other advanced A-level maths methods useful for finding an outlier value include Pierce's criterion, the Modified Thompson Tau-Test, Dixon's Q Test, Grubb's Test and the Generalized ESD.
How Outliers are Represented on Box Plots?
Outlier values are at the extreme sides of a box plot, and they are usually represented by the minimum and maximum values. Where either the minimum or maximum value appears to be an outlier, there's a need for new minimum and maximum values to be selected in the box plot.
Hence, the next data value, either below the maximum value or above the minimum value, may be selected as the new values in the box plot. The outlier boundary is then defined by the new minimum and maximum value, where the minimum or maximum values are unknown.
When Should Outliers Be Cleaned?
Data cleaning happens when outlier values are erased from a given distribution. It's the act of erasing erroneous or missing data.
It's important to consider the context of outlier data critically before removing them. For instance, an extremely high salary than other employees would likely belong to the CEO of the company. In the same vein, a student's test score far above 100% would likely be a marking error.
Where an outlier is considered erroneous, like the student scoring over 100%, it must be deleted from the other data. Whereas it cannot be removed if it forms a crucial element of the data set like in the case of the CEO's salary.
That's to say, where an outlier is not an error, it should not be removed. The only justification for removing outliers is if it constitutes an error in the data set.
Outlier values are not difficult to find from every indication; as emphasized, they can only be erased if they are error values. If not, they are part of the data set and must be calculated together with the data. That said, outliers are not difficult to calculate, especially if the data set is small. Even if the data set is large, there are advanced procedures for finding outlier values.
The only drill to grasp the workings of outlier values is to understand the context that has led to their appearance in the data set. If this is not considered, statistical results may be altered where data is cleaned and outliers removed arbitrarily. That's to say that outliers should not be removed without any justification.
Take your time to look at the distribution to understand the context. Don't jump to the conclusion that outlier values are not important and should be removed. Several ways of calculating outliers have been provided in the piece; ensure to practice them to get the hang of calculating outliers accurately and seamlessly.
This post was updated on 01 Aug, 2023.