Outliers Outliers Out!!

Identify and Remove the Outliers

Whether your data contains outliers? Which are those samples, and, how to remove them from your data? We will see these things with example in R.

The library we use is “metan”. Install the package if required [use the code install.packages(“metan”)] and load it.

library(metan)

Now set the working directory. I have created a folder in the folder Blog of drive E. You have to use your folder here. Remember to use either / or \\ between folder and sub folders and not \.

setwd("E:\\Blog")

I have created a sample data having four variables and 28 samples in .csv format and named it “outlier.csv”. The data table is given in the last section of the blog.

Read the data and see the header of the data. The function head() shows the first 6 rows of the data by default.

data <- read.csv("outlier.csv", header = T)
head(data)

##    pH   EC   OC   BD
## 1 7.2 1.80 1.60 1.19
## 2 7.2 0.08 0.40 1.14
## 3 7.0 2.10 0.40 1.41
## 4 7.1 0.34 0.44 1.48
## 5 7.2 0.36 0.40 1.46
## 6 7.4 0.38 0.06 1.46

 

Now we can use the function inspect() to find the variable names in the data, if there are any missing values, number of rows, the minimum, maximum, and the median of each variable in the data and the number of possible outliers in each variable. 

inspect(data, plot = F)

## # A tibble: 4 x 9
##   Variable Class   Missing Levels Valid_n   Min Median   Max Outlier
##   <chr>    <chr>   <chr>   <chr>    <int> <dbl>  <dbl> <dbl>   <dbl>
## 1 pH       numeric No      -           28  6.4    7.1    7.8       0
## 2 EC       numeric No      -           28  0.08   0.21   2.1       2
## 3 OC       numeric No      -           28  0.05   0.7    2.1       1
## 4 BD       numeric No      -           28  1.14   1.47   1.9       3

 

So, in the data, there are four variables namely “pH”, “EC”, “OC”, and “BD”.

All variables are numeric.

There are no missing values.

There are 28 valid samples for all the variables.

The possible number of outliers in variables “EC”, “OC”, and “pH” are 2, 1, and 3, respectively.

Notice the plot = F in the code. If it is written as plot = T, you will get the following correlation matrix along with the above statistics.

 

Now let us find the outliers for a single variable. We use the function find_outlier().

 olOC <- find_outliers(data, var = OC, plots = TRUE)

## Trait: OC
## Number of possible outliers: 1
## Line(s): 28
## Proportion: 3.7%
## Mean of the outliers: 2.1
## Maximum of the outliers: 2.1  | Line 28
## Minimum of the outliers: 2.1  | Line 28
## With outliers:    mean = 0.827 | CV = 56.007%
## Without outliers: mean = 0.78 | CV = 50.999%

 

The result shows that there is a single possible outlier in the variable OC. This means, the maximum and the minimum of the outliers will be same. And the line number 28 is the outlier.

It also shows the mean and the CV of the data with and without the outliers.

 Further, as I have used plots = TRUE, it have produced the box-plots and histograms with and without outliers.


Now, how to remove the identified outlier(s)?

First we will select the outliers. Here I have created a variable out.OC which will contain the outliers present in the variable OC of the data.

(out.OC<-boxplot(data$OC)$out )


## [1] 2.1

So, it is the value 2.1 for OC in the data.

Now we can remove the outlier from the data

OC<- data[-which(data$OC %in% out.OC),]

In the above line of code, we have created a new data named as OC as a subset of the old data which excludes the rows having the outliers of variable OC.

Now again we can check if there are still some outliers in the data.

boxplot(OC$OC)$out



## numeric(0)

The answer is no [numeric(0)].

This can be visualized by the box-plot above.

 

This was only about univariate outliers. We can also identify and remove the bivariate and multivariate outliers. To understand various methods to identify and remove univariate, bivariate, and multivariate outliers from the data, you can consult the following books.




The data used




Dr. Nirmal Kumar
Sr. Scientist
Division of Remote Sensing Applications
ICAR-National bureau of Soil Survey and Land Use Planning


 

 

 

Comments

Popular posts from this blog

Creating Soil Textural Triangles in R

Presenting major cations and anions in soil or water samples through Maucha Diagrams