My Understanding of Data Cleaning

ifeelfree
2 min readNov 28, 2020

--

I use this post to summarize my data cleaning experience.

Part 1: What is data cleaning?

Data Cleaning means the process of identifying the incorrect, incomplete, inaccurate, irrelevant or missing part of the data and then modifying, replacing or deleting them according to the necessity.

Part 2: Data cleaning scenario 1: missing data

(1) Drop the missing data

  • Drop any row that has missing data for at least one column or drop any row that has missing data for all the columns
dataframe.dropna(how='any')
dataframe.dropna(how='all')

Before and after the missing data is removed, we recommend that we should use info()function to have an idea how missing data is distributed in the data base.

(2) Imputation

  • mean
  • medium
  • mode (categorical data or a variable with outliers)
  • impute 0, a very small number, or a very large number to differentiate missing values from other values
  • use knn to impute values based on features that are most similar
  • Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB)
    This is a common statistical approach to the analysis of longitudinal repeated measures data where some follow-up observations may be missing. Longitudinal data track the same sample at different points in time. Both these methods can introduce bias in analysis and perform poorly when data has a visible trend
  • Linear Interpolation
    This method works well for a time series with some trend but is not suitable for seasonal data
  • Seasonal Adjustment + Linear Interpolation
    This method works well for data with both trend and seasonality

(3) ML algorithms that support missing data

There are also implementations of some machine learning algorithms, such as gradient boosting decision trees that can handle missing values.

Part 3: Outliers

Outliers are extreme values that deviate from other observations on data, they may indicate a variability in a measurement, experimental errors or a novelty. In other words, an outlier is an observation that diverges from an overall pattern on a sample.

(1) Using ML models to remove outliers

Use the trained model to evaluate the image dataset and rank them according to their loss. Check each problematic image and see whether it should be deleted. In order to do that, we need a good UI, and we can use ipywidgets to create such a user interface.

--

--

No responses yet