My Understanding of Exploratory Data Analysis

ifeelfree
9 min readNov 30, 2020

--

Exploratory data analysis (EDA), pioneered by John Tukey, set a foundation for the field of data science. The key idea of EDA is that the first and most important step in any project based on data is to look at the data. By summarizing and visualizing the data, you can gain valuable intuition and understanding of the project. Exploratory analysis should be a cornerstone of any data science project.

· Part 1: Location Estimator
· Part 2: Variability Estimator
· Part 3: Distribution Estimator
· Part 4: Correlation
· Part 5: Contingency Table
· Part 6: More Variables (more than 2)
· Part 7: Reference

Part 1: Location Estimator

Trimmed mean

It is defined as the average of all values after dropping a fixed number of extreme values.

x has been sorted in the formula. A trimmed mean eliminates the influence of extreme values.

Weighted mean/median

It is the sum of all values times a weight divided by the sum of the weights. There are two reasons in favor of weighted mean/median:

  • Some values are intrinsically more variable than others, and highly variable observations are given a lower weight. For example, if we are taking the average from multiple sensors and one of the sensors is less accurate, then we might down-weight the data from that sensor.
  • The data collected does not equally represent the different groups that we are interested in measuring. For example, because of the way an online experiment was conducted, we may not have a set of data that accurately reflects all groups in the user base. To correct that, we can give a higher weight to the values from the groups that were underrepresented.

Weighted median is calculated in this way: instead of the middle number, the weighted median is a value such that the sum of the weights is equal for the lower and upper halves of the sorted list. Like the median, the weighted median is robust to outliers.

The implementation is as follows (use Python package wquantiles and numpy):

np.average(state['Murder.Rate'], weights=state['Population'])wquantiles.median(state['Murder.Rate'], weights=state['Population'])

Summary

  • The basic metric is the mean.
  • While robust estimators (median, trimmed mean, weighted mean/median) are valid for small data sets, they do not provide added benefit for large or even moderately sized data sets.
  • Statisticians and data scientists use different terms for the same thing. Statisticians use the term estimate while data scientists use the term metric for location statistics.
  • location and variability are referred to as the first and second moments of a distribution. The third and fourth moments are called skewness and kurtosis. Skewness refers to whether the data is skewed to larger or smaller values; kurtosis indicates the propensity of the data to have extreme values.

Part 2: Variability Estimator

Deviation: the difference between the observed values and the estimate of location.

Variance: the sum of squared deviations from the mean divided by n — 1 where n is the number of data values. The reason why we use n-1 is because we can get an unbiased estimate of variance and standard deviation. There are n — 1 degrees of freedom since there is one constraint: the standard deviation depends on calculating the sample mean. For most problems, data
scientists do not need to worry about degrees of freedom.

Mean absolute deviation: the mean of the absolute values of the deviations from the mean. It is also called l1-norm, Manhattan norm

Median absolute deviation from the median (MAD): the median of the absolute values of the deviations from the median. Normally, we need to normalize MAD by a multiplication factor 1.4826, which means that 50% of the normal distribution fall within the range +/- MAD. See statsmodels.robust.scale.mad

from statsmodels import robustdata = np.random.randn(30, 2)
df = pd.DataFrame(data, columns=['column_1','column_2'])
print(df['column_1'].std())
print(robust.scale.mad(df['column_1']))

Range: the difference between the largest and the smallest value in the data set.

Order statistics: metrics based on the data values sorted from smallest to biggest.

Percentile and quantile

Interquartile Range (IQR): the difference between 25th percentile and the 75th percentile.

data = np.random.randn(30, 2)
df = pd.DataFrame(data, columns=['column_1','column_2'])
df.head(3)
df['column_1'].quantile(0.75)-df['column_1'].quantile(0.25)

Percentiles

df['column_1'].quantile([0.05, 0.75, 0.95])

Part 3: Distribution Estimator

Boxplots

  • horizontal thick line in the box: median
  • top/bottom horizontal thin line in : 75th and 25th percentiles
  • the dashed lines are called whiskers, and it will extend to the furthest point within 1.5 times the Interquartile Range (IQR). Any data outside of the whiskers is plotted as single points or circles (often considered as outliers).
  • range: the y-coordinate range
df['column_1'].plot.box()

An improved version of boxplots is violinplot:

sns.violinplot(df['column_1'], inner='quartile')

The density is mirrored and flipped over, and the resulting shape is filled in, creating an image resembling a violin. The advantage of a violin plot is that it can show nuances in the distribution that aren’t perceptible in a boxplot. On the other hand, the boxplot more clearly shows the outliers in the data.

Frequency table and histogram

binned_column = pd.cut(df['column_1'], 10)
print(type(binned_column))
binned_column.value_counts(sort=False)

A histogram is a way to visualize the frequency table.

ax = df['column_1'].plot.hist(figsize=(6, 6))
ax.set_xlabel('range value')

Density plots and estimates

A density plot is a smoothed version of a histogram.

ax = df['column_1'].plot.hist(figsize=(6, 6),bins=11)
df['column_1'].plot.density(ax=ax)
ax.set_xlabel('range value')

We can also use bar/pie plot to show the categorical data:

ax=df['column_1'].plot.bar()/pie()
ax.set_xlabel('')
ax.set_ylabel('')

For categorical data, the mode is the value that appears most often in the data, and it, however, cannot be used for numeric data.

Part 4: Correlation

Variables X and Y (each with measured data) are said to be positively correlated if high values of X go with high values of Y, and low values of X go
with low values of Y. If high values of X go with low values of Y, and vice versa, the variables are negatively correlated. We can use heatmap to visualize the correlation matrix.

print(df.corr())
sns.heatmap(df.corr())

Like the mean and standard deviation, the correlation coefficient is sensitive to outliers in the data.

Another way of visualizing the relationship between two measured data variables is with a scatterplot. The x-axis represents one variable and y-axis another, and each point on the graph is a record.

df.plot.scatter(x='column_1', y='column_2')

However, scatter plot might not be proper for data sets that contain a large amount of records. In this case, a better option is to use a hexagonal binning plot. Rather than plotting points, which would appear as a monolithic dark cloud, we grouped the records into hexagonal bins and plotted the hexagons with a color indicating the number of records in that bin.

left: scatter; right: hexagonal binning
df.plot.hexbin(x='column_1', y='column_2')

An alternative is to use contour plot: the contours are essentially a topographical map to two variables; each contour band represents a specific density of points, increasing as one nears a “peak.”

left: scatter; right: contour

In a summary, in order to show the correlation between two numerical variables, we have the following solutions:

  • scatter plot: it is often used when we have small amount of data records
  • hexagonal binning and contour plot: they are used when there are a lot of data records
  • heat map: it is often used to illustrate matrix. For example, correlation matrix, confusion matrix and so on

Part 5: Contingency Table

A contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables.

import pandas 
import numpy

# creating some data
a = numpy.array(["foo", "foo", "foo", "foo",
"bar", "bar", "bar", "bar",
"foo", "foo", "foo"],
dtype=object)

b = numpy.array(["one", "one", "one", "two",
"one", "one", "one", "two",
"two", "two", "one"],
dtype=object)

res = pandas.crosstab(a, [b], rownames=['a'], colnames=['b'])
res

Part 6: More Variables (more than 2)

Both boxplots and voilinplots support multiple variables. For example:

airline_stats.boxplot(by='airline', column='pct_carrier_delay')sns.violinplot(airline_stats['airline'], airline_stats['pct_carrier_delay'])

The same with scatter plots, hexagonal binning plots and so on. The basic idea is conditioning variables in a graphics system, and showing pairs of variables sequentially.

loans_income = pd.read_csv(LOANS_INCOME_CSV, squeeze=True)sample_data = pd.DataFrame({
'income': loans_income.sample(1000),
'type': 'Data',
})
sample_mean_05 = pd.DataFrame({
'income': [loans_income.sample(5).mean() for _ in range(1000)],
'type': 'Mean of 5',
})
sample_mean_20 = pd.DataFrame({
'income': [loans_income.sample(20).mean() for _ in range(1000)],
'type': 'Mean of 20',
})
results = pd.concat([sample_data, sample_mean_05, sample_mean_20])

g = sns.FacetGrid(results, col='type', col_wrap=1,
height=2, aspect=2)
g.map(plt.hist, 'income', range=[0, 200000], bins=40)
g.set_axis_labels('Income', 'Count')
g.set_titles('{col_name}')
plt.tight_layout()
plt.show()

Part 7: Feature Visualization

t-SNE

t-SNE is an unsupervised non-linear technique primarily used for data exploration and visualizing high-dimensional data.

t-SNE differs from PCA by preserving only small pairwise distances or local similarities whereas PCA is concerned with preserving large pairwise distances to maximize variance.

1. Step 1, measure similarities between points in the high dimensional space. Think of a bunch of data points scattered on a 2D space (Figure 2). For each data point (xi) we’ll center a Gaussian distribution over that point. Then we measure the density of all points (xj) under that Gaussian distribution. Then renormalize for all points. This gives us a set of probabilities (Pij) for all points. Those probabilities are proportional to the similarities. All that means is, if data points x1 and x2 have equal values under this gaussian circle then their proportions and similarities are equal and hence you have local similarities in the structure of this high-dimensional space. The Gaussian distribution or circle can be manipulated using what’s called perplexity, which influences the variance of the distribution (circle size) and essentially the number of nearest neighbors. Normal range for perplexity is between 5 and 50

2. Step 2 is similar to step 1, but instead of using a Gaussian distribution you use a Student t-distribution with one degree of freedom, which is also known as the Cauchy distribution (Figure 3). This gives us a second set of probabilities (Qij) in the low dimensional space. As you can see the Student t-distribution has heavier tails than the normal distribution. The heavy tails allow for better modeling of far apart distances.

3. The last step is that we want these set of probabilities from the low-dimensional space (Qij) to reflect those of the high dimensional space (Pij) as best as possible. We want the two map structures to be similar. We measure the difference between the probability distributions of the two-dimensional spaces using Kullback-Liebler divergence (KL). I won’t get too much into KL except that it is an asymmetrical approach that efficiently compares large Pij and Qij values. Finally, we use gradient descent to minimize our KL cost function.

t-SNE can be used:

(1) t-SNE could be used on high-dimensional data and then the output of those dimensions then become inputs to some other classification model.

(2) data exploration for data separation.

Part 8: Reference

Book

  • Practical Statistics for Data Scientists 50+ Essential Concepts Using R and Python: CHAPTER 1 Exploratory Data Analysis

Blog

Code

Video

--

--

No responses yet