My Understanding of Data and Sampling Distributions

9 min readDec 1, 2020

How to sample data is the most important part of any machine learning project. The data collected must be accurate and complete, and random sampling provides a way of reducing bias. Data sampling distribution is used to check whether the collected samples follow the inherent population distribution.

· Part 1: Random Sampling
· Sampling vs population
· What is a good sampling?
· What is a proper sample size?
· Part 2: Resampling
· Sampling distribution
· Central limit theorem
· Standard Error
· The Bootstrap
· Permutation
· Part 3: Confidence Intervals
· Calculate confidence intervals via the bootstrap
· Part 4: Normal Distribution
· Part 5: Other Distributions
· Long-tailed distributions
· Student’s t-Distribution
· Binomial distribution
· Chi-Square distribution
· F-distribution
· Poisson-distribution
· Exponential-distribution
· Weibull-distribution
· Part 6: Data Collection in the era of Big Data
· Part 7: Reference
· Book
· Paper
· Video
· Code

Part 1: Random Sampling

Sampling vs population

A sample is a subset of data from a larger data set (population). Sampling can be classified into two categories: 1) sampling with replacement and 2) sampling without replacement.

from scipy import stats
np.random.seed(seed=1)
x = np.linspace(-3, 3, 300)
xsample = stats.norm.rvs(size=1000)fig, axes = plt.subplots(ncols=2, figsize=(5, 1.5))ax = axes[0]
ax.fill(x, stats.norm.pdf(x))
ax.set_axis_off()
ax.set_xlim(-3, 3)ax = axes[1]
ax.hist(xsample, bins=30)
ax.set_axis_off()
ax.set_xlim(-3, 3)

What is a good sampling?

Data quality in data science involves completeness, consistency of format, cleanliness, and accuracy of individual data points.

Several factors may affect sampling completeness:

sample bias: the sample was different in some meaningful and nonrandom way from the larger population it was meant to represent. Proper definition of an accessible population is difficult. There may exist selection bias, which refers to the practice of selectively choosing data.
self-selection sampling bias: it is a type of sample bias. It refers to samples that are willing to be sampled. For example, people who are willing to submit their comments in Yelp. They motivate to write reviews may have bad experiences.
regression to the mean: Regression to the mean refers to a phenomenon involving successive measurements on a given variable: extreme observations tend to be followed by more central ones.

Sampling may have random errors, and these errors do not tend strongly in any direction.

In stratified sampling, the population is divided up into strata, and random samples are taken from each stratum. Political pollsters might seek to learn the electoral preferences of whites, blacks, and Hispanics. A simple random sample taken from the population would yield too few blacks and Hispanics, so those strata could be over-weighted in stratified sampling to yield equivalent sample sizes.

First, in classification problems, we often use stratified sampling. This is particularly important if the minority class is significantly less frequent than the other class. In this case, if we were to choose a split randomly, we could be unlucky enough to not select any samples of the minority class in the Training set, and all of them in the Test set, drastically reducing the performance.

Specifying a hypothesis and then collecting data following randomization and
random sampling principles ensures against bias.

All other forms of data analysis run the risk of bias resulting from the data collection/analysis process (repeated running of models in data mining, data snooping in research, and after-the-fact selection of interesting events).

When using temporal data, instead, the split should be done sequentially. In particular, the Training data should be the initial 70% of the data ordered temporally, while the Test set should be the most recent part. This prevents the model from, quite literally, “looking into the future”.

What is a proper sample size?

The classic scenario for the value of big data is when the data is not only big but sparse as well (plenty of features).

Part 2: Resampling

Sampling distribution

It is important to distinguish the distribution of the individual data points, known as data distribution, and the distribution of a sample statistic, known as the sampling distribution.

Central limit theorem

It says that the means drawn from multiple samples will resemble the familiar bell-shaped normal curve, even if the source population is not normally distributed, provided that the sample size is large enough and the departure of the data from normality is not too great.

Standard Error

The standard error is a single metric that sums up the variability in the sampling distribution for a statistic. The standard error can be estimated using a statistic based on standard deviation s of the sample values and the sample size n.

Standard error = s/sqrt(n)

There is no need to draw brand new samples to calculate the Standard Error; instead, we can use bootstrap resamples.

The Bootstrap

no assumption about the data has normal distribution
bootstrap will replace observed data set, and it is a resampling technique with replacement, and it is a powerful tool for increasing the variability of training samples.

from sklearn.utils import resample
results = []
for nrepeat in range(1000):
    sample = resample(loans_income) #bootstrap
    results.append(sample.median())
results = pd.Series(results)
print('Bootstrap Statistics:')
print(f'original: {loans_income.median()}')
print(f'bias: {results.mean() - loans_income.median()}')# standard error
print(f'std. error: {results.std()}')

Permutation

Permutation and the bootstrap are two main types of resampling procedures. The bootstrap is used to access the reliability of an estimate while permutation tests are used to test hypothesis.

Part 3: Confidence Intervals

Calculate confidence intervals via the bootstrap

Confidence interval always come with a coverage level, expressed as a percentage, say 90% or 95%.

Confidence interval is wider if the coverage level is high, and the coverage level is called the level of confidence.

One way to think of a 90% confidence interval is as follows: it is the interval that encloses the central 90% of the bootstrap sampling distribution of a sample statistic. The procedure is as follows:

a) draw a random sample of size n with replacement from the data

b) record the statistic of interest for the resampling sample

c) repeat a) and b) R times

d) for an x% confidence interval, trim [(100-x)/2]% of the R resample results from either end of the distribution

e) the trim points are the endpoints of x% bootstrap confidence interval

Calculate confidence intervals via z-test

null hypothesis
z value (0.05)

- p>0.1 not significant

- p<0.1 marginally significant

- p<0.05 significant

- p<0.01 highly significant

p = 0.5sd = np.sqrt(p * (1-p) * n_obs)z = ((n_control + 0.5) — p * n_obs) / sdprint(z) # zprint(2 * stats.norm.cdf(z) # p

Part 4: Normal Distribution

in a normal distribution, 68% of the data lies within one standard deviation of the mean, and 95% lies within two standard deviation.
A standard normal distribution is one in which the units on the x-axis are expressed in terms of standard deviations away from the mean. The
transformed value is termed a z-score, and the normal distribution is sometimes called the z-distribution. To convert data to z-scores, you subtract the mean of the data and divide by the standard deviation
QQ-Plot: it is used to visually determine how close the sample distribution to the normal distribution.

from scipy import stats
fig, ax = plt.subplots(figsize=(4, 4))norm_sample = stats.norm.rvs(size=1000, scale=4, loc=8)
stats.probplot(norm_sample, plot=ax)plt.tight_layout()
plt.show()

Part 5: Other Distributions

Long-tailed distributions

This means that we are much more likely to observe extreme values than would be expected if the data had a normal distribution.

(a) long-tailed distribution (b) normal distribution

Another common phenomenon in QQ-Plot for long-tailed distribution: the points are close to the line for the data within one standard deviation of the mean.

Student’s t-Distribution

The t-distribution is a family of distributions resembling the normal distribution but with thicker tails. The larger the sample, the more normally shaped the t-distribution becomes.

a) how t-distribution is introduced?

b) assume the population is normal with unknown standard derivation and know mean, then the standardization will leads to t-distribution

c) when the sample size increase, t-distribution approximate normal distribution.

Students t-distribution is often used for calculating the confidence intervals. Suppose, for example, we have a 90% confidence level about the sample mean, then the mean will be in the range given by the formula:

x+/-t(0.05)*s/sqrt(n)

where n is the sample number, x and s are the sample’s average and standard derivation, t is the t-statistics function.

The official definition of t-distribution is:

Binomial distribution

The binomial distribution is the frequency distribution of the number of successes (x) in a given number of trials (n) with specified probability (p) of success in each trial.
There is a family of binomial distributions, depending on the values of n and p.

stats.binom.pmf(12, n=18, p=0.5)

With a large enough number of trials (particularly when p is close to 0.5), the binomial distribution is indistinguishable from the normal distribution.

Chi-Square distribution

Expectation is defined loosely as “nothing unusual or of note in the data”. This is also termed the “null hypothesis”.

F-distribution

Steps of getting F-distribution

Select a random sample of size n1 from a normal population, having a standard deviation equal to σ1.
Select an independent random sample of size n2 from a normal population, having a standard deviation equal to σ2.
The f statistic is the ratio of s12/σ12 and s22/σ22.

The F-statistic is the ratio of the variability among the group means to the
variability within each group.

Poisson-distribution

The key parameter in a Poisson distribution is λ , or lambda. This is the mean number of events that occurs in a specified interval of time or space. The variance for a Poisson distribution is also λ .

sample = stats.poisson.rvs(2,size=1000)
pd.Series(sample).plot.hist()

Exponential-distribution

Using the same parameter λ that we used in the Poisson distribution, we can also model the distribution of the time between events: time between visits to a website or between cars arriving at a toll plaza. It is also used in engineering to model time to failure, and in process management to model, for example, the time required per service call.

sample = stats.expon.rvs(0.2, size=100)
pd.Series(sample).plot.hist()

Weibull-distribution

It is a generalization of exponential distribution when the even rate changes over the time.

sample = stats.weibull_min.rvs(1.5, scale=5000, size=100)
pd.Series(sample).plot.hist()

Part 6: Data Collection in the era of Big Data

Not all data samples are equal — some are more valuable to your model than others. For example, if your model has already trained on 1M scans of normal lungs and only 1000 scans of cancerous lungs, a scan of a cancerous lung is much more valuable than a scan of a normal lung. Indiscriminately accepting all available data might hurt your model’s performance and even make it susceptible to data poisoning attacks.

Building the Software 2.0 Stack (Andrei Karpathy, Spark+AI Summit 2018)

Part 7: Reference

Book

Practical Statistics for Data Scientists 50+ Essential Concepts Using R and Python: CHAPTER 2 Data and Sampling Distributions