Understanding Sampling & Estimation in Statistics

Sampling and estimation are fundamental concepts in statistics that allow us to make inferences about a population based on a sample. In this article, we will explore the basics of sampling and estimation, their importance, and practical applications.

Sampling

Definition

Sampling involves selecting a subset of individuals or items from a larger population to estimate characteristics of the entire population.

Types of Sampling
  1. Random Sampling: Each member of the population has an equal chance of being selected.
  2. Stratified Sampling: The population is divided into subgroups (strata) and random samples are taken from each stratum.
  3. Systematic Sampling: Every nth member of the population is selected.
  4. Cluster Sampling: The population is divided into clusters, and entire clusters are randomly selected.

Python Code Examples for Sampling

1. Random Sampling
import numpy as np

# Population data (example)
population = np.arange(1, 101)

# Simple random sampling
sample_size = 10
random_sample = np.random.choice(population, size=sample_size, replace=False)
print(f"Random Sample: {random_sample}")

Output :

Random Sample: [31 57 14 42 8 21 96 4 49 15], When you will execute you might get different samples due to random samples

2. Stratified Sampling
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Example dataset
data = {
'Age': np.random.randint(18, 65, size=100),
'Income': np.random.randint(20000, 80000, size=100),
'Gender': np.random.choice(['Male', 'Female'], size=100)
}

df = pd.DataFrame(data)
df
# Stratified sampling based on Gender
strata = df['Gender']
train, test = train_test_split(df, test_size=0.2, stratify=strata)
print(f"Stratified Sample (Train):\n{train.head()}\n")
print(f"Stratified Sample (Test):\n{test.head()}")
3. Systematic Sampling
import numpy as np

# Population data (example)
population = np.arange(2, 13)

# Systematic sampling
sample_size = 3 # after 3rd sample select next sample
interval = len(population) // sample_size
systematic_sample = population[::interval]
print(f"Systematic Sample: {systematic_sample}")

Output – Systematic Sample: [ 2 5 8 11]

4. Cluster Sampling
import numpy as np
import pandas as pd
# Example dataset
data = {
'City': np.random.choice(['City A', 'City B', 'City C', 'City D'], size=100),
'Income': np.random.randint(20000, 80000, size=100)
}

df = pd.DataFrame(data)
df
# Cluster sampling: Select 2 clusters randomly
clusters = df['City'].unique()
selected_clusters = np.random.choice(clusters, size=2, replace=False)
clusters, selected_clusters

Output

(array([‘City B’, ‘City C’, ‘City D’, ‘City A’], dtype=object),
array([‘City A’, ‘City D’], dtype=object))

cluster_sample = df[df['City'].isin(selected_clusters)]
print(f"Cluster Sample:\n{cluster_sample.head()}")

Estimation

Definition

Estimation involves inferring population parameters (e.g., mean, proportion) based on sample data.

Python Code Example for Estimation
import numpy as np
import scipy.stats as stats

# Sample data (example)
sample_data = np.random.normal(loc=50, scale=10, size=30)
sample_data

Output – When i executed, i got this output. In your case you might get different due to random ness

# Point estimate of the mean
sample_mean = np.mean(sample_data)
print(f"Sample Mean: {sample_mean}")

# Output: Sample Mean: 50.14196388941628

# Confidence interval for the mean
confidence_level = 0.95
degrees_freedom = len(sample_data) - 1
sample_standard_error = stats.sem(sample_data)
confidence_interval = stats.t.interval(confidence_level, degrees_freedom, sample_mean, sample_standard_error)
print(f"95% Confidence Interval: {confidence_interval}")

# Output: 95% Confidence Interval: (45.18535024124498, 55.09857753758758)

Real-Time Use

Sampling and estimation are used in various fields such as market research (estimating customer preferences), healthcare (estimating disease prevalence), and quality control (estimating defect rates).

Conclusion

Sampling and estimation are crucial for making inferences about a population based on sample data. In this article, we’ve explored the basics of sampling methods, estimation techniques, and their practical applications using Python examples.

Practice Set

  1. Perform a simple random sampling on a dataset of your choice and calculate the sample mean.
  2. Estimate the population mean and construct a 95% confidence interval using a sample dataset.

Leave a Reply