Continuous data are numerical and occur across a range. Anything that changes incrementally is continuous data, such as the temperature outside, my age, or the number of hours that have passed since New Years 2000. Ordinal data are also numerical, but they are more concerned with the order in which they occur than with the values themselves. Rankings are ordinal data, such as class rank or a ranked-choice vote. Nominal data is not numerical, but rather consists of names, labels, or categories. The names of all the students in the class would be nominal data, as would ne the name of the state which each of us lives in.
Let’s say I’m trying to evaluate the relationship between a remote student’s distance from Williamsburg during the Spring 2021 semester and the letter grade that they earn in DATA 146. In this case, the feature is distance from Williamsburg (which varies for each student), and the target is their grade (this might be correlated with distance). Distance from Williamsburg would be a continuous variable, since it is numerical and incremental. Letter grade is categorical since it is represented with a label rather than a number, but it is also ordinal in that A is clearly higher than B, B higher than C, and so on. The name of the town where each student resides could also be attached to the data as a categorical representation of location.
import numpy as np
n = 1000
a = 5.0
b = 5.0
#np.random.seed(146)
x1 = np.random.beta(a, b, size=n)
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 8))
plt.hist(x1, rwidth=0.9)
plt.show()
mean1 = np.mean(x1)
median1 = np.median(x1)
print("The mean is "+str(mean1)+".")
print("The median is "+str(median1)+".")
The mean is 0.49187385260651606.
The median is 0.4905356820841425.
n = 1000
a = 2.0
b = 8.0
#np.random.seed(146)
x2 = np.random.beta(a, b, size=n)
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 8))
plt.hist(x2, rwidth=0.9)
plt.show()
mean2 = np.mean(x2)
median2 = np.median(x2)
print("The mean is "+str(mean2)+".")
print("The median is "+str(median2)+".")
The mean is 0.19481817832741233.
The median is 0.17276231730476235.
n = 1000
a = 8.0
b = 2.0
#np.random.seed(146)
x3 = np.random.beta(a, b, size=n)
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 8))
plt.hist(x3, rwidth=0.9)
plt.show()
mean3 = np.mean(x3)
median3 = np.median(x3)
print("The mean is "+str(mean3)+".")
print("The median is "+str(median3)+".")
The mean is 0.7962899834253547.
The median is 0.8169509494635234.
import pandas as pd
df = pd.read_csv('gapminder.tsv', sep='\t')
min_year = df['year'].min()
max_year = df['year'].max()
min_life = df['lifeExp'].min()
max_life = df['lifeExp'].max()
n_bins = 10
my_bins = np.linspace(min_life, max_life, n_bins + 1)
idx_min_year = df['year'] == min_year
idx_max_year = df['year'] == max_year
df_min_year = df[idx_min_year]
df_max_year = df[idx_max_year]
plt.figure(figsize=(6, 6))
plt.hist(df_min_year['lifeExp'], rwidth=0.9,
label=min_year, alpha=0.5, bins=my_bins)
plt.hist(df_max_year['lifeExp'], rwidth=0.9,
label=max_year, alpha=0.5, bins=my_bins)
plt.xlabel('Life Expectancy')
plt.ylabel('Number of countries')
plt.legend()
plt.show()
min_year = df['year'].min()
max_year = df['year'].max()
min_life = np.log10(df['lifeExp'].min())
max_life = np.log10(df['lifeExp'].max())
n_bins = 10
my_bins = np.linspace(min_life, max_life, n_bins + 1)
idx_min_year = df['year'] == min_year
idx_max_year = df['year'] == max_year
df_min_year = df[idx_min_year]
df_max_year = df[idx_max_year]
plt.figure(figsize=(6, 6))
plt.hist(np.log10(df_min_year['lifeExp']), rwidth=0.9,
label=min_year, alpha=0.5, bins=my_bins)
plt.hist(np.log10(df_max_year['lifeExp']), rwidth=0.9,
label=max_year, alpha=0.5, bins=my_bins)
plt.xlabel('$\log_{10}$ Life Expectancy')
plt.ylabel('Number of countries')
plt.legend()
plt.show()
The plot under the logarithmic transformation offers a much clearer picture of the change in life expectancy across countries from 1952 to 2007.
import seaborn as sns
plt.figure(figsize=(6, 6))
sns.boxplot(x=df['year'], y=df['pop'])
plt.ylabel('Population')
plt.xlabel('Year')
plt.show()
log_pop = np.log10(df['pop'])
plt.figure(figsize=(6, 6))
sns.boxplot(x=df['year'], y=(log_pop))
plt.xlabel('Year')
plt.ylabel('Population')
plt.show()
Once again, the logarithmic data is just more intuitive to interpret. The outliers in the untransformed graph dominate the visual representation, changing the scale of the y-axis so that even seeing most of the data is almost impossible. The logarithmic translation clarifies the visual representation of change in population amongst all countries from 1952 to 2007.