A package is a collection of modules, each of which can contain a set of related commands.
A library is a collection of related functions that can be downloaded together for use in a work session.
First, you need to install the package: pip install pandas
. Then, to make the library of functions accessible, you need to import the library: import numpy as np
.
#Example 1
pip install numpy
import numpy as np
#Example 2
pip install pandas
import pandas as pd
In the second example above, I typed not just “import pandas” but “import pandas as pd”. This makes “pd” the alias for “pandas”, a shorter name that I can use to call functions. Using an alias can save time and also reduce the likelihood of typos interfering with code running correctly.
A data frame is a way of organizing data that consists of rows and columns. The first row is a header row containing column names, and the first column is an index by which the data frame is sorted.
A great one is pandas.
pd.read_csv
data = pd.read_csv('gapminder.tsv',sep='\t')
The read_csv() function assumes comma-separated values, so if the values are separated otherwise (by tabs, in this case), then that must be specified.
Yes, the argument sep must be defined if the values are separated by something other than commas.
The number of rows and columns is given by the shape function. The size of the data, given by the size function, describes the number of rows multiplied by the number of columns. Additionally, a row can be called an observation and a column can be called a variable.
data.shape
(1704, 6)
data.size
10224
df = pd.read_csv('gapminder.tsv',sep='\t')
df.year
0 1952
1 1957
2 1962
3 1967
4 1972
...
1699 1987
1700 1992
1701 1997
1702 2002
1703 2007
Name: year, Length: 1704, dtype: int64
Yes, the variable occurs in intervals of 5. If I were adding years since 2007, I would add 2012 and 2017.
For each year, data is added for every country. If I add two additional years, that will be an outcome for each year and for each country. Since there are 2 years and 142 countries, 284 new outcomes would be added in total.
countries = len(df['country'].unique())
countries*2
284
Rwanda in 1992 had the lowest life expectancy.
min_lifeExp = df['lifeExp'].min()
min_lifeExp
23.599
idx_min_lifeExp = (df['lifeExp'] == min_lifeExp)
temp = df[idx_min_lifeExp]
temp['country']
1292 Rwanda
Name: country, dtype: object
temp['year']
1292 1992
Name: year, dtype: int64
In the 1990s, Rwanda was experiencing a genocide. Tens of thousands of Tutsis, an ethnic minority group, were murdered. This helps to explain the drastically low life expectancy of 23.599 years.
gdp = df['pop']*df['gdpPercap']
df['gdp'] = gdp
df.head()
country | continent | year | lifeExp | pop | gdpPercap | gdp | |
---|---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 | 6.567086e+09 |
1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 | 7.585449e+09 |
2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 | 8.758856e+09 |
3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.197138 | 9.648014e+09 |
4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.981106 | 9.678553e+09 |
subset_country = df[df['country'].isin(['Germany','France','Italy','Spain'])]
subset_2007 = subset_country[subset_country['year']==2007]
subset_2007 = subset_2007.sort_values('gdp',ascending=False)
subset_2007
country | continent | year | lifeExp | pop | gdpPercap | gdp | |
---|---|---|---|---|---|---|---|
575 | Germany | Europe | 2007 | 79.406 | 82400996 | 32170.37442 | 2.650871e+12 |
539 | France | Europe | 2007 | 80.657 | 61083916 | 30470.01670 | 1.861228e+12 |
779 | Italy | Europe | 2007 | 80.546 | 58147733 | 28569.71970 | 1.661264e+12 |
1427 | Spain | Europe | 2007 | 80.941 | 40448191 | 28821.06370 | 1.165760e+12 |
Country | GDP |
---|---|
Germany | 2.65e+12 |
France | 1.86e+12 |
Italy | 1.66e+12 |
Spain | 1.17e+12 |
Germany’s total GDP grew the most between 2002 and 2007.
subset_country = df[df['country'].isin(['Germany','France','Italy','Spain'])]
subset_2002 = subset_country[subset_country['year']==2002]
subset_2002 = subset_2002.sort_values('gdp',ascending=False)
germanyDif = subset_2007.iloc[0]['gdp'].astype(int) - subset_2002.iloc[0]['gdp'].astype(int)
franceDif = subset_2007.iloc[1]['gdp'].astype(int) - subset_2002.iloc[1]['gdp'].astype(int)
italyDif = subset_2007.iloc[2]['gdp'].astype(int) - subset_2002.iloc[2]['gdp'].astype(int)
spainDif = subset_2007.iloc[3]['gdp'].astype(int) - subset_2002.iloc[3]['gdp'].astype(int)
max(germanyDif,franceDif,italyDif,spainDif) == germanyDif
True
The & operator determines whether both conditions are true. For example, the expression (2>1) & (3>2)
is true, because both sides of the comparison are true. == is used to evaluate whether two things are equal to each other. For example, 2 == 2
is true. | is the inclusive or, which evaluates to true if either condition is true or if both are true. For example, (2>1) | (2<1)
would still evaluate to true even though the second condition is false. Yet (2>1) | (1<2)
also evaluates to true, and both conditions are true. ^ is the exclusive or, which only evaluates to true if one but not both conditions are true. For example, (2>1) ^ (2<1)
would still evaluate to true even though the second condition is false. But (2>1) ^ (1<2)
also evaluates to false, since both conditions are true.
(2>1) & (3>2)
True
2 == 2
True
(2>1) | (2<1)
True
(2>1) | (1<2)
True
(2>1) ^ (2<1)
True
(2>1) ^ (1<2)
False
.loc retrieves a value based on the label given by its index. .iloc retrieves a value based on its integer position in a series. Extracting a series of consecutive observations can be done with .iloc, by extracting the data from several integer positions in a row:
df.iloc[0:5]
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 |
2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 |
3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.197138 |
4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.981106 |
To extract observations from a series of consecutive columns, a subset of the data frame can be created using column names. iloc can also be used to subset consecutive columns by using a colon to indicate that all rows should be included and specifying which columns to subset in brackets.
columns = ['country','continent','year']
df_subset = df[columns]
df_subset.head()
country | continent | year | |
---|---|---|---|
0 | Afghanistan | Asia | 1952 |
1 | Afghanistan | Asia | 1957 |
2 | Afghanistan | Asia | 1962 |
3 | Afghanistan | Asia | 1967 |
4 | Afghanistan | Asia | 1972 |
df.iloc[:,[0,1,2]]
country | continent | year | |
---|---|---|---|
0 | Afghanistan | Asia | 1952 |
1 | Afghanistan | Asia | 1957 |
2 | Afghanistan | Asia | 1962 |
3 | Afghanistan | Asia | 1967 |
4 | Afghanistan | Asia | 1972 |
... | ... | ... | ... |
1699 | Zimbabwe | Africa | 1987 |
1700 | Zimbabwe | Africa | 1992 |
1701 | Zimbabwe | Africa | 1997 |
1702 | Zimbabwe | Africa | 2002 |
1703 | Zimbabwe | Africa | 2007 |
1704 rows × 3 columns
An API is an application programming interface, the part of a remote server that interacts with requests. It’s basically the messenger between the client and the provider. To construct a request, you need to import the requests library and specify the url to be requested. Then a local file must be created for writing, such as in the format below. Finally, the file can be imported to the current work session as a pandas data frame.
import requests
url = "https://api.covidtracking.com/v1/states/daily.csv"
r = requests.get(url)
file_name = 'thisismyfile' + '.csv'
with open(file_name, 'wb') as f:
f.write(r.content)
import pandas as pd
dfex = pd.read_csv(file_name)
dfex.head()
date | state | positive | probableCases | negative | pending | totalTestResultsSource | totalTestResults | hospitalizedCurrently | hospitalizedCumulative | ... | dataQualityGrade | deathIncrease | hospitalizedIncrease | hash | commercialScore | negativeRegularScore | negativeScore | positiveScore | score | grade | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20210217 | AK | 54799.0 | NaN | NaN | NaN | totalTestsViral | 1611971.0 | 32.0 | 1238.0 | ... | NaN | 1 | 5 | c002b868e44913a74a70528d3fa75e7e3e5b4191 | 0 | 0 | 0 | 0 | 0 | NaN |
1 | 20210217 | AL | 483167.0 | 104382.0 | 1859516.0 | NaN | totalTestsPeopleViral | 2238301.0 | 1030.0 | 44541.0 | ... | NaN | 89 | 0 | 3d4a8eef551e9cf3e5ab585388add5a45719e97b | 0 | 0 | 0 | 0 | 0 | NaN |
2 | 20210217 | AR | 314192.0 | 65337.0 | 2344576.0 | NaN | totalTestsViral | 2593431.0 | 638.0 | 14392.0 | ... | NaN | 26 | 0 | 1de895ab8fa94a2d14163418fea02180332cea5f | 0 | 0 | 0 | 0 | 0 | NaN |
3 | 20210217 | AS | 0.0 | NaN | 2140.0 | NaN | totalTestsViral | 2140.0 | NaN | NaN | ... | NaN | 0 | 0 | 2857f0702fbfc1243e77e13a044b5830f3bea397 | 0 | 0 | 0 | 0 | 0 | NaN |
4 | 20210217 | AZ | 801055.0 | 53685.0 | 2901236.0 | NaN | totalTestsViral | 7281277.0 | 1941.0 | 55983.0 | ... | NaN | 82 | 118 | 7f4d3ca6991968ec0fb4227de7d9ced82b0d1ccf | 0 | 0 | 0 | 0 | 0 | NaN |
5 rows × 55 columns
The apply() function applies the same function to each row in a data frame. It can quickly and tidily enact the same action upon many rows of data.
A loop can also be used to go through the data frame and apply a function to each row. apply() does this more neatly and efficiently.
Instead of using .iloc, a subset of a data frame can be created by using specific column names. The subset is selected by listing the column names, filtered by viewing the data frame of only those column names, and assigned to a new variable name (df_subset), as shown below.
columns = ['country','continent','year']
df_subset = df[columns]
df_subset.head()
country | continent | year | |
---|---|---|---|
0 | Afghanistan | Asia | 1952 |
1 | Afghanistan | Asia | 1957 |
2 | Afghanistan | Asia | 1962 |
3 | Afghanistan | Asia | 1967 |
4 | Afghanistan | Asia | 1972 |