Data146

Project 1

Lucy Greenman

Part 1

What is a package?

A package is a collection of modules, each of which can contain a set of related commands.

What is a library?

A library is a collection of related functions that can be downloaded together for use in a work session.

What two steps do you need to execute in order to install a package and then make that library of functions accessible?

First, you need to install the package: pip install pandas. Then, to make the library of functions accessible, you need to import the library: import numpy as np.

Provide examples of how you would execute these two steps using packages from class.

#Example 1
pip install numpy
import numpy as np
#Example 2
pip install pandas
import pandas as pd

Be sure to include an alias and explain why it is a good idea to do so.

In the second example above, I typed not just “import pandas” but “import pandas as pd”. This makes “pd” the alias for “pandas”, a shorter name that I can use to call functions. Using an alias can save time and also reduce the likelihood of typos interfering with code running correctly.

Part 2

What is a data frame?

A data frame is a way of organizing data that consists of rows and columns. The first row is a header row containing column names, and the first column is an index by which the data frame is sorted.

Identify a library of functions that is particularly useful for working with data frames.

A great one is pandas.

In order to read a file in its remote location within the file system of your operating system, which command would you use?

pd.read_csv

Provide an example of how to read a file and import it into your work session in order to create a new data frame.

data = pd.read_csv('gapminder.tsv',sep='\t')

Also, describe why specifying an argument within a read_() function can be significant.

The read_csv() function assumes comma-separated values, so if the values are separated otherwise (by tabs, in this case), then that must be specified.

Does data that is saved as a file in a different type of format require a particular argument in order for a data frame to be successfully imported?

Yes, the argument sep must be defined if the values are separated by something other than commas.

Also, provide an example that describes a data frame you created. How do you determine how many rows and columns are in a data frame? Is there an alternate terminology for describing rows and columns?

The number of rows and columns is given by the shape function. The size of the data, given by the size function, describes the number of rows multiplied by the number of columns. Additionally, a row can be called an observation and a column can be called a variable.

data.shape

(1704, 6)

data.size

Part 3

Import the gapminder.tsv data set and create a new data frame.

df = pd.read_csv('gapminder.tsv',sep='\t')

Interrogate and describe the year variable within the data frame you created.

df.year

     1952
     1957
     1962
     1967
     1972
        ... 
  1987
  1992
  1997
  2002
  2007
Name: year, Length: 1704, dtype: int64

Does this variable exhibit regular intervals? If you were to add new outcomes to the raw data in order to update and make it more current, which years would you add to each subset of observations?

Yes, the variable occurs in intervals of 5. If I were adding years since 2007, I would add 2012 and 2017.

Stretch goal: can you identify how many new outcomes in total you would be adding to your data frame?

For each year, data is added for every country. If I add two additional years, that will be an outcome for each year and for each country. Since there are 2 years and 142 countries, 284 new outcomes would be added in total.

countries = len(df['country'].unique())
countries*2

Part 4

Using the data frame you created by importing the gapminder.tsv data set, determine which country at what point in time had the lowest life expectancy.

Rwanda in 1992 had the lowest life expectancy.

min_lifeExp = df['lifeExp'].min()
min_lifeExp

23.599

idx_min_lifeExp = (df['lifeExp'] == min_lifeExp)
temp = df[idx_min_lifeExp]
temp['country']

1292    Rwanda
Name: country, dtype: object

temp['year']

1292    1992
Name: year, dtype: int64

Conduct a cursory level investigation as to why this was the case and provide a brief explanation in support of your explanation.

In the 1990s, Rwanda was experiencing a genocide. Tens of thousands of Tutsis, an ethnic minority group, were murdered. This helps to explain the drastically low life expectancy of 23.599 years.

Part 5

Using the data frame you created by importing the gapminder.tsv data set, multiply the variable pop by the variable gdpPercap and assign the results to a newly created variable.

gdp = df['pop']*df['gdpPercap']
df['gdp'] = gdp
df.head()

	country	continent	year	lifeExp	pop	gdpPercap	gdp
0	Afghanistan	Asia	1952	28.801	8425333	779.445314	6.567086e+09
1	Afghanistan	Asia	1957	30.332	9240934	820.853030	7.585449e+09
2	Afghanistan	Asia	1962	31.997	10267083	853.100710	8.758856e+09
3	Afghanistan	Asia	1967	34.020	11537966	836.197138	9.648014e+09
4	Afghanistan	Asia	1972	36.088	13079460	739.981106	9.678553e+09

Then subset and order from highest to lowest the results for Germany, France, Italy and Spain in 2007. Create a table that illustrates your results (you are welcome to either create a table in markdown or plot/save in PyCharm and upload the image).

subset_country = df[df['country'].isin(['Germany','France','Italy','Spain'])]
subset_2007 = subset_country[subset_country['year']==2007]
subset_2007 = subset_2007.sort_values('gdp',ascending=False)
subset_2007

	country	continent	year	lifeExp	pop	gdpPercap	gdp
575	Germany	Europe	2007	79.406	82400996	32170.37442	2.650871e+12
539	France	Europe	2007	80.657	61083916	30470.01670	1.861228e+12
779	Italy	Europe	2007	80.546	58147733	28569.71970	1.661264e+12
1427	Spain	Europe	2007	80.941	40448191	28821.06370	1.165760e+12

Country	GDP
Germany	2.65e+12
France	1.86e+12
Italy	1.66e+12
Spain	1.17e+12

Stretch goal: which of the four European countries exhibited the most significant increase in total gross domestic product during the previous 5-year period (to 2007)?

Germany’s total GDP grew the most between 2002 and 2007.

subset_country = df[df['country'].isin(['Germany','France','Italy','Spain'])]
subset_2002 = subset_country[subset_country['year']==2002]
subset_2002 = subset_2002.sort_values('gdp',ascending=False)
germanyDif = subset_2007.iloc[0]['gdp'].astype(int) - subset_2002.iloc[0]['gdp'].astype(int)
franceDif = subset_2007.iloc[1]['gdp'].astype(int) - subset_2002.iloc[1]['gdp'].astype(int)
italyDif = subset_2007.iloc[2]['gdp'].astype(int) - subset_2002.iloc[2]['gdp'].astype(int)
spainDif = subset_2007.iloc[3]['gdp'].astype(int) - subset_2002.iloc[3]['gdp'].astype(int)
max(germanyDif,franceDif,italyDif,spainDif) == germanyDif

True

Part 6

You have been introduced to four logical operators thus far: &, ==, | and ^. Describe each one including its purpose and function. Provide an example of how each might be used in the context of programming.

The & operator determines whether both conditions are true. For example, the expression (2>1) & (3>2) is true, because both sides of the comparison are true. == is used to evaluate whether two things are equal to each other. For example, 2 == 2 is true. | is the inclusive or, which evaluates to true if either condition is true or if both are true. For example, (2>1) | (2<1) would still evaluate to true even though the second condition is false. Yet (2>1) | (1<2) also evaluates to true, and both conditions are true. ^ is the exclusive or, which only evaluates to true if one but not both conditions are true. For example, (2>1) ^ (2<1) would still evaluate to true even though the second condition is false. But (2>1) ^ (1<2) also evaluates to false, since both conditions are true.

(2>1) & (3>2)

True

2 == 2

True

(2>1) | (2<1)

True

(2>1) | (1<2)

True

(2>1) ^ (2<1)

True

(2>1) ^ (1<2)

False

Part 7

Describe the difference between .loc and .iloc. Provide an example of how to extract a series of consecutive observations from a data frame.

.loc retrieves a value based on the label given by its index. .iloc retrieves a value based on its integer position in a series. Extracting a series of consecutive observations can be done with .iloc, by extracting the data from several integer positions in a row:

df.iloc[0:5]

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106

Stretch goal: provide an example of how to extract all observations from a series of consecutive columns.

To extract observations from a series of consecutive columns, a subset of the data frame can be created using column names. iloc can also be used to subset consecutive columns by using a colon to indicate that all rows should be included and specifying which columns to subset in brackets.

columns = ['country','continent','year']
df_subset = df[columns]
df_subset.head()

	country	continent	year
0	Afghanistan	Asia	1952
1	Afghanistan	Asia	1957
2	Afghanistan	Asia	1962
3	Afghanistan	Asia	1967
4	Afghanistan	Asia	1972

df.iloc[:,[0,1,2]]

	country	continent	year
0	Afghanistan	Asia	1952
1	Afghanistan	Asia	1957
2	Afghanistan	Asia	1962
3	Afghanistan	Asia	1967
4	Afghanistan	Asia	1972
...	...	...	...
1699	Zimbabwe	Africa	1987
1700	Zimbabwe	Africa	1992
1701	Zimbabwe	Africa	1997
1702	Zimbabwe	Africa	2002
1703	Zimbabwe	Africa	2007

1704 rows × 3 columns

Part 8

Describe how an api works. Provide an example of how to construct a request to a remote server in order to pull data, write it to a local file and then import it to your current work session.

An API is an application programming interface, the part of a remote server that interacts with requests. It’s basically the messenger between the client and the provider. To construct a request, you need to import the requests library and specify the url to be requested. Then a local file must be created for writing, such as in the format below. Finally, the file can be imported to the current work session as a pandas data frame.

import requests
url = "https://api.covidtracking.com/v1/states/daily.csv"
r = requests.get(url)
file_name = 'thisismyfile' + '.csv'
with open(file_name, 'wb') as f:
    f.write(r.content)
    import pandas as pd
dfex = pd.read_csv(file_name)
dfex.head()

	date	state	positive	probableCases	negative	pending	totalTestResultsSource	totalTestResults	hospitalizedCurrently	hospitalizedCumulative	...	dataQualityGrade	deathIncrease	hospitalizedIncrease	hash	grade
0	20210217	AK	54799.0	NaN	NaN	NaN	totalTestsViral	1611971.0	32.0	1238.0	...	NaN	1	5	c002b868e44913a74a70528d3fa75e7e3e5b4191	NaN
1	20210217	AL	483167.0	104382.0	1859516.0	NaN	totalTestsPeopleViral	2238301.0	1030.0	44541.0	...	NaN	89	0	3d4a8eef551e9cf3e5ab585388add5a45719e97b	NaN
2	20210217	AR	314192.0	65337.0	2344576.0	NaN	totalTestsViral	2593431.0	638.0	14392.0	...	NaN	26	0	1de895ab8fa94a2d14163418fea02180332cea5f	NaN
3	20210217	AS	0.0	NaN	2140.0	NaN	totalTestsViral	2140.0	NaN	NaN	...	NaN	0	0	2857f0702fbfc1243e77e13a044b5830f3bea397	NaN
4	20210217	AZ	801055.0	53685.0	2901236.0	NaN	totalTestsViral	7281277.0	1941.0	55983.0	...	NaN	82	118	7f4d3ca6991968ec0fb4227de7d9ced82b0d1ccf	NaN

5 rows × 55 columns

Part 9

Describe the apply() function from the pandas library. What is its purpose?

The apply() function applies the same function to each row in a data frame. It can quickly and tidily enact the same action upon many rows of data.

Using apply() to various class objects is an alternative (potentially preferable approach) to writing what other type of command? Why do you think apply() could be a preferred approach?

A loop can also be used to go through the data frame and apply a function to each row. apply() does this more neatly and efficiently.

Part 10

Also describe an alternative approach to filtering the number of columns in a data frame. Instead of using .iloc, what other approach might be used to select, filter and assign a subset number of variables to a new data frame?

Instead of using .iloc, a subset of a data frame can be created by using specific column names. The subset is selected by listing the column names, filtered by viewing the data frame of only those column names, and assigned to a new variable name (df_subset), as shown below.

columns = ['country','continent','year']
df_subset = df[columns]
df_subset.head()

	country	continent	year
0	Afghanistan	Asia	1952
1	Afghanistan	Asia	1957
2	Afghanistan	Asia	1962
3	Afghanistan	Asia	1967
4	Afghanistan	Asia	1972