</div>
Adapted by Juan Carlos Basto Pineda from the very nice lesson on Plotting and Programming in Python by Software Carpentry.
Licensed under CC-BY 4.0 2018–2021 by The Carpentries
Licensed under CC-BY 4.0 by Software Carpentry Foundation
I strongly recommend you to visit the full lesson later on.
Let's start with a quick view of some sampled Gross Domestic Product data.
import pandas as pd
data = pd.read_csv('data/gapminder_gdp_oceania.csv')
data
index_col
to specify that a column’s values should be used as row headings¶data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
data
DataFrame.info()
method to find out more about a dataframe¶data.info()
DataFrame.index
and labels of columns as DataFrame.columns
¶data.index
data.columns
How would you get a new dataframe only containning data for New Zealand after 2000, and labeling the row with the index NZ
, and labeling the columns just with the year?
# Your Answer Here
DataFrame.T
to transpose a dataframe.¶print(data.T)
Use DataFrame.describe()
to get summary statistics about data.¶Gets the summary statistics of only the columns that have numerical data. All other columns are ignored, unless you use the argument include='all'
data.describe(include='all')
gapminder_gdp_americas.csv
into a variable called americas
and display its summary statistics.# Your Answer Here
americas.shape
, americas.head()
, americas.tail()
What do they do?# Your Answer Here
gdp_LACoNGA.csv
, checking the help of the pd.DataFrame.to_csv
command first. You can do it typing help(americas.to_csv)
# Your Answer Here
Let's take a look at the data from European countries
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
data.head(2)
What do you think that the command DataFrame.iloc
does after checking the following commands?
data.iloc[0]
data.iloc[:,0]
data.iloc[0,:]
data.iloc[0,0]
We can access data by the labels as well. Remember that row labels are given by data.index
and column labels by data.columns
data.index
data.columns
What do you think that the command DataFrame.loc
does after checking the following commands?
data.loc["Albania", "gdpPercap_1952"]
:
on its own to mean all columns or all rows.¶The following line serves to extract a full row based on its index
data.loc["Albania", :]
Note that you would get the same result printing data.loc["Albania"]
(without a second index), or simply data["Albania"]
data.loc["Albania"]
In case you want data from a given column, change the position of the :
data.loc[:,'gdpPercap_2007']
Or you can access it just by using square brackets and column name, or using a dot .
, without the need for .loc
or .iloc
commands:
data['gdpPercap_2007']
data.gdpPercap_2007
=
and providing a value or an array with the right shapedata.loc["Albania", "gdpPercap_1952"] = 0
data.iloc[0:3,0] = [1,2,3]
data['new_column'] = [array of values with the right shape]
# Your Answer Here
DataFrame.loc
and a named slice¶data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
All the statistical operators that work on entire dataframes work the same way on slices.
E.g., let's find the maximum at certain columns (other well-known functions like min, median, mean, std... are availbale, as pandas is built on top of numpy)
data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].max(axis = 1)
Other well-known functions like min, median, mean, std... are availbale, as pandas is built on top of numpy
data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].median()
In which of the following countries/years was the GDP larger than 10.000?
# Use a subset of data to keep output readable.
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print('Subset of data:\n', subset)
# Which values were greater than 10000 ?
print('\nWhere are values large?\n', subset > 10000)
mask = subset > 10000
subset[mask]
NaN
(Not a Number) where it is false.mask_higher = data > data.mean()
wealth_score = data[mask_higher].aggregate('sum', axis=1) / len(data.columns)
data['ws'] = wealth_score
data.head()
DataFrame.groupby
to apply a math function to subsets of data according to a given parameter¶What was the total contribution of the countries in each category of wealth score in each year?
data.groupby(by='ws').sum()
groupby
to classify and then apply a customized mathematical function with '.GroupBy.apply'¶df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
df
g = df.groupby('A')
g.apply(lambda x: x / x.sum())
g.apply(lambda x: x.C.max() - x.B.min())
groupby
object according to the wealth score # Your Answer Here
# Your Answer Here
# Your Answer Here
# Your Answer Here