title

</div>

BASICS¶

Contributors:¶

Juan Carlos Basto Pineda (juan.basto.pineda@gmail.com)

Adapted by Juan Carlos Basto Pineda from the very nice lesson on Plotting and Programming in Python by Software Carpentry.
Licensed under CC-BY 4.0 2018–2021 by The Carpentries
Licensed under CC-BY 4.0 by Software Carpentry Foundation

I strongly recommend you to visit the full lesson later on.

Reading Tabular Data into DataFrames¶

Let's start with a quick view of some sampled Gross Domestic Product data.

In [ ]:

import pandas as pd

In [ ]:

data = pd.read_csv('data/gapminder_gdp_oceania.csv')

In [ ]:

data

Use `index_col` to specify that a column’s values should be used as row headings¶

In [ ]:

data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
data

Use the `DataFrame.info()` method to find out more about a dataframe¶

In [ ]:

data.info()

Labels of rows as `DataFrame.index` and labels of columns as `DataFrame.columns`¶

In [ ]:

data.index

In [ ]:

data.columns

`DataFrame.index` and `DataFrame.columns` can be redefined as necessary¶

Check the following commands

You can get rid of rows or columns you don't need with `DataFrame.drop`¶

DataFrame.drop

How would you get a new dataframe only containning data for New Zealand after 2000, and labeling the row with the index NZ, and labeling the columns just with the year?

In [ ]:

# Your Answer Here

Use `DataFrame.T` to transpose a dataframe.¶

In [ ]:

print(data.T)

`Use DataFrame.describe()` to get summary statistics about data.¶

Gets the summary statistics of only the columns that have numerical data. All other columns are ignored, unless you use the argument include='all'

In [ ]:

data.describe(include='all')

Read the data in gapminder_gdp_americas.csv into a variable called americas and display its summary statistics.

In [ ]:

# Your Answer Here

Check the commands americas.shape, americas.head(), americas.tail() What do they do?

In [ ]:

# Your Answer Here

Drop some rows to retain only those countries in Latin America from where there are Universities participating in LACoNGA-Physics.
Store the reduced dataframe witht he name gdp_LACoNGA.csv, checking the help of the pd.DataFrame.to_csv command first. You can do it typing help(americas.to_csv)

In [ ]:

# Your Answer Here

Accesing and modifying data¶

Let's take a look at the data from European countries

In [ ]:

data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

In [ ]:

data.head(2)

What do you think that the command DataFrame.iloc does after checking the following commands?

In [ ]:

data.iloc[0]

In [ ]:

data.iloc[:,0]

In [ ]:

data.iloc[0,:]

In [ ]:

data.iloc[0,0]

We can access data by the labels as well. Remember that row labels are given by data.index and column labels by data.columns

In [ ]:

data.index

In [ ]:

data.columns

What do you think that the command DataFrame.loc does after checking the following commands?

In [ ]:

data.loc["Albania", "gdpPercap_1952"]

Use `:` on its own to mean all columns or all rows.¶

The following line serves to extract a full row based on its index

In [ ]:

data.loc["Albania", :]

Note that you would get the same result printing data.loc["Albania"] (without a second index), or simply data["Albania"]

In [ ]:

data.loc["Albania"]

In case you want data from a given column, change the position of the :

In [ ]:

data.loc[:,'gdpPercap_2007']

Or you can access it just by using square brackets and column name, or using a dot ., without the need for .loc or .iloc commands:

In [ ]:

data['gdpPercap_2007']

In [ ]:

data.gdpPercap_2007

Modifying data¶

The same command you use to check a value or a range of values can be used to substitute data, just using = and providing a value or an array with the right shape
For instance, data.loc["Albania", "gdpPercap_1952"] = 0
data.iloc[0:3,0] = [1,2,3]
To add a new column simply invoke it with the desired name and making the values assignation:
data['new_column'] = [array of values with the right shape]

Exercise:¶

Create a copy of the DataFrame
Change the first value of the GDP for the first country to 5
Replace all numbers in a column by 1's
Add a new column with 0's

In [ ]:

# Your Answer Here

Select multiple columns or rows using `DataFrame.loc` and a named slice¶

In [ ]:

data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']

Result of slicing can be used in further operations.¶

All the statistical operators that work on entire dataframes work the same way on slices.
E.g., let's find the maximum at certain columns (other well-known functions like min, median, mean, std... are availbale, as pandas is built on top of numpy)

In [ ]:

data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].max(axis = 1)

Other well-known functions like min, median, mean, std... are availbale, as pandas is built on top of numpy

In [ ]:

data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].median()

Use comparisons to select data based on value.¶

In which of the following countries/years was the GDP larger than 10.000?

In [ ]:

# Use a subset of data to keep output readable.
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print('Subset of data:\n', subset)

In [ ]:

# Which values were greater than 10000 ?
print('\nWhere are values large?\n', subset > 10000)

The concept of Boolean masks.¶

In [ ]:

mask = subset > 10000
subset[mask]

Get the value where the mask is true, and NaN (Not a Number) where it is false.
Useful because NaNs are ignored by operations like max, min, average, etc.

`Aggregate` method to apply a mathematical function along rows or columns¶

What do you think the following commands are doing?

In [ ]:

mask_higher = data > data.mean()
wealth_score = data[mask_higher].aggregate('sum', axis=1) / len(data.columns)
data['ws'] = wealth_score
data.head()

`DataFrame.groupby` to apply a math function to subsets of data according to a given parameter¶

What was the total contribution of the countries in each category of wealth score in each year?

In [ ]:

data.groupby(by='ws').sum()

Use `groupby` to classify and then apply a customized mathematical function with '.GroupBy.apply'¶

In [ ]:

df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
df

In [ ]:

g = df.groupby('A')

In [ ]:

g.apply(lambda x: x / x.sum())

In [ ]:

g.apply(lambda x: x.C.max() - x.B.min())

Exercise¶

Create a groupby object according to the wealth score

In [ ]:

# Your Answer Here

For each group, calculate the mean GDP in 2007

In [ ]:

# Your Answer Here

For each country, calculate the percentage of its contribution to the total GDP of tis group in 2007

In [ ]:

# Your Answer Here

For each group, calculate the total contribution to the GDP aling all years

In [ ]:

# Your Answer Here

BASICS¶

Contributors:¶

Reading Tabular Data into DataFrames¶

Use index_col to specify that a column’s values should be used as row headings¶

Use the DataFrame.info() method to find out more about a dataframe¶

Labels of rows as DataFrame.index and labels of columns as DataFrame.columns¶

DataFrame.index and DataFrame.columns can be redefined as necessary¶

You can get rid of rows or columns you don't need with DataFrame.drop¶

Use DataFrame.T to transpose a dataframe.¶

Use DataFrame.describe() to get summary statistics about data.¶

Accesing and modifying data¶

Use : on its own to mean all columns or all rows.¶

Modifying data¶

Exercise:¶

Select multiple columns or rows using DataFrame.loc and a named slice¶

Result of slicing can be used in further operations.¶

Use comparisons to select data based on value.¶

The concept of Boolean masks.¶

Aggregate method to apply a mathematical function along rows or columns¶

DataFrame.groupby to apply a math function to subsets of data according to a given parameter¶

Use groupby to classify and then apply a customized mathematical function with '.GroupBy.apply'¶

Exercise¶

Use `index_col` to specify that a column’s values should be used as row headings¶

Use the `DataFrame.info()` method to find out more about a dataframe¶

Labels of rows as `DataFrame.index` and labels of columns as `DataFrame.columns`¶

`DataFrame.index` and `DataFrame.columns` can be redefined as necessary¶

You can get rid of rows or columns you don't need with `DataFrame.drop`¶

Use `DataFrame.T` to transpose a dataframe.¶

`Use DataFrame.describe()` to get summary statistics about data.¶

Use `:` on its own to mean all columns or all rows.¶

Select multiple columns or rows using `DataFrame.loc` and a named slice¶

`Aggregate` method to apply a mathematical function along rows or columns¶

`DataFrame.groupby` to apply a math function to subsets of data according to a given parameter¶

Use `groupby` to classify and then apply a customized mathematical function with '.GroupBy.apply'¶