title

</div>

BASICS

Contributors:

Adapted by Juan Carlos Basto Pineda from the very nice lesson on Plotting and Programming in Python by Software Carpentry.
Licensed under CC-BY 4.0 2018–2021 by The Carpentries
Licensed under CC-BY 4.0 by Software Carpentry Foundation

I strongly recommend you to visit the full lesson later on.

Reading Tabular Data into DataFrames

Let's start with a quick view of some sampled Gross Domestic Product data.

Use index_col to specify that a column’s values should be used as row headings

Use the DataFrame.info() method to find out more about a dataframe

Labels of rows as DataFrame.index and labels of columns as DataFrame.columns

DataFrame.index and DataFrame.columns can be redefined as necessary

Check the following commands

You can get rid of rows or columns you don't need with DataFrame.drop

How would you get a new dataframe only containning data for New Zealand after 2000, and labeling the row with the index NZ, and labeling the columns just with the year?

Use DataFrame.T to transpose a dataframe.

Use DataFrame.describe() to get summary statistics about data.

Gets the summary statistics of only the columns that have numerical data. All other columns are ignored, unless you use the argument include='all'

Accesing and modifying data

Let's take a look at the data from European countries

What do you think that the command DataFrame.iloc does after checking the following commands?

We can access data by the labels as well. Remember that row labels are given by data.index and column labels by data.columns

What do you think that the command DataFrame.loc does after checking the following commands?

Use : on its own to mean all columns or all rows.

The following line serves to extract a full row based on its index

Note that you would get the same result printing data.loc["Albania"] (without a second index), or simply data["Albania"]

In case you want data from a given column, change the position of the :

Or you can access it just by using square brackets and column name, or using a dot ., without the need for .loc or .iloc commands:

Modifying data

Exercise:

Select multiple columns or rows using DataFrame.loc and a named slice

Result of slicing can be used in further operations.

All the statistical operators that work on entire dataframes work the same way on slices.
E.g., let's find the maximum at certain columns (other well-known functions like min, median, mean, std... are availbale, as pandas is built on top of numpy)

Other well-known functions like min, median, mean, std... are availbale, as pandas is built on top of numpy

Use comparisons to select data based on value.

In which of the following countries/years was the GDP larger than 10.000?

The concept of Boolean masks.

Aggregate method to apply a mathematical function along rows or columns

What do you think the following commands are doing?

DataFrame.groupby to apply a math function to subsets of data according to a given parameter

What was the total contribution of the countries in each category of wealth score in each year?

Use groupby to classify and then apply a customized mathematical function with '.GroupBy.apply'

Exercise