Annex: Tabular data analysis with pandas#

Note

If you have not yet set up Python on your computer, you can execute this tutorial in your browser via Google Colab. Click on the rocket in the top right corner and launch “Colab”. If that doesn’t work download the .ipynb file and import it in Google Colab.

Then install pandas and numpy by executing the following command in a Jupyter cell at the top of the notebook.

!pip install -q pandas numpy

Pandas is a an open source library providing tabular data structures and data analysis tools.In other words, if you can imagine the data in an Excel spreadsheet, then Pandas is the tool for the job.

Note

Documentation for this package is available at https://pandas.pydata.org/docs/.

Package Imports#

This will be our first experience with importing a package.

Usually we import pandas with the alias pd.

We might also need numpy, Python’s main library for numerical computations.

import pandas as pd
import numpy as np

Series#

A Series represents a one-dimensional array of data. It is similar to a dictionary consisting of an index and values, but has more functions.

Note

Example data on Germany’s final six nuclear power plants is from Wikipedia.

names = ["Neckarwestheim", "Isar 2", "Emsland"]
values = [1269, 1365, 1290]
s = pd.Series(values, index=names)
s
Neckarwestheim    1269
Isar 2            1365
Emsland           1290
dtype: int64
dictionary = {
    "Neckarwestheim": 1269,
    "Isar 2": 1365,
    "Emsland": 1290,
}
s = pd.Series(dictionary)
s
Neckarwestheim    1269
Isar 2            1365
Emsland           1290
dtype: int64

Arithmetic operations can be applied to the whole pd.Series.

s**0.5
Neckarwestheim    35.623026
Isar 2            36.945906
Emsland           35.916570
dtype: float64

We can access the underlying index object if we need to:

s.index
Index(['Neckarwestheim', 'Isar 2', 'Emsland'], dtype='object')

We can get values back out using the index via the .loc attribute

s.loc["Isar 2"]
1365

Or by raw position using .iloc

s.iloc[2]
1290

We can pass a list or array to loc to get multiple rows back:

s.loc[["Neckarwestheim", "Emsland"]]
Neckarwestheim    1269
Emsland           1290
dtype: int64

DataFrame#

Series are limited to a single column. A more useful Pandas data structure is the DataFrame. A DataFrame is basically a bunch of series that share the same index.

data = {
    "capacity": [1269, 1365, 1290],  # MW
    "type": ["PWR", "PWR", "PWR"],
    "start_year": [1989, 1988, 1988],
    "end_year": [np.nan, np.nan, np.nan],
}
df = pd.DataFrame(data, index=["Neckarwestheim", "Isar 2", "Emsland"])
df
capacity type start_year end_year
Neckarwestheim 1269 PWR 1989 NaN
Isar 2 1365 PWR 1988 NaN
Emsland 1290 PWR 1988 NaN

A wide range of statistical functions are available on both Series and DataFrames.

df.min()
capacity      1269
type           PWR
start_year    1988
end_year       NaN
dtype: object
df.mean(numeric_only=True)
capacity      1308.000000
start_year    1988.333333
end_year              NaN
dtype: float64

We can get a single column as a Series using python’s getitem syntax on the DataFrame object.

df["capacity"]
Neckarwestheim    1269
Isar 2            1365
Emsland           1290
Name: capacity, dtype: int64

Indexing works very similar to series

df.loc["Emsland"]
capacity      1290
type           PWR
start_year    1988
end_year       NaN
Name: Emsland, dtype: object

But we can also specify the column(s) and row(s) we want to access

df.at["Emsland", "start_year"]
1988

We can also add new columns to the DataFrame:

df["reduced_capacity"] = df.capacity * 0.8
df
capacity type start_year end_year reduced_capacity
Neckarwestheim 1269 PWR 1989 NaN 1015.2
Isar 2 1365 PWR 1988 NaN 1092.0
Emsland 1290 PWR 1988 NaN 1032.0

We can also remove columns or rows from a DataFrame:

Note

This operation needs to be an inplace operation to be permanent.

df.drop("reduced_capacity", axis="columns", inplace=True)

We can also drop columns with only NaN values

df.dropna(axis=1)
capacity type start_year
Neckarwestheim 1269 PWR 1989
Isar 2 1365 PWR 1988
Emsland 1290 PWR 1988

Or fill it up with default “fallback” data:

df.fillna(2023)
capacity type start_year end_year
Neckarwestheim 1269 PWR 1989 2023.0
Isar 2 1365 PWR 1988 2023.0
Emsland 1290 PWR 1988 2023.0

Sorting Data#

We can also sort the entries in dataframes, e.g. alphabetically by index or numerically by column values

df.sort_index()
capacity type start_year end_year
Emsland 1290 PWR 1988 NaN
Isar 2 1365 PWR 1988 NaN
Neckarwestheim 1269 PWR 1989 NaN
df.sort_values(by="capacity", ascending=False)
capacity type start_year end_year
Isar 2 1365 PWR 1988 NaN
Emsland 1290 PWR 1988 NaN
Neckarwestheim 1269 PWR 1989 NaN

Filtering Data#

We can also filter a DataFrame using a boolean series obtained from a condition. This is very useful to build subsets of the DataFrame.

df.capacity > 1300
Neckarwestheim    False
Isar 2             True
Emsland           False
Name: capacity, dtype: bool
df[df.capacity > 1300]
capacity type start_year end_year
Isar 2 1365 PWR 1988 NaN

We can also combine multiple conditions, but we need to wrap the conditions with brackets!

df[(df.capacity > 1300) & (df.start_year >= 1988)]
capacity type start_year end_year
Isar 2 1365 PWR 1988 NaN

Or we make SQL-like queries:

df.query("start_year == 1988")
capacity type start_year end_year
Isar 2 1365 PWR 1988 NaN
Emsland 1290 PWR 1988 NaN
threshold = 1300
df.query("start_year == 1988 and capacity > @threshold")
capacity type start_year end_year
Isar 2 1365 PWR 1988 NaN

Modifying Values#

In many cases, we want to modify values in a dataframe based on some rule. To modify values, we need to use .loc or .iloc

df.loc["Isar 2", "capacity"] = 1366
df
capacity type start_year end_year
Neckarwestheim 1269 PWR 1989 NaN
Isar 2 1366 PWR 1988 NaN
Emsland 1290 PWR 1988 NaN

Sometimes it can be useful to rename columns:

df.rename(columns=dict(type="reactor"))
capacity reactor start_year end_year
Neckarwestheim 1269 PWR 1989 NaN
Isar 2 1366 PWR 1988 NaN
Emsland 1290 PWR 1988 NaN

Sometimes it can be useful to replace values:

df.replace({"PWR": "Pressurized water reactor"})
capacity type start_year end_year
Neckarwestheim 1269 Pressurized water reactor 1989 NaN
Isar 2 1366 Pressurized water reactor 1988 NaN
Emsland 1290 Pressurized water reactor 1988 NaN

Time Series#

Time indexes are great when handling time-dependent data.

Let’s first read some time series data, using the pd.read_csv() function, which takes a local file path ora link to an online resource.

The example data hourly time series for Germany in 2015 for:

  1. electricity demand from OPSD in GW

  2. onshore wind capacity factors from renewables.ninja in per-unit of installed capacity

  3. offshore wind capacity factors from renewables.ninja in per-unit of installed capacity

  4. solar PV capacity factors from renewables.ninja in per-unit of installed capacity

  5. electricity day-ahead spot market prices in €/MWh from EPEX Spot zone DE/AT/LU retrieved via SMARD platform

url = (
    "https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv"
)
ts = pd.read_csv(url, index_col=0, parse_dates=True)
ts.head()
load onwind offwind solar prices
2015-01-01 00:00:00 41.151 0.1566 0.7030 0.0 NaN
2015-01-01 01:00:00 40.135 0.1659 0.6875 0.0 NaN
2015-01-01 02:00:00 39.106 0.1746 0.6535 0.0 NaN
2015-01-01 03:00:00 38.765 0.1745 0.6803 0.0 NaN
2015-01-01 04:00:00 38.941 0.1826 0.7272 0.0 NaN

We can use Python’s slicing notation inside .loc to select a date range, and then use the built-in plotting feature of Pandas:

ts.loc["2015-01-01":"2015-03-01", "load"].plot()
<Axes: >
_images/5027921f45aa71ed75677a9bcbe5b1729e4b4f67b23d26f9c2dfa6e566dbec27.png
ts.loc["2015-05-01", "solar"].plot()
<Axes: >
_images/5eb67b37ae9fb768881e87e0c8148231d25d5afa5242e5ec642ba3c63b348751.png

A common operation is to change the resolution of a dataset by resampling in time, which Pandas exposes through the resample function.

Note

The resample periods are specified using pandas offset index syntax.

ts["onwind"].resample("ME").mean().plot()
<Axes: >
_images/2a6bf69a61cb87efbdd188eb8e01d0451c020e14e74c93051a16ed0a9f99dd99.png

Groupby Functionality#

DataFrame objects have a groupby method. The simplest way to think about it is that you pass another series, whose values are used to split the original object into different groups.

Here’s an example which retrieves the total generation capacity per country:

fn = "https://raw.githubusercontent.com/PyPSA/powerplantmatching/master/powerplants.csv"
df = pd.read_csv(fn, index_col=0)
df.iloc[:5, :10]
Name Fueltype Technology Set Country Capacity Efficiency DateIn DateRetrofit DateOut
id
0 Brokdorf Nuclear Steam Turbine PP Germany 1410.0 0.33 1986.0 1986.0 2021.0
1 Kernkraftwerk Emsland Nuclear Steam Turbine PP Germany 1336.0 0.33 1988.0 1988.0 2023.0
2 Isar Nuclear Steam Turbine PP Germany 1410.0 0.33 1979.0 1988.0 2023.0
3 Gemeinschaftskernkraftwerk Neckarwestheim Nuclear Steam Turbine PP Germany 1310.0 0.33 1976.0 1989.0 2023.0
4 Borssele Hard Coal Steam Turbine PP Netherlands 485.0 NaN 1973.0 NaN 2034.0
grouped = df.groupby("Country").Capacity.sum()
grouped.head()
Country
Albania                    2370.400000
Austria                   24643.200368
Belgium                   21443.151009
Bosnia and Herzegovina     4827.195964
Bulgaria                  15699.186363
Name: Capacity, dtype: float64

Let’s break apart this operation a bit. The workflow with groupby can be divided into three general steps:

  1. Split: Partition the data into different groups based on some criterion.

  2. Apply: Do some caclulation within each group, e.g. minimum, maximum, sums.

  3. Combine: Put the results back together into a single object.

Grouping is not only possible on a single columns, but also on multiple columns. For instance, we might want to group the capacities by country and fuel type. To achieve this, we pass a list of functions to the groupby functions.

capacities = df.groupby(["Country", "Fueltype"]).Capacity.sum()
capacities
Country         Fueltype     
Albania         Hydro             1743.9
                Other               98.0
                Solar              294.5
                Wind               234.0
Austria         Hard Coal         1331.4
                                  ...   
United Kingdom  Other               55.0
                Solar            11668.6
                Solid Biomass     4919.0
                Waste              288.9
                Wind             38670.3
Name: Capacity, Length: 242, dtype: float64

By grouping by multiple attributes, our index becomes a pd.MultiIndex (a hierarchical index with multiple levels.

capacities.index[:5]
MultiIndex([('Albania',     'Hydro'),
            ('Albania',     'Other'),
            ('Albania',     'Solar'),
            ('Albania',      'Wind'),
            ('Austria', 'Hard Coal')],
           names=['Country', 'Fueltype'])

We can use the .unstack function to reshape the multi-indexed pd.Series into a pd.DataFrame which has the second index level as columns.

capacities.unstack().tail().T
Country Spain Sweden Switzerland Ukraine United Kingdom
Fueltype
Biogas NaN NaN NaN NaN 31.000000
Geothermal NaN NaN NaN NaN NaN
Hard Coal 11904.878478 291.000000 NaN 24474.0 33823.617061
Hydro 26069.861248 14273.686625 20115.0408 6590.0 4576.175000
Lignite 1831.400000 NaN NaN NaN NaN
Natural Gas 28394.244000 2358.000000 55.0000 4687.9 36366.400000
Nuclear 7733.200000 9859.000000 3355.0000 17635.0 19181.000000
Oil 1854.371000 1685.000000 NaN NaN 100.000000
Other NaN NaN NaN NaN 55.000000
Solar 36998.200000 281.800000 96.8000 5628.7 11668.600000
Solid Biomass 563.000000 2432.600000 NaN NaN 4919.000000
Waste 388.054000 NaN NaN NaN 288.900000
Wind 34371.650000 16958.800000 55.0000 461.4 38670.300000

Exercises#

Task 1: Provide a list of unique fuel types included in the power plants dataset.

Hide code cell content
df.Fueltype.unique()
array(['Nuclear', 'Hard Coal', 'Hydro', 'Oil', 'Lignite', 'Natural Gas',
       'Solid Biomass', 'Wind', 'Other', 'Solar', 'Waste', 'Biogas',
       'Geothermal'], dtype=object)

Task 2: Filter the dataset by power plants with the fuel type “Hard Coal”. How many hard coal power plants are there?

Hide code cell content
coal = df.loc[df.Fueltype == "Hard Coal"]
coal
Name Fueltype Technology Set Country Capacity Efficiency DateIn DateRetrofit DateOut lat lon Duration Volume_Mm3 DamHeight_m StorageCapacity_MWh EIC projectID
id
4 Borssele Hard Coal Steam Turbine PP Netherlands 485.000000 NaN 1973.0 NaN 2034.0 51.433200 3.716000 NaN 0.0 0.0 0.0 {'49W000000000054X'} {'BEYONDCOAL': {'BEYOND-NL-2'}, 'ENTSOE': {'49...
80 Didcot Hard Coal CCGT PP United Kingdom 1490.000000 0.550000 1970.0 1998.0 2013.0 51.622300 -1.260800 NaN 0.0 0.0 0.0 {'48WSTN0000DIDCBC'} {'BEYONDCOAL': {'BEYOND-UK-22'}, 'ENTSOE': {'4...
81 Emile Huchet Hard Coal CCGT PP France 596.493211 NaN 1958.0 2010.0 2022.0 49.152500 6.698100 NaN 0.0 0.0 0.0 {'17W100P100P0345B', '17W100P100P0344D'} {'BEYONDCOAL': {'BEYOND-FR-67'}, 'ENTSOE': {'1...
82 Amercoeur Hard Coal CCGT PP Belgium 451.000000 0.187765 1968.0 NaN 2009.0 50.431000 4.395500 NaN 0.0 0.0 0.0 {'22WAMERCO000010Y'} {'BEYONDCOAL': {'BEYOND-BE-27'}, 'ENTSOE': {'2...
83 Timelkam Hard Coal CCGT CHP Austria 400.000000 0.590000 1962.0 2008.0 2008.0 48.010800 13.589100 NaN 0.0 0.0 0.0 {'14WENERGIE--WT02'} {'BEYONDCOAL': {'BEYOND-AT-9'}, 'ENTSOE': {'14...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
29782 St Hard Coal NaN CHP Germany 21.645000 NaN 1982.0 NaN NaN 49.976593 9.068953 NaN 0.0 0.0 0.0 {nan} {'MASTR': {'MASTR-SEE971943692655'}}
29807 Uer Hard Coal NaN CHP Germany 15.200000 NaN 1964.0 NaN NaN 51.368132 6.662350 NaN 0.0 0.0 0.0 {nan} {'MASTR': {'MASTR-SEE988421065542'}}
29816 Walheim Hard Coal NaN PP Germany 244.000000 NaN 1964.0 NaN NaN 49.017585 9.157690 NaN 0.0 0.0 0.0 {nan, nan} {'MASTR': {'MASTR-SEE964114029633', 'MASTR-SEE...
29833 Wd Ffw Hard Coal NaN CHP Germany 123.000000 NaN 1990.0 NaN NaN 50.099000 8.653000 NaN 0.0 0.0 0.0 {nan, nan} {'MASTR': {'MASTR-SEE915289541482', 'MASTR-SEE...
29838 West Hard Coal NaN CHP Germany 277.000000 NaN 1985.0 NaN NaN 52.442456 10.762681 NaN 0.0 0.0 0.0 {nan, nan} {'MASTR': {'MASTR-SEE917432813484', 'MASTR-SEE...

332 rows × 18 columns

Task 3: Identify the three largest coal power plants. In which countries are they located? When were they built?

Hide code cell content
coal.loc[coal.Capacity.nlargest(3).index]
Name Fueltype Technology Set Country Capacity Efficiency DateIn DateRetrofit DateOut lat lon Duration Volume_Mm3 DamHeight_m StorageCapacity_MWh EIC projectID
id
249 Kozienice Hard Coal Steam Turbine PP Poland 3682.216205 NaN 1972.0 NaN 2042.0 51.6647 21.4667 NaN 0.0 0.0 0.0 {'19W000000000095U', '19W000000000104I'} {'BEYONDCOAL': {'BEYOND-PL-96'}, 'ENTSOE': {'1...
3650 Vuglegirska Hard Coal CCGT PP Ukraine 3600.000000 NaN 1972.0 NaN NaN 48.4652 38.2027 NaN 0.0 0.0 0.0 {nan} {'GPD': {'WRI1005107'}, 'GEO': {'GEO-43001'}}
513 Opole Hard Coal Steam Turbine PP Poland 3071.893939 NaN 1993.0 NaN 2020.0 50.7518 17.8820 NaN 0.0 0.0 0.0 {'19W0000000001292'} {'BEYONDCOAL': {'BEYOND-PL-16'}, 'ENTSOE': {'1...

Task 4: What is the average “DateIn” of each “Fueltype”? Which type of power plants is the oldest on average?

Hide code cell content
2024 - df.groupby("Fueltype").DateIn.mean().sort_values()
Fueltype
Hard Coal        52.092593
Hydro            51.260690
Nuclear          48.214953
Lignite          47.184211
Other            31.543103
Waste            26.845588
Geothermal       23.857143
Oil              23.410138
Solid Biomass    22.516588
Natural Gas      22.089599
Wind             14.483789
Biogas           11.415730
Solar             8.584491
Name: DateIn, dtype: float64

Task 5: In the time series provided, calculate the annual average capacity factors of wind and solar.

Hide code cell content
ts.mean()
load       54.736992
onwind      0.205556
offwind     0.362993
solar       0.122621
prices     31.835717
dtype: float64

Task 6: In the time series provided, calculate and plot the monthly average electricity price.

Hide code cell content
ts["prices"].resample("ME").mean().plot()
<Axes: >
_images/1a11f0b6800bad32c86a47bed1fc4be378e6c8f0ac03f981889520a3a2dae384.png