Ten-line usage example¶
Suppose we want to analyze annual unemployment data for some European countries. All we need to know in advance is the data provider: Eurostat.
pandaSDMX makes it easy to search the directory of dataflows, and the complete structural metadata about the datasets available through the selected dataflow. (This example skips these steps; see the walkthrough.)
The data we want is in a data flow with the identifier UNE_RT_A
.
This dataflow references a data structure definition (DSD) with that same ID.
The DSD contains or references all the metadata describing data sets available through this dataflow: the concepts, things measured, dimensions, and lists of codes used to label each dimension.
In [1]: import pandasdmx as sdmx
In [2]: estat = sdmx.Request('ESTAT')
Download the metadata:
In [3]: metadata = estat.datastructure('UNE_RT_A')
In [4]: metadata
Out[4]:
<pandasdmx.StructureMessage>
<Header>
id: 'DSD1676610120'
prepared: '2023-02-17T05:02:00.711000+00:00'
sender: <Agency ESTAT>
source:
test: False
response: <Response [200]>
Codelist (6): UNIT AGE GEO OBS_FLAG SEX FREQ
ConceptScheme (1): UNE_RT_A
DataStructureDefinition (1): UNE_RT_A
Explore the contents of some code lists:
In [5]: for cl in 'AGE', 'UNIT':
...: print(sdmx.to_pandas(metadata.codelist[cl]))
...:
name parent
AGE
TOTAL Total AGE
LFD Late foetal death AGE
LFD1 Late foetal death (group 1) AGE
LFD2 Late foetal death (group 2) AGE
MN0 Zero minutes AGE
... ... ...
AVG Average AGE
NRP No response AGE
NSP Not specified AGE
OTH Other AGE
UNK Unknown AGE
[654 rows x 2 columns]
name parent
UNIT
TOTAL Total UNIT
NR Number UNIT
NR_HAB Number per inhabitant UNIT
THS Thousand UNIT
MIO Million UNIT
... ... ...
PD_PCH_SM_NAC Price index (implicit deflator), percentage ch... UNIT
CRC_MEUR Current replacement costs, million euro UNIT
CRC_MNAC Current replacement costs, million units of na... UNIT
PYR_MEUR Previous year replacement costs, million euro UNIT
PYR_MNAC Previous year replacement costs, million units... UNIT
[695 rows x 2 columns]
Next we download a dataset. To obtain data on Greece, Ireland and Spain only, we use codes from the code list ‘GEO’ to specify a key for the dimension named ‘geo’. We also use a query parameter, ‘startPeriod’, to limit the scope of the data returned:
In [6]: resp = estat.data(
...: 'UNE_RT_A',
...: key={'geo': 'EL+ES+IE'},
...: params={'startPeriod': '2015'},
...: )
...:
resp
is a DataMessage
object.
We use its to_pandas()
method to convert it to a pandas.DataFrame
, and select on the age
dimension we saw in the metadata
above:
In [7]: data = resp.to_pandas(
...: datetime={'dim': 'TIME_PERIOD', 'freq': 'freq'}).xs('Y15-74', level='age',
...: axis=1, drop_level=False)
...:
We can now explore the data set as expressed in a familiar pandas object. First, show dimension names:
In [8]: data.columns.names
Out[8]: FrozenList(['age', 'unit', 'sex', 'geo'])
…and corresponding key values along these dimensions:
In [9]: data.columns.levels
Out[9]: FrozenList([['Y15-24', 'Y15-29', 'Y15-74', 'Y20-64', 'Y25-54', 'Y25-74', 'Y55-74'], ['PC_ACT', 'PC_POP', 'THS_PER'], ['F', 'M', 'T'], ['EL', 'ES', 'IE']])
Select some data of interest: show aggregate unemployment rates across ages (‘Y15-74’ on the AGE
dimension) and sexes (‘T’ on the SEX
dimension), expressed as a percentage of active population (‘PC_ACT’ on the UNIT
dimension):
In [10]: data.loc[:, ('Y15-74', 'PC_ACT', 'T')]
Out[10]:
geo EL ES IE
TIME_PERIOD
2015 25.0 22.1 9.9
2016 23.9 19.6 8.4
2017 21.8 17.2 6.7
2018 19.7 15.3 5.8
2019 17.9 14.1 5.0
2020 17.6 15.5 5.9
2021 14.7 14.8 6.2