Ten-line usage example¶
Suppose we want to analyze annual unemployment data for some European countries. All we need to know in advance is the data provider: Eurostat.
pandaSDMX makes it easy to search the directory of dataflows, and the complete structural metadata about the datasets available through the selected dataflow. (This example skips these steps; see the walkthrough.)
The data we want is in a data flow with the identifier une_rt_a
.
This dataflow references a data structure definition (DSD) with the ID DSD_une_rt_a
.
The DSD, in turn, contains or references all the metadata describing data sets available through this dataflow: the concepts, things measured, dimensions, and lists of codes used to label each dimension.
In [1]: import pandasdmx as sdmx
In [2]: estat = sdmx.Request('ESTAT')
Download the metadata:
In [3]: metadata = estat.datastructure('DSD_une_rt_a')
In [4]: metadata
Out[4]:
<pandasdmx.StructureMessage>
<Header>
id: 'IDREF119616'
prepared: '2020-05-15T10:16:44.873Z'
receiver: 'Unknown'
sender: 'Unknown'
response: <Response [200]>
Codelist (7): CL_AGE CL_FREQ CL_GEO CL_OBS_FLAG CL_OBS_STATUS CL_SEX ...
ConceptScheme (1): CS_DSD_une_rt_a
DataStructureDefinition (1): DSD_une_rt_a
Explore the contents of some code lists:
In [5]: for cl in 'CL_AGE', 'CL_UNIT':
...: print(sdmx.to_pandas(metadata.codelist[cl]))
...:
CL_AGE
Y15-24 From 15 to 24 years
Y15-74 From 15 to 74 years
Y20-64 From 20 to 64 years
Y25-54 From 25 to 54 years
Y25-74 From 25 to 74 years
Y55-74 From 55 to 74 years
Name: AGE, dtype: object
CL_UNIT
THS_PER Thousand persons
PC_POP Percentage of total population
PC_ACT Percentage of active population
Name: UNIT, dtype: object
Next we download a dataset. To obtain data on Greece, Ireland and Spain only, we use codes from the code list ‘CL_GEO’ to specify a key for the dimension named ‘GEO’. We also use a query parameter, ‘startPeriod’, to limit the scope of the data returned:
In [6]: resp = estat.data(
...: 'une_rt_a',
...: key={'GEO': 'EL+ES+IE'},
...: params={'startPeriod': '2007'},
...: )
...:
resp
is a DataMessage
object.
We use its to_pandas()
method to convert it to a pandas.Dataframe
, then select on the AGE
dimension we saw in the metadata
above:
In [7]: data = resp.to_pandas().xs('Y15-74', level='AGE',
...: axis=1, drop_level=False)
...:
We can now explore the data set as expressed in a familiar pandas object. First, show dimension names:
In [8]: data.columns.names
Out[8]: FrozenList(['FREQ', 'AGE', 'UNIT', 'SEX', 'GEO'])
…and corresponding key values along these dimensions:
In [9]: data.columns.levels
Out[9]: FrozenList([['A'], ['Y15-24', 'Y15-74', 'Y20-64', 'Y25-54', 'Y25-74', 'Y55-74'], ['PC_ACT', 'PC_POP', 'THS_PER'], ['F', 'M', 'T'], ['EL', 'ES', 'IE']])
Select some data of interest: show aggregate unemployment rates across ages (‘Y15-74’ on the AGE
dimension) and sexes (‘T’ on the SEX
dimension), expressed as a percentage of active population (‘PC_ACT’ on the UNIT
dimension):
In [10]: data.loc[:, ('A', 'Y15-74', 'PC_ACT', 'T')]
Out[10]:
GEO EL ES IE
TIME_PERIOD
2007-01-01 8.4 8.2 5.0
2008-01-01 7.8 11.3 6.8
2009-01-01 9.6 17.9 12.6
2010-01-01 12.7 19.9 14.6
2011-01-01 17.9 21.4 15.4
2012-01-01 24.5 24.8 15.5
2013-01-01 27.5 26.1 13.8
2014-01-01 26.5 24.5 11.9
2015-01-01 24.9 22.1 10.0
2016-01-01 23.6 19.6 8.4
2017-01-01 21.5 17.2 6.7
2018-01-01 19.3 15.3 5.8
2019-01-01 17.3 14.1 5.0