Ten-line usage example

Suppose we want to analyze annual unemployment data for some European countries. All we need to know in advance is the data provider: Eurostat.

pandaSDMX makes it easy to search the directory of dataflows, and the complete structural metadata about the datasets available through the selected dataflow. (This example skips these steps; see the walkthrough.)

The data we want is in a data flow with the identifier une_rt_a. This dataflow references a data structure definition (DSD) with the ID DSD_une_rt_a. The DSD, in turn, contains or references all the metadata describing data sets available through this dataflow: the concepts, things measured, dimensions, and lists of codes used to label each dimension.

In [1]: import pandasdmx as sdmx

In [2]: estat = sdmx.Request('ESTAT')

Download the metadata:

In [3]: metadata = estat.datastructure('DSD_une_rt_a')

In [4]: metadata
Out[4]: 
<pandasdmx.StructureMessage>
  <Header>
    id: 'IDREF113827'
    prepared: '2020-05-14T21:35:35.471Z'
    receiver: 'Unknown'
    sender: 'Unknown'
  response: <Response [200]>
  Codelist (7): CL_AGE CL_FREQ CL_GEO CL_OBS_FLAG CL_OBS_STATUS CL_SEX ...
  ConceptScheme (1): CS_DSD_une_rt_a
  DataStructureDefinition (1): DSD_une_rt_a

Explore the contents of some code lists:

In [5]: for cl in 'CL_AGE', 'CL_UNIT':
   ...:     print(sdmx.to_pandas(metadata.codelist[cl]))
   ...: 
CL_AGE
Y15-24    From 15 to 24 years
Y15-74    From 15 to 74 years
Y20-64    From 20 to 64 years
Y25-54    From 25 to 54 years
Y25-74    From 25 to 74 years
Y55-74    From 55 to 74 years
Name: AGE, dtype: object
CL_UNIT
THS_PER                   Thousand persons
PC_POP      Percentage of total population
PC_ACT     Percentage of active population
Name: UNIT, dtype: object

Next we download a dataset. To obtain data on Greece, Ireland and Spain only, we use codes from the code list ‘CL_GEO’ to specify a key for the dimension named ‘GEO’. We also use a query parameter, ‘startPeriod’, to limit the scope of the data returned:

In [6]: resp = estat.data(
   ...:     'une_rt_a',
   ...:     key={'GEO': 'EL+ES+IE'},
   ...:     params={'startPeriod': '2007'},
   ...:     )
   ...: 

resp is a DataMessage object. We use its to_pandas() method to convert it to a pandas.Dataframe, then select on the AGE dimension we saw in the metadata above:

In [7]: data = resp.to_pandas().xs('Y15-74', level='AGE',
   ...:           axis=1, drop_level=False)
   ...: 

We can now explore the data set as expressed in a familiar pandas object. First, show dimension names:

In [8]: data.columns.names
Out[8]: FrozenList(['FREQ', 'AGE', 'UNIT', 'SEX', 'GEO'])

…and corresponding key values along these dimensions:

In [9]: data.columns.levels
Out[9]: FrozenList([['A'], ['Y15-24', 'Y15-74', 'Y20-64', 'Y25-54', 'Y25-74', 'Y55-74'], ['PC_ACT', 'PC_POP', 'THS_PER'], ['F', 'M', 'T'], ['EL', 'ES', 'IE']])

Select some data of interest: show aggregate unemployment rates across ages (‘Y15-74’ on the AGE dimension) and sexes (‘T’ on the SEX dimension), expressed as a percentage of active population (‘PC_ACT’ on the UNIT dimension):

In [10]: data.loc[:, ('A', 'Y15-74', 'PC_ACT', 'T')]
Out[10]: 
GEO            EL    ES    IE
TIME_PERIOD                  
2007-01-01    8.4   8.2   5.0
2008-01-01    7.8  11.3   6.8
2009-01-01    9.6  17.9  12.6
2010-01-01   12.7  19.9  14.6
2011-01-01   17.9  21.4  15.4
2012-01-01   24.5  24.8  15.5
2013-01-01   27.5  26.1  13.8
2014-01-01   26.5  24.5  11.9
2015-01-01   24.9  22.1  10.0
2016-01-01   23.6  19.6   8.4
2017-01-01   21.5  17.2   6.7
2018-01-01   19.3  15.3   5.8
2019-01-01   17.3  14.1   5.0