pandaSDMX: Statistical Data and Metadata eXchange in Python

pandaSDMX is an Apache 2.0-licensed Python library to retrieve and acquire statistical data and metadata disseminated in SDMX 2.1, an ISO-standard widely used by institutions such as statistics offices, central banks, and international organisations. pandaSDMX exposes datasets and related structural metadata including dataflows, code-lists, and datastructure definitions as pandas Series or multi-indexed DataFrames. Many other output formats and storage backends are available thanks to Odo.

Main features

  • support for many SDMX features including
    • generic data sets in SDMXML format
    • data sets in SDMXJSON format
    • data structure definitions, code lists and concept schemes
    • dataflow definitions and content-constraints
    • categorisations and category schemes
  • pythonic representation of the SDMX information model
  • When requesting datasets, validate column selections against code lists and content-constraints if available
  • export data and structural metadata such as code lists as multi-indexed pandas DataFrames or Series, and many other formats as well as database backends via Odo
  • read and write SDMX messages to and from files
  • configurable HTTP connections
  • support for requests-cache allowing to cache SDMX messages in memory, MongoDB, Redis or SQLite
  • extensible through custom readers and writers for alternative input and output formats
  • growing test suite

Example

Suppose we want to analyze annual unemployment data for some European countries. All we need to know in advance is the data provider, eurostat. pandaSDMX makes it super easy to search the directory of dataflows, and the complete structural metadata about the datasets available through the selected dataflow. We will skip this step here. The impatient reader may directly jump to Basic usage. The dataflow with the ID ‘une_rt_a’ contains the data we want. The dataflow definition references a datastructure definition with the ID ‘DSD_une_rt_a’. It contains or references all the metadata describing data sets available through this dataflow: the dimensions, concept schemes, and corresponding code lists.

In [1]: from pandasdmx import Request

In [2]: estat = Request('ESTAT')

# Download the metadata and expose it as a dict mapping resource names to pandas DataFrames
In [3]: metadata = estat.datastructure('DSD_une_rt_a').write()

# Show some code lists; the performance warning is just an oddity of pandas.
In [4]: metadata.codelist.loc[['AGE', 'UNIT']]
---------------------------------------------------------------------------
UnsortedIndexError                        Traceback (most recent call last)
<ipython-input-4-355ddd13490e> in <module>()
----> 1 metadata.codelist.loc[['AGE', 'UNIT']]

~/checkouts/readthedocs.org/user_builds/pandasdmx/envs/master/lib/python3.5/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1326         else:
   1327             key = com._apply_if_callable(key, self.obj)
-> 1328             return self._getitem_axis(key, axis=0)
   1329 
   1330     def _is_scalar_access(self, key):

~/checkouts/readthedocs.org/user_builds/pandasdmx/envs/master/lib/python3.5/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1543             # nested tuple slicing
   1544             if is_nested_tuple(key, labels):
-> 1545                 locs = labels.get_locs(key)
   1546                 indexer = [slice(None)] * self.ndim
   1547                 indexer[axis] = locs

~/checkouts/readthedocs.org/user_builds/pandasdmx/envs/master/lib/python3.5/site-packages/pandas/core/indexes/multi.py in get_locs(self, tup)
   2267                                      'to be fully lexsorted tuple len ({0}), '
   2268                                      'lexsort depth ({1})'
-> 2269                                      .format(len(tup), self.lexsort_depth))
   2270 
   2271         # indexer

UnsortedIndexError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'

Next we download a dataset. We use codes from the code list ‘GEO’ to obtain data on Greece, Ireland and Spain only.

In [5]: resp = estat.data('une_rt_a', key={'GEO': 'EL+ES+IE'}, params={'startPeriod': '2007'})

# We use a generator expression to select some columns
# and write them to a pandas DataFrame
In [6]: data = resp.write(s for s in resp.data.series if s.key.AGE == 'TOTAL')

# Explore the data set. First, show dimension names
In [7]: data.columns.names
Out[7]: FrozenList(['UNIT', 'AGE', 'SEX', 'GEO', 'FREQ'])

# and corresponding dimension values
In [8]: data.columns.levels
Out[8]: FrozenList([['PC_ACT', 'PC_POP', 'THS_PER'], ['TOTAL'], ['F', 'M', 'T'], ['EL', 'ES', 'IE'], ['A']])

# Show aggregate unemployment rates across ages and sexes as
# percentage of active population
In [9]: data.loc[:, ('PC_ACT', 'TOTAL', 'T')]
Out[9]: 
GEO            EL    ES    IE
FREQ            A     A     A
TIME_PERIOD                  
2016         23.6  19.6   7.9
2015         24.9  22.1   9.4
2014         26.5  24.5  11.3
2013         27.5  26.1  13.1
2012         24.5  24.8  14.7
2011         17.9  21.4  14.7
2010         12.7  19.9  13.9
2009          9.6  17.9  12.0
2008          7.8  11.3   6.4
2007          8.4   8.2   4.7

Quick install

  • conda install -c alcibiade pandasdmx # Latest release should be available soon. Check the version!
  • pip install pandasdmx # for all others

Indices and tables