pandaSDMX: Statistical Data and Metadata eXchange in Python

pandaSDMX is an Apache 2.0-licensed Python package aimed at becoming the most intuitive and versatile tool to retrieve and acquire statistical data and metadata disseminated in SDMX format. It supports out of the box the SDMX services of the European statistics office (Eurostat), the European Central Bank (ECB), the French National Institute for statistics (INSEE), the Australian Bureau of Statistics, and the OECD (JSON only). pandaSDMX can export data and metadata as pandas DataFrames, the gold-standard of data analysis in Python. From pandas you can export data and metadata to Excel, R and friends. As from version 0.4, pandaSDMX can export data to many other file formats and database backends via Odo.

Main features

  • support for many SDMX features including
    • generic data sets in SDMXML format
    • compact data sets in SDMXJSON format (OECD only)
    • data structure definitions, code lists and concept schemes
    • dataflow definitions and content-constraints
    • categorisations and category schemes
  • pythonic representation of the SDMX information model
  • When requesting datasets, validate column selections against code lists and content-constraints if available
  • export data and structural metadata such as code lists as multi-indexed pandas DataFrames or Series, and many other formats and database backends via Odo
  • read and write SDMX messages to and from local files
  • configurable HTTP connections
  • support for requests-cache allowing to cache SDMX messages in memory, MongoDB, Redis or SQLite
  • extensible through custom readers and writers for alternative input and output formats of data and metadata
  • growing test suite

Example

Suppose we want to analyze annual unemployment data for some European countries. All we need to know in advance is the data provider, eurostat. pandaSDMX makes it super easy to search the directory of dataflows, and the complete structural metadata about the datasets available through the selected dataflow. We will skip this step here. The impatient reader may directly jump to Basic usage. The dataflow with the ID ‘une_rt_a’ contains the data we want. The dataflow definition references a datastructure definition with the ID ‘DSD_une_rt_a’. It contains or references all the metadata describing data sets available through this dataflow: the dimensions, concept schemes, and corresponding code lists.

In [1]: from pandasdmx import Request

In [2]: estat = Request('ESTAT')

# Download the metadata and expose it as a dict mapping resource names to pandas DataFrames
In [3]: metadata = estat.datastructure('DSD_une_rt_a').write()

# Show some code lists
In [4]: metadata.codelist.ix[['AGE', 'UNIT']]
Out[4]: 
             dim_or_attr                             name
AGE  AGE               D                              AGE
     TOTAL             D                            Total
     Y25-74            D              From 25 to 74 years
     Y_LT25            D               Less than 25 years
UNIT UNIT              D                             UNIT
     PC_ACT            D  Percentage of active population
     PC_POP            D   Percentage of total population
     THS_PER           D                 Thousand persons

Next we download a data set. We use codes from the code list ‘GEO’ to obtain data on Greece, Ireland and Spain only.

In [5]: resp = estat.data('une_rt_a', key={'GEO': 'EL+ES+IE'}, params={'startPeriod': '2007'})
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-463c38da96e5> in <module>()
----> 1 resp = estat.data('une_rt_a', key={'GEO': 'EL+ES+IE'}, params={'startPeriod': '2007'})

/home/docs/checkouts/readthedocs.org/user_builds/pandasdmx/envs/master/lib/python3.5/site-packages/pandaSDMX-0.6.1-py3.5.egg/pandasdmx/api.py in get(self, resource_type, resource_id, agency, key, params, headers, fromfile, tofile, url, get_footer_url, memcache, writer)
    268             # API spec.
    269             if resource_type == 'data' and isinstance(key, dict):
--> 270                 key = self._make_key(resource_id, key)
    271 
    272             # Get http headers from agency config if not given by the caller

/home/docs/checkouts/readthedocs.org/user_builds/pandasdmx/envs/master/lib/python3.5/site-packages/pandaSDMX-0.6.1-py3.5.egg/pandasdmx/api.py in _make_key(self, flow_id, key)
    379         '''
    380         # get all series keys
--> 381         all_keys = self.series_keys(flow_id)
    382         dim_names = list(all_keys)
    383         # Validate the key dict

/home/docs/checkouts/readthedocs.org/user_builds/pandasdmx/envs/master/lib/python3.5/site-packages/pandaSDMX-0.6.1-py3.5.egg/pandasdmx/api.py in series_keys(self, flow_id, cache)
    161             # download an empty dataset with all available series keys
    162             resp = self.data(flow_id, params={'detail': 'serieskeysonly'})
--> 163             l = list(s.key for s in resp.data.series)
    164             df = PD.DataFrame(l, columns=l[0]._fields, dtype='category')
    165             if cache:

/home/docs/checkouts/readthedocs.org/user_builds/pandasdmx/envs/master/lib/python3.5/site-packages/pandaSDMX-0.6.1-py3.5.egg/pandasdmx/api.py in __getattr__(self, name)
    498         Make Message attributes directly readable from Response instance
    499         '''
--> 500         return getattr(self.msg, name)
    501 
    502     def _init_writer(self, writer):

AttributeError: 'DataMessage' object has no attribute 'data'

# We use a generator expression to narrow down the column selection
# and write these columns to a pandas DataFrame
In [6]: data = resp.write(s for s in resp.data.series if s.key.AGE == 'TOTAL')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-87e25855dc27> in <module>()
----> 1 data = resp.write(s for s in resp.data.series if s.key.AGE == 'TOTAL')

NameError: name 'resp' is not defined

# Explore the data set. First, show dimension names
In [7]: data.columns.names
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-0786dd9c8bac> in <module>()
----> 1 data.columns.names

NameError: name 'data' is not defined

# and corresponding dimension values
In [8]: data.columns.levels
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-6d02406b3677> in <module>()
----> 1 data.columns.levels

NameError: name 'data' is not defined

# Show aggregate unemployment rates across ages and sexes as
# percentage of active population
In [9]: data.loc[:, ('PC_ACT', 'TOTAL', 'T')]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-7cdea879ea6f> in <module>()
----> 1 data.loc[:, ('PC_ACT', 'TOTAL', 'T')]

NameError: name 'data' is not defined

Quick install

  • conda install -c alcibiade pandasdmx # for Anaconda users
  • pip install pandasdmx # for all others

Indices and tables