pandaSDMX: Statistical Data and Metadata eXchange in Python

pandaSDMX is an Apache 2.0-licensed Python client to retrieve and acquire statistical data and metadata disseminated in SDMX 2.1, an ISO-standard widely used by institutions such as statistics offices, central banks, and international organisations. pandaSDMX exposes datasets and related structural metadata including dataflows, codelists, and datastructure definitions as pandas Series or multi-indexed DataFrames. Many other output formats and storage backends are available thanks to Odo.

Main features

  • support for many SDMX 2.1 features
  • SDMXML and SDMXJSON formats
  • pythonic representation of the SDMX information model
  • When requesting datasets, validate column selections against code lists and content-constraints if available
  • export data and structural metadata such as code lists as multi-indexed pandas DataFrames or Series, and many other formats as well as database backends via Odo
  • read and write SDMX messages to and from files
  • configurable HTTP connections
  • support for requests-cache allowing to cache SDMX messages in memory, MongoDB, Redis or SQLite
  • extensible through custom readers and writers for alternative input and output formats
  • growing test suite

Example

Suppose we want to analyze annual unemployment data for some European countries. All we need to know in advance is the data provider, eurostat. pandaSDMX makes it super easy to search the directory of dataflows, and analyze the complete structural metadata about the datasets available through the selected dataflow. We will skip this step here. The impatient reader may directly jump to Basic usage. The dataflow with the ID ‘une_rt_a’ contains the data we want. The dataflow definition references the datastructure definition which contains or references all the metadata describing data sets available through this dataflow: the dimensions, concept schemes, and corresponding code lists.

In [1]: from pandasdmx import Request

In [2]: estat = Request('ESTAT')

# Download the metadata and expose it as a dict mapping resource names to pandas DataFrames
In [3]: flow_response = estat.dataflow('une_rt_a')

In [4]: structure_response = flow_response.dataflow.une_rt_a.structure(request=True, target_only=False)

# Show some code lists.
In [5]: structure_response.write().codelist.loc['GEO'].head()
Out[5]: 
    dim_or_attr      name
GEO           D       GEO
AT            D   Austria
BE            D   Belgium
BG            D  Bulgaria
CY            D    Cyprus

Next we download a dataset. We use codes from the code list ‘GEO’ to obtain data on Greece, Ireland and Spain only.

In [6]: resp = estat.data('une_rt_a', key={'GEO': 'EL+ES+IE'}, params={'startPeriod': '2007'})

# We use a generator expression to select some columns
# and write them to a pandas DataFrame
In [7]: data = resp.write(s for s in resp.data.series if s.key.AGE == 'TOTAL')

# Explore the data set. First, show dimension names
In [8]: data.columns.names
Out[8]: FrozenList(['UNIT', 'AGE', 'SEX', 'GEO', 'FREQ'])

# and corresponding dimension values
In [9]: data.columns.levels
Out[9]: FrozenList([['PC_ACT', 'PC_POP', 'THS_PER'], ['TOTAL'], ['F', 'M', 'T'], ['EL', 'ES', 'IE'], ['A']])

# Show aggregate unemployment rates across ages and sexes as
# percentage of active population
In [10]: data.loc[:, ('PC_ACT', 'TOTAL', 'T')]
Out[10]: 
GEO            EL    ES    IE
FREQ            A     A     A
TIME_PERIOD                  
2018         19.3  15.3   5.8
2017         21.5  17.2   6.7
2016         23.6  19.6   8.4
2015         24.9  22.1  10.0
2014         26.5  24.5  11.9
2013         27.5  26.1  13.8
2012         24.5  24.8  15.5
2011         17.9  21.4  15.4
2010         12.7  19.9  14.6
2009          9.6  17.9  12.6
2008          7.8  11.3   6.8
2007          8.4   8.2   5.0

Quick install

  • pip install pandasdmx

Table of contents

What’s new?

v0.9 (2018-04)

This version is the last tested on Python 2.x. Future versions will be tested on Python 3.5+ only

New features
  • four new data providers INEGI (Mexico), Norges Bank (Norway), International Labour Organization (ILO) and and Italian statistics office (ISTAT)
  • model: make Ref instances callable for resolving them, i.e. getting the referenced object by making a remote request if needed
  • improve loading of structure-specific messages when DSD is not passed / must be requested on the fly
  • process multiple and cascading content constraints as described in the Technical Guide (Chap. 6 of the SDMX 2.1 standard)
  • StructureMessages and DataMessages now have properties to compute the constrained and unconstrained codelists as dicts of frozensets of codes. For DataMessage this is useful when series_keys was set to True when making the request. This prompts the data provider to generate a dataset without data, but with the complete set of series keys. This is the most accurate representation of the available series. Agencies such as IMF and ECB support this feature.

v0.8.2 (2017-12-21)

  • fix reading of structure-specific data sets when DSD_ID is present in the data set

v0.8.1 (2017-12-20)

  • fix broken package preventing pip installs of the wheel

v0.8 (2017-12-12)

  • add support for an alternative data set format defined for SDMXML messages. These so-called structure-specific data sets lend themselves for large data queries. File sizes are typically about 60 % smaller than with equivalent generic data sets. To make use of structure-specific data sets, instantiate Request objects with agency IDs such as ‘ECB_S’, ‘INSEE_S’ or ‘ESTAT_S’ instead of ‘ECB’ etc. These alternative agency profiles prompt pandaSDMX to execute data queries for structure-specific data sets. For all other queries they behave exactly as their siblings. See a code example in chapter 5 of the docs.
  • raise ValueError when user attempts to request a resource other than data from an agency delivering data in SCMX-JSON format only (OECD and ABS).
  • Update INSEE profile
  • handle empty series properly
  • data2pd writer: the code for Series index generation was rewritten from scratch to make better use of pandas’ time series functionality. However, some data sets, in particular from INSEE, which come with bimonthly or semestrial frequencies cannot be rendered as PeriodIndex. Pass parse_time=False to the .write method to prevent errors.

v0.7.0 (2017-06-10)

  • add new data providers:
    • Australian Bureau of Statistics
    • International Monetary Fund - SDMXCentral only
    • United Nations Division of Statistics
    • UNESCO (free registration required)
    • World Bank - World Integrated Trade Solution (WITS)
  • new feature: load metadata on data providers from json file; allow the user to add new agencies on the fly by specifying an appropriate JSON file using the pandasdmx.api.Request.load_agency_profile().
  • new pandasdmx.api.Request.preview_data() providing a powerful fine-grain key validation algorithm by downloading all series-keys of a dataset and exposing them as a pandas DataFrame which is then mapped to the cartesian product of the given dimension values. Works only with data providers such as ECB and UNSD which support “series-keys-only” requests. This feature could be wrapped by a browser-based UI for building queries.
  • sdjxjson reader: add support for flat and cross-sectional datasets, preserve dimension order where possible
  • structure2pd writer: in codelists, output Concept rather than Code attributes in the first line of each code-list. This may provide more information.

v0.6.1 (2017-02-03)

  • fix 2to3 issue which caused crashes on Python 2.7

v0.6 (2017-01-07)

This release contains some important stability improvements.

Bug fixes
  • JSON data from OECD is now properly downloaded
  • The data writer tries to gleen a frequency value for a time series from its attributes. This is helpful when exporting data sets, e.g., from INSEE (Issue 41).
Known issues

A data set which lacks a FREQ dimension or attribute can be exported as pandas DataFrame only when parse_time=False?, i.e. no DateTime index is generated. The resulting DataFrame has a string index. Use pandas magic to create a DateTimeIndex from there.

v0.5 (2016-10-30)

New features
  • new reader module for SDMX JSON data messages
  • add OECD as data provider (data messages only)
  • pandasdmx.model.Category is now an iterator over categorised objects. This greatly simplifies category usage. Besides, categories with the same ID while belonging to multiple category schemes are no longer conflated.
API changes
  • Request constructor: make agency ID case-insensitive
  • As Category is now an iterator over categorised objects, Categorisations is no longer considered part of the public API.
Bug fixes
  • sdmxml reader: fix AttributeError in write_source method, thanks to Topas
  • correctly distinguish between categories with same ID while belonging to different category schemes

v0.4 (2016-04-11)

New features
  • add new provider INSEE, the French statistics office (thanks to Stéphan Rault)
  • register ‘.sdmx’ files with Odo if available
  • logging of http requests and file operations.
  • new structure2pd writer to export codelists, dataflow-definitions and other structural metadata from structure messages as multi-indexed pandas DataFrames. Desired attributes can be specified and are represented by columns.
API changes
  • pandasdmx.api.Request constructor accepts a log_level keyword argument which can be set to a log-level for the pandasdmx logger and its children (currently only pandasdmx.api)
  • pandasdmx.api.Request now has a timeout property to set the timeout for http requests
  • extend api.Request._agencies configuration to specify agency- and resource-specific settings such as headers. Future versions may exploit this to provide reader selection information.
  • api.Request.get: specify http_headers per request. Defaults are set according to agency configuration
  • Response instances expose Message attributes to make application code more succinct
  • rename pandasdmx.api.Message attributes to singular form Old names are deprecated and will be removed in the future.
  • pandasdmx.api.Request exposes resource names such as data, datastructure, dataflow etc. as descriptors calling ‘get’ without specifying the resource type as string. In interactive environments, this saves typing and enables code completion.
  • data2pd writer: return attributes as namedtuples rather than dict
  • use patched version of namedtuple that accepts non-identifier strings as field names and makes all fields accessible through dict syntax.
  • remove GenericDataSet and GenericDataMessage. Use DataSet and DataMessage instead
  • sdmxml reader: return strings or unicode strings instead of LXML smart strings
  • sdmxml reader: remove most of the specialized read methods. Adapt model to use generalized methods. This makes code more maintainable.
  • pandasdmx.model.Representation for DSD attributes and dimensions now supports text not just codelists.
Other changes and enhancements
  • documentation has been overhauled. Code examples are now much simpler thanks to the new structure2pd writer
  • testing: switch from nose to py.test
  • improve packaging. Include tests in sdist only
  • numerous bug fixes

v0.3.1 (2015-10-04)

This release fixes a few bugs which caused crashes in some situations.

v0.3.0 (2015-09-22)

  • support for requests-cache allowing to cache SDMX messages in memory, MongoDB, Redis or SQLite
  • pythonic selection of series when requesting a dataset: Request.get allows the key keyword argument in a data request to be a dict mapping dimension names to values. In this case, the dataflow definition and datastructure definition, and content-constraint are downloaded on the fly, cached in memory and used to validate the keys. The dotted key string needed to construct the URL will be generated automatically.
  • The Response.write method takes a parse_time keyword arg. Set it to False to avoid parsing of dates, times and time periods as exotic formats may cause crashes.
  • The Request.get method takes a memcache keyward argument. If set to a string, the received Response instance will be stored in the dict Request.cache for later use. This is useful when, e.g., a DSD is needed multiple times to validate keys.
  • fixed base URL for Eurostat
  • major refactorings to enhance code maintainability

v0.2.2

  • Make HTTP connections configurable by exposing the requests.get API through the pandasdmx.api.Request constructor. Hence, proxy servers, authorisation information and other HTTP-related parameters consumed by requests.get can be specified for each Request instance and used in subsequent requests. The configuration is exposed as a dict through a new Request.client.config attribute.
  • Responses have a new http_headers attribute containing the HTTP headers returned by the SDMX server

v0.2.1

  • Request.get: allow fromfile to be a file-like object
  • extract SDMX messages from zip archives if given. Important for large datasets from Eurostat
  • automatically get a resource at an URL given in the footer of the received message. This allows to automatically get large datasets from Eurostat that have been made available at the given URL. The number of attempts and the time to wait before each request are configurable via the get_footer_url argument.

v0.2 (2015-04-13)

This version is a quantum leap. The whole project has been redesigned and rewritten from scratch to provide robust support for many SDMX features. The new architecture is centered around a pythonic representation of the SDMX information model. It is extensible through readers and writers for alternative input and output formats. Export to pandas has been dramatically improved. Sphinx documentation has been added.

v0.1 (2014-09)

Initial release

FAQ

Can pandaSDMX connect to SDMX providers other than INSEE, ECB and Eurostat?

Any SDMX provider can be supported that generates SDMX 2.1-compliant messages. INSEE, ECB and Eurostat are hard-coded. Others may be added in a few lines. Alternatively, a custom base URL can be provided to the pandasdmx.api.Request.get() method. See the docstring. Support for SDMX 2.0 messages could be added as a new reader module. Perhaps the model would have to be tweaked a bit as well.

Writing large datasets to pandas DataFrames is slow. What can I do?

The main performance hit comes from parsing the time or time period strings. In case of regular data such as monthly (not trading day!), call the write method with fromfreq set to True so that only the first string will be parsed and the rest inferred from the frequency of the series. Caution: If the series is stored in the XML document in reverse chronological order, the reverse_obs argument must be set to True as well to prevent the resulting dataframe index from extending into a remote future.

Getting started

Installation

Prerequisites

pandaSDMX is a pure Python package. As such it should run on any platform. It requires Python 2.7, 3.4 or higher.

It is recommended to use one of the common Python distributions for scientific data analysis such as

Along with a current Python interpreter these Python distributions include lots of useful packages for data analysis. For other Python distributions (not only scientific) see here.

pandaSDMX has the following dependencies:

  • the data analysis library pandas which itself depends on a number of packages
  • the HTTP library requests
  • LXML for XML processing.
  • JSONPATH-RW for JSON processing.
Optional dependencies
  • requests-cache allowing to cache SDMX messages in memory, MongoDB, Redis and more.
  • odo for fancy data conversion and database export
  • IPython is required to build the Sphinx documentation To do this, check out the pandaSDMX repository on github.
  • py.test to run the test suite.
Download

From the command line of your OS, issue

pip install pandasdmx

Installation with conda is currently not supported.

Of course, you can also download the tarball from the PyPI and issue python setup.py install from the package dir.

Running the test suite

The test suite is contained in the source distribution. It is recommended to run the tests with py.test.

Package overview

Modules

api
module containing the API to make queries to SDMX web services, validate keys (filters) etc. See pandasdmx.api.Request in particular its get method. pandasdmx.api.Request.get() return pandasdmx.api.Response instances.
model
implements the SDMX information model.
remote
contains a wrapper class around requests for http. Called by pandasdmx.api.Request.get() to make http requests to SDMX services. Also reads sdmxml files instead of querying them over the web.

Subpackages

reader
read SDMX files and instantiate the appropriate classes from pandasdmx.model There are currently two readers: one for XML-based SDMXML 2.1 and one for SDMX-JSON 2.1.
writer

contains writer classes transforming SDMX artefacts into other formats or writing them to arbitrary destinations such as databases. As of v0.6.0, two writers are available:

  • ‘data2pandas’ exports a dataset or portions thereof to a pandas DataFrame or Series.
  • ‘structure2pd’ exports structural metadata such as lists of data-flow definitions, code-lists, concept-schemes etc. which are contained in a structural SDMX message as as a dict mapping resource names (e.g. ‘dataflow’, ‘codelist’) to pandas DataFrames.
utils:
utility functions and classes. Contains a wrapper around dict allowing attribute access to dict items.
tests
unit tests and sample files

What next?

The following chapters explain the key characteristics of SDMX, demonstrate the basic usage of pandaSDMX and provide additional information on some advanced topics. While users that are new to SDMX are likely to benefit a lot from reading the next chapter on SDMX, normal use of pandaSDMX should not strictly require this. The Basic usage chapter should enable you to retrieve datasets and write them to pandas DataFrames. But if you want to exploit the full richness of the information model, or simply feel more comfortable if you know what happens behind the scenes, the SDMX introduction is for you. It also contains links to reference materials on SDMX. .

A very short introduction to SDMX

General purpose

SDMX (short for: Statistical Data and Metadata eXchange) is a set of standards and guidelines aimed at facilitating the production, dissemination, retrieval and processing of statistical data and metadata. SDMX is sponsored by a wide range of public institutions including the UN, the IMF, the Worldbank, BIS, ILO, FAO, the OECD, the ECB, Eurostat, and a number of national statistics offices. These and other institutions provide a vast array of current and historic data sets and metadata sets via free or fee-based REST and SOAP web services. pandaSDMX only supports SDMX v2.1, that is, the latest version of this standard. Some agencies such as the ILO and WHO still offer SDMX 2.0-compliant services. These cannot be accessed by pandaSDMX. It is expected that most SDMX providers will ultimately upgrade to the latest version of the standard.

Information model

At its core, SDMX defines an information model consisting of a set of classes, their logical relations, and semantics. There are classes defining things like data sets, metadata sets, data and metadata structures, processes, organisations and their specific roles to name but a few. The information model is agnostic as to its implementation. The SDMX standard provides an XML-based implementation (see below). And a more efficient JSON-variant called SDMXJSON is being standardised by the SDMX Technical Standards Working Group. PandaSDMX supports both formats.

The following sections briefly introduces some key elements of the information model.

Data sets

a data set can broadly be described as a container of ordered observations and attributes attached to them. Observations (e.g. the annual unemployment rate) are classified by dimensions such as country, age, sex, and time period. Attributes may further describe an individual observation or a set of observations. Typical uses for attributes are the level of confidentiality, or data quality. Observations may be clustered into series, in particular, time series. The data set must explicitly specify the dimension at observation such as ‘time’, ‘time_period’ or anything else. If a data set consists of series whose dimension at observation is neither time nor time period, the data set is called cross-sectional. A data set that is not grouped into series, i.e. where all dimension values including time, if available, are stated for each observation, are called flat data sets. These are hardly memory-efficient, but benefit from a very simple representation.

An attribute may be attached to a series to express the fact that it applies to all contained observations. This increases efficiency and adds meaning. Subsets of series within a data set may be clustered into groups. A group is defined by specifying one or more dimension values, but not all: At least the dimension at observation and one other dimension must remain free (or wild-carded). Otherwise, the group would in fact be either a single observation or a series. The main purpose of group is to serve as a convenient attachment point for attributes. Hence, a given attribute may be attached to all series within the group at once. Attributes may finally be attached to the entire data set, i.e. to all series/observations therein.

Structural metadata: data structure definition, concept scheme and code list

In the above section on data sets, we have carelessly used structural terms such as dimension, dimension value and attachment of attributes. This is because it is almost impossible to talk about data sets without talking about their structure. The information model provides a number of classes to describe the structure of data sets without talking about data. The container class for this is called DataStructureDefinition (in short: DSD). It contains a list of dimensions and for each dimension a reference to exactly one concept describing its meaning. A concept describes the set of permissible dimension values. This can be done in various ways depending on the intended data type. Finite value sets (such as country codes, currencies, a data quality classification etc.) are described by reference to code lists. Infinite value sets are described by facets which is simply a way to express that a dimension may have int, float or time-stamp values, to name but a few. A set of concepts referred to in the dimension descriptors of a data structure definition is called concept scheme.

The set of allowed observation values such as the unemployment rate measured in per cent is defined by a special dimension called MeasureDimension.

Dataflow definition

A dataflow describes how a particular data set is structured (by referring to a DSD), how often it is updated over time by its maintaining agency, under what conditions it will be provided etc. The terminology is a bit confusing: You cannot actually obtain a dataflow from an SDMX web service. Rather, you can request one or more dataflow definitions describing how datasets under this dataflow are structured, which codes may be used to query for desired columns etc. The dataflow definition and the artefacts to which it refers give you all the information you need to exploit the data sets you can request using the dataflow’s ID.

A DataFlowDefinition is a class that describes a dataflow. A DataFlowDefinition has a unique identifier, a human-readable name and potentially a more detailed description. Both may be multi-lingual. The dataflow’s ID is used to query the data set it describes. The dataflow also features a reference to the DSD which structures the data sets available under this dataflow ID. For instance, in the frontpage example we used the dataflow ID ‘une_rt_a’.

Constraints

Constraints are a mechanism to specify a subset of keys from the set of possible combinations of keys available in the referenced code lists for which there is actually data. For example, a constraint may reflect the fact that in a certain country there are no lakes or hospitals, and hence no data about water quality or hospitalization.

There are two types of constraints:

A content-constraint is a mechanism to express the fact that data sets of a given dataflow only comprise columns for a subset of values from the code-lists representing dimension values. For example, the datastructure definition for a dataflow on exchange rates references the code list of all country codes in the world, whereas the data sets provided under this dataflow only covers the ten largest currencies. These can be enumerated by a content-constraint attached to the dataflow definition or DSD. Content-constraints can be used to validate dimension names and values (a.k.a. keys) when requesting data sets selecting columns of interest. pandaSDMX supports content constraints and provides convenient methods to validate keys, compute the constrained code lists etc.

An attachment-constraint describes to which parts of a data set (column/series, group of series, observation, the entire data set) certain attributes may be attached. Attachment-constraints are not supported by pandaSDMX as this feature is needed only for data set generation. However, pandaSDMX does support attributes in the information model and when exporting data sets to pandas.

Category schemes and categorisations

Categories serve to classify or categorise things like dataflows, e.g., by subject matter. Multiple categories may belong to a container called CategorySchemes.

A Categorisation links the thing to be categorised, e.g., a DataFlowDefinition, to a Category.

Class hierarchy

The SDMX information model defines a number of abstract base classes from which subclasses such as DataFlowDefinition or DataStructureDefinition are derived. E.g., DataFlowDefinition inherits from MaintainableArtefact attributes indicating the maintaining agency. MaintainableArtefact inherits from VersionableArtefact, which, in turn, inherits from IdentifiableArtefact which inherits from AnnotableArtefact and so forth. Hence, DataStructureDefinition may have a unique ID, a version, a natural language name in multiple languages, a description, and annotations. pandaSDMX takes full advantage from this class hierarchy.

Implementations of the information model

Background

There are two implementations of the information model:

  • SDMXML is XML-based. It is fully standardised and covers the complete information model. However, it is a bit heavy-weight and data providers are gradually shifting to the JSON flavor currently in the works.
  • SDMXJSON: This recent JSON-based implementation is more lightweight and efficient. While standardisation is in an advanced stage, structure-messages are not yet covered. Data messages work well though, and pandaSDMX supports them as from v0.5.
SDMXML

The SDMX standard defines an XML-based implementation of the information model called SDMXML. An SDMXML document contains exactly one SDMX Message. There are several types of Message such as GenericDataMessage to represent a data set in generic form, i.e. containing all the information required to interpret it. Hence, data sets in generic representation may be used without knowing the related DataStructureDefinition. The downside is that generic data set messages are much larger than their sister format StructureSpecificdata set. pandaSDMX has always supported generic data set messages. In v0.8, support for structure-specific data messages was aded. SDMX-JSON messages can be consumed as well.

The term ‘structure-specific dataset’ reflects the fact that in order to interpret such dataset, one needs to know the datastructure definition (DSD). Otherwise, it would be impossible to distinguish dimension values from attributes etc. Hence, when downloading a structure-specific dataset, pandaSDMX will download the DSD on the fly or retrieves it from a local cash.

Another important SDMXML message type is StructureMessage which may contain artefacts such as DataStructureDefinitions, code lists, conceptschemes, categoryschemes and so forth.

SDMXML provides that each message contains a Header containing some metadata about the message. Finally, SDMXML messages may contain a Footer element. It provides information on any errors that have occurred on the server side, e.g., if the requested data set exceeds the size limit, or the server needs some time to make it available under a given link.

The test suite comes with a number of small SDMXML demo files. View them in your favorite XML editor to get a deeper understanding of the structure and content of various message types.

SDMX services provide XML schemas to validate a particular SDMXML file. However, pandaSDMX does not yet support validation.

SDMXJSON

SDMXJSON represents SDMX data sets and related metadata as JSON files provided by RESTful web services. Early adopters of this format are OECD, ECB and IMF. As of v0.5, pandaSDMX supports the OECD’s REST interface for SDMXJSON. However, note that structural metadata is not yet fully standardised. Hence, it is impossible at this stage to download dataflow definitions, codelists etc. from ABS (Australia) and OECD.

SDMX web services

The SDMX standard defines both a REST and a SOAP web service API. As of v0.8, pandaSDMX only supports the REST API.

The URL specifies the type, providing agency, and ID of the requested SDMX resource (dataflow, categoryscheme, data etc.). The query part of the URL (after the ‘?’) may be used to give optional query parameters. For instance, when requesting data, the scope of the data set may be narrowed down by specifying a key to select only matching columns (e.g. on a particular country). The dimension names and values used to select the rows can be validated by checking if they are contained in the relevant codelists referenced by the datastructure definition (see above), and any content-constraint attached to the dataflow definition for the queried data set. Moreover, rows may be chosen by specifying a startperiod and endperiod for the time series. In addition, the query part may set a references parameter to instruct the SDMX server to return a number of other artefacts along with the resource actually requested. For example, a DataStructureDefinition contains references to code lists and concept schemes (see above). If the ‘references’ parameter is set to ‘all’, these will be returned in the same StructureMessage. The next chapter contains some examples to demonstrate this mechanism. Further details can be found in the SDMX User Guide, and the Web Service Guidelines.

Further reading

  • The SDMX standards and guidelines are the authoritative resource. This page is a must for anyone eager to dive deeper into SDMX. Start with the User Guide and the Information Model (Part 2 of the standard). The Web Services Guidelines contain instructive examples for typical queries.
  • Eurostat SDMX page
  • European Central Bank SDMX page It links to a range of study guides and helpful video tutorials.
  • SDMXSource: - Java, .NET and ActionScript implementations of SDMX software, in part open source

Basic usage

Overview

This chapter illustrates the main steps of a typical workflow, namely:

  1. Choose a data provider
  2. Download the catalogue of dataflows available from the data provider and select a dataflow for further inspection
  3. download metadata on the selected dataflow including the datastructure definition, concepts, codelists and content constraints describing the datasets available through that dataflow
  4. Analyze the metadata as pandas DataFrames or by directly inspecting the Pythonic information model
  5. Specify the needed portions of the data from the dataflow by constructing a selection (“key”) of series and a period/time range for the prospective dataset
  6. Download the actual dataset specified by dataflow ID, key and period/time range
  7. write the dataset or selected series thereof to a pandas DataFrame or Series to analyze the dataset

Each of the steps share common tasks which flow from the architecture of pandaSDMX:

  1. Use an pandasdmx.api.Request instance to get an SDMX message from a web service or file.
  2. Explore the returned pandasdmx.api.Response instance. The SDMX message is contained in its msg attribute. Note that there are two types of message: DataMessage and StructureMessage. The former contains a data set, the latter contains structural metadata about one or more dataflows, most importantly one or more dataflow definitions and related metadata such as the datastructure definition, codelists, constraints etc.

Connecting to an SDMX web service, caching

First, we instantiate pandasdmx.api.Request. The constructor accepts an optional agency ID as string. The list of supported agencies can be viewed here, or as shown below.

In [1]: from pandasdmx import Request # '*' would do the same

In [2]: ecb = Request('ECB')

ecb is now configured so as to make requests to the European Central Bank. If you want to send requests to multiple agencies, instantiate multiple Request objects.

Configuring the http connection

To pre-configure the HTTP connections to be established by a Request instance, you can pass all keyword arguments consumed by the underlying HTTP library requests. For a complete description of the options see the requests documentation. For example, a proxy server can be specified for subsequent requests like so:

In [3]: ecb_via_proxy = Request('ECB', proxies={'http': 'http://1.2.3.4:5678'})

HTTP request parameters are exposed through a dict. It may be modified between requests.

In [4]: ecb_via_proxy.client.config
Out[4]: {'proxies': {'http': 'http://1.2.3.4:5678'}, 'stream': True, 'timeout': 30.1}

The Request.client attribute acts a bit like a requests.Session in that it conveniently stores the configuration for subsequent HTTP requests. Modify it to change the configuration. For convenience, pandasdmx.api.Request has a timeout property to set the timeout in seconds for http requests.

Caching received files

Since v0.3.0, requests-cache is supported. To use it, pass an optional cache keyword argument to Request() constructor. If given, it must be a dict whose items will be passed to requests_cache.install_cache function. Use it to cache SDMX messages in databases such as MongoDB, Redis or SQLite. See the requests-cache` docs for further information.

Loading a file instead of requesting it via http

Request instances can load SDMX messages from local files. Issuing r = Request() without passing any agency ID instantiates a Request object not tied to any agency. It may only be used to load SDMX messages from files, unless a pre-fabricated URL is passed to pandasdmx.api.Request.get().

Obtaining and exploring metadata about datasets

This section illustrates how to download and explore metadata. Assume we are looking for time-series on exchange rates. Our best guess is that the European Central Bank provides a relevant dataflow. We could google for the dataflow ID or browse the ECB’s website. However, we choose to use SDMX metadata to get a complete overview of the dataflows the ECB provides.

Working with datasets

Selecting and requesting data from a dataflow

Requesting a dataset is as easy as requesting a dataflow definition or any other SDMX artefact: Just call the pandasdmx.api.Request.get() method and pass it ‘data’ as the resource_type and the dataflow ID as resource_id. As a shortcut, you can use the data descriptor which calls the get method implicitly.

Generic or structure-specific data format?

Data providers which support SDMXML offer data sets in two distinct formats:

  • generic data sets: These are self-contained but less memory-efficient. They are suitable for small to medium data sets, but less so for large ones.
  • Structure-specific data sets: This format is memory-efficient (typically about 60 per cent smaller than a generic data set) but it requires the datastructure definition (DSD) to interpret the XML file. The DSD must be downloaded prior to parsing the dataset. pandaSDMX can do this behind the scenes. However, as we shall see in the next section, the DSD can also be provided by the caller to save an additional request.

The intended data format is chosen by selecting the agency. For example, ‘ECB’ provides generic data sets, whereas ‘ECB_S’ provides structure-specific data sets. Hence, there are actually two agency ID’s for ECB, ESTAT etc. Note that data providers supporting SDMXJSON only work with a single format for data sets. Hence, there is merely one agency ID for OECD and ABS.

Filtering

In most cases we want to filter the data by columns or rows in order to request only the data we are interested in. Not only does this increase performance. Rather, some dataflows are really huge, and would exceed the server or client limits. The REST API of SDMX offers two ways to narrow down a data request:

  • specifying dimension values which the series to be returned must match (filtering by column labels) or
  • limiting the time range or number of observations per series (filtering by row labels)

From the ECB’s dataflow on exchange rates, we specify the CURRENCY dimension to be either ‘USD’ or ‘JPY’. This can be done by passing a key keyword argument to the get method or the data descriptor. It may either be a string (low-level API) or a dict. The dict form introduced in v0.3.0 is more convenient and pythonic as it allows pandaSDMX to infer the string form from the dict. Its keys (= dimension names) and values (= dimension values) will be validated against the datastructure definition as well as the content-constraints if available.

Content-constraints are implemented only in their CubeRegion flavor. KeyValueSets are not yet supported. In this case, the provided demension values will be validated only against the unconstrained codelist. It is thus not always guaranteed that the dataset actually contains the desired data, e.g., because the country of interest does not deliver the data to the SDMX data provider. Note that even constrained codelists do not guarantee that for a given key there will be data on the server. This is because the codelists may mislead the user to think that every element of their cartesian product is a valid key for a series, whereas there is actually data merely for a subset of that product. The KeyValue flavor of content constraints is thus a more accurate predictor. But this feature is not known to be used by any data provider. Thus pandaSDMX does not support it.

Another way to validate a key against valid codes are series-key-only datasets, i.e. a dataset with all possible series keys where no series contains any observation. pandaSDMX supports this validation method as well. However, it is disabled by default. Pass series_keys=True to the Request method to validate a given key against a series-keys only dataset rather than the DSD.

If we choose the string form of the key, it must consist of ‘.’-separated slots representing the dimensions. Values are optional. As we saw in the previous section, the ECB’s dataflow for exchange rates has five relevant dimensions, the ‘CURRENCY’ dimension being at position two. This yields the key ‘.USD+JPY…’. The ‘+’ can be read as an ‘OR’ operator. The dict form is shown below.

Further, we will set a meaningful start period for the time series to exclude any prior data from the request.

To request the data in generic format, we could simply issue:

>>> data_response = ecb.data(resource_id = 'EXR', key={'CURRENCY': ['USD', 'JPY']}, params = {'startPeriod': '2016'})

However, we want to demonstrate how structure-specific data sets are requested. To this end, we instantiate a one-off Request object configured to make requests for efficient structure-specific data, and we pass it the DSD obtained in the previous section. Without passing the DSD, it would be downloaded automatically right after the data set:

In [21]: data_response = Request('ecb_s').data(resource_id = 'EXR',
   ....: key={'CURRENCY': ['USD', 'JPY']},
   ....: params = {'startPeriod': '2017'}, dsd=dsd)
   ....: 

In [22]: data = data_response.data

In [23]: type(data)
Out[23]: pandasdmx.model.DataSet
Anatomy of data sets

This section explains the key elements and structure of a data set. You can skip it on first read when you just want to be able to download data and export it to pandas. More advanced operations, e.g., exporting only a subset of series to pandas, requires some understanding of the anatomy of a dataset including observations and attributes.

As we saw in the previous section, the datastructure definition (DSD) is crucial to understanding the data structure, the meaning of dimension and attribute values, and to select series of interest from the entire data set by specifying a valid key.

The pandasdmx.model.DataSet class has the following features:

dim_at_obs
attribute showing which dimension is at observation level. For time series its value is either ‘TIME’ or ‘TIME_PERIOD’. If it is ‘AllDimensions’, the dataset is said to be flat. In this case there are no series, just a flat list of observations.
series
property returning an iterator over pandasdmx.model.Series instances
obs
method returning an iterator over the observations. Only for flat datasets.
attributes
namedtuple of attributes, if any, that are attached at dataset level

The pandasdmx.model.Series class has the following features:

key
nnamedtuple mapping dimension names to dimension values
obs
method returning an iterator over observations within the series
attributes:
namedtuple mapping any attribute names to values
groups
list of pandasdmx.model.Group instances to which this series belongs. Note that groups are merely attachment points for attributes.
In [24]: data.dim_at_obs
Out[24]: 'TIME_PERIOD'

In [25]: series_l = list(data.series)

In [26]: len(series_l)
Out[26]: 16

In [27]: series_l[5].key
Out[27]: SeriesKey(FREQ='D', CURRENCY='USD', CURRENCY_DENOM='EUR', EXR_TYPE='SP00', EXR_SUFFIX='A')

In [28]: set(s.key.FREQ for s in data.series)
Out[28]: {'A', 'D', 'H', 'M', 'Q'}

This dataset thus comprises 16 time series of several different period lengths. We could have chosen to request only daily data in the first place by providing the value D for the FREQ dimension. In the next section we will show how columns from a dataset can be selected through the information model when writing to a pandas DataFrame.

Writing data to pandas
Selecting columns using the model API

As we want to write data to a pandas DataFrame rather than an iterator of pandas Series, we avoid mixing up different frequencies as pandas may raise an error when passed data with incompatible frequencies. Therefore, we single out the series with daily data. The pandasdmx.api.Response.write() method accepts an optional iterable to select a subset of the series contained in the dataset. Thus we can now generate our pandas DataFrame from daily exchange rate data only:

In [29]: daily = (s for s in data.series if s.key.FREQ == 'D')

In [30]: cur_df = data_response.write(daily)

In [31]: cur_df.shape
Out[31]: (653, 2)

In [32]: cur_df.tail()
Out[32]: 
FREQ                 D        
CURRENCY           JPY     USD
CURRENCY_DENOM     EUR     EUR
EXR_TYPE          SP00    SP00
EXR_SUFFIX           A       A
TIME_PERIOD                   
2019-07-18      120.89  1.1216
2019-07-19      120.93  1.1226
2019-07-22      121.03  1.1215
2019-07-23      120.82  1.1173
2019-07-24      120.41  1.1140
Specifying whether to write observations, attributes or both

The docstring of the pandasdmx.writer.data2pandas.Writer.write() method explains a number of optional arguments to control whether or not another dataframe should be generated for the attributes, which attributes it should contain, and, most importantly, if the resulting pandas Series should be concatenated to a single DataFrame at all (asframe = True is the default).

Controlling index generation

The write method provides the following parameters to control index generation. This is useful to increase performance for large datasets with regular indexes (e.g. monthly data, and to avoid crashes caused by exotic datetime formats not parsed by pandas:

  • fromfreq: if True, the index will be extrapolated from the first date or period and the frequency. This is only robust if the dataset has a uniform index, e.g. has no gaps like for daily trading data.
  • reverse_obs:: if True, return observations in a series in reverse document order. This may be useful to establish chronological order, in particular incombination with fromfreq. Default is False.
  • If pandas raises parsing errors due to exotic date-time formats, set parse_time to False to obtain a string index rather than datetime index. Default is True.

Working with files

The pandasdmx.api.Request.get() method accepts two optional keyword arguments tofile and fromfile. If a file path or, in case of fromfile, a file-like object is given, any SDMX message received from the server will be written to a file, or a file will be read instead of making a request to a remote server.

The file to be read may be a zip file (new in version 0.2.1). In this case, the SDMX message must be the first file in the archive. The same works for zip files returned from an SDMX server. This happens, e.g., when Eurostat finds that the requested dataset has been too large. In this case the first request will yield a message with a footer containing a link to a zip file to be made available after some time. The link may be extracted by issuing something like:

>>> resp.footer.text[1]

and passed as url argument when calling get a second time to get the zipped data message.

Since version 0.2.1, this second request can be performed automatically through the get_footer_url parameter. It defaults to (30, 3) which means that three attempts will be made in 30 seconds intervals. This behavior is useful when requesting large datasets from Eurostat. Deactivate it by setting get_footer_url to None.

In addition, since version 0.4 you can use pandasdmx.api.Response.write_source() to save the serialized XML tree to a file.

Caching Response instances in memory

The ‘’get’’ API provides a rudimentary cache for Response instances. It is a simple dict mapping user-provided names to the Response instances. If we want to cache a Response, we can provide a suitable name by passing the keyword argument memcache to the get method. Pre-existing items under the same key will be overwritten.

Note

Caching of http responses can also be achieved through ‘’requests-cache’. Activate the cache by instantiating pandasdmx.api.Request passing a keyword argument cache. It must be a dict mapping config and other values.

Using odo to export datasets to other data formats and database backends

Since version 0.4, pandaSDMX supports odo, a great tool to convert datasets to a variety of data formats and database backends. To use this feature, you have to call pandasdmx.odo_register() to register .sdmx files with odo. Then you can convert an .sdmx file containing a dataset to, say, a CSV file or an SQLite or PostgreSQL database in a few lines:

>>> import pandasdmx
>>> from odo import odo
___ pandasdmx.odo_register()
>>> odo('mydata.sdmx', 'sqlite:///mydata.sqlite')

Behind the scenes, odo uses pandaSDMX to convert the .sdmx file to a pandas DataFrame and performs any further conversions from there based on odo’s conversion graph. Any keyword arguments passed to odo will be passed on to pandasdmx.api.Response.write().

There is a limitation though: In the exchange rate example from the previous chapter, we needed to select same-frequency series from the dataset before converting the data set to pandas. This will likely cause crashes as odo’s discover method is unaware of this selection. Hence, .sdmx files can only be exported using odo if they can be exported to pandas without passing any arguments to pandasdmx.api.Response.write().

Handling errors

The pandasdmx.api.Response instance generated upon receipt of the response from the server has a status_code attribute. The SDMX web services guidelines explain the meaning of these codes. In addition, if the SDMX server has encountered an error, it may return a message which includes a footer containing explanatory notes. pandaSDMX exposes the content of a footer via a text attribute which is a list of strings.

Note

pandaSDMX raises only http errors with status code between 400 and 499. Codes >= 500 do not raise an error as the SDMX web services guidelines define special meanings to those codes. The caller must therefore raise an error if needed.

Logging

Since version 0.4, pandaSDMX can log certain events such as when a connection to a web service is made or a file has been successfully downloaded. It uses the logging package from the Python stdlib. . To activate logging, you must set the parent logger’s level to the desired value as described in the logging docs. Example:

>>> pandasdmx.logger.setLevel(10)

Data providers

Overview

pandaSDMX supports a number of data providers out of the box. Each data provider is configured by an item in agencies.json in the package root. Data providers are identified by a case-insensitive string such as “ECB”, “ESTAT_S” or “OECD”. For each pre-configured data provider, agencies.json contains the URL and name of the SDMX API and potentially some additional metadata about the provider’s web API. The configuration information about data providers is stored in the dict-type class attribute _agencies of Request. Other data providers can be configured by passing a suitable json-file to the pandasdmx.api.Request.add_agency() method which will be used to update the dict storing the agency configuration.

Pre-configured data providers

This section describes the data providers supported out of the box. The most salient distinction between data providers derives from the supported API: While OECD and Australian Bureau of Statistics (ABS) are only supported with regards to their SDMX-JSON APIs, all others send SDMX-ML messages. SDMX-JSON is currently confined to data messages. Hence, pandaSDMX features relating to structural metadata are unavailable when making requests to OECD or ABS.

Agencies supporting SDMXML messages come in two flavors: one for generic data sets (e.g. ECB, ESTAT, INSEE etc.), the other for structure-specific data sets (e.g., ECB_S, ESTAT_S etc.).

Australian Bureau of Statistics (ABS)

SDMX-JSON only. Start by browsing the website to retrieve the dataflow you’re interested in. Then try to fine-tune a planned data request by providing a valid key (= selection of series from the dataset). No automatic validation can be performed as structural metadata is unavailable.

Eurostat
  • SDMXML-based API.
  • thousands of dataflows on a wide range of topics.
  • No categorisations available.
  • Long response times are reported. Increase the timeout attribute to avoid timeout exceptions.
European Central Bank (ECB)
  • SDMXML-based API
  • supports categorisations of data-flows
  • supports preview_data and series-key based key validation
  • in general short response times
French National Institute for Statistics (INSEE)
  • SDMXML-based API.
  • An issue has been reportet apparently due to a missing pericite codelist in StructureMessages. This may cause crashes. Avoid downloading this type of message. Prepare the key as string using the web interface, and simply download a dataset.
International Labour Organization (ILO)

ILO’s SDMX web API deviates in some respects from the others. It is highly recommended to read the API guide. Here are some of the gotchas:

  • dataflow IDs take on the role of a filter. E.g., there are dataflows for individual countries, ages, sexes etc. rather than merely for different indicators.
  • Do not set the ‘references’ parameter to ‘all’ as is done by pandaSDMX by default when one requests a dataflow specified by ID. ILO can handle ‘references’ = ‘descendants’ and some others, but not ‘all’.
  • As the default format is SDMX 2.0, the ‘format’ parameter should be set to ‘generic_2_1’ or equivalent for each request.
International Monetary Fund (IMF) - SDMX Central only
  • SDMXML-based API
  • supports series-key-only and hence dataset-based key validation and construction.
Italian Statistics Office (ISTAT)

ISTAT uses roughly the Same server platform as Eurostat.

Norges Bank (Central Bank of Norway, “NB” or “NB_S”)
  • agency ID: ‘NB’ for generic, “NB_S” for structure-specific data
  • few dataflows. So do not use categoryscheme
  • it is unknown whether NB supports series-keys-only
Organisation for Economic Cooperation and Development (OECD)

SDMX-JSON only. Start by browsing the website to retrieve the dataflow you’re interested in. Then try to fine-tune a planned data request by providing a valid key (= selection of series from the dataset). No automatic validation can be performed as structural metadata is unavailable.

United Nations Statistics Division (UNSD)
  • SDMXML-based API
  • supports preview_data and series-key based key validation
  • supports categoryscheme even though it offers very few dataflows. Do don’t use this feature. Moreover, it seems that categories confusingly include dataflows which UNSD does not actually provide.
UNESCO
  • free registration required
  • subscription key must be provided either as parameter or HTTP-header with each request
  • SDMXML-based API
  • An issue with structure-specific datasets has been reported. It seems that Series are not recognized due to some oddity in the XML format.

Advanced topics

References in the SDMX information model and REST APIs

Background

Some SDMX artefacts (objects) reference others to indicate a relationship between both objects. pandaSDMX represents such references as instances of pandasdmx.model.Ref.

Examples:

Such references can be considered as edges of a directed graph whose nodes are the SDMX objects. Objects referenced by another are denoted as its children. These and their children are called its descendents. Objects referring to anoter object are called its parents and so forth. Siblings of an object are on the same level of the graph considered as multi-rooted tree.

For example, the DSD referenced by a DataflowDefinition is its child, the codelist referenced by the dimensions defined in the DSD are children of that child, i.e. descendants of said DataflowDefinition.

The pandasdmx.model.Ref

pandasdmx.model.Ref instances identify the referenced target by attributes such as id, agency_id, package (= resource type) etc. To resolve a reference, i.e. to retrieve the target, Ref instances are callable (new in v0.9). The __call__ method accepts some arguments to influence the retrieval process.

  • Set request to True to allow remote requests in case the target is not found in the current message.
  • Set target_only to False if you want the entire SDMX message that has been downloaded rather than just the referenced artefact.
  • raise_errors specifies whether an exception will be raised or suppressed (in which case None is returned.
Using references in requests

SDMX web services support a references parameter in HTTP requests which can take on values such as ‘all’, ‘descendants’ and so forth. This parameter instructs the web service to include, when generating the DataMessage of StructureMessage, the objects implicitly designated by the references parameter alongside the explicit resource. For example, in the request

>>> response = some_agency.dataflow('SOME_ID', params={'references': 'all'})

the agency will return

  • the dataflow ‘SOME_ID’ explicitly specified
  • the DSD referenced by the dataflow’s structure attribute
  • the codelists referenced indirectly by the DSD
  • any content-constraints which reference the dataflow or the DSD.

It is much more efficient to request many objects in a single request. Thus, pandaSDMX provides sensible defaults for the references parameter in common situations. For example, when a single dataflow is requested by specifying its ID, pandaSDMX sets references to ‘all’ as this appears most useful. On the other hand, when the dataflow ID is wildcarded, it is more practical not to request all referenced objects alongside as the response would likely be excessively large, and the user is deemed to be interested in the bird’s eye perspective (list of dataflows) prior to focusing on a particular dataflow and its descendents and ancestors. The default value for the references parameter can be overridden.

Note that some agencies differ in their behavior regarding the references parameter. E.g., Eurostat (ESTAT) does not return the DSD when requesting a dataflow even though references is set to ‘all’. This behavior is likely inconsistent with the SDMX standard.

Category schemes

SDMX supports category-schemes to categorize dataflow definitions and other objects. This helps retrieve, e.g., a dataflow of interest. Note that not all agencies support categoryschemes. A good example is the ECB. However, as the ECB’s SDMX service offers less than 100 dataflows, using categoryschemes is not strictly necessary. A counter-example is Eurostat which offers more than 6000 dataflows, yet does not categorize them. Hence, the user must search through the flat list of dataflows.

To search the list of dataflows by category, we request the category scheme from the ECB’s SDMX service and explore the response like so:

In [1]: from pandasdmx import *

In [2]: ecb = Request('ecb')

In [3]: cat_response = ecb.categoryscheme()

Like any other scheme, a category scheme is essentially a dict mapping ID’s to the actual SDMX objects. To display the categorised items, in our case the dataflow definitions contained in the category on exchange rates, we iterate over the Category instance (new in version 0.5):

In [4]: cat_response.categoryscheme.keys()
Out[4]: dict_keys(['MOBILE_NAVI', 'JDF_NAVI', 'MOBILE_BASKETS', 'MOBILE_NAVI_PUB'])

In [5]: list(cat_response.categoryscheme.MOBILE_NAVI['07'])
Out[5]: 
[DataflowDefinition | EXR | Exchange Rates,
 DataflowDefinition | WTS | Trade weights]

The information model in detail

The easiest way to understanding the class hierarchy of the information model is to download a DSD from a data provider and inspect the resulting objects’ base classes and MRO.

In most situations, structure metadata is represented by subclasses of dict mapping the SDMX artifacts’ ID’s to the artefacts themselves. The most intuitive examples are the container of code lists and the codes within a code list.

Likewise, categorisations, categoryschemes, and many other artefacts from the SDMX information model are represented by subclasses of dict.

If dict keys are valid attribute names, you can use attribute syntax. This is thanks to pandasdmx.utils.DictLike, a thin wrapper around dict that internally uses a patched third-party tool.

In particular, the categoryscheme attribute of a pandasdmx.model.StructureMessage`instance is an instance of ``DictLike`. The DictLike `` container for the received category schemes uses the ``ID attribute of pandasdmx.model.CategoryScheme as keys. This level of generality is required to cater for situations in which more than one category scheme is returned.

Note that pandasdmx.model.DictLike has a `` aslist`` method. It returns its values as a new list sorted by id. The sorting criterion may be overridden in subclasses. We can see this when dealing with dimensions in a pandasdmx.model.DataStructureDefinition where the dimensions are ordered by position.

Accessing the underlying XML document

The information model does not (yet) expose all attributes of SDMX messages. However, the underlying XML elements are accessible from almost everywhere. This is thanks to the base class pandasdmx.model.SDMXObject. It injects two attributes: _elem and _reader which grant access to the XML element represented by the model class instance as well as the reader instance.

Extending pandaSDMX

pandaSDMX is now extensible by readers and writers. While the API needs a few refinements, it should be straightforward to depart from pandasdmx.writer.data2pandas to develop writers for alternative output formats such as spreadsheet, database, or web applications.

Similarly, a reader for the upcoming JSON-based SDMX format would be useful.

Interested developers should contact the author at fhaxbox66@gmail.com.

pandasdmx

pandasdmx package

Subpackages
pandasdmx.reader package
Submodules
pandasdmx.reader.sdmxjson module

This module contains a reader for SDMXML v2.1.

class pandasdmx.reader.sdmxjson.Reader(request, dsd, **kwargs)[source]

Bases: pandasdmx.reader.BaseReader

Read SDMXJSON 2.1 and expose it as instances from pandasdmx.model

dataset_attrib(sdmxobj)[source]
dim_at_obs(sdmxobj)[source]
generic_groups(sdmxobj)[source]
getitem0 = operator.itemgetter(0)
getitem_key = operator.itemgetter('_key')
group_key(sdmxobj)[source]
header_error(sdmxobj)[source]
initialize(source)[source]
international_str(name, sdmxobj)[source]

return DictLike of xml:lang attributes. If node has no attributes, assume that language is ‘en’.

iter_generic_obs(sdmxobj, with_value, with_attributes)[source]
iter_generic_series_obs(sdmxobj, with_value, with_attributes, reverse_obs=False)[source]
iter_series(sdmxobj)[source]
read_as_str(name, sdmxobj, first_only=True)[source]
series_attrib(sdmxobj)[source]
series_key(sdmxobj)[source]
structured_by(sdmxobj)[source]
write_source(filename)[source]

Save source to file by calling write on the root element.

class pandasdmx.reader.sdmxjson.XPath(path)[source]

Bases: object

pandasdmx.reader.sdmxml module

This module contains a reader for SDMXML v2.1.

class pandasdmx.reader.sdmxml.Reader(request, dsd, **kwargs)[source]

Bases: pandasdmx.reader.BaseReader

Read SDMX-ML 2.1 and expose it as instances from pandasdmx.model

dataset_attrib(sdmxobj)
dim_at_obs(sdmxobj)[source]
generic_groups(sdmxobj)[source]
group_key(sdmxobj)[source]
header_error(sdmxobj)[source]
initialize(source)[source]
international_str(name, sdmxobj)[source]

return DictLike of xml:lang attributes. If node has no attributes, assume that language is ‘en’.

iter_generic_obs(sdmxobj, with_value, with_attributes)[source]
iter_generic_series_obs(sdmxobj, with_value, with_attributes, reverse_obs=False)[source]
iter_series(sdmxobj)[source]
series_attrib(sdmxobj)[source]
series_key(sdmxobj)[source]
structured_by(sdmxobj)[source]
write_source(filename)[source]

Save XML source to file by calling write on the root element.

Module contents

This module contains the base class for readers.

class pandasdmx.reader.BaseReader(request, dsd, **kwargs)[source]

Bases: object

initialize(source)[source]
read_as_str(name, sdmxobj, first_only=True)[source]
read_identifiables(cls, sdmxobj, offset=None)[source]

If sdmxobj inherits from dict: update it with modelized elements. These must be instances of model.IdentifiableArtefact, i.e. have an ‘id’ attribute. This will be used as dict keys. If sdmxobj does not inherit from dict: return a new DictLike.

read_instance(cls, sdmxobj, offset=None, first_only=True)[source]

If cls in _paths and matches, return an instance of cls with the first XML element, or, if first_only is False, a list of cls instances for all elements found, If no matches were found, return None.

pandasdmx.utils package
Submodules
pandasdmx.utils.aadict module
class pandasdmx.utils.aadict.aadict[source]

Bases: dict

A dict subclass that allows attribute access to be synonymous with item access, e.g. mydict.attribute == mydict['attribute']. It also provides several other useful helper methods, such as pick() and omit().

static d2a(subject)[source]
static d2ar(subject)[source]
omit(*args)[source]
pick(*args)[source]
update([E, ]**F) → None. Update D from dict/iterable E and F.[source]

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

pandasdmx.utils.anynamedtuple module
pandasdmx.utils.anynamedtuple.namedtuple(typename, field_names, verbose=False, rename=False)[source]

Returns a new subclass of tuple with named fields. This is a patched version of collections.namedtuple from the stdlib. Unlike the latter, it accepts non-identifier strings as field names. All values are accessible through dict syntax. Fields whose names are identifiers are also accessible via attribute syntax as in ordinary namedtuples, alongside traditional indexing. This feature is needed as SDMX allows field names to contain ‘-‘.

>>> Point = namedtuple('Point', ['x', 'y'])
>>> Point.__doc__                   # docstring for the new class
'Point(x, y)'
>>> p = Point(11, y=22)             # instantiate with positional args or keywords
>>> p[0] + p[1]                     # indexable like a plain tuple
33
>>> x, y = p                        # unpack like a regular tuple
>>> x, y
(11, 22)
>>> p.x + p.y                       # fields also accessable by name
33
>>> d = p._asdict()                 # convert to a dictionary
>>> d['x']
11
>>> Point(**d)                      # convert from a dictionary
Point(x=11, y=22)
>>> p._replace(x=100)               # _replace() is like str.replace() but targets named fields
Point(x=100, y=22)
Module contents

module pandasdmx.utils - helper classes and functions

class pandasdmx.utils.DictLike[source]

Bases: pandasdmx.utils.aadict.aadict

Thin wrapper around dict type

It allows attribute-like item access, has a find() method and inherits other useful features from aadict.

any()[source]

return an arbitrary or the only value. If dict is empty, raise KeyError.

aslist()[source]

return values() as unordered list

find(search_str, by='name', language='en')[source]

Select values by attribute

Parameters:
  • searchstr (str) – the string to search for
  • by (str) – the name of the attribute to search by, defaults to ‘name’ The specified attribute must be either a string or a dict mapping language codes to strings. Such attributes occur, e.g. in pandasdmx.model.NameableArtefact which is a base class for pandasdmx.model.DataFlowDefinition and many others.
  • language (str) – language code specifying the language of the text to be searched, defaults to ‘en’
Returns:

items where value.<by> contains the search_str. International strings

stored as dict with language codes as keys are searched. Capitalization is ignored.

Return type:

DictLike

class pandasdmx.utils.LazyDict(func, *args, **kwargs)[source]

Bases: dict

lazily comput values by calling func(k)

class pandasdmx.utils.NamedTupleFactory[source]

Bases: object

Wrap namedtuple function from the collections stdlib module to return a singleton if a nametuple with the same field names has already been created.

cache = {('dim', 'value', 'attrib'): <class 'pandasdmx.utils.SeriesObservation'>, ('key', 'value', 'attrib'): <class 'pandasdmx.utils.GenericObservation'>}
pandasdmx.utils.concat_namedtuples(*tup, **kwargs)[source]

Concatenate 2 or more namedtuples. The new namedtuple type is provided by NamedTupleFactory return new namedtuple instance

pandasdmx.utils.str2bool(s)[source]
pandasdmx.writer package
Submodules
pandasdmx.writer.data2pandas module

This module contains a writer class that writes a generic data message to pandas dataframes or series.

class pandasdmx.writer.data2pandas.Writer(msg, **kwargs)[source]

Bases: pandasdmx.writer.BaseWriter

iter_pd_series(iter_series, dim_at_obs, dtype, attributes, reverse_obs, fromfreq, parse_time)[source]
write(source=None, asframe=True, dtype=<class 'numpy.float64'>, attributes='', reverse_obs=False, fromfreq=False, parse_time=True)[source]

Transfform a pandasdmx.model.DataMessage instance to a pandas DataFrame or iterator over pandas Series.

Parameters:
  • source (pandasdmx.model.DataMessage) – a pandasdmx.model.DataSet or iterator of pandasdmx.model.Series
  • asframe (bool) – if True, merge the series of values and/or attributes into one or two multi-indexed pandas.DataFrame(s), otherwise return an iterator of pandas.Series. (default: True)
  • dtype (str, NP.dtype, None) – datatype for values. Defaults to NP.float64 if None, do not return the values of a series. In this case, attributes must not be an empty string so that some attribute is returned.
  • attributes (str, None) – string determining which attributes, if any, should be returned in separate series or a separate DataFrame. Allowed values: ‘’, ‘o’, ‘s’, ‘g’, ‘d’ or any combination thereof such as ‘os’, ‘go’. Defaults to ‘’. Where ‘o’, ‘s’, ‘g’, and ‘d’ mean that attributes at observation, series, group and dataset level will be returned as members of per-observation namedtuples.
  • reverse_obs (bool) – if True, return observations in reverse order. Default: False
  • fromfreq (bool) – if True, extrapolate time periods from the first item and FREQ dimension. Default: False
  • parse_time (bool) – if True (default), try to generate datetime index, provided that dim_at_obs is ‘TIME’ or ‘TIME_PERIOD’. Otherwise, parse_time is ignored. If False, always generate index of strings. Set it to False to increase performance and avoid parsing errors for exotic date-time formats unsupported by pandas.
pandasdmx.writer.structure2pd module

This module contains a writer class that writes artefacts from a StructureMessage to pandas dataFrames. This is useful, e.g., to visualize codes from a codelist or concepts from a concept scheme. The writer is more general though: It can output any collection of nameable SDMX objects.

class pandasdmx.writer.structure2pd.Writer(msg, **kwargs)[source]

Bases: pandasdmx.writer.BaseWriter

write(source=None, rows=None, **kwargs)[source]

Transfform structural metadata, i.e. codelists, concept-schemes, lists of dataflow definitions or category-schemes from a pandasdmx.model.StructureMessage instance into a pandas DataFrame. This method is called by pandasdmx.api.Response.write() . It is not part of the public-facing API. Yet, certain kwargs are propagated from there.

Parameters:
  • source (pandasdmx.model.StructureMessage) – a pandasdmx.model.StructureMessage instance.
  • rows (str) – sets the desired content to be extracted from the StructureMessage. Must be a name of an attribute of the StructureMessage. The attribute must be an instance of dict whose keys are strings. These will be interpreted as ID’s and used for the MultiIndex of the DataFrame to be returned. Values can be either instances of dict such as for codelists and categoryscheme, or simple nameable objects such as for dataflows. In the latter case, the DataFrame will have a flat index. (default: depends on content found in Message. Common is ‘codelist’)
  • columns (str, list) – if str, it denotes the attribute of attributes of the values (nameable SDMX objects such as Code or ConceptScheme) that will be stored in the DataFrame. If a list, it must contain strings that are valid attibute values. Defaults to: [‘name’, ‘description’]
  • constraint (bool) – if True (default), apply any constraints to codelists, i.e. only the codes allowed by the constraints attached to the DSD, dataflow and provision agreements contained in the message are written to the DataFrame. Otherwise, the entire codelist is written.
  • lang (str) – locale identifier. Specifies the preferred language for international strings such as names. Default is ‘en’.
Module contents

This module contains the base class for writers.

class pandasdmx.writer.BaseWriter(msg, **kwargs)[source]

Bases: object

Submodules
pandasdmx.api module

This module defines two classes: pandasdmx.api.Request and pandasdmx.api.Response. Together, these form the high-level API of pandasdmx. Requesting data and metadata from an SDMX server requires a good understanding of this API and a basic understanding of the SDMX web service guidelines only the chapters on REST services are relevant as pandasdmx does not support the SOAP interface.

class pandasdmx.api.Request(agency='', cache=None, log_level=None, **http_cfg)[source]

Bases: object

Get SDMX data and metadata from remote servers or local files.

agency
categoryscheme

Descriptor to wrap Request.get vor convenient calls without specifying the resource as arg.

clear_cache(key=None)[source]

If key is Non (default), remove the item if it exists. Otherwise, clear the entire cache.

codelist

Descriptor to wrap Request.get vor convenient calls without specifying the resource as arg.

conceptscheme

Descriptor to wrap Request.get vor convenient calls without specifying the resource as arg.

contentconstraint

Descriptor to wrap Request.get vor convenient calls without specifying the resource as arg.

data

Descriptor to wrap Request.get vor convenient calls without specifying the resource as arg.

dataflow

Descriptor to wrap Request.get vor convenient calls without specifying the resource as arg.

datastructure

Descriptor to wrap Request.get vor convenient calls without specifying the resource as arg.

get(resource_type='', resource_id='', agency='', version=None, key='', params={}, headers={}, fromfile=None, tofile=None, url=None, get_footer_url=(30, 3), memcache=None, writer=None, dsd=None, series_keys=True)[source]

get SDMX data or metadata and return it as a pandasdmx.api.Response instance.

While ‘get’ can load any SDMX file (also as zip-file) specified by ‘fromfile’, it can only construct URLs for the SDMX service set for this instance. Hence, you have to instantiate a pandasdmx.api.Request instance for each data provider you want to access, or pass a pre-fabricated URL through the url parameter.

Parameters:
  • resource_type (str) – the type of resource to be requested. Values must be one of the items in Request._resources such as ‘data’, ‘dataflow’, ‘categoryscheme’ etc. It is used for URL construction, not to read the received SDMX file. Hence, if fromfile is given, resource_type may be ‘’. Defaults to ‘’.
  • resource_id (str) – the id of the resource to be requested. It is used for URL construction. Defaults to ‘’.
  • agency (str) – ID of the agency providing the data or metadata. Used for URL construction only. It tells the SDMX web service which agency the requested information originates from. Note that an SDMX service may provide information from multiple data providers. may be ‘’ if fromfile is given. Not to be confused with the agency ID passed to __init__() which specifies the SDMX web service to be accessed.
  • key (str, dict) – select columns from a dataset by specifying dimension values. If type is str, it must conform to the SDMX REST API, i.e. dot-separated dimension values. If ‘key’ is of type ‘dict’, it must map dimension names to allowed dimension values. Two or more values can be separated by ‘+’ as in the str form. The DSD will be downloaded and the items are validated against it before downloading the dataset.
  • params (dict) – defines the query part of the URL. The SDMX web service guidelines (www.sdmx.org) explain the meaning of permissible parameters. It can be used to restrict the time range of the data to be delivered (startperiod, endperiod), whether parents, siblings or descendants of the specified resource should be returned as well (e.g. references=’parentsandsiblings’). Sensible defaults are set automatically depending on the values of other args such as resource_type. Defaults to {}.
  • headers (dict) – http headers. Given headers will overwrite instance-wide headers passed to the constructor. Defaults to None, i.e. use defaults from agency configuration
  • fromfile (str) – path to the file to be loaded instead of accessing an SDMX web service. Defaults to None. If fromfile is given, args relating to URL construction will be ignored.
  • tofile (str) – file path to write the received SDMX file on the fly. This is useful, e.g., if you want to save it for later loading as local file with fromfile or if you want to open an SDMX file in an XML editor.
  • url (str) – URL of the resource to download. If given, any other arguments such as resource_type or resource_id are ignored. Default is None.
  • get_footer_url ((int, int)) – tuple of the form (seconds, number_of_attempts). Determines the behavior in case the received SDMX message has a footer where one of its lines is a valid URL. get_footer_url defines how many attempts should be made to request the resource at that URL after waiting so many seconds before each attempt. This behavior is useful when requesting large datasets from Eurostat. Other agencies do not seem to send such footers. Once an attempt to get the resource has been successful, the original message containing the footer is dismissed and the dataset is returned. The tofile argument is propagated. Note that the written file may be a zip archive. pandaSDMX handles zip archives since version 0.2.1. Defaults to (30, 3).
  • memcache (str) – If given, return Response instance if already in self.cache(dict),
  • download resource and cache Response instance. (otherwise) –
writer(str): optional custom writer class.
Should inherit from pandasdmx.writer.BaseWriter. Defaults to None, i.e. one of the included writers is selected as appropriate.
dsd(model.DataStructure): DSD to be passed on to the sdmxml reader
to process a structure-specific dataset without an incidental http request.
series_keys(bool):
If True (default), use the SeriesKeysOnly http param if supported by the agency (e.g. ECB) to download all valid key combinations. This is the most accurate key validation method. Otherwise, i.e. if False or the agency does not support SeriesKeysOnly requests, key validation is performed using codelists and content constraints, if any.
Returns:
instance containing the requested
SDMX Message.
Return type:pandasdmx.api.Response
classmethod list_agencies()[source]

Return a sorted list of valid agency IDs. These can be used to create Request instances.

classmethod load_agency_profile(source)[source]

Classmethod loading metadata on a data provider. source must be a json-formated string or file-like object describing one or more data providers (URL of the SDMX web API, resource types etc. The dict Request._agencies is updated with the metadata from the source.

Returns None

prepare_key(key)[source]

Split any value of the form ‘v1+v2+v3’ into a list and return a new key dict. Values that are lists already are left unchanged.

preview_data(flow_id, key=None, count=True, total=True, dsd=None)[source]

Get keys or number of series for a prospective dataset query allowing for keys with multiple values per dimension. It downloads the complete list of series keys for a dataflow rather than using constraints and DSD. This feature is, however, not supported by all data providers. ECB, IMF_SDMXCENTRAL and UNSD are known to work.

Args:

flow_id(str): dataflow id

key(dict): optional key mapping dimension names to values or lists of values.
Must have been validated before. It is not checked if key values are actually valid dimension names and values. Default: {}
count(bool): if True (default), return the number of series
of the dataset designated by flow_id and key. If False, the actual keys are returned as a pandas DataFrame or dict of dataframes, depending on the value of ‘total’.
total(bool): if True (default), return the aggregate number
of series or a single dataframe (depending on the value of ‘count’). If False, return a dict mapping keys to dataframes of series keys. E.g., if key={‘COUNTRY’:’IT+CA+AU’}, the dict will have 3 items describing the series keys for each country respectively. If ‘count’ is True, dict values will be int rather than PD.DataFrame.
series_keys(flow_id, cache=True, dsd=None)[source]

Get an empty dataset with all possible series keys.

Return a pandas DataFrame. Each column represents a dimension, each row a series key of datasets of the given dataflow.

timeout
class pandasdmx.api.ResourceGetter(resource_type)[source]

Bases: object

Descriptor to wrap Request.get vor convenient calls without specifying the resource as arg.

class pandasdmx.api.Response(msg, url, headers, status_code, writer=None)[source]

Bases: object

Container class for SDMX messages.

It is instantiated by .

msg

a pythonic representation of the SDMX message

Type:pandasdmx.model.Message
status_code

the status code from the http response, if any

Type:int
url

the URL, if any, that was sent to the SDMX server

Type:str
headers

http response headers returned by ‘’requests’‘

Type:dict
write()[source]

wrapper around the writer’s write method. Arguments are propagated to the writer.

write(source=None, **kwargs)[source]

Wrapper to call the writer’s write method if present.

Parameters:source (pandasdmx.model.Message, iterable) – stuff to be written. If a pandasdmx.model.Message is given, the writer itself must determine what to write unless specified in the keyword arguments. If an iterable is given, the writer should write each item. Keyword arguments may specify what to do with the output depending on the writer’s API. Defaults to self.msg.
Returns:anything the writer returns.
Return type:type
write_source(filename)[source]

write xml file by calling the ‘write’ method of lxml root element. Useful to save the xml source file for offline use. Similar to passing tofile arg to Request.get()

Parameters:filename (str) – name/path of target file
Returns:whatever the LXML deserializer returns.
exception pandasdmx.api.SDMXException[source]

Bases: Exception

pandasdmx.model module

This module is part of the pandaSDMX package

SDMX 2.1 information model
  1. 2014 Dr. Leo (fhaxbox66@gmail.com)
class pandasdmx.model.AnnotableArtefact(reader, elem, **kwargs)[source]

Bases: pandasdmx.model.SDMXObject

annotations
class pandasdmx.model.Annotation(reader, elem, **kwargs)[source]

Bases: pandasdmx.model.SDMXObject

annotationtype
id
text
title
url
class pandasdmx.model.AttributeDescriptor(*args, **kwargs)[source]

Bases: pandasdmx.model.ComponentList

class pandasdmx.model.Categorisation(*args, **kwargs)[source]

Bases: pandasdmx.model.MaintainableArtefact

class pandasdmx.model.Categorisations(*args, **kwargs)[source]

Bases: pandasdmx.model.SDMXObject, pandasdmx.utils.DictLike

class pandasdmx.model.Category(*args, **kwargs)[source]

Bases: pandasdmx.model.Item

class pandasdmx.model.CategoryScheme(*args, **kwargs)[source]

Bases: pandasdmx.model.ItemScheme

class pandasdmx.model.Code(*args, **kwargs)[source]

Bases: pandasdmx.model.Item

class pandasdmx.model.Codelist(*args, **kwargs)[source]

Bases: pandasdmx.model.ItemScheme

class pandasdmx.model.CodelistHandler(*args, **kwargs)[source]

Bases: pandasdmx.model.KeyValidatorMixin

High-level API implementing the application of content constraints to codelists. It is primarily used as a mixin to StructureMessage instances containing codelists, a DSD, Dataflow and related constraints. However, it may also be used stand-online. It computes the constrained codelists in collaboration with Constrainable, ContentConstraint and Cube Region classes.

class pandasdmx.model.Component(*args, **kwargs)[source]

Bases: pandasdmx.model.IdentifiableArtefact

concept
concept_identity
local_repr
class pandasdmx.model.ComponentList(*args, **kwargs)[source]

Bases: pandasdmx.model.IdentifiableArtefact, pandasdmx.model.Scheme

class pandasdmx.model.Concept(*args, **kwargs)[source]

Bases: pandasdmx.model.Item

class pandasdmx.model.ConceptScheme(*args, **kwargs)[source]

Bases: pandasdmx.model.ItemScheme

class pandasdmx.model.Constrainable[source]

Bases: object

apply(dim_codes=None, attr_codes=None)[source]

Compute the constrained code lists as frozensets by merging the constraints resulting from all ContentConstraint instances into a dict of sets of valid codes for dimensions and attributes respectively. Each codelist is constrained by at most one Constraint so that no set operations are required.

Return tuple of constrained_dimensions(dict), constrained_attribute_codes(dict)

constrained_by
class pandasdmx.model.Constraint(*args, **kwargs)[source]

Bases: pandasdmx.model.MaintainableArtefact

class pandasdmx.model.ContentConstraint(*args, **kwargs)[source]

Bases: pandasdmx.model.Constraint

apply(dim_codes=None, attr_codes=None)[source]

Compute the constrained code lists as frozensets by merging the constraints resulting from all cube regions into a dict of sets of valid codes for dimensions and attributes respectively. We assume that each codelist is constrained by at most one cube region so that no set operations are required.

Return tuple of constrained_dimension_codes(dict), constrained_attribute_codes(dict)

class pandasdmx.model.CubeRegion(*args, **kwargs)[source]

Bases: pandasdmx.model.SDMXObject

apply(dim_codes=None, attr_codes=None)[source]

Compute the code lists constrained by the cube region as frozensets.

Parameters:
  • dim_codes (dict) – maps dim IDs to the referenced codelist represented by a frozenset. The set may or may not be constrained by a jigher-level ContentConstraint. See the Technical Guideline (Part 6 of the SDMX Standard). Default is None (disregard dimensions)
  • attr_codes (dict) – same as above, but for attributes as specified by a DSD.

Return tuple of constrained_dimensions(dict), constrained_attribute_codes(dict)

class pandasdmx.model.DataAttribute(*args, **kwargs)[source]

Bases: pandasdmx.model.Component

related_to
usage_status
class pandasdmx.model.DataMessage(*args, **kwargs)[source]

Bases: pandasdmx.model.KeyValidatorMixin, pandasdmx.model.Message

class pandasdmx.model.DataSet(*args, **kwargs)[source]

Bases: pandasdmx.model.SDMXObject

attrib
dim_at_obs
groups
iter_groups
obs(with_values=True, with_attributes=True)[source]

return an iterator over observations in a flat dataset. An observation is represented as a namedtuple with 3 fields (‘key’, ‘value’, ‘attrib’).

obs.key is a namedtuple of dimensions. Its field names represent dimension names, its values the dimension values.

obs.value is a string that can in in most cases be interpreted as float64 obs.attrib is a namedtuple of attribute names and values.

with_values and with_attributes: If one or both of these flags is False, the respective value will be None. Use these flags to increase performance. The flags default to True.

series

return an iterator over Series instances in this DataSet. Note that DataSets in flat format, i.e. header.dim_at_obs = “AllDimensions”, have no series. Use DataSet.obs() instead.

class pandasdmx.model.DataStructureDefinition(*args, **kwargs)[source]

Bases: pandasdmx.model.Constrainable, pandasdmx.model.MaintainableArtefact

class pandasdmx.model.DataflowDefinition(*args, **kwargs)[source]

Bases: pandasdmx.model.Constrainable, pandasdmx.model.StructureUsage

class pandasdmx.model.Dimension(*args, **kwargs)[source]

Bases: pandasdmx.model.Component

class pandasdmx.model.DimensionDescriptor(*args, **kwargs)[source]

Bases: pandasdmx.model.ComponentList

class pandasdmx.model.Facet(facet_type=None, facet_value_type='', itemscheme_facet='', *args, **kwargs)[source]

Bases: object

facet_type = {}
facet_value_type = ('String', 'Big Integer', 'Integer', 'Long', 'Short', 'Double', 'Boolean', 'URI', 'DateTime', 'Time', 'GregorianYear', 'GregorianMonth', 'GregorianDate', 'Day', 'MonthDay', 'Duration')
itemscheme_facet = ''
class pandasdmx.model.Footer(reader, elem, **kwargs)[source]

Bases: pandasdmx.model.SDMXObject

code
severity
text
class pandasdmx.model.Group(*args, **kwargs)[source]

Bases: pandasdmx.model.SDMXObject

class pandasdmx.model.Header(*args, **kwargs)[source]

Bases: pandasdmx.model.SDMXObject

error
id
prepared
receiver
sender
class pandasdmx.model.IdentifiableArtefact(*args, **kwargs)[source]

Bases: pandasdmx.model.AnnotableArtefact

uri
class pandasdmx.model.Item(*args, **kwargs)[source]

Bases: pandasdmx.model.NameableArtefact

children
class pandasdmx.model.ItemScheme(*args, **kwargs)[source]

Bases: pandasdmx.model.MaintainableArtefact, pandasdmx.model.Scheme

is_partial
class pandasdmx.model.KeyValidatorMixin[source]

Bases: object

Mix-in class with methods for key validation. Relies on properties computing code sets, constrained codes etc. Subclasses are DataMessage and CodelistHandler which is, in turn, inherited by StructureMessage.

class pandasdmx.model.KeyValue(*args, **kwargs)[source]

Bases: pandasdmx.model.SDMXObject

values
class pandasdmx.model.MaintainableArtefact(*args, **kwargs)[source]

Bases: pandasdmx.model.VersionableArtefact

is_external_ref
is_final
maintainer
service_url
structure_url
class pandasdmx.model.MeasureDescriptor(*args, **kwargs)[source]

Bases: pandasdmx.model.ComponentList

class pandasdmx.model.MeasureDimension(*args, **kwargs)[source]

Bases: pandasdmx.model.Dimension

class pandasdmx.model.Message(*args, **kwargs)[source]

Bases: pandasdmx.model.SDMXObject

class pandasdmx.model.NameableArtefact(*args, **kwargs)[source]

Bases: pandasdmx.model.IdentifiableArtefact

description
name
class pandasdmx.model.PrimaryMeasure(*args, **kwargs)[source]

Bases: pandasdmx.model.Component

class pandasdmx.model.ProvisionAgreement(*args, **kwargs)[source]

Bases: pandasdmx.model.Constrainable, pandasdmx.model.MaintainableArtefact

class pandasdmx.model.Ref(reader, elem, **kwargs)[source]

Bases: pandasdmx.model.SDMXObject

agency_id
id
maintainable_parent_id
package
ref_class
version
class pandasdmx.model.ReportingYearStartDay(*args, **kwargs)[source]

Bases: pandasdmx.model.DataAttribute

class pandasdmx.model.Representation(*args, **kwargs)[source]

Bases: pandasdmx.model.SDMXObject

class pandasdmx.model.SDMXObject(reader, elem, **kwargs)[source]

Bases: object

class pandasdmx.model.Scheme(*args, **kwargs)[source]

Bases: pandasdmx.utils.DictLike

aslist()[source]

return values() as unordered list

class pandasdmx.model.Series(*args, **kwargs)[source]

Bases: pandasdmx.model.SDMXObject

group_attrib

return a namedtuple containing all attributes attached to groups of which the given series is a member for each group of which the series is a member

obs(with_values=True, with_attributes=True, reverse_obs=False)[source]

return an iterator over observations in a series. An observation is represented as a namedtuple with 3 fields (‘key’, ‘value’, ‘attrib’). obs.key is a namedtuple of dimensions, obs.value is a string value and obs.attrib is a namedtuple of attributes. If with_values or with_attributes is False, the respective value is None. Use these flags to increase performance. The flags default to True.

class pandasdmx.model.StructureMessage(*args, **kwargs)[source]

Bases: pandasdmx.model.CodelistHandler, pandasdmx.model.Message

class pandasdmx.model.StructureUsage(*args, **kwargs)[source]

Bases: pandasdmx.model.MaintainableArtefact

structure
class pandasdmx.model.TimeDimension(*args, **kwargs)[source]

Bases: pandasdmx.model.Dimension

class pandasdmx.model.VersionableArtefact(*args, **kwargs)[source]

Bases: pandasdmx.model.NameableArtefact

valid_from
valid_to
version
pandasdmx.remote module

This module is part of pandaSDMX. It contains a classes for http access.

class pandasdmx.remote.REST(cache, http_cfg)[source]

Bases: object

Query SDMX resources via REST or from a file

The constructor accepts arbitrary keyword arguments that will be passed to the requests.get function on each call. This makes the REST class somewhat similar to a requests.Session. E.g., proxies or authorisation data needs only be provided once. The keyword arguments are stored in self.config. Modify this dict to issue the next ‘get’ request with changed arguments.

get(url, fromfile=None, params={}, headers={})[source]

Get SDMX message from REST service or local file

Parameters:
  • url (str) – URL of the REST service without the query part If None, fromfile must be set. Default is None
  • params (dict) – will be appended as query part to the URL after a ‘?’
  • fromfile (str) – path to SDMX file containing an SDMX message. It will be passed on to the reader for parsing.
  • headers (dict) – http headers. Overwrite instance-wide headers. Default is {}
Returns:

three objects:

  1. file-like object containing the SDMX message
  2. the complete URL, if any, including the query part constructed from params
  3. the status code

Return type:

tuple

Raises:

HTTPError if SDMX service responded with – status code 401. Otherwise, the status code is returned

max_size = 16777216

upper bound for in-memory temp file. Larger files will be spooled from disc

request(url, params={}, headers={})[source]

Retrieve SDMX messages. If needed, override in subclasses to support other data providers.

Parameters:url (str) – The URL of the message.
Returns:the xml data as file-like object
pandasdmx.remote.is_url(s)[source]

return True if s (str) is a valid URL, False otherwise.

Module contents

pandaSDMX - a Python package for SDMX - Statistical Data and Metadata eXchange

class pandasdmx.Request(agency='', cache=None, log_level=None, **http_cfg)[source]

Bases: object

Get SDMX data and metadata from remote servers or local files.

agency
categoryscheme

Descriptor to wrap Request.get vor convenient calls without specifying the resource as arg.

clear_cache(key=None)[source]

If key is Non (default), remove the item if it exists. Otherwise, clear the entire cache.

codelist

Descriptor to wrap Request.get vor convenient calls without specifying the resource as arg.

conceptscheme

Descriptor to wrap Request.get vor convenient calls without specifying the resource as arg.

contentconstraint

Descriptor to wrap Request.get vor convenient calls without specifying the resource as arg.

data

Descriptor to wrap Request.get vor convenient calls without specifying the resource as arg.

dataflow

Descriptor to wrap Request.get vor convenient calls without specifying the resource as arg.

datastructure

Descriptor to wrap Request.get vor convenient calls without specifying the resource as arg.

get(resource_type='', resource_id='', agency='', version=None, key='', params={}, headers={}, fromfile=None, tofile=None, url=None, get_footer_url=(30, 3), memcache=None, writer=None, dsd=None, series_keys=True)[source]

get SDMX data or metadata and return it as a pandasdmx.api.Response instance.

While ‘get’ can load any SDMX file (also as zip-file) specified by ‘fromfile’, it can only construct URLs for the SDMX service set for this instance. Hence, you have to instantiate a pandasdmx.api.Request instance for each data provider you want to access, or pass a pre-fabricated URL through the url parameter.

Parameters:
  • resource_type (str) – the type of resource to be requested. Values must be one of the items in Request._resources such as ‘data’, ‘dataflow’, ‘categoryscheme’ etc. It is used for URL construction, not to read the received SDMX file. Hence, if fromfile is given, resource_type may be ‘’. Defaults to ‘’.
  • resource_id (str) – the id of the resource to be requested. It is used for URL construction. Defaults to ‘’.
  • agency (str) – ID of the agency providing the data or metadata. Used for URL construction only. It tells the SDMX web service which agency the requested information originates from. Note that an SDMX service may provide information from multiple data providers. may be ‘’ if fromfile is given. Not to be confused with the agency ID passed to __init__() which specifies the SDMX web service to be accessed.
  • key (str, dict) – select columns from a dataset by specifying dimension values. If type is str, it must conform to the SDMX REST API, i.e. dot-separated dimension values. If ‘key’ is of type ‘dict’, it must map dimension names to allowed dimension values. Two or more values can be separated by ‘+’ as in the str form. The DSD will be downloaded and the items are validated against it before downloading the dataset.
  • params (dict) – defines the query part of the URL. The SDMX web service guidelines (www.sdmx.org) explain the meaning of permissible parameters. It can be used to restrict the time range of the data to be delivered (startperiod, endperiod), whether parents, siblings or descendants of the specified resource should be returned as well (e.g. references=’parentsandsiblings’). Sensible defaults are set automatically depending on the values of other args such as resource_type. Defaults to {}.
  • headers (dict) – http headers. Given headers will overwrite instance-wide headers passed to the constructor. Defaults to None, i.e. use defaults from agency configuration
  • fromfile (str) – path to the file to be loaded instead of accessing an SDMX web service. Defaults to None. If fromfile is given, args relating to URL construction will be ignored.
  • tofile (str) – file path to write the received SDMX file on the fly. This is useful, e.g., if you want to save it for later loading as local file with fromfile or if you want to open an SDMX file in an XML editor.
  • url (str) – URL of the resource to download. If given, any other arguments such as resource_type or resource_id are ignored. Default is None.
  • get_footer_url ((int, int)) – tuple of the form (seconds, number_of_attempts). Determines the behavior in case the received SDMX message has a footer where one of its lines is a valid URL. get_footer_url defines how many attempts should be made to request the resource at that URL after waiting so many seconds before each attempt. This behavior is useful when requesting large datasets from Eurostat. Other agencies do not seem to send such footers. Once an attempt to get the resource has been successful, the original message containing the footer is dismissed and the dataset is returned. The tofile argument is propagated. Note that the written file may be a zip archive. pandaSDMX handles zip archives since version 0.2.1. Defaults to (30, 3).
  • memcache (str) – If given, return Response instance if already in self.cache(dict),
  • download resource and cache Response instance. (otherwise) –
writer(str): optional custom writer class.
Should inherit from pandasdmx.writer.BaseWriter. Defaults to None, i.e. one of the included writers is selected as appropriate.
dsd(model.DataStructure): DSD to be passed on to the sdmxml reader
to process a structure-specific dataset without an incidental http request.
series_keys(bool):
If True (default), use the SeriesKeysOnly http param if supported by the agency (e.g. ECB) to download all valid key combinations. This is the most accurate key validation method. Otherwise, i.e. if False or the agency does not support SeriesKeysOnly requests, key validation is performed using codelists and content constraints, if any.
Returns:
instance containing the requested
SDMX Message.
Return type:pandasdmx.api.Response
classmethod list_agencies()[source]

Return a sorted list of valid agency IDs. These can be used to create Request instances.

classmethod load_agency_profile(source)[source]

Classmethod loading metadata on a data provider. source must be a json-formated string or file-like object describing one or more data providers (URL of the SDMX web API, resource types etc. The dict Request._agencies is updated with the metadata from the source.

Returns None

prepare_key(key)[source]

Split any value of the form ‘v1+v2+v3’ into a list and return a new key dict. Values that are lists already are left unchanged.

preview_data(flow_id, key=None, count=True, total=True, dsd=None)[source]

Get keys or number of series for a prospective dataset query allowing for keys with multiple values per dimension. It downloads the complete list of series keys for a dataflow rather than using constraints and DSD. This feature is, however, not supported by all data providers. ECB, IMF_SDMXCENTRAL and UNSD are known to work.

Args:

flow_id(str): dataflow id

key(dict): optional key mapping dimension names to values or lists of values.
Must have been validated before. It is not checked if key values are actually valid dimension names and values. Default: {}
count(bool): if True (default), return the number of series
of the dataset designated by flow_id and key. If False, the actual keys are returned as a pandas DataFrame or dict of dataframes, depending on the value of ‘total’.
total(bool): if True (default), return the aggregate number
of series or a single dataframe (depending on the value of ‘count’). If False, return a dict mapping keys to dataframes of series keys. E.g., if key={‘COUNTRY’:’IT+CA+AU’}, the dict will have 3 items describing the series keys for each country respectively. If ‘count’ is True, dict values will be int rather than PD.DataFrame.
series_keys(flow_id, cache=True, dsd=None)[source]

Get an empty dataset with all possible series keys.

Return a pandas DataFrame. Each column represents a dimension, each row a series key of datasets of the given dataflow.

timeout

Contributing

Contributions such as bug reports or pull requests and any other user feedback are much appreciated. Development takes place on github. There is also a low traffic Mailing list.

License

Notwithstanding other licenses applicable to any third-party software included in this package, pandaSDMX is licensed under the Apache 2.0 license, a copy of which is included in the source distribution.

Copyright 2014, 2015 Dr. Leo <fhaxbox66qgmail.com>, All Rights Reserved.

Indices and tables