Skip to content

Instantly share code, notes, and snippets.

@khaeru
Created September 12, 2024 08:11
Show Gist options
  • Select an option

  • Save khaeru/1d386e4c35d561e2bf7dfd18249071f3 to your computer and use it in GitHub Desktop.

Select an option

Save khaeru/1d386e4c35d561e2bf7dfd18249071f3 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "8c07c69c-2d75-4b39-a724-b55f09c6548f",
"metadata": {},
"source": [
"# Access and check SDMX metadata\n",
"\n",
"This example uses two files from the [sdmx-test-data](https://github.com/khaeru/sdmx-test-data/tree/main/ESTAT) repo, both in SDMX-ML (XML) format.\n",
"\n",
"- `esms-structure.xml` contains a Structure Message with the Metadata Structure Definition and other associated structural metadata.\n",
"- `esms.xml` contains a Metadata Message with a single Metadata Set containing a single Metadata Report.\n",
"\n",
"Use the [`sdmx1`](https://sdmx1.readthedocs.io) package to load the two messages; show their contents:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "23f62f3c-0209-4eae-8a64-df5b23aebf59",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<sdmx.StructureMessage>\n",
" <Header>\n",
" id: 'ESMS'\n",
" prepared: '2010-11-13T08:00:33+08:00'\n",
" sender: <Agency ESTAT>\n",
" source: \n",
" test: False\n",
" Categorisation (2): PSC.DEM.TOT DEMO_TOT\n",
" CategoryScheme (1): DATAFLOWS_SCHEME\n",
" Codelist (1): CL_COUNTRY\n",
" ConceptScheme (2): META_UPDATE ESMS_CONCEPTS\n",
" DataflowDefinition (1): DEMO_TOT\n",
" MetadataStructureDefinition (1): ESMS_SIMPLE\n",
" DataStructureDefinition (1): DEMOGRAPHY"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sdmx\n",
"\n",
"msg_structure = sdmx.read_sdmx(\"esms-structure.xml\")\n",
"msg_metadata = sdmx.read_sdmx(\"esms.xml\")\n",
"\n",
"# msg_structure, msg_metadata # Verbose\n",
"msg_structure"
]
},
{
"cell_type": "markdown",
"id": "789c915b-a5d4-4bb9-adc7-c77500b5d70b",
"metadata": {},
"source": [
"Notice that the structure message also contains other structural artefacts that are referred to by the Metadata Structure Definition.\n",
"For example, the scheme that containing all the concepts for which metadata are provided:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c73c3b33-6cc9-4a25-a1e5-070016b43718",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<ConceptScheme ESTAT:ESMS_CONCEPTS(1.0) (19 items): Eurostat SDMX Metadata Structure concepts>"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cs = msg_structure.concept_scheme[\"ESMS_CONCEPTS\"]\n",
"cs"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "1dea4580-3893-45aa-be72-cfce6126d6fb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'ADDRESS': <Concept ADDRESS: Address>,\n",
" 'ADDRESS_CITY': <Concept ADDRESS_CITY: Address City>,\n",
" 'ADDRESS_COUNTRY': <Concept ADDRESS_COUNTRY: Address Country>,\n",
" 'ADDRESS_STREET': <Concept ADDRESS_STREET: Address Street>,\n",
" 'ADDRESS_POST_CODE': <Concept ADDRESS_POST_CODE: Address Postal Code>,\n",
" 'COMMENT': <Concept COMMENT: Comment>,\n",
" 'CONTACT': <Concept CONTACT: Contact>,\n",
" 'CONTACT_EMAIL': <Concept CONTACT_EMAIL: Contact email address>,\n",
" 'CONTACT_NAME': <Concept CONTACT_NAME: Contact name>,\n",
" 'CONTACT_PHONE': <Concept CONTACT_PHONE: Contact phone number>,\n",
" 'DATA_DESCR': <Concept DATA_DESCR: Data description>,\n",
" 'META_CERTIFIED': <Concept META_CERTIFIED: Metadata last ceritfied>,\n",
" 'META_LAST_UPDATE': <Concept META_LAST_UPDATE: Metadata last update>,\n",
" 'META_POSTED': <Concept META_POSTED: Metadata last posted>,\n",
" 'META_UPDATE': <Concept META_UPDATE: Metadata Update>,\n",
" 'NEXT_DATE': <Concept NEXT_DATE: Next Date>,\n",
" 'ORGANISATION': <Concept ORGANISATION: Organisation>,\n",
" 'ORGANISATION_UNIT': <Concept ORGANISATION_UNIT: Organisation unit>,\n",
" 'STAT_PRES': <Concept STAT_PRES: Statistical presentation>}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cs.items"
]
},
{
"cell_type": "markdown",
"id": "253a3b7d-a2a4-4a35-8589-16b355dad735",
"metadata": {},
"source": [
"Retrieve the Metadata Structure Definition itself:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "685b46a6-2791-43ea-8307-57705b452e0c",
"metadata": {},
"outputs": [],
"source": [
"msd = msg_structure.metadatastructure[\"ESMS_SIMPLE\"]"
]
},
{
"cell_type": "markdown",
"id": "a5afddf5-ba4c-4321-bb31-ffd6520f697b",
"metadata": {},
"source": [
"(Note this is a \"ESMS_**SIMPLE**\" structure. There is a separate \"ESMS_**FULL**\", but Eurostat does not appear to publish this as SDMX 2.1; only the (much) older SDMX 2.0, which fewer tools support.)\n",
"\n",
"It contains multiple Report Structures; retrieve one:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "495d0cdf-74a8-4126-9509-2bb0fc357d87",
"metadata": {},
"outputs": [],
"source": [
"rs = msd.report_structure[\"ESMS_SIMPLE_REPORT\"]"
]
},
{
"cell_type": "markdown",
"id": "195863d8-ce78-49d2-abb5-0557375cd351",
"metadata": {},
"source": [
"Display its components—the Metadata Attributes that structure any conforming report:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "5ad9c4f6-6e20-403c-b16d-314915302857",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CONTACT <Concept CONTACT: Contact>\n",
" ORGANISATION <Concept ORGANISATION: Organisation>\n",
" ORGANISATION_UNIT <Concept ORGANISATION_UNIT: Organisation unit>\n",
" NAME <Concept CONTACT_NAME: Contact name>\n",
" ADDRESS <Concept ADDRESS: Address>\n",
" STREET <Concept ADDRESS_STREET: Address Street>\n",
" CITY <Concept ADDRESS_CITY: Address City>\n",
" POSTAL_CODE <Concept ADDRESS_POST_CODE: Address Postal Code>\n",
" COUNTRY <Concept ADDRESS_COUNTRY: Address Country>\n",
" PHONE <Concept CONTACT_PHONE: Contact phone number>\n",
" EMAIL <Concept CONTACT_EMAIL: Contact email address>\n",
"META_UPDATE <Concept META_UPDATE: Metadata Update>\n",
" CERTIFIED <Concept META_CERTIFIED: Metadata last ceritfied>\n",
" POSTED <Concept META_POSTED: Metadata last posted>\n",
" NEXT <Concept NEXT_DATE: Next Date>\n",
" UPDATED <Concept META_LAST_UPDATE: Metadata last update>\n",
" NEXT <Concept NEXT_DATE: Next Date>\n",
"STAT_PRES <Concept STAT_PRES: Statistical presentation>\n",
" DATA_DESCR <Concept DATA_DESCR: Data description>\n"
]
}
],
"source": [
"def show_metadata_attribute(mda, indent=\"\"):\n",
" print(f\"{indent}{mda.id} {mda.concept_identity!r}\")\n",
" for c in mda.child:\n",
" show_metadata_attribute(c, indent + \" \")\n",
"\n",
"for mda in rs.components:\n",
" show_metadata_attribute(mda)"
]
},
{
"cell_type": "markdown",
"id": "7d24f90a-2bc1-4392-9a0a-d2ff91845b9e",
"metadata": {},
"source": [
"Next, turn to the Metadata Message.\n",
"Select the single Metadata Set, and its single Metadata Report:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "68f8d234-9dcd-4958-a87f-de61d36f5f6c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"MetadataReport(metadata=[OtherNonEnumeratedAttributeValue(value_for='CONTACT', parent=None, child=[OtherNonEnumeratedAttributeValue(value_for='ORGANISATION', parent=None, child=[], value='Eurostat, the statistical office of the European Union'), OtherNonEnumeratedAttributeValue(value_for='ORGANISATION_UNIT', parent=None, child=[], value='Unit F1: Population'), OtherNonEnumeratedAttributeValue(value_for='ADDRESS', parent=None, child=[OtherNonEnumeratedAttributeValue(value_for='STREET', parent=None, child=[], value='RUE ALPHONSE WEICKER 5'), OtherNonEnumeratedAttributeValue(value_for='CITY', parent=None, child=[], value='LUXEMBOURG'), OtherNonEnumeratedAttributeValue(value_for='POSTAL_CODE', parent=None, child=[], value='2721'), OtherNonEnumeratedAttributeValue(value_for='COUNTRY', parent=None, child=[], value='LU')], value=None), OtherNonEnumeratedAttributeValue(value_for='EMAIL', parent=None, child=[], value='pop_unit@ec.europa.eu')], value=None), OtherNonEnumeratedAttributeValue(value_for='META_UPDATE', parent=None, child=[OtherNonEnumeratedAttributeValue(value_for='CERTIFIED', parent=None, child=[], value='2009-12-10'), OtherNonEnumeratedAttributeValue(value_for='POSTED', parent=None, child=[], value='2010-01-13'), OtherNonEnumeratedAttributeValue(value_for='UPDATED', parent=None, child=[OtherNonEnumeratedAttributeValue(value_for='NEXT', parent=None, child=[], value='2011-01-20')], value='2010-01-13')], value=None), OtherNonEnumeratedAttributeValue(value_for='STAT_PRES', parent=None, child=[XHTMLAttributeValue(value=<Element {http://www.w3.org/1999/xhtml}div at 0x782cb3fa2840>, value_for='DATA_DESCR', parent=None, child=[])], value=None)], target=None, attaches_to=TargetObjectKey(key_values={'REPORT_PERIOD_TARGET': TargetReportPeriod(value_for='REPORT_PERIOD_TARGET', report_period='2010'), 'CATEGORY': TargetIdentifiableObject(value_for='CATEGORY', obj=<Category PSC.DEM.TOT>), 'DATA_PROVIDER': TargetIdentifiableObject(value_for='DATA_PROVIDER', obj=<DataProvider ESTAT>)}))"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mds = msg_metadata.data[0]\n",
"mdr = mds.report[0]\n",
"mdr"
]
},
{
"cell_type": "markdown",
"id": "28d33574-e323-47a1-a0a7-af6aaed23303",
"metadata": {},
"source": [
"We can iterate over the Reported Attributes in this report and show their actual values:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "7f1d8c49-f61f-472b-985e-214b036d193c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CONTACT None\n",
" ORGANISATION 'Eurostat, the statistical office of the European Union'\n",
" ORGANISATION_UNIT 'Unit F1: Population'\n",
" ADDRESS None\n",
" STREET 'RUE ALPHONSE WEICKER 5'\n",
" CITY 'LUXEMBOURG'\n",
" POSTAL_CODE '2721'\n",
" COUNTRY 'LU'\n",
" EMAIL 'pop_unit@ec.europa.eu'\n",
"META_UPDATE None\n",
" CERTIFIED '2009-12-10'\n",
" POSTED '2010-01-13'\n",
" UPDATED '2010-01-13'\n",
" NEXT '2011-01-20'\n",
"STAT_PRES None\n",
" DATA_DESCR <Element {http://www.w3.org/1999/xhtml}div at 0x782cb3fa2840>\n"
]
}
],
"source": [
"def show_reported_attribute(ra, indent=\"\"):\n",
" print(f\"{indent}{ra.value_for} {ra.value!r}\")\n",
" for c in ra.child:\n",
" show_reported_attribute(c, indent + \" \")\n",
"\n",
"for ra in mdr.metadata:\n",
" show_reported_attribute(ra)"
]
},
{
"cell_type": "markdown",
"id": "86984a91-5b3f-4786-be50-2c8859447c78",
"metadata": {},
"source": [
"…note that the DATA_DESCR attribute value is in XHTML.\n",
"\n",
"The attributes of the Metadata Set refer to the Metadata Structure Definition:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "2451a4b7-aa08-4588-8070-d1fe50161710",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'urn:sdmx:org.sdmx.infomodel.metadatastructure.MetadataStructureDefinition=ESTAT:ESMS(1.0)'"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mds.structured_by.urn"
]
},
{
"cell_type": "markdown",
"id": "c8e89e6a-6d89-4ed7-a1c8-a81e2c8b7e5b",
"metadata": {},
"source": [
"This **should** mean that the metadata report is, by construction, consistent with the referred structure. If it were not, that would be malformed (or inconsistent) SDMX.\n",
"\n",
"However, we can also ‘validate’ manually:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "d877648c-d09b-49f2-a0eb-43a4c2a31f4e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"mda.id = 'CONTACT' ra.value_for = 'CONTACT'\n",
" mda.id = 'ORGANISATION' ra.value_for = 'ORGANISATION'\n",
" mda.id = 'ORGANISATION_UNIT' ra.value_for = 'ORGANISATION_UNIT'\n",
" mda.id = 'ADDRESS' ra.value_for = 'ADDRESS'\n",
" mda.id = 'STREET' ra.value_for = 'STREET'\n",
" mda.id = 'CITY' ra.value_for = 'CITY'\n",
" mda.id = 'POSTAL_CODE' ra.value_for = 'POSTAL_CODE'\n",
" mda.id = 'COUNTRY' ra.value_for = 'COUNTRY'\n",
" mda.id = 'EMAIL' ra.value_for = 'EMAIL'\n",
"mda.id = 'META_UPDATE' ra.value_for = 'META_UPDATE'\n",
" mda.id = 'CERTIFIED' ra.value_for = 'CERTIFIED'\n",
" mda.id = 'POSTED' ra.value_for = 'POSTED'\n",
" mda.id = 'UPDATED' ra.value_for = 'UPDATED'\n",
" mda.id = 'NEXT' ra.value_for = 'NEXT'\n",
"mda.id = 'STAT_PRES' ra.value_for = 'STAT_PRES'\n",
" mda.id = 'DATA_DESCR' ra.value_for = 'DATA_DESCR'\n"
]
}
],
"source": [
"def check_mda_ra(mda, ra, indent=\"\"):\n",
" print(f\"{indent}{mda.id = } {ra.value_for = }\")\n",
" for child_ra in ra.child:\n",
" # Identify the corresponding MetadataAttribute by its ID\n",
" child_mdas = list(filter(lambda c: c.id == child_ra.value_for, mda.child))\n",
" assert 1 == len(child_mdas)\n",
" check_mda_ra(child_mdas[0], child_ra, indent=indent + \" \")\n",
"\n",
"for mda, ra in zip(rs.components, mdr.metadata):\n",
" check_mda_ra(mda, ra)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment