3.3. Metadata interoperability

3.3.1. Why metadata is essential for data stations

Metadata plays a crucial role in a data station. Without good metadata about the data, researchers and systems do not know what data is available, what its quality is, and under what conditions it may be used. Metadata makes it possible to:

Find datasets: A researcher can search a catalogue for datasets on heart failure in a specific age group.
Understand datasets: The description explains which variables are present, how the data was collected, and what limitations apply.
Process datasets automatically: Systems can assess on the basis of standardised metadata whether a dataset is suitable for a particular analysis.

To achieve interoperability between data stations, it is of great importance that metadata is described in a standardised way. International standards and profiles have been developed for this purpose.

DCATDCAT-APHealthDCAT-APHealth-RI metadata schema

DCAT (Data Catalog Vocabulary) is a W3C standard originally designed for describing datasets in data catalogues. It provides a common vocabulary with which organisations can describe their datasets, regardless of the domain. DCAT version 3 (2024) introduces, among other things, the DatasetSeries class for describing related datasets.

DCAT-AP is a European application profile of DCAT, developed by the European Commission. It refines DCAT with additional mandatory and recommended fields relevant to European government data, such as access rights and applicable legislation. DCAT-AP forms the basis for many national data catalogues.

HealthDCAT-AP is an extension of DCAT-AP specifically for the healthcare sector. It adds metadata elements that are essential for health data, such as:

Health categories (EHR, images, genomic data)
Population characteristics (age range, number of unique individuals)
Coding systems (SNOMED CT, ICD-10, LOINC)
Retention periods and storage deadlines
The responsible Health Data Access Body (HDAB)

The Health-RI metadata schema is a Dutch profile based on DCAT-AP and HealthDCAT-AP, with specific adaptations for the Health-RI Data Catalogue. This schema defines which metadata elements are mandatory, recommended or optional for Dutch health data.

3.3.2. DCAT classes in a data station

In DCAT, various classes are used to describe the structure of a data catalogue. For data stations, the following classes are most relevant.

The catalogue: dcat:Catalog

A dcat:Catalog represents the data station itself as a collection of datasets and data services. The catalogue acts as the entry point for users and systems wishing to know what data is available. The catalogue not only describes which datasets exist, but also via which services they are accessible. The following properties are essential for a catalogue:

Property	Description	Example
`dct:title`	Name of the catalogue	"Data Catalogue Hospital X"
`dct:description`	Description of the content	"Catalogue with clinical datasets for research"
`dct:publisher`	Organisation publishing the catalogue	Hospital X
`dcat:dataset`	References to datasets	Links to available datasets
`dcat:service`	References to data services	Links to API endpoints

In practice, one organisation may have multiple catalogues. For example, a hospital may have a catalogue for imaging data (MRI, CT, PET) and a separate catalogue for clinical data from the EHR.

The dataset: dcat:Dataset

A dcat:Dataset represents a logical collection of data described and made available as a unit. The granularity of a dataset is not technically prescribed, but is determined by the organisation of the data holder and the way in which researchers want to be able to find and request the data. Datasets can, for example, be organised:

By disease: "COVID-19 admission data 2020-2024"
By domain: "Cardiology patient data"
By research project: "IMPROVE study diabetes cohort"

For datasets in healthcare, the following metadata elements are particularly relevant:

Category	HealthDCAT-AP elements	Purpose
Population	`numberOfUniqueIndividuals`, `numberOfRecords`, `populationCoverage`, `minTypicalAge`, `maxTypicalAge`	Insight into size and composition
Terminology	`hasCodingSystem`	Which coding systems are used (SNOMED, ICD-10, LOINC)
Legal	`accessRights`, `applicableLegislation`	Access conditions and legislation (EHDS, GDPR)
Quality	`hasQualityAnnotation`	Reference to quality reporting
Privacy	`retentionPeriod`, `hdab`	Retention period and responsible authority

The dataset series: dcat:DatasetSeries

The dcat:DatasetSeries class is intended for describing collections of related datasets that are published separately but are logically connected. This is particularly suitable for:

Periodic cohorts: Annually published extractions from the same patient population
Versions of research datasets: Successive versions of a cohort after corrections or extensions
Longitudinal data: Measurement moments in a follow-up study

The series defines common metadata for the data (such as access rights and coding systems), while individual datasets retain their specific characteristics (such as the number of records and the temporal coverage).

The distribution: dcat:Distribution

A dcat:Distribution describes a concrete realisation of a dataset: how the data can actually be obtained or accessed. One dataset can have multiple distributions:

Distribution type	Description	Typical use
Bulk export	One-time download of the complete dataset	Research in a central SPE
API access	Programmatic access via a service	Federated analysis
Synthetic sample	Representative test data	Development and validation of algorithms
View/subset	Predefined selection	Specific research purpose

The distribution specifies technical details such as the file format (dcat:mediaType), the access URL (dcat:accessURL), and optionally a reference to the underlying data service.

The data service: dcat:DataService

A dcat:DataService describes an operational service that provides access to datasets. Where a distribution represents a static snapshot or downloadable file, a data service provides dynamic access to the underlying data. The service can execute real-time queries, retrieve current data, or even perform computations without the data leaving the data station. Different types of data services are possible, for example:

Data service type	Example	Description
FHIR API	`https://fhir.example.org/`	REST API in accordance with HL7 FHIR standard
OMOP query service	SQL interface for OMOP-CDM	Standardised queries on OMOP database
DICOM WADO-RS	Web access for medical images	Retrieval of images via web protocol
Federated endpoint	KIK-V or PLUGIN node	Receipt and execution of algorithms

For automatic processing by systems, it is essential that the data service makes clear how it can be accessed. This is done via:

dcat:endpointURL: The base address of the service
dcat:endpointDescription: Link to the technical documentation (e.g. OpenAPI specification or FHIR CapabilityStatement)
dct:conformsTo: The standard to which the service conforms

3.3.3. Static versus dynamic data sources

An important distinction when modelling data sources is the difference between static and dynamic data sources. This distinction has direct consequences for how metadata is managed and how version control is applied.

Static data sources: cohorts and extractions

Static data sources are datasets that were defined and extracted at a specific point in time. The data does not change after extraction, which ensures reproducibility of research results. This type of data source is common in research projects where a fixed population and observation period are defined. Typical examples of static data sources are:

Research cohorts: A predefined group of patients for a specific study
Periodic snapshots: Annual or quarterly extractions from a source system
Anonymised datasets: One-time generated datasets for open research

For static data sources, explicit version control is essential:

Property	Use	Example
`dcat:version`	Version number of the dataset	"2024.1"
`dct:issued`	Publication date	"2024-03-15"
`dct:modified`	Date of last modification	"2024-03-20"
`dcat:inSeries`	Link to a series	Link to the overarching DatasetSeries
`dcat:prev`	Reference to previous version	Link to previous dataset

The DatasetSeries class is particularly useful for static data sources when:

New versions of the same type of data are published periodically
Researchers need to be able to select specific versions
There is common metadata that can be defined at series level

Dynamic data sources: active databases

Dynamic data sources are databases that are continuously updated, such as EHR systems, OMOP-CDM databases or XNAT environments. It is not meaningful here to create a new version with every data change.

For dynamic data sources, we recommend the following approach:

Aspect	Recommendation
The database as a whole	Describe as `dcat:Dataset` with `dcat:accrualPeriodicity` to indicate the update frequency
Version control	Use `dcat:version` only for structural changes (e.g. upgrade from OMOP CDM v5.3 to v5.4)
Temporal coverage	Use `dct:temporal` with only a start date (no end date) to indicate that the data is actively growing
Statistics	Periodically update population statistics (`numberOfRecords`, `numberOfUniqueIndividuals`) with the corresponding `dct:modified`
Schema documentation	Describe via `dcat:Distribution` with reference to schema documentation

Practical example: versioning strategy

A hospital has an OMOP-CDM database that is updated daily with new patient data. The metadata is managed as follows:

Schema version: dcat:version = "OMOP CDM v5.4" – changes only when the data model is upgraded
Update frequency: dcat:accrualPeriodicity = DAILY – indicates that data is added daily
Statistics: numberOfRecords and numberOfUniqueIndividuals are updated monthly, with dct:modified on the update date
Temporal coverage: dct:temporal with start date 2018-01-01 and no end date

The Research Data Alliance has published data versioning use cases and versioning recommendations that can be taken into account in the final choice and implementation of the versioning strategy.

3.3.4. Metadata for primary versus secondary use

When describing metadata, it is important to distinguish between primary and secondary use of data. This distinction influences which metadata elements receive emphasis.

Primary use: focus on the service

For primary use (direct care delivery), source systems remain leading. In DCAT, the focus is primarily on describing what is available and how it can be consulted. Since a dataset is not created for every individual patient, the emphasis is on describing the source system and the associated data service.

The dcat:DataService plays a central role here. It is important that the service is well described, so that a requesting system (for example another EHR or a care application) can correctly request and use the data.

Secondary use: focus on content and context

For secondary use (research, policy, innovation), the emphasis shifts to the content and quality of the data, as well as the context in which the data was collected. HealthDCAT-AP enables explicit modelling of these aspects:

Aspect	HealthDCAT-AP elements
Access rights	`NON_PUBLIC`, `RESTRICTED`, `PUBLIC`
Legislation	References to EHDS, GDPR, national laws
Purpose binding	`dpv:hasPurpose` (e.g. academic research, public health)
Legal basis	`dpv:hasLegalBasis` (e.g. public interest, consent)
Quality	Reference to quality reports via `dqv:hasQualityAnnotation`
HDAB	`healthdcatap:hdab` – the responsible Health Data Access Body

Endpoint descriptions for automatic processing

When a catalogue browser or federated query engine wants to automatically determine the capabilities of a data service, it must be known what type of specification the dcat:endpointDescription contains. A FHIR CapabilityStatement has, for example, a different structure from an OpenAPI specification, and systems need to know how to interpret these. By making the type of specification explicit with dct:conformsTo, a system can make the right choice and discover the available capabilities.

Combining endpointDescription with conformsTo

Always combine dcat:endpointDescription with dct:conformsTo to make the type of specification explicit:

<https://example.org/service/fhir-api> a dcat:DataService ;
    dcat:endpointURL <https://fhir.example.org/> ;
    dcat:endpointDescription <https://fhir.example.org/metadata> ;
    # Explicitly indicate that this is a FHIR CapabilityStatement
    dct:conformsTo <http://hl7.org/fhir/StructureDefinition/CapabilityStatement> .

<https://example.org/service/rest-api> a dcat:DataService ;
    dcat:endpointURL <https://api.example.org/> ;
    dcat:endpointDescription <https://api.example.org/openapi.json> ;
    # Explicitly indicate that this is an OpenAPI specification
    dct:conformsTo <https://spec.openapis.org/oas/v3.1.0> .

3.3.5. Examples of metadata for a data station

The following examples illustrate how different types of data sources in a data station can be described with DCAT and HealthDCAT-AP. Each example demonstrates specific aspects of the standard.

Example 1: FHIR system (dynamic data source)

A FHIR system in a hospital can be exposed in various ways. In this example, a COVID-19 dataset is described that grows dynamically as new admissions are registered.