Layers, pipes and patterns: detailing the concept of data stations as a foundational building block for federated data systems in healthcare

Abstract

We generalize the concept of Personal Health Train in terms of the European Interoperability Reference Architecture, using the concept of a data station as a foundational building block. By conceptualizing data stations as persistent data storage within the domain of the Data Holder that is tightly integrated with decentralized computing resources, we outline steps towards interoperable federated analytics networks in healthcare.

Keywords

personal health train, federated learning, data station, data systems, data engineering, lakehouse, Apache Arrow, Apache Parquet, Apache Iceberg, duckdb, polars

A shift towards federated data systems as a design paradigm

The ambition for a seamlessly connected digital healthcare ecosystem, capable of leveraging vast quantities of patient data remains illusive. Designing and implementing health data platforms is notoriously difficult, given the heterogeneity and complexity of such systems. To address these issues, federated data systems are gaining traction as an overarching design paradigm in which data to remains securely at its source and computations are performed in a decentralized, distributed fashion.

Recent technological inventions offer important new enablers to implement federated data systems, most notably:

  • Significant increase in single-node computing capabilities, whereby it is now possible to process up to 1 TB of tabular data on a single machine node thereby enabling increasingly large volumes of data to be processed in a decentralized, federated system [1, 2];
  • Maturity of federated analytics and specifically federated learning as a means of performing analysis whilst ‘hidding’ the data from third parties, including training of deep learning models through aggregation of weights [35];
  • Privacy-enhancing technologies (PETs) such as secure multi-party computation (MPC) that are now sufficiently mature to be used on an industrial scale, enabling computations to be done under encryption (in-the-blind) thereby significantly improving security across a network of participants [6, 7];
  • The composable data stack as a solution design that allows for unbundling of the venerable relational database into loosely coupled components, thereby making it easier and more practical to implement federated data systems using cloud-based components with microservices and thus opening up a transition path towards more modular and robust architectures [8, 9].

The architectural shift from centralized to federated data systems is not merely a technical evolution. Modern approaches to data governance are undergoing a similar paradigm shift towards federated solutions. Federated data systems are inherently more aligned with contemporary data governance frameworks, including the Data Governance Act (DGA), the European Health Data Space (EHDS) and the concept of data solidarity [10]. Within the context of large corporations, the concept of a data mesh is increasingly being adopted as well, which in essence is a federation of data producers and consumers within a commercial enterprise. From the perspective of sovereignty and solidarity, we believe that a commons-based, federated approach has distinct benefits in moving towards a more equitable, open digital infrastructure [11].

However, this ongoing paradigm shift towards is not without challenges. The notion of what constitutes a federated data system needs to be defined in more detail if we are to see the forest for the trees between different meanings of the same word. For example, ‘federation’ can refer to any of the following concepts:

  • Data federation in de context of distributed databases addresses the problem of uniformly accessing multiple, possibly heterogeneous data sources, by mapping them into a unified schema, such as an RDF(S)/OWL ontology or a relational schema, and by supporting the execution of queries, like SPARQL or SQL queries, over that unified schema [12];
  • Federation within the context of a Personal Health Train (PHT) refers to the concept where data processing is brought to the (personal health) data rather than the other way around, allowing (private) data accessed to be controlled, and to observe ethical and legal concerns [1317], and is just one of many solutions designs that are collectively grouped as federated analytics [3];
  • Federation in Trusted Research Environment (TRE) pertains to a mechanism for data sharing in a temporary staging environment within a network of research organizations through federations services such as localization and access [18];
  • Federation in the context of data spaces, as described in the DSSC Blueprint 2.0, pertains to the support the interplay of participants in a data space, operating in accordance to the policies and rules specified in the Rulebook by the data space authority.

What then, is a viable development path out of this creative chaos?

Towards data availability for secondary use

Inspired by previous calls to action to move towards open architectures for health data systems [19, 20] and the notion of the hourglass model [19, 21, 22], we hypothesize that the concept of a ‘data station’ can be used as a foundational building block for federated data systems. A data station should provide a set of minimal standards (at the waist of the hourglass), thereby maximizing the freedom to operate between data providers and data consumers within the context of a health data space. Note that this approach has many similarities with the FAIR Hourglass [22, 23]. Our approach of data stations presented here focuses on enabling secondary data use of routine collected clinical data using the architecture of the Personal Health Train as a starting point [1317]. The objective of this paper is to extend this architecture in order to address four design questions that are relevant in ongoing efforts to implement nation-wide federated systems.

Online transactional vs. analytical processing

First, it is well known data systems have different design an performance characteristics depending whether they are built for online transactional processing (OLTP) or online analytical processing (OLAP), as summarized in the Table 1 below (taken from [24]).

Table 1: Distinguishing characteristics between online transactional and analytical systems
Property Transaction processing systems (OLTP) Analytic systems (OLAP)
Main read pattern Small number of records per query, fetched by key Aggregate over large number of records
Main write pattern Random-access, low-latency writes from user input Bulk import (ETL) or event stream
Data modeling Predefined Defined post-hoc, either schema-on-read or schema-on-write
Primarily used by End user/customer, via web application Internal analyst, for decision support
What data represents Latest state of data (current point in time) History of events that happened over time
Dataset size Gigabytes to terabytes Terabytes to petabytes

Current efforts to design and implement the EHDS in fact aims to support primary (i.e. OLTP) and secondary use (OLAP) in one go [25, 26]. This Herculean endeavour has spawned many initiatives to develop a coherent architecture and support implementation across Europe that ultimately should lead to interoperability in the broadest sense of the word, most notably:

  • The Data Space Blueprint v2.0 (DSB2) by the Data Spaces Support Centre ([27]) that serves as a vital guide for organizations building and participating in data spaces.
  • The Simpl Programme ([28]) that aims to develop an open source, secure middleware that supports data access and interoperability in European data initiatives. It provides multiple compatible components, free to use, that adhere to a common standard of data quality and data sharing.
  • TEHDAS2 ([29]), a joint action that prepares the ground for the harmonised implementation of the secondary use of health data in the EHDS.

We believe, however, that in order to successfully design and implement health data spaces, more detailed analysis and solution patterns are required that distinguish between primary (OLTP) and secondary (OLAP) data use. Although functional components can be shared between these two, it is a matter of the devil being in the details. Hence one of the objectives of this paper is to detail an open, technology agnostic architecture for secondary use, to complement existing efforts and guide the development in the field.

Centralized vs decentralized processing

A second design question pertains to the choice between centralized or distributed c.q. decentralized platforms, which are not only be driven by technical considerations (scalability, elasticity, fault tolerance, latency) but are also strongly dependent on organizational, legal or regulatory requirements such as data residency. The general approach of EHDS and other data spaces is federative by nature, that is, decentralized. For example, DSB2 stresses the need for interoperability and federative protocols within and across data spaces.

Upon closer inspection, however, specific functional components that are foreseen within the EHDS are best characterized as centralized (sub-)systems. As an example, consider the secure processing environments (SPE) as defined in article 73 of the EHDS. Known examples of such SPEs include data platforms provided by national statistics offices (CBS Microdata environment), healthcare-specific national platforms (Finland’s Kapseli platform) and Trusted Research Environments (TREs) within the domain of research (see EOSC-ENTRUST for examples across Europe). Given that healthcare data is often vertically partitioned (data elements of the same subject are scattered across various data holders), SPEs provide the most effective means to (temporarily) share, integrate and analyse such data. Hence many SPEs are best described as centralized systems, and thus we need to take into account that data spaces constitute a hybrid architecture that includes both centralized and decentralized components. Thus a more detailed analysis is required to arrive at scalable solution patterns that combine centralized vs. decentralized processing in a larger federated health data system.

Solution patterns to populate data stations

A third design issue pertains to the mechanisms through which the data stations are to be populated with data by the data holder. In essence the industry is converging towards solution patterns from data engineering and data warehousing where the data station is realized using the lakehouse architecture (see [30] and references therein). We stress the importance to difference between OLTP and OLAP systems where we envisage a data station for secondary use as an OLAP system. To illustrate this difference, consider the OpenHIE specification where the Shared Health Record is positioned as the data station for OLTP and the analytics datawarehouse as a separate OLAP functional building block that is populated using FHIR data pipelines [30, 31]. Figure 1 shows an example of such a setup, in this case for the imaging domain. The Extract-Transform-Load pattern of this pipeline includes various data (pre-)processing steps including image format alignment, de-identification and transformation to a common data model (CDM).

Figure 1: Example of data preparation pipeline in the imaging domain, as developed within the EUCAIM project (link document)

Federation as a system of systems

The fourth design challenge pertains to the need for having a ‘system of systems’ [32] in the federated health data system at large. In real-world setting, secondary health data sharing will need to take into account the limited resources and expertise of smaller health care providers. In fact, the EHDS explicitly addresses this issue in article article 50 that exempts micro enterprises from the obligation to directly participate in the EHDS, but that each member state may opt to form so-called data intermediaries to act as a go-between (recital 59). Along a different dimension, we foresee a system of systems of various regional or domain-specific federated networks that are loosely coupled, which poses design challenges in terms of autonomy (to what extend can each sub-network make independent choices), connectivity (how to connect sub-systems) and diversity.

Design questions framed within realization of Secure Processing Environments for secondary use

By addressing these four design questions, our aim is to specify the solution designs for a federated health data system with national coverage that can support secondary data use within the context of the EHDS. To clarify this objective, consider Figure 2 depicting the main concepts relevant to our paper in terms of an Archimate layered view. Starting from the top, the Secure Processing Environments (SPE) is taken as the key capability that needs to be realized as part of the EHDS. SPEs can be realized using different types of platforms, namely centralized, federated or hybrid. At the time of writing, we observe that there are already quite a few products available in the market, be it commercial or as an established open source project. The secure processing platforms (note we use platform to designate the business product vs. SPE to denote the capability) are composed of various value creation services as defined in the DSB2 (Figure 3). These services in turn are realized by underlying data analysis processes, which are envisaged to be supported by (Data) Service Providers.

Figure 2: Layered view of Secure Processing Environments (SPEs) as capability in the European Health Data Space.
Figure 3: Taxonomy of value creation services as defined in the DSSC Blueprint 2.0. Source: dssc.eu.

Outline

With this overarching layered view in mind, we loosely follow an Action Design Research approach [33, 34] to address these design questions. Our main contributions are:

  • A harmonized ontology of a data station and data hub, that integrates the PHT architecture [15, 17] into the European Interoperability Reference Architecture [EIRA] and the DSSC Blueprint 2.0
  • Comparative analysis of existing implementations
  • Synthesis of the above into functional and technical description of a data station in Archimate, thereby focusing on three primary software patterns [35]:
    • the layers pattern for addressing various aspects of interoperability across the stack, including the Legal, Organizational, Semantic and Technical (LOST) framework as defined in EIRA and extending earlier work by [36] who have introduced five layers of interoperability specifically for PHT;
    • the hub-and-spoke event broker topology pattern, where we generalize the concept of client-server federated analytics to formulate specifications for achieving interoperability across frameworks that leverage de facto standards for container orchestration with Kubernetes;
    • the pipes and filters pattern for addressing various solution designs for the extract-transform-load (ETL) mechanisms through which data stations are populated, following the current common practice of datalakes and lakehouse solution designs [3741];
  • Proposals for new designs for specific domains, such as genomics or imaging.
  • A reference implementation of data stations using the trains based on containers as a generic infrastructure for federated learning and federated analysis.

References

1.
Raasveldt M, Mühleisen H (2019) DuckDB: An Embeddable Analytical Database. In: Proceedings of the 2019 International Conference on Management of Data. ACM, Amsterdam Netherlands, pp 1981–1984
2.
Nahrstedt F, Karmouche M, Bargieł K, Banijamali P, Nalini Pradeep Kumar A, Malavolta I (2024) An Empirical Study on the Energy Usage and Performance of Pandas and Polars Data Analysis Python Libraries. In: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering. ACM, Salerno Italy, pp 58–68
3.
Wang Z, Ji H, Zhu Y, Wang D, Han Z (2025) A Survey on Federated Analytics: Taxonomy, Enabling Techniques, Applications and Open Issues. IEEE Commun Surv Tutorials 1–1. https://doi.org/10.1109/COMST.2025.3558755
4.
Rieke N, Hancox J, Li W, et al (2020) The future of digital health with federated learning. npj Digit Med 3(1, 1):1–7. https://doi.org/10.1038/s41746-020-00323-1
5.
Teo ZL, Jin L, Liu N, et al (2024) Federated machine learning in healthcare: A systematic review on clinical applications and technical architecture. Cell Reports Medicine 5(2):101419. https://doi.org/10.1016/j.xcrm.2024.101419
6.
(2023) The PET Guide. United Nations
7.
(2023) From privacy to partnership. The Royal Society
8.
Pedreira P, Erling O, Karanasos K, et al (2023) The Composable Data Management System Manifesto. Proc VLDB Endow 16(10):2679–2685. https://doi.org/10.14778/3603581.3603604
9.
The Composable Codex. https://voltrondata.com/codex.html. Accessed 16 Oct 2024
10.
Prainsack B, El-Sayed S (2023) Beyond Individual Rights: How Data Solidarity Gives People Meaningful Control over Data. The American Journal of Bioethics 23(11):36–39. https://doi.org/10.1080/15265161.2023.2256267
11.
Krewer J, Warso Z (2024) Digital Commons as Providers of Public Digital Infrastructures. Open Future Foundation
12.
Gu Z, Corcoglioniti F, Lanti D, et al (2022) A systematic overview of data federation systems. Semantic Web 15(1):107–165. https://doi.org/10.3233/SW-223201
13.
Beyan O, Choudhury A, van Soest J, et al (2020) Distributed Analytics on Sensitive Medical Data: The Personal Health Train. Data Intelligence 2(1–2):96–107. https://doi.org/10.1162/dint_a_00032
14.
Choudhury A, van Soest J, Nayak S, Dekker A (2020) Personal Health Train on FHIR: A Privacy Preserving Federated Approach for Analyzing FAIR Data in Healthcare. In: Bhattacharjee A, Borgohain SKr, Soni B, Verma G, Gao X-Z (eds) Machine Learning, Image Processing, Network Security and Data Sciences. Springer, Singapore, pp 85–95
15.
Bonino da Silva Santos LO, Ferreira Pires L, Martinez V, Moreira J, Guizzardi R (2022) Personal Health Train Architecture with Dynamic Cloud Staging. SN Computer Science 4. https://doi.org/10.1007/s42979-022-01422-4
16.
Zhang C, Choudhury A, Volmer L, et al (2023) Secure and Private Healthcare Analytics: A Feasibility Study of Federated Deep Learning with Personal Health Train. https://www.researchsquare.com/article/rs-3158418/v1. Accessed 24 Oct 2024
17.
Choudhury A, Volmer L, Martin F, et al (2025) Advancing Privacy-Preserving Health Care Analytics and Implementation of the Personal Health Train: Federated Deep Learning Study. JMIR AI 4(1):e60847. https://doi.org/10.2196/60847
18.
(2024) European network of trusted research environments (EOSC-ENTRUST) project. https://eosc-entrust.eu/. Accessed 26 Jun 2025
19.
Estrin D, Sim I (2010) Health care delivery. Open mHealth architecture: An engine for health care innovation. Science 330(6005):759–760. https://doi.org/10.1126/science.1196187
20.
Mehl GL, Seneviratne MG, Berg ML, et al (2023) A full-STAC remedy for global digital health transformation: Open standards, technologies, architectures and content. Oxford Open Digital Health 1:oqad018. https://doi.org/10.1093/oodh/oqad018
21.
Beck M (2019) On the hourglass model. Communications of the ACM 62(7):48–57. https://doi.org/10.1145/3274770
22.
Schultes E (2023) The FAIR hourglass: A framework for FAIR implementation. FC 1(1):13–17. https://doi.org/10.3233/FC-221514
23.
Bonino da Silva Santos LO, Guizzardi G, Prince Sales T (2022) FAIR Digital Object Framework
24.
Kleppmann M, Riccomini C (2026) Designing data-intensive applications, 2nd edtion (early release). O’Reilly
25.
(2025) European Health Data Space Regulation (EHDS). https://health.ec.europa.eu/ehealth-digital-health-and-care/european-health-data-space-regulation-ehds_en. Accessed 9 Jun 2025
26.
Cascini F, Pantovic A, Al-Ajlouni YA, Puleo V, De Maio L, Ricciardi W (2024) Health data sharing attitudes towards primary and secondary use of data: A systematic review. eClinicalMedicine 71:102551. https://doi.org/10.1016/j.eclinm.2024.102551
27.
Data Spaces Blueprint v2.0. https://dssc.eu/space/BVE2/1071251457/Data+Spaces+Blueprint+v2.0+-+Home. Accessed 9 Jun 2025
28.
Simpl Programme. https://simpl-programme.ec.europa.eu/. Accessed 9 Jun 2025
29.
TEHDAS2. https://tehdas.eu/. Accessed 9 Jun 2025
30.
Kapitan D, Heddema F, Dekker A, Sieswerda M, Verhoeff B-J, Berg M (2025) Data Interoperability in Context: The Importance of Open-Source Implementations When Choosing Open Standards. Journal of Medical Internet Research 27(1):e66616. https://doi.org/10.2196/66616
31.
32.
Gorod A, Sauser B, Boardman J (2008) Paradox: Holarchical view of system of systems engineering management. In: 2008 IEEE International Conference on System of Systems Engineering. pp 1–6
33.
Sein M, Henfridsson O, Purao S, Rossi M, Lindgren R (2011) Action Design Research. MIS Quarterly 35:37–56. https://doi.org/10.2307/23043488
34.
Venable J, Pries-Heje J, Baskerville R (2016) FEDS: A Framework for Evaluation in Design Science Research. Eur J Inf Syst 25(1):77–89. https://doi.org/10.1057/ejis.2014.36
35.
Buschmann F, Meunier R, Rohnert H, Sommerlad P, Stal M (1996) A System of Patterns. Wiley
36.
Welten S, de Arruda Botelho Herr M, Hempel L, et al (2024) A study on interoperability between two Personal Health Train infrastructures in leukodystrophy data analysis. Sci Data 11(1):663. https://doi.org/10.1038/s41597-024-03450-6
37.
Harby AA, Zulkernine F (2025) Data Lakehouse: A survey and experimental study. Information Systems 127:102460. https://doi.org/10.1016/j.is.2024.102460
38.
Schneider J, Gröger C, Lutsch A, Schwarz H, Mitschang B (2024) The Lakehouse: State of the Art on Concepts and Technologies. SN COMPUT SCI 5(5):449. https://doi.org/10.1007/s42979-024-02737-0
39.
Hai R, Koutras C, Quix C, Jarke M (2023) Data Lakes: A Survey of Functions and Systems. IEEE Transactions on Knowledge and Data Engineering 35(12):12571–12590. https://doi.org/10.1109/TKDE.2023.3270101
40.
AbouZaid A, Barclay PJ, Chrysoulas C, Pitropakis N (2025) Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem. Discov Appl Sci 7(3):166. https://doi.org/10.1007/s42452-025-06545-w
41.
Mazumdar D, Hughes J, Onofre JB (2023) The Data Lakehouse: Data Warehousing and More. http://arxiv.org/abs/2310.08697. Accessed 11 Jan 2024

Citation

BibTeX citation:
@article{2025,
  author = {},
  title = {Layers, Pipes and Patterns: Detailing the Concept of Data
    Stations as a Foundational Building Block for Federated Data Systems
    in Healthcare},
  journal = {GitHub},
  pages = {1-3},
  date = {2025-06-01},
  url = {https://github.com/health-ri/aida/data-stations/index.html},
  langid = {en},
  abstract = {We generalize the concept of Personal Health Train in
    terms of the European Interoperability Reference Architecture, using
    the concept of a data station as a foundational building block. By
    conceptualizing data stations as persistent data storage within the
    domain of the Data Holder that is tightly integrated with
    decentralized computing resources, we outline steps towards
    interoperable federated analytics networks in healthcare.}
}
For attribution, please cite this work as: