Motivation

A shift towards federated data systems as a design paradigm

The ambition for a seamlessly connected digital healthcare ecosystem, capable of leveraging vast quantities of patient data remains illusive. Designing and implementing health data platforms is notoriously difficult, given the heterogeneity and complexity of such systems. To address these issues, federated data systems are gaining traction as an overarching design paradigm in which data to remains securely at its source and computations are performed in a decentralized, distributed fashion.

Recent technological inventions offer important new enablers to implement federated data systems, most notably:

Significant increase in single-node computing capabilities, whereby it is now possible to process up to 1 TB of tabular data on a single machine node thereby enabling increasingly large volumes of data to be processed in a decentralized, federated system 12;

Maturity of federated analytics and specifically federated learning as a means of performing analysis whilst 'hiding' the data from third parties, including training of deep learning models through aggregation of weights 345;

Privacy-enhancing technologies (PETs) such as secure multi-party computation (MPC) that are now sufficiently mature to be used on an industrial scale, enabling computations to be done under encryption (in-the-blind) thereby significantly improving security across a network of participants 67;

The composable data stack as a solution design that allows for unbundling of the venerable relational database into loosely coupled components, thereby making it easier and more practical to implement federated data systems using cloud-based components with microservices and thus opening up a transition path towards more modular and robust architectures 89.

The architectural shift from centralized to federated data systems is not merely a technical evolution. Modern approaches to data governance are undergoing a similar paradigm shift towards federated solutions. Federated data systems are inherently more aligned with contemporary data governance frameworks, including the Data Governance Act (DGA), the European Health Data Space (EHDS) and the concept of data solidarity 10. Within the context of large corporations, the concept of a data mesh is increasingly being adopted as well, which in essence is a federation of data producers and consumers within a commercial enterprise. From the perspective of sovereignty and solidarity, we believe that a commons-based, federated approach has distinct benefits in moving towards a more equitable, open digital infrastructure 11.

However, this ongoing paradigm shift towards is not without challenges. The notion of what constitutes a federated data system needs to be defined in more detail if we are to see the forest for the trees between different meanings of the same word. For example, 'federation' can refer to any of the following concepts:

Data federation in de context of distributed databases addresses the problem of uniformly accessing multiple, possibly heterogeneous data sources, by mapping them into a unified schema, such as an RDF(S)/OWL ontology or a relational schema, and by supporting the execution of queries, like SPARQL or SQL queries, over that unified schema 12;

Federation within the context of a Personal Health Train (PHT) refers to the concept where data processing is brought to the (personal health) data rather than the other way around, allowing (private) data accessed to be controlled, and to observe ethical and legal concerns 1314151617, and is just one of many solutions designs that are collectively grouped as federated analytics 3;

Federation in Trusted Research Environment (TRE) pertains to a mechanism for data sharing in a temporary staging environment within a network of research organizations through federations services such as localization and access 18;

Federation in the context of data spaces, as described in the DSSC Blueprint 2.0, pertains to the support the interplay of participants in a data space, operating in accordance to the policies and rules specified in the Rulebook by the data space authority.

What then, is a viable development path out of this creative chaos?


  1. Mark Raasveldt and Hannes Mühleisen. DuckDB: an Embeddable Analytical Database. In Proceedings of the 2019 International Conference on Management of Data, 1981–1984. Amsterdam Netherlands, June 2019. ACM. URL: https://dl.acm.org/doi/10.1145/3299869.3320212 (visited on 2025-06-09), doi:10.1145/3299869.3320212

  2. Felix Nahrstedt, Mehdi Karmouche, Karolina Bargieł, Pouyeh Banijamali, Apoorva Nalini Pradeep Kumar, and Ivano Malavolta. An Empirical Study on the Energy Usage and Performance of Pandas and Polars Data Analysis Python Libraries. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, 58–68. Salerno Italy, June 2024. ACM. URL: https://dl.acm.org/doi/10.1145/3661167.3661203 (visited on 2025-06-09), doi:10.1145/3661167.3661203

  3. Zibo Wang, Haichao Ji, Yifei Zhu, Dan Wang, and Zhu Han. A Survey on Federated Analytics: Taxonomy, Enabling Techniques, Applications and Open Issues. IEEE Communications Surveys & Tutorials, pages 1–1, 2025. URL: https://ieeexplore.ieee.org/document/10960683/ (visited on 2025-06-19), doi:10.1109/COMST.2025.3558755

  4. Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletarì, Holger R. Roth, Shadi Albarqouni, Spyridon Bakas, Mathieu N. Galtier, Bennett A. Landman, Klaus Maier-Hein, Sébastien Ourselin, Micah Sheller, Ronald M. Summers, Andrew Trask, Daguang Xu, Maximilian Baust, and M. Jorge Cardoso. The future of digital health with federated learning. npj Digital Medicine, 3(1):1–7, September 2020. Number: 1 Publisher: Nature Publishing Group. URL: https://www.nature.com/articles/s41746-020-00323-1 (visited on 2023-04-23), doi:10.1038/s41746-020-00323-1

  5. Zhen Ling Teo, Liyuan Jin, Nan Liu, Siqi Li, Di Miao, Xiaoman Zhang, Wei Yan Ng, Ting Fang Tan, Deborah Meixuan Lee, Kai Jie Chua, John Heng, Yong Liu, Rick Siow Mong Goh, and Daniel Shu Wei Ting. Federated machine learning in healthcare: A systematic review on clinical applications and technical architecture. Cell Reports Medicine, 5(2):101419, February 2024. URL: https://www.sciencedirect.com/science/article/pii/S2666379124000429 (visited on 2025-09-17), doi:10.1016/j.xcrm.2024.101419

  6. United Nations Committee of Experts on Big Data and Data Science for Official Statistics. The PET Guide. Technical Report, United Nations, 2023. URL: https://unstats.un.org/bigdata/task-teams/privacy/guide/ (visited on 2025-01-22). 

  7. The Royal Society. From privacy to partnership. Technical Report, The Royal Society, January 2023. 

  8. Pedro Pedreira, Orri Erling, Konstantinos Karanasos, Scott Schneider, Wes McKinney, Satya R Valluri, Mohamed Zait, and Jacques Nadeau. The Composable Data Management System Manifesto. Proceedings of the VLDB Endowment, 16(10):2679–2685, June 2023. URL: https://dl.acm.org/doi/10.14778/3603581.3603604 (visited on 2023-12-27), doi:10.14778/3603581.3603604

  9. The Composable Codex. URL: https://voltrondata.com/codex.html (visited on 2024-10-16). 

  10. Barbara Prainsack and Seliem El-Sayed. Beyond Individual Rights: How Data Solidarity Gives People Meaningful Control over Data. The American Journal of Bioethics, 23(11):36–39, November 2023. URL: https://www.tandfonline.com/doi/full/10.1080/15265161.2023.2256267 (visited on 2023-11-11), doi:10.1080/15265161.2023.2256267

  11. Jan Krewer and Zuzanna Warso. Digital Commons as Providers of Public Digital Infrastructures. Technical Report, Open Future Foundation, November 2024. URL: https://openfuture.eu/publication/digital-commons-as-providers-of-public-digital-infrastructures (visited on 2025-06-15). 

  12. Zhenzhen Gu, Francesco Corcoglioniti, Davide Lanti, Alessandro Mosca, Guohui Xiao, Jing Xiong, and Diego Calvanese. A systematic overview of data federation systems. Semantic Web, 15(1):107–165, December 2022. Publisher: IOS Press. URL: https://content.iospress.com/articles/semantic-web/sw223201 (visited on 2024-05-27), doi:10.3233/SW-223201

  13. Oya Beyan, Ananya Choudhury, Johan van Soest, Oliver Kohlbacher, Lukas Zimmermann, Holger Stenzhorn, Md. Rezaul Karim, Michel Dumontier, Stefan Decker, Luiz Olavo Bonino da Silva Santos, and Andre Dekker. Distributed Analytics on Sensitive Medical Data: The Personal Health Train. Data Intelligence, 2(1-2):96–107, January 2020. URL: https://doi.org/10.1162/dint_a_00032 (visited on 2023-02-16), doi:10.1162/dint_a_00032

  14. Ananya Choudhury, Johan van Soest, Stuti Nayak, and Andre Dekker. Personal Health Train on FHIR: A Privacy Preserving Federated Approach for Analyzing FAIR Data in Healthcare. In Arup Bhattacharjee, Samir Kr. Borgohain, Badal Soni, Gyanendra Verma, and Xiao-Zhi Gao, editors, Machine Learning, Image Processing, Network Security and Data Sciences, Communications in Computer and Information Science, 85–95. Singapore, 2020. Springer. doi:10.1007/978-981-15-6315-7_7

  15. Luiz Olavo Bonino da Silva Santos, Luis Ferreira Pires, Virginia Martinez, João Moreira, and Renata Guizzardi. Personal Health Train Architecture with Dynamic Cloud Staging. SN Computer Science, October 2022. doi:10.1007/s42979-022-01422-4

  16. Chong Zhang, Ananya Choudhury, Leroy Volmer, Johan Soest, Inigo Bermejo, Andre Dekker, Aiara Lobo Gomes, and Leonard Wee. Secure and Private Healthcare Analytics: A Feasibility Study of Federated Deep Learning with Personal Health Train. July 2023. ISSN: 2693-5015. URL: https://www.researchsquare.com/article/rs-3158418/v1 (visited on 2024-10-24), doi:10.21203/rs.3.rs-3158418/v1

  17. Ananya Choudhury, Leroy Volmer, Frank Martin, Rianne Fijten, Leonard Wee, Andre Dekker, and Johan van Soest. Advancing Privacy-Preserving Health Care Analytics and Implementation of the Personal Health Train: Federated Deep Learning Study. JMIR AI, 4(1):e60847, February 2025. Company: JMIR AI Distributor: JMIR AI Institution: JMIR AI Label: JMIR AI Publisher: JMIR Publications Inc., Toronto, Canada. URL: https://ai.jmir.org/2025/1/e60847 (visited on 2025-06-26), doi:10.2196/60847

  18. European network of trusted research environments (EOSC-ENTRUST) project. March 2024. URL: https://eosc-entrust.eu/ (visited on 2025-06-26). 

Comments