GDI Harmonised Minimal Data Model

Living Document,

This version:
TBD
Issue Tracking:
GitHub
Editors:
Ana Konrad
Hannah Neikes
Jeroen Beliën
Joeri van der Velde
Abhishek Nayak
Aedin Culhane
Not Ready For Implementation

This spec is not yet ready for implementation. It exists in this repository to record the ideas and promote discussion.

Before attempting to implement this spec, please contact the editors.


Abstract


Some context about the GDI HMD.

1. Introduction

Some info about GDI here.

1.1. Goals and Scope of the Harmonized Minimal Data Model

The Harmonized Minimal Data Model (HMDM) has multiple purposes, some of these are:

  1. Each dataset that is submitted to the catalogue should adhere (to a certain extend, still to be decided upon) to this model

  2. Within the catalogue based on our model search options can be made available (filters/facets)

  3. The HMDM can/needs to be implemented in the Beacons (and/or other discovery tools).

Next to the HMDM we provided a minimal metadata for submission schema.

Version 1 of metadata for submission model was released and can be seen here. This version was based on the version of HealthDCAT-AP from December 2024. Since the work on HealthDCAT-AP is still ongoing, there have been some changes since that time. This means our metadata for submission is not compliant to the current version of HealthDCAT-AP anymore. We will update the metadata for submission to version 1.1, which will be, according to HealthDCAT-AP, provided in 3 variants, based on the access level of the dataset: Open, Sensitive and Protected, which differ in cardinalities of the properties. The metadata for submission model will include all the current mandatory fields from those HealthDCAT-AP vairants, as well as a custom class created specifically for the submission of metadata in GDI, called HMD Submission. The name spaces for HMD Submission class and its properties were made available through FAIR Genomes repository. There are also some mandatory fields that were already included in the GDI MS8 deliverable, which will stay mandatory for backwards compatibility. This update will be very minimal, so only extending the current model with new mandatory fields.

1.2. Overview and Diagram

2. Matches

Different countries use different standards or vocabularies for their data. For example, in the harmonized minimal data model we mostly use terms from SNOMED CT and NCIT, but some countries only use ICD-10, and it’s difficult for them to convert everything to SNOMED. To make things easier, we decided to focus on matches between terms from different systems. This way, a dataset using ICD-10 can still be included in the GDI Catalogue, as long as we can clearly show how its terms connect to SNOMED or other terms.

For example:

This will make data both human and machine readable.

2.1. How To Use Matches

We started by adding different types of matches to all items and values in the model. The types of matches (match attribute), their definitions and examples can be found in the table below.

Match Attribute Definition Example
skos:exactMatch Two concepts have the same meaning and can be used interchangeably in all contexts. This link is transitive, meaning if A is an exact match to B, and B is an exact match to C, then A is also an exact match to C. SNOMED Date of birth has an exact match to EFO Date of birth. Both refer to the same concept with no variation in meaning.
skos:closeMatch Two concepts are sufficiently similar that they can be used interchangeably in many applications and schemes, but they are not strictly identical. This link is not meant to be transitive. SNOMED Patient has a close match to NCIT Study Participant.
A patient is an individual receiving medical care, while a study participant is someone enrolled in a research study. In some cases, these terms can be used interchangeably (e.g., in clinical trials involving patients), but not all study participants are patients—some might be healthy controls.
skos:relatedMatch Two concepts are associated but not similar enough to be considered exact or close matches. The concept HGNC BRCA1 and NCIT Breast Carcinoma have a skos:relatedMatch relationship because they are strongly associated but not identical. BRCA1 is a gene mutation, while breast cancer is a disease that may develop as a result of certain BRCA1 mutations. However, not all BRCA1 mutations lead to breast cancer, and not all cases of breast cancer are caused by BRCA1 mutations, making them related but not broader/narrower concepts.
skos:broadMatch The concept represented by the external entity is broader than the item. This is a reverse property of skos:narrowMatch. The concept NCIT Breast Carcinoma has a skos:broadMatch relationship to SNOMED Cancer. This means that Cancer is the broader concept because it represents a general category that includes multiple types of cancer, one of which is Breast Cancer.
skos:narrowMatch The concept represented by the external entity is narrower than the item. This is a reverse property of skos:broadMatch. The concept SNOMED Cancer has a skos:narrowMatch relationship to NCIT Breast Carcinoma. This means that Breast Cancer is a specific subtype of Cancer.

Next to the matches, we also record what is the source of the match. If the mapping has already been done, for example SNOMED to ICD-10 in the SNOMED browser, or in the T-Rex browser, we include this information with the match in the spreadsheet.

2.2. Using SSSSOM

From now on, we decided to follow the SSSOM (Simple Standard for Sharing Ontological Mappings) method. This is a standard format for describing matches between terms. We will:

Using this model and the SSSOM standard will help us:

3. Main Classes

In the Harmonized Minimal Dataset, we defined 12 different classes.

3.1. Subject

Item Definition Value Cardinality
Birth Date The calendar date on which a person was born. Complete date, without time, following the ISO 8601. If only year or year-month is available, use that. xsd:date or xsd:gYearMonth or xsd:gYear 1..1
Administrative Gender The gender of a person used for administrative purposes. HL7 Administrative Gender 1..1
Biological Sex at Birth The sex of a person at birth as molecular proven. HL7 ValueSet: Birth Sex 0..1
Date of Last Follow-up Date of last follow-up, partial date with month and year. Date (YYYY-MM-DD), ISO 8601 format 0..1
Country of Origin A person’s descent or lineage, from a person or from a population. 2- or 3-lettercode from ISO 3166-1 if only a country code is provided. If a country-subdivision then a value from the ISO 3166-2 0..1
Subject ID A sequence of characters used to identify, name, or characterize a trial or study subject. String 1..1

3.2. Sample

3.3. Sequence

3.4. Sample Relation

3.5. Diagnosis

Item Definition Value Cardinality
Date of Diagnosis Date at which diagnosis was made. Date (YYYY-MM-DD), ISO 8601 format 0..1
Administrative Gender The gender of a person used for administrative purposes. HL7 Administrative Gender 1..1
Diagnosis The investigation, analysis and recognition of the presence and nature of disease, condition, or injury from expressed signs and symptoms; also, the scientific determination of any kind; the concise results of such an investigation. Children of Disease (Disorder) in SNOMED 0..n
Provisional diagnosis / clinical diagnosis An initial diagnosis that is subject to change as new information becomes available. Children of Disease (Disorder) in SNOMED 0..n

3.6. Treatment

3.7. Environmental Exposure

3.8. Biomarker

3.9. TNM

3.10. Variant

3.11. Target Gene

3.12. Digital Resource

Conformance

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

References

Normative References

[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://datatracker.ietf.org/doc/html/rfc2119