1. Introduction
Some info about GDI here.1.1. Goals and Scope of the Harmonized Minimal Data Model
The Harmonized Minimal Data Model (HMDM) has multiple purposes, some of these are:
-
Each dataset that is submitted to the catalogue should adhere (to a certain extend, still to be decided upon) to this model
-
Within the catalogue based on our model search options can be made available (filters/facets)
-
The HMDM can/needs to be implemented in the Beacons (and/or other discovery tools).
Next to the HMDM we provided a minimal metadata for submission schema.
Version 1 of metadata for submission model was released and can be seen here. This version was based on the version of HealthDCAT-AP from December 2024. Since the work on HealthDCAT-AP is still ongoing, there have been some changes since that time. This means our metadata for submission is not compliant to the current version of HealthDCAT-AP anymore. We will update the metadata for submission to version 1.1, which will be, according to HealthDCAT-AP, provided in 3 variants, based on the access level of the dataset: Open, Sensitive and Protected, which differ in cardinalities of the properties. The metadata for submission model will include all the current mandatory fields from those HealthDCAT-AP vairants, as well as a custom class created specifically for the submission of metadata in GDI, called HMD Submission. The name spaces for HMD Submission class and its properties were made available through FAIR Genomes repository. There are also some mandatory fields that were already included in the GDI MS8 deliverable, which will stay mandatory for backwards compatibility. This update will be very minimal, so only extending the current model with new mandatory fields.
1.2. Overview and Diagram
2. Matches
Different countries use different standards or vocabularies for their data. For example, in the harmonized minimal data model we mostly use terms from SNOMED CT and NCIT, but some countries only use ICD-10, and it’s difficult for them to convert everything to SNOMED. To make things easier, we decided to focus on matches between terms from different systems. This way, a dataset using ICD-10 can still be included in the GDI Catalogue, as long as we can clearly show how its terms connect to SNOMED or other terms.
For example:
-
One dataset might use a SNOMED term to describe a disease.
-
Another dataset might describe the same disease using an ICD-10 term.
-
Without a match between them, our system would think they are two different things.
-
If we add an exact match, the system will understand that they are the same, just described with different vocabularies.
This will make data both human and machine readable.
2.1. How To Use Matches
We started by adding different types of matches to all items and values in the model. The types of matches (match attribute), their definitions and examples can be found in the table below.
Match Attribute | Definition | Example |
---|---|---|
skos:exactMatch | Two concepts have the same meaning and can be used interchangeably in all contexts. This link is transitive, meaning if A is an exact match to B, and B is an exact match to C, then A is also an exact match to C. | SNOMED Date of birth has an exact match to EFO Date of birth. Both refer to the same concept with no variation in meaning. |
skos:closeMatch | Two concepts are sufficiently similar that they can be used interchangeably in many applications and schemes, but they are not strictly identical. This link is not meant to be transitive. | SNOMED Patient has a close match to NCIT Study Participant. A patient is an individual receiving medical care, while a study participant is someone enrolled in a research study. In some cases, these terms can be used interchangeably (e.g., in clinical trials involving patients), but not all study participants are patients—some might be healthy controls. |
skos:relatedMatch | Two concepts are associated but not similar enough to be considered exact or close matches. | The concept HGNC BRCA1 and NCIT Breast Carcinoma have a skos:relatedMatch relationship because they are strongly associated but not identical. BRCA1 is a gene mutation, while breast cancer is a disease that may develop as a result of certain BRCA1 mutations. However, not all BRCA1 mutations lead to breast cancer, and not all cases of breast cancer are caused by BRCA1 mutations, making them related but not broader/narrower concepts. |
skos:broadMatch | The concept represented by the external entity is broader than the item. This is a reverse property of skos:narrowMatch. | The concept NCIT Breast Carcinoma has a skos:broadMatch relationship to SNOMED Cancer. This means that Cancer is the broader concept because it represents a general category that includes multiple types of cancer, one of which is Breast Cancer. |
skos:narrowMatch | The concept represented by the external entity is narrower than the item. This is a reverse property of skos:broadMatch. | The concept SNOMED Cancer has a skos:narrowMatch relationship to NCIT Breast Carcinoma. This means that Breast Cancer is a specific subtype of Cancer. |
Next to the matches, we also record what is the source of the match. If the mapping has already been done, for example SNOMED to ICD-10 in the SNOMED browser, or in the T-Rex browser, we include this information with the match in the spreadsheet.
2.2. Using SSSSOM
From now on, we decided to follow the SSSOM (Simple Standard for Sharing Ontological Mappings) method. This is a standard format for describing matches between terms. We will:-
Add all required SSSOM elements.
-
Include recommended and optional elements if they are useful.
-
Make sure our matches follow the SSSOM rules.
Using this model and the SSSOM standard will help us:
-
Make it easier for teams using different vocabularies to share their data.
-
Let people and software search and analyze the data more effectively.
-
Support more countries and teams to join the GDI Catalogue without needing to change everything.
3. Main Classes
In the Harmonized Minimal Dataset, we defined 12 different classes.3.1. Subject
Item | Definition | Value | Cardinality |
---|---|---|---|
Birth Date | The calendar date on which a person was born. | Complete date, without time, following the ISO 8601. If only year or year-month is available, use that. xsd:date or xsd:gYearMonth or xsd:gYear | 1..1 |
Administrative Gender | The gender of a person used for administrative purposes. | HL7 Administrative Gender | 1..1 |
Biological Sex at Birth | The sex of a person at birth as molecular proven. | HL7 ValueSet: Birth Sex | 0..1 |
Date of Last Follow-up | Date of last follow-up, partial date with month and year. | Date (YYYY-MM-DD), ISO 8601 format | 0..1 |
Country of Origin | A person’s descent or lineage, from a person or from a population. | 2- or 3-lettercode from ISO 3166-1 if only a country code is provided. If a country-subdivision then a value from the ISO 3166-2 | 0..1 |
Subject ID | A sequence of characters used to identify, name, or characterize a trial or study subject. | String | 1..1 |
3.2. Sample
3.3. Sequence
3.4. Sample Relation
3.5. Diagnosis
Item | Definition | Value | Cardinality |
---|---|---|---|
Date of Diagnosis | Date at which diagnosis was made. | Date (YYYY-MM-DD), ISO 8601 format | 0..1 |
Administrative Gender | The gender of a person used for administrative purposes. | HL7 Administrative Gender | 1..1 |
Diagnosis | The investigation, analysis and recognition of the presence and nature of disease, condition, or injury from expressed signs and symptoms; also, the scientific determination of any kind; the concise results of such an investigation. | Children of Disease (Disorder) in SNOMED | 0..n |
Provisional diagnosis / clinical diagnosis | An initial diagnosis that is subject to change as new information becomes available. | Children of Disease (Disorder) in SNOMED | 0..n |