1. Introduction
Some info about GDI here.1.1. Goals and Scope of the Harmonized Minimal Data Model
The Harmonized Minimal Data Model (HMDM) has multiple purposes, some of these are:
-
Each dataset that is submitted to the catalogue should adhere (to a certain extend, still to be decided upon) to this model
-
Within the catalogue based on our model search options can be made available (filters/facets)
-
The HMDM can/needs to be implemented in the Beacons (and/or other discovery tools).
Next to the HMDM we provided a minimal metadata for submission schema.
Version 1 of metadata for submission model was released and can be seen here. This version was based on the version of HealthDCAT-AP from December 2024. Since the work on HealthDCAT-AP is still ongoing, there have been some changes since that time. This means our metadata for submission is not compliant to the current version of HealthDCAT-AP anymore. We will update the metadata for submission to version 1.1, which will be, according to HealthDCAT-AP, provided in 3 variants, based on the access level of the dataset: Open, Sensitive and Protected, which differ in cardinalities of the properties. The metadata for submission model will include all the current mandatory fields from those HealthDCAT-AP vairants, as well as a custom class created specifically for the submission of metadata in GDI, called HMD Submission. The name spaces for HMD Submission class and its properties were made available through FAIR Genomes repository. There are also some mandatory fields that were already included in the GDI MS8 deliverable, which will stay mandatory for backwards compatibility. This update will be very minimal, so only extending the current model with new mandatory fields.
1.2. Overview and Diagram
2. Main Classes
In the Harmonized Minimal Dataset, we defined 12 different classes.2.1. Subject
| Item | Definition | Value | Cardinality |
|---|---|---|---|
| Birth Date | The calendar date on which a person was born. | Complete date, without time, following the ISO 8601. If only year or year-month is available, use that. xsd:date or xsd:gYearMonth or xsd:gYear | 1..1 |
| Administrative Gender | The gender of a person used for administrative purposes. | HL7 Administrative Gender | 1..1 |
| Biological Sex at Birth | The sex of a person at birth as molecular proven. | HL7 ValueSet: Birth Sex | 0..1 |
| Date of Last Follow-up | Date of last follow-up, partial date with month and year. | Date (YYYY-MM-DD), ISO 8601 format | 0..1 |
| Country of Origin | A person’s descent or lineage, from a person or from a population. | 2- or 3-lettercode from ISO 3166-1 if only a country code is provided. If a country-subdivision then a value from the ISO 3166-2 | 0..1 |
| Subject ID | A sequence of characters used to identify, name, or characterize a trial or study subject. | String | 1..1 |
2.2. Diagnosis
| Item | Definition | Value | Cardinality |
|---|---|---|---|
| Date of Diagnosis | Date at which diagnosis was made. | Date (YYYY-MM-DD), ISO 8601 format | 0..1 |
| Diagnosis | The investigation, analysis and recognition of the presence and nature of disease, condition, or injury from expressed signs and symptoms; also, the scientific determination of any kind; the concise results of such an investigation. | Children of Disease (Disorder) in SNOMED | 0..n |
| Provisional diagnosis / clinical diagnosis | An initial diagnosis that is subject to change as new information becomes available. | Children of Disease (Disorder) in SNOMED | 0..n |
2.3. Sample
| Item | Definition | Value | Cardinality |
|---|---|---|---|
| Anatomical sample location | Anatomic site from which the sample was taken. | ICD-11 Anatomy and topography | 1..1 |
| Pathological state | The pathological condition of the sample. | Tissue Normal, Germline Normal, Primary Tumor, Tumor Metastasis, Recurrent Tumor, Organoid, Tumoroid | 1..1 |
| Date | Defines the date of sampling. | Date (YYYY-MM-DD), ISO 8601 format | 0..1 |
| ID | Unique identifier for a collected specimen assigned by data provider. | String | 0..1 |
| Organism | A living entity. | Children of NCIT Organism | 1..1 |
| Biospecimen_Type | The type of a material sample taken from a biological entity for testing, diagnostic, propagation, treatment or research purposes. This includes particular types of cellular molecules, cells, tissues, organs, body fluids, embryos, and body excretory substances. | Value from type of Sample table in SPREC Codes v3.0 :
ASC - Ascites fluid AMN - Amniotic fluid BAL - Bronchoalveolar lavage BLD - Blood (whole) BMA - Bone marrow aspirate BMK - Breast milk BUC - Buccal cells BUF - Unficolled buffy coat, viable CEL - Ficoll mononuclear cells, viable CEN - Fresh cells from non blood specimen type CLN - Cells from nonblood specimen type (e.g., disrupted tissue), viable CRD - Cord blood CSF - Cerebrospinal fluid NAS - Nasal washing PEL - Ficoll mononuclear cells, nonviable PEN - Cells from nonblood specimen type (e.g., disrupted tissue), nonviable PFL - Pleural fluid PL1 - Plasma, single spun PL2 - Plasma, double spun SAL - Saliva SEM - Semen SER - Serum SPT - Sputum STL - Stool SYN - Synovial fluid TER - Tears U24 - 24-h urine URN - Urine ZZZ Other | 1..1 |
| Extraction_Technique | The technique of extraction of the sample. | Values within MESH Specimen Handling | 0..1 |
| Storage_Conditions | Storage conditions of the sample. | FAIR Genomes Storage Conditions or SPREC Codes v3.0 | 0..n |
| Assayed_Biological_Macromolecule | Macromolecule derived from the sample. | Children of EFO biological macromolecule | 0..1 |
| Sampling_Intent | Describes the purpose for taking the sample. | String | 0..n |
2.3.1. TNM
| Item | Definition | Value | Cardinality |
|---|---|---|---|
| Tumor_size | Indicates the size of the primary tumor. | Double | 0..1 |
| Tumor_size_unit | Indicates the unit in which tumor size is stated. | UCUM | 0..1 |
| Lymph_Node_Status | Indicates lymph node involvement according to the international TNM classification for solid tumors. | Children of 1279504008 | American Joint Committee on Cancer ycN category allowable value (qualifier value) | 0..1 |
| Metastases | Indicates presence or absence of metastasis according to the international TNM classification for solid tumors. | Children of American Joint Committee on Cancer pathological M category allowable value (qualifier value) or Children of American Joint Committee on Cancer clinical M category allowable value (qualifier value) and HL7 NULL flavor in case not determined. | 0..1 |
| Version | Version of the TNM classification. | String | 0..1 |
| Stage | The extent of a cancer in the body. Staging is usually based on the size of the tumor, whether lymph nodes contain cancer, and whether the cancer has spread from the original site to other parts of the body. | or or or | 0..1 |
| Pathology_Clinical | Indication whether TNM classification is based on clinical or pathological evaluation. If pathological is available, this always goes before clinical. | Clinical or pathological | 1..1 |
| Date of evaluation | Date of the clinical/pathological evaluation. | Date (YYYY-MM-DD), ISO 8601 format | 0..1 |
2.3.2. Sample Relation
| Item | Definition | Value | Cardinality |
|---|---|---|---|
| Related_Sample | Points to a related sample. | Sample ID | 1..n |
| Relation_Type | Describes the relationship type between the two connected samples. | List to be defined/extended:
New sample Derived Aliquot Control (experiment) Disease versus Control Matched Longitudinal Paired Familial Spatial | 1..1 |
| Relation_Description | Free text describing the Relation Type of two samples. | String | 0..1 |
2.4. Sequence
| Item | Definition | Value | Cardinality |
|---|---|---|---|
| Target | Identification of the sequenced target. | Whole Genome Sequencing (WGS) Whole Exome Sequencing (WES) Multi-gene panel sequencing array OTH | 1..1 |
| Sequencing_Date | Defines the date of sequencing. | Date (YYYY-MM-DD), ISO 8601 format | 0..1 |
| ISO 15189 accredited | Indication whether the laboratory is accredited according to ISO 15189 (clinical). | Boolean | 0..1 |
| ISO 17025 accredited | Indication whether the laboratory is accredited according to ISO 17025 (research). | Boolean | 0..1 |
| WIP Protocol | String or https://www.protocols.io URL | ||
| Participated in proficiency testing | Indication whether the laboratory had participated in any proficiency testing, such as interlaboratory comparison. | Boolean | 0..1 |
| IVDR_passed | Indicate whether the methodology (including chemistry and sequencing standards) used for sequencing follows the In vitro diagnostic medical devices (IVDR) regulation passed by the EU in April 2017. | Boolean | 0..1 |
| Sequencing Platform | The used sequencing platform (i.e. brand, name of a company that produces sequencer equipment). | FAIR Genomes or EFO list | 1..n |
| Average depth of coverage | Mean coverage for whole genome sequencing, or mean target coverage for whole exome and targeted sequencing (eg 60x, average number of times each target base has been ‘read’ by sequencer). | Integer | 1..1 |
| Breadth of coverage | Breadth of coverage (or evenness) is the proportion or percentage of a reads that has been sequenced at a the provided average depth of coverage. (E.g. if for Average depth of coverage, the value is 60, and the Evenness is 50% at 60x, the value here will be 50). | Integer | 0..1 |
| Additional NGS quality control metrics | Statement of any additional NGS quality control metrics. | String | 0..1 |
| Initial_input_file_format | Identification of the genomic file format of the initial input file (eg. fastq, bam, cram). | EDAM’s file types and formats | 1..1 |
| Final_output_file_format | Identification of the genomic file format of the final output file (eg. vcf, gvcf). | EDAM’s file types and formats | 1..1 |
| Final_output_file_format_version | Identification of the version of genomic file format of the final output file (eg. vcf, gvcf). | String | 0..1 |
| Alignment_software | Identification of the software used for alignment. | --> Digital Resource class | 0..n |
| Alignment_Genome | The specific build of the human genome used as reference for sequence alignment and variant calling. | --> Digital Resource class | 1..n |
| Specific_Settings_Alignment_Genome | Any specific settings regarding alternative contigs or decoys. | String | 0..n |
| Target Gene | In case of targeted sequencing, specify which gene is being targeted. This item points to another class: Target_Gene. | --> Target Gene class | 0..n |
| Target Other | Any other targeted genomic region. | Children of SIO region NULL flavors | 0..n |
| Panel_of_Normals_Included | Indicate whether a panel of normals is included during variant calling. | Boolean | 0..1 |
| Panel_of_Normals_Description | Free text description of panel of normals, if applicable. | String | 0..1 |
| Variant | A detected and reported variant. | --> Variant class | 0..n |
| Variant_calling | Identification of the software used for variant calling. | --> Digital Resource class | 0..n |
| Variant_calling_date | Defines the date of variant calling. | Date (YYYY-MM-DD), ISO 8601 format | 0..1 |
| Variant_Annotation | Identification of the software used for variant annotation. | --> Digital Resource class | 0..n |
| Variant_Annotation_database | Database and version used for variant annotation. | --> Digital Resource class | 0..n |
2.4.1. Digital Resource
| Item | Definition | Value | Cardinality |
|---|---|---|---|
| Name | The name of the tool/software/database used. | String | 1..1 |
| Website | Link to the website or repository (like GitHub) of the tool/software/database. | URL | 0..n |
| Identifier | bio.tools identifier for the digital resource. | bio.tools identifier | 0..1 |
| Version | The version of the tool/software/database used. | double | 1..1 |
| Date used | The date when the tool/software/database was last used. | xsd:dateTime | 1..1 |
| Settings | Free text account of the settings used in the tool/software/database. | String | 0..n |
| Parameters | Description of parameters used with the specified software. Copy the complete command line (all lines executed) used. | String | 0..n |
2.4.2. Variant
| Item | Definition | Value | Cardinality |
|---|---|---|---|
| Variant_Type | The category or type of variation or abnormality present in an amino acid or nucleic acid sequence. | SNVs, indels, SVs, CNVs, gene fusions, ... (to be extended). | 0..n |
| Variant_Origin | A quality inhering in a variant by virtue of its origin. | somatic, germline, maternal, paternal, pedigree specific, population specific, de novo. | 0..1 |
| Variant_representation | The representation of the variant using HGVS nomenclature. | String following HGVS nomenclature | 1..1 |
| Clinical_Variant_Interpretation_criteria | Internationally (e.g. ACMG, ESMO-ESCAT) criteria met for variant interpretation | List of versions of ACMG, ESMO-ESCAT, others?? | 0..n |
| Clinical_Variant_Interpretation_result | Indicates result of clinical variant interpretation. | benign, likely benign, VUS, likely pathogenic, pathogenic. | 0..1 |
| Clinical_expert_panel_decision | Decision by clinical expert panel concerning the variant interpretation | String | 0..n |
| Applied_Criteria_of_Evidence | A category which fits with categories provided by Expert panels or tools accepted in Clincial Practice. If such recommendations are not available the weighted categories provided by freely available tools would be acceptable. | As listed in tables 3 and 4 in Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology | 0..n |
| Clinical_Interpretation_Tool | Identification of the tool used for clinical interpretation | --> Digital Resource class | 0..n |
| Variant_calling_software_deviation | Identification of the software used for variant calling, if different from software stated in Sequence class. | --> Digital Resource class | 0..n |
| Variant_Annotation_tools_deviation | Identification of the software used for variant annotation, if different from software stated in Sequence class. | --> Digital Resource class | 0..n |
| Variant_Annotation_database_deviation | Database and version used for variant annotation, if different from software stated in Sequence class. | --> Digital Resource class | 0..n |
| Reported_to_patient | Indication if the variant has been reported back to the patient, if different from software stated in Sequence class. | Boolean | 0..1 |
2.4.3. Target Gene
| Item | Definition | Value | Cardinality |
|---|---|---|---|
| URI | URI identifying the targeted gene. | URI to either HGNC NCBI gene, OMIM, HPO or HGVS for variants. | 0..1 |
| Label | Label of the target gene, if no URI can be provided. | String | 0..1 |
| Description | Description of target gene. | String | 0..1 |
2.5. Treatment
| Item | Definition | Value | Cardinality |
|---|---|---|---|
| Intention_to_Treat | Indicate the intended disease outcome for which the treatment is given, may be coded as SNOMED-CT. | 373808002 | Curative - procedure intent
373847000 | Neoadjuvant intent 373846009 | Adjuvant - intent 363676003 | Palliative - procedure intent 360271000 | Prophylaxis - procedure intent 243114000 | Support (regime/therapy) 129304002 | Excision - action 261004008 | Diagnostic intent 129428001 | Preventive - intent 429892002 | Guidance intent 360156006 | Screening - procedure intent 447295008 | Forensic intent Other - HL7 NULL flavor (OTH, UNK, NI, NA) | 0..1 |
| Setting | Indicate the treatment setting, which describes the treatment’s purpose in relation to the primary treatment (Reference: NCIT C124308) | Adjuvant Advanced/Metastatic Neoadjuvant Not applicable | 0..1 |
| Type | Indicates the type of treatment regimen that the patient completed; coded values can be chosen, SNOMED may be chosen for "procedures". | 23719005 | Bone marrow transplant
367336001 | Chemotherapy 18629005 | Medication 423827005 | Endoscopic therapy 169413002 | Hormonal therapy 76334006 | Immunotherapy 257891001 | Photodynamic therapy 1287742003 | Radiation therapy 56305001 | Radionucleotide therapy 1269349006 | Stem cell transplant 387713003 | Surgery 394613000 | Gene therapy 310341009 or 373818007 | Watchful waiting 424313000 | Active follow-up 703423002 | Chemoradiotherapy HL7 NULL flavor (NI, NA, OTH) | Other | 1..1 |
| Subtype | Detailed specification of the treatment type. | String | 0..1 |
| Line | Indicate if the treatment was the primary treatment following the initial diagnosis or 2nd, 3rd, … | String | 0..1 |
| Treatment Code | A drug product that contains one or more active and/or inactive ingredients used by the patient intended to treat, prevent or alleviate the symptoms of disease. Any hormone therapies, gender-related or otherwise, should also be recorded here. | ATC codes, IDMP when available. In case of Cancer: HEMONC, RxNorm | 1..n |
| Date_start_overall | The date and time of (the start of) the treatment. | Date (YYYY-MM-DD), ISO 8601 format | 0..1 |
| Date_end_overall | The date and time of the end of the overall treatment. | Date (YYYY-MM-DD), ISO 8601 format | 0..1 |
| Intended Duration | The duration of treatment regimen, in days. | Integer | 0..1 |
| DoseUnits | Indicates the total dose given in units (e.g. of Gray (Gy) when radiation, Millimeters/24hours for medication). | Choicelist with most common DoseUnits, like Gray, Milligrams/24 hours: make use of UCUM. | 0..1 |
| cumulativeDose | The amount of any substance administered over a specific period of time. | Float | 0..1 |
| doseIntervals | The number of times a substance is administered within a specific time period. | ISO8601 Period | 0..1 |
| routeOfAdministration | Designation of the part of the body through which or into which, or the way in which, the medicinal product is intended to be introduced. In some cases a medicinal product can be intended for more than one route and/or method of administration. | Subclass of Anatomy qualifier | 0..n |
| Pre-Treated | Indicates if disease has been pre-treated or is in course of treatment. | Boolean | 0..1 |
| Procedures | Indicates the type of treatment regimen that the patient completed; coded values can be chosen, SNOMED may be chosen for "procedures". | SNOMED Procedures | 0..1 |
| Surgical resection quality | Evaluation of the surgical resection quality based on coded values. | 1222638005 |American Joint Committee on Cancer R0 (qualifier value)
1222639002 | American Joint Committee on Cancer R1 (qualifier value) 1222640000 | American Joint Committee on Cancer R2 (qualifier value) 1222641001 | American Joint Committee on Cancer RX (qualifier value) | 0..1 |
| Treatment_Status | Indicates the patient’s outcome of the prescribed treatment (coded values, e.g. treatment not completed | because of toxicity). | 182992009 | Treatment completed (situation)
445528004 | Treatment changed (situation) 405613005 | Planned procedure (situation) 416406003 | Procedure discontinued (situation) | 0..1 |
| Reason_for_incomplete_Treatment | Reason for discontinuation of the treatment. | 419099009 | Dead (finding)
1296859006 | Procedure declined (situation) 713247000 | Procedure discontinued by patient (situation) 713246009 | Procedure discontinued by healthcare professional (situation) 266721009 | Absent response to treatment (situation) 407563006 | Treatment not tolerated (situation) technical or organizational problems | to be added, or use OTH Other | HL7 null flavour OTH Not applicable | HL7 null flavour, NA No information | HL7 null flavour, NI | 0..1 |
| Response_to_Treatment | The patients' response to the applied treatment regimen. (Source: RECIST). | Complete Response Disease progression NED Partial Response Stable Disease | 0..1 |
| Adverse_Events | Reports any treatment related adverse events. (Codelist reference: NCI-CTCAE (v5.0)) | CTCAE Codes | 0..n |
| Toxicity_Type | If the treatment was terminated early due to acute toxicity, indicates the type of toxicity that caused early termination of treatment. | Children of Toxicity (NCIT) | 0..1 |
| Modality | Indicates the method of radiation treatment or modality. | Electron Heavy Ions Photon Proton | 0..n |
| Fractions | Indicates the total number of fractions delivered as part of radiation treatment. | Integer | 0..1 |
| Site | Indicates the body region where radiation therapy was administered. | Children concepts of Anatomical Structure (body structure) | 0..n |
2.6. Biomarker
| Item | Definition | Value | Cardinality |
|---|---|---|---|
| Purpose | Necessary information to indicate the objective of the biomarker used (coded value, e.g. diagnostic, prognostic,…). | Diagnosis Prognosis Prediction Monitoring | 0..1 |
| Type | What type is the biomarker classified as. | Molecular
Imaging Anthropometric Cellular Physiological | 1..1 |
| Subtype | Subtype of the biomarker as classified in the type. | If Type ==
- Molecular: Genetics/Genomics, Epigenetics/Epigenomics, Transcription/Transcriptomics, Metabolites/Metabolomics, Proteins/Proteomics, Microbiomics/Microbiology, Biochemistry (biochemical), Other molecular biomarker, N/P - Imaging: X-Rays, Ultrasound (echography, etc), CT Scan, PET/SPECT, Spectrometry, MRI, Scintigraphy (Gamma), Mammography, Other image biomarker, N/P - Anthropometric: BMI, Body perimeters (circumference), Other anthropometric biomarker, N/P - Cellular: Histology (tissue abnormalities), Cytology (cell types), Other cellular biomarker, N/P - Physiological: Blood Pressure, Ankle-brachial Index, ECG, EEG, Electromyography, Other physiological biomarker, N/P | 1..1 |
| Name | Biomarker name. | String | 0..1 |
| Code | Code of the biomarker. | String | 0..1 |
| Code System | Code System of the Biomarker Code. | String or URI | 0..1 |
| Date | The date of the biomarker being obtained. | Date (YYYY-MM-DD), ISO 8601 format | 0..1 |
2.7. Environmental Exposure
| Item | Definition | Value | Cardinality |
|---|---|---|---|
| Tobacco Use Status | The status of the patient’s tobacco use. | SCTID:449868002 | Smokes tobacco daily
SCTID:428041000124106 | Occasional tobacco smoker SCTID:43381005 | Passive smoker SCTID:8517006 | Ex-smoker SCTID:405746006 | Current non smoker but past smoking history unknown SCTID:266919005 | Never smoked tobacco NullFlavor OTH | 0..1 |
| Type Of Tobacco Used | Type of tobacco the patient uses. | Children of tobacco product combined with childred of vaporizer, or
HL7 NULL flavors | 1..n (conditional) |
| Smoking amount | The number of cigarettes, cigars or grams of rolling tobacco consumed per day, week, month or year. | amount/day | amount/week | amount/month | amount/year | 1..1 (conditional) |
| Pack years | The unit indicating the smoker’s total exposure to tobacco smoke. For cigarettes, this is calculated using the number of smoked packs of cigarettes per day (one pack = 20 cigarettes) times the number of years of smoking. For other forms of tobacco, this is usually converted to an equivalent cigarette consumption. Often, only the number of pack years is estimated. | Integer | 1..1 (conditional) |
| Alcohol use status | The status of the patient’s alcohol use. | SCTID:219006 | Current drinker
SCTID:105542008 | Non - drinker SCTID:82581004 | Ex-drinker SCTID:783261004 | Lifetime non-drinker of alcohol NullFlavor OTH | 0..1 |
| Alcohol amount | The extent of the patient’s alcohol use in units of alcohol per time period (day/week/year). | AU per (day/week/year). 1 A.U.=
125 ml of wine, 330 ml of beer, 80 ml of drink, 40 ml liquor | 0..n |
| Other Type of Exposure | Indicates other exposures other than tabacco/smoking and alcohol use, for example asbestos. This exposure can include direct physical contact, inhalation, ingestion, or residing in close proximity to the source. | - Inadequate water (quantity and quality), sanitation and solid waste disposal, improper hygein (handwashing)
- Improper water resource managment, uncluding poor drainage, - Crowded housing and poor ventialtion of smoke - Exposures to vehicular and industrial air pollution - Population movement and encroachment and construction, which affect feeding and breeding grounds of vectors, such as mosquitoes, - Exposure to naturally ocurring toxic substances - Natural resources degradation (for example, landslides, poor drainage, erosion) - Climate change, partly from combustion of fossil fuel and release of greenhouse gases in transportation, industry, and poor energy conservation in housing, fuel, commerce, and industry - Ozone depletion from industrial and commercial actvitiy Biological hazard (Bacteria, Fungi, Parasitic worms, Protozoa, Viruses, Prions) Chemical hazards (Air pollutant, Heavy metal (Arsenice, Mercury, Lead), Pesticides, Other ( Formaldehyde, Asbestos, PFAS, PCBs, BPA, Phthalates, Radon, DDT) Physical environmental hazard (Radiation, Natural disaster (narrow matches: exposure to earthquake, exposure to flooding, exposure to tsunami etc.), Extreme weather, Human activities (eg. traffic accident)) HL7 NULL flavors | 0..n |
| Other_Other Type of Exposure | Provide the option to if "Other" option is choosen from List of other exposures, type a free text field. | String | 1..1 (conditional) |
| Travel History | Description of any travel within 4 weeks before the diagnosis. | String | 0..n |
| Travel History_start_date | Start date of relevant travel history. | Date (YYYY-MM-DD), ISO 8601 format | 0..n |
| Travel History_end_date | End date of relevant travel history. | Date (YYYY-MM-DD), ISO 8601 format | 0..n |
| Residential area at risk | Description of the area of known environmental exposure conditions where the subject/patient resides. | String | 0..n |
3. Matches
Different countries use different standards or vocabularies for their data. For example, in the harmonized minimal data model we mostly use terms from SNOMED CT and NCIT, but some countries only use ICD-10, and it’s difficult for them to convert everything to SNOMED. To make things easier, we decided to focus on matches between terms from different systems. This way, a dataset using ICD-10 can still be included in the GDI Catalogue, as long as we can clearly show how its terms connect to SNOMED or other terms.
For example:
-
One dataset might use a SNOMED term to describe a disease.
-
Another dataset might describe the same disease using an ICD-10 term.
-
Without a match between them, our system would think they are two different things.
-
If we add an exact match, the system will understand that they are the same, just described with different vocabularies.
This will make data both human and machine readable.
3.1. How To Use Matches
We started by adding different types of matches to all items and values in the model. The types of matches (match attribute), their definitions and examples can be found in the table below.
| Match Attribute | Definition | Example |
|---|---|---|
| skos:exactMatch | Two concepts have the same meaning and can be used interchangeably in all contexts. This link is transitive, meaning if A is an exact match to B, and B is an exact match to C, then A is also an exact match to C. | SNOMED Date of birth has an exact match to EFO Date of birth. Both refer to the same concept with no variation in meaning. |
| skos:closeMatch | Two concepts are sufficiently similar that they can be used interchangeably in many applications and schemes, but they are not strictly identical. This link is not meant to be transitive. | SNOMED Patient has a close match to NCIT Study Participant. A patient is an individual receiving medical care, while a study participant is someone enrolled in a research study. In some cases, these terms can be used interchangeably (e.g., in clinical trials involving patients), but not all study participants are patients—some might be healthy controls. |
| skos:relatedMatch | Two concepts are associated but not similar enough to be considered exact or close matches. | The concept HGNC BRCA1 and NCIT Breast Carcinoma have a skos:relatedMatch relationship because they are strongly associated but not identical. BRCA1 is a gene mutation, while breast cancer is a disease that may develop as a result of certain BRCA1 mutations. However, not all BRCA1 mutations lead to breast cancer, and not all cases of breast cancer are caused by BRCA1 mutations, making them related but not broader/narrower concepts. |
| skos:broadMatch | The concept represented by the external entity is broader than the item. This is a reverse property of skos:narrowMatch. | The concept NCIT Breast Carcinoma has a skos:broadMatch relationship to SNOMED Cancer. This means that Cancer is the broader concept because it represents a general category that includes multiple types of cancer, one of which is Breast Cancer. |
| skos:narrowMatch | The concept represented by the external entity is narrower than the item. This is a reverse property of skos:broadMatch. | The concept SNOMED Cancer has a skos:narrowMatch relationship to NCIT Breast Carcinoma. This means that Breast Cancer is a specific subtype of Cancer. |
Next to the matches, we also record what is the source of the match. If the mapping has already been done, for example SNOMED to ICD-10 in the SNOMED browser, or in the T-Rex browser, we include this information with the match in the spreadsheet.
3.2. Using SSSOM
From now on, we decided to follow the SSSOM (Simple Standard for Sharing Ontological Mappings) method. This is a standard format for describing matches between terms. We will:-
Add all required SSSOM elements.
-
Include recommended and optional elements if they are useful.
-
Make sure our matches follow the SSSOM rules.
Using this model and the SSSOM standard will help us:
-
Make it easier for teams using different vocabularies to share their data.
-
Let people and software search and analyze the data more effectively.
-
Support more countries and teams to join the GDI Catalogue without needing to change everything.