Skip to content

Workshop

This section is intended as a training exercise and is not required to use the FAIR Data Station.

Exercise: Data, Metadata Minimal Information Models and Data Interoperability (../Version 1)

Section titled “Exercise: Data, Metadata Minimal Information Models and Data Interoperability (../Version 1)”

Online repositories sharing scientific data are vital for the advancement of science. Data sharing improves research transparency, promotes the validation of experimental methods and scientific conclusions, enables data reuse, and facilitates knowledge discovery using new analysis tools. Essential for reusing scientific data is the availability of machine-readable metadata about the scientific experiments conducted with a degree of completeness that reflects the FAIR guiding principles: Findable, Accessible, Interoperable, Reusable.

Several tools have been created to help make data FAIR. The ISA metadata framework standard outlines a model for capturing experiment metadata using 3 levels: Investigation, Study, and Assay. A key feature of properly FAIRified data is a high level of data interoperability. From a data producer/user point of view, two levels are important: structural and semantic interoperability.

Structural interoperability defines the format of the (../meta)data, allowing the data to be interpreted by multiple systems. For example, the FASTA sequence format is the most implemented and best machine-actionable data standard for sequence data and therefore directly understood by many sequence analysis tools.

Semantic interoperability entails the transformation of ambiguous human-understandable metadata in a standardized machine-actionable open format, allowing computational support systems to automatically find, access, and reuse data. To ensure that the set of metadata is sufficient for the data to be unambiguously described, standardized minimal information models and checklists, detailing those requirements, have been developed for a wide array of experiment data.

In this exercise, we study the metadata available for ENA Project: PRJDB10485 available at ENA Browser Project PRJDB10485. Here we can learn about the background of this study. In the ISA framework, this overarching information would be placed within the Investigation/Study level.

“The aim of this project is to analyze the dysbiosis of fecal microbiome in HIV-1 infected individuals in Ghana. Gut microbiome dysbiosis has been correlated to the progression of non-AIDS diseases such as cardiovascular and metabolic disorders. Because the microbiome composition is different among races and countries, analyses of the composition in different regions is important to understand the pathogenesis unique to specific regions. In the present study, we examined fecal microbiome compositions in HIV-1 infected individuals in Ghana. In a cross-sectional case-control study, age- and gender-matched HIV-1 infected Ghanaian adults (../HIV-1 [+]; n = 55) and seronegative controls (../HIV-1 [-]; n = 55) were enrolled.”

Subsequently, we can download the metadata XML files of 55 HIV-1 infected adults and 55 seronegative controls and open them in Excel via ENA Browser Sample SAMD00244427 or use the link to directly get the associated metadata ENA Browser Sample Metadata.

Ontologies and minimal information models are both essential for managing and representing this information, but they have distinct purposes: ontologies focus on capturing domain knowledge and semantics. Minimal information models focus on standardizing data reporting on a specific type of research.

In an ontology, an attribute refers to a property or characteristic that is associated with a particular entity or concept within a domain. For example, an ontology describing dogs would include properties such as “breed,” “shoulder height,” and “weight.” These attributes help to distinguish dogs based on their specific characteristics and are typically represented as key-value pairs (../e.g., weight: 30 kg), where the key denotes the name or label of the attribute, and the value specifies the corresponding property or characteristic associated with the entity.

So the question is what are the attributes associated with the samples obtained in this study? Are these attributes understandable by a computer (../and are they interoperable)? And how do these attributes align with the proposed minimal information model for this type of study (../here human gut: ERC000015)?

In the table below we have collected the attributes (../metadata types) associated with the 110 samples obtained in this study. (../The values will vary depending on the origin of the sample – shown are the values linked to the first sample)

TypeInteroperable? (../Yes/No/Alternative)Value
ART_drugs_currentNoTDF/3TC/EFV
ART_duration_at baseline_months29
ART_start_date2015-04-08
ART_status_at_baselineART
CD4_count(../cells/ul)473
CD8_count(../cells/ul)1155
Co-trimoxazole duration (../mths)24
External IdSAMD00244427
HIV_risk_exposureNoHeterosexual
INSDC center nameAIDS Research Center, National Institute of Infectious Diseases
INSDC first public2021-03-26T00:00:00Z
INSDC last update2024-01-14T05:42:57.197Z
INSDC secondary accessionDRS176868
INSDC statuslive
Marital_statusSingle
NCBI submission modelMIMARKS.survey.human-gut
NCBI submission packageMIMARKS.survey.human-gut.6.0
SRA accessionDRS176868
Viral_load (../copies/ml)15746
agehost age: 23
collection date2017-09-13
descriptionKeywords: GSC:MIxS;MIMARKS:6.0
educationSecondary school
env_broad_scalehuman gut
env_local_scalehuman gut environment
env_mediumfecal material
geo loc nameGhana:Koforidua
hostHomo sapiens
host_disease_statHIV-1 positive
log VL4.197170247
occupationHair dresser
organismhuman gut metagenome
project nameDysbiotic fecal microbiome in HIV-1 infected individuals in Ghana
sample name5P
sexfemale
title16s rDNA sequence from fecal sample of HIV-1 infected female from Koforidua, Ghana, sample ID HG-P-006-KO
  1. Go to ENA Checklists. How many checklists are currently available?
  2. Using the ENA checklists, select the appropriate minimal information model (../human gut: ERC000015). Check the interoperability of the currently used attributes (../Column 1) and replace with the interoperable version when available.
  3. FAIR should go hand in hand with privacy regulations. Which of these (../or combination of) attributes would be invading the privacy of the tested persons?
  4. Trying to build on existing domain-specific principles and workflows on the one hand, while trying to get to a maximum level of cross-domain interoperability on the other are competing goals. Which of the attributes are domain specific? Hint: cross-check with the ENA default checklist -> ERC000011 (../ENA Browser ERC000011).
  5. Which attributes are too specific to be useful for data reuse?
  6. Which attributes would be essential for data reuse?