Metadata loader

Load dataset dictionaries and access all datasets and fields in a flexible manner.

source

MetaLoader

 MetaLoader (base_path:str='nbs/examples/', cohort:str=None,
             flexible_field_search:bool=False, errors:str='warn',
             **kwargs)

*Class to load multiple dictionaries and allows to easily access the relevant fields.

Args:

base_path (str, optional): The base path where the data is stored. Defaults to DATASETS_PATH.
cohort (str, optional): The name of the cohort within the dataset. Defaults to COHORT.
flexible_field_search (bool, optional): Whether to allow regex field search. Defaults to False.
errors (str, optional): Whether to raise an error or issue a warning if missing data is encountered.
    Possible values are 'raise', 'warn' and 'ignore'. Defaults to 'raise'.
**kwargs: Additional keyword arguments to pass to a DataLoader class.

Attributes:

dicts (pd.DataFrame): A dictionary of data dictionaries (dataframes) of all availbale datasets in the base_path.
fields (list): A list of all fields.
cohort (str): The name of the cohort being used.
base_path (str): The base path where the data is stored.
flexible_field_search (bool): Whether to allow regex field search.
errors (str): Whether to raise an error or issue a warning if missing data is encountered.
kwargs (dict): Additional keyword arguments to pass to a DataLoader class.*
# os.path.join(DATASETS_PATH, '*')
os.listdir(DATASETS_PATH)
['cgm', 'diet_logging', 'events', 'fundus', 'sleep', 'metadata']

The MetaLoader can be used to query all availbale fields throughout all datasets. In the following example, 3 datasets are available.

ml = MetaLoader()
ml
MetaLoader for: examples/*
with 83 fields
4 datasets:
['cgm'
 'diet_logging'
 'fundus'
 'sleep']

The object contains only the data dictionaries (metadata) of these datasets, where the columns correspond to columns in the data tables of the dataset (e.g., fundus).

ml.dicts['fundus']
tabular_field_name fundus_image_left fundus_image_right collection_date fractal_dimension_left fractal_dimension_right
dataset fundus fundus fundus fundus fundus
field_string Fundus image (left) Fundus image (right) Collection date (YYYY-MM-DD) Fractal dimension (left) Fractal dimension (right)
description_string Fundus image (left) Fundus image (right) Collection date (YYYY-MM-DD) Fractal dimension (left) Fractal dimension (right)
parent_dataframe <NA> <NA> <NA> <NA> <NA>
relative_location fundus/fundus.parquet fundus/fundus.parquet fundus/fundus.parquet fundus/fundus.parquet fundus/fundus.parquet
value_type Text Text Date Continuous Continuous
units <NA> <NA> Time Time Time
sampling_rate <NA> <NA> <NA> <NA> <NA>
field_type Image file (individual) Image file (individual) Data Data Data
array Single Single Single Single Single
cohorts 10K 10K 10K 10K 10K
data_type image image tabular tabular tabular
debut 2021-02-17 2021-02-17 2021-02-17 2021-02-17 2021-02-17
pandas_dtype string string datetime64[ns] float float

You can query fields from multiple datasets directly.

ml[['glucose', 'fundus_image_left', 'fundus/collection_date']]
tabular_field_name cgm/glucose fundus/fundus_image_left fundus/collection_date
dataset cgm fundus fundus
field_string Glucose Fundus image (left) Collection date (YYYY-MM-DD)
description_string cgm temporal glucose values Fundus image (left) Collection date (YYYY-MM-DD)
parent_dataframe <NA> <NA> <NA>
relative_location cgm/cgm.parquet fundus/fundus.parquet fundus/fundus.parquet
value_type Series data, continous Text Date
units mg/dl <NA> Time
sampling_rate 15min <NA> <NA>
field_type Continuous Image file (individual) Data
array Single Single Single
cohorts 10K 10K 10K
data_type time series image tabular
debut 2018-12-27 2021-02-17 2021-02-17
pandas_dtype float string datetime64[ns]

Note that in the example above, for collection_date (that is common to all datasets) the dataset is specified in the prefix fundus/. Therefore, the loader returns the field from the fundus imaging dataset. Omitting this prefix will return all collection_date fields in the Human Phenotype Project.

You can then use the MetaLoader to load the actual data of fields from multiple datasets. Here we load glucose from the CGM dataset, and fundus_image_left from the fundus dataset.

df1 = ml.load(['glucose'])
df1
glucose
participant_id collection_timestamp connection_id
0 2020-05-25 10:48:00+03:00 1000001 111.6
2020-05-25 11:03:00+03:00 1000001 79.2
2020-05-25 11:18:00+03:00 1000001 84.6
2020-05-25 11:33:00+03:00 1000001 106.2
2020-05-25 11:48:00+03:00 1000001 102.6
... ... ...
2020-05-30 15:03:00+03:00 1000001 90.0
2020-05-30 15:18:00+03:00 1000001 99.0
2020-05-30 15:33:00+03:00 1000001 106.2
2020-05-30 15:48:00+03:00 1000001 99.0
2020-05-30 16:03:00+03:00 1000001 84.6

502 rows × 1 columns

df2 = ml.load(['fundus_image_left'])
df2
fundus_image_left
participant_id cohort research_stage array_index
0 10k 00_00_visit 0 /path/to/file
1 10k 00_00_visit 0 /path/to/file
2 10k 00_00_visit 0 /path/to/file
3 10k 00_00_visit 0 /path/to/file
4 10k 00_00_visit 0 /path/to/file
df3 = ml.load(['glucose' ,'fundus_image_left'])
df3
glucose fundus_image_left
participant_id collection_timestamp connection_id cohort research_stage
0 2020-05-25 10:48:00+03:00 1000001 10k 00_00_visit 111.6 /path/to/file
2020-05-25 11:03:00+03:00 1000001 10k 00_00_visit 79.2 /path/to/file
2020-05-25 11:18:00+03:00 1000001 10k 00_00_visit 84.6 /path/to/file
2020-05-25 11:33:00+03:00 1000001 10k 00_00_visit 106.2 /path/to/file
2020-05-25 11:48:00+03:00 1000001 10k 00_00_visit 102.6 /path/to/file
... ... ... ... ... ...
2020-05-30 16:03:00+03:00 1000001 10k 00_00_visit 84.6 /path/to/file
1 NaT NaN 10k 00_00_visit NaN /path/to/file
2 NaT NaN 10k 00_00_visit NaN /path/to/file
3 NaT NaN 10k 00_00_visit NaN /path/to/file
4 NaT NaN 10k 00_00_visit NaN /path/to/file

506 rows × 2 columns

ml.load(['glucose' ,'fundus_image_left', 'fundus/collection_date']).head()
glucose fundus_image_left collection_date
participant_id collection_timestamp connection_id cohort research_stage
0 2020-05-25 10:48:00+03:00 1000001 10k 00_00_visit 111.6 /path/to/file 2022-11-16
2020-05-25 11:03:00+03:00 1000001 10k 00_00_visit 79.2 /path/to/file 2022-11-16
2020-05-25 11:18:00+03:00 1000001 10k 00_00_visit 84.6 /path/to/file 2022-11-16
2020-05-25 11:33:00+03:00 1000001 10k 00_00_visit 106.2 /path/to/file 2022-11-16
2020-05-25 11:48:00+03:00 1000001 10k 00_00_visit 102.6 /path/to/file 2022-11-16

You may use more flexible search queries using regex and various properties of the fields. Both the get() method and load() method support the same syntax.

  1. Example: get all bulk data fields.
ml.get('Time series', flexible=True, prop='field_type')
tabular_field_name cgm/cgm_filename
dataset cgm
field_string CGM timeseries
description_string Name of the file containing the participants' ...
parent_dataframe <NA>
relative_location cgm/cgm.parquet
value_type Text
units <NA>
sampling_rate <NA>
field_type Time series file (individual)
array Single
cohorts 10K
data_type text
debut 2018-12-27
pandas_dtype string
ml.get('Time series file (individual)', prop='field_type')
tabular_field_name cgm/cgm_filename
dataset cgm
field_string CGM timeseries
description_string Name of the file containing the participants' ...
parent_dataframe <NA>
relative_location cgm/cgm.parquet
value_type Text
units <NA>
sampling_rate <NA>
field_type Time series file (individual)
array Single
cohorts 10K
data_type text
debut 2018-12-27
pandas_dtype string
ml.load('Image file (individual)', prop='field_type')
fundus_image_left fundus_image_right
participant_id cohort research_stage array_index
0 10k 00_00_visit 0 /path/to/file /path/to/file
1 10k 00_00_visit 0 /path/to/file /path/to/file
2 10k 00_00_visit 0 /path/to/file /path/to/file
3 10k 00_00_visit 0 /path/to/file /path/to/file
4 10k 00_00_visit 0 /path/to/file /path/to/file
  1. Example: get all fields that include “mg” in their units
ml.get('mg', flexible=True, prop='units')
tabular_field_name cgm/1st qu_ cgm/3rd qu_ cgm/auc cgm/ea1c cgm/glucose cgm/gmi cgm/iqr cgm/mad cgm/mag cgm/mage ... cgm/modd cgm/range cgm/sd cgm/sdb cgm/sdbdm cgm/sddm cgm/sdhhmm cgm/sdw cgm/sdwsh diet_logging/sodium_mg
dataset cgm cgm cgm cgm cgm cgm cgm cgm cgm cgm ... cgm cgm cgm cgm cgm cgm cgm cgm cgm diet_logging
field_string 1st quantile 3rd quantile AUC eA1C Glucose GMI IQR MAD MAG MAGE ... MODD Range SD SDb SDbdm SDdm SDhhmm SDw SDwsh Sodium intake per food logged
description_string First quantile of all glucose values. Third quantile of all glucose values. Hourly average AUC. This measure integrates, t... A linear transformation of the mean glucose va... cgm temporal glucose values A linear transformation of the mean glucose va... Interquartile range (IQR), calculated as the d... Median Absolute Deviation (MAD). This is a mea... Mean Absolute Glucose (MAG). This is a measure... Mean Amplitude of Glycemic Excursions (MAGE), ... ... Mean difference between glucose values obtaine... Difference between the maximum and minimum glu... Standard deviation of all glucose values. SD between days, within time points. Mean valu... SD between days, within time points, corrected... Horizontal SD. SD of the mean glucose values, ... SD between time points. Standard deviation of ... Vertical SD within days. Average value of the ... SD within series. Taking hour-long intervals t... Sodium intake per food logged
parent_dataframe <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
relative_location cgm/cgm.parquet cgm/cgm.parquet cgm/cgm.parquet cgm/cgm.parquet cgm/cgm.parquet cgm/cgm.parquet cgm/cgm.parquet cgm/cgm.parquet cgm/cgm.parquet cgm/cgm.parquet ... cgm/cgm.parquet cgm/cgm.parquet cgm/cgm.parquet cgm/cgm.parquet cgm/cgm.parquet cgm/cgm.parquet cgm/cgm.parquet cgm/cgm.parquet cgm/cgm.parquet diet_logging/diet_logging.parquet
value_type Continuous Continuous Continuous Continuous Series data, continous Continuous Continuous Continuous Continuous Continuous ... Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous
units mg/dl mg/dl mg/dl*h mg/dl mg/dl mg/dl mg/dl mg/dl mg/dl mg/dl ... mg/dl mg/dl mg/dl mg/dl mg/dl mg/dl mg/dl mg/dl mg/dl mg
sampling_rate <NA> <NA> <NA> <NA> 15min <NA> <NA> <NA> <NA> <NA> ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
field_type Data Data Data Data Continuous Data Data Data Data Data ... Data Data Data Data Data Data Data Data Data Data
array Single Single Single Single Single Single Single Single Single Single ... Single Single Single Single Single Single Single Single Single Single
cohorts 10K 10K 10K 10K 10K 10K 10K 10K 10K 10K ... 10K 10K 10K 10K 10K 10K 10K 10K 10K 10K
data_type tabular tabular tabular tabular time series tabular tabular tabular tabular tabular ... tabular tabular tabular tabular tabular tabular tabular tabular tabular Time Series
debut 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 ... 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2019-09-01
pandas_dtype float float float float float float float float float float ... float float float float float float float float float float

14 rows × 24 columns