Metadata loader

Load dataset dictionaries and access all datasets and fields in a flexible manner.

MetaLoader

 MetaLoader (base_path:str='nbs/examples/', cohort:str=None,
             flexible_field_search:bool=False, errors:str='warn',
             **kwargs)

*Class to load multiple dictionaries and allows to easily access the relevant fields.

Args:

base_path (str, optional): The base path where the data is stored. Defaults to DATASETS_PATH.
cohort (str, optional): The name of the cohort within the dataset. Defaults to COHORT.
flexible_field_search (bool, optional): Whether to allow regex field search. Defaults to False.
errors (str, optional): Whether to raise an error or issue a warning if missing data is encountered.
    Possible values are 'raise', 'warn' and 'ignore'. Defaults to 'raise'.
**kwargs: Additional keyword arguments to pass to a DataLoader class.

Attributes:

dicts (pd.DataFrame): A dictionary of data dictionaries (dataframes) of all availbale datasets in the base_path.
fields (list): A list of all fields.
cohort (str): The name of the cohort being used.
base_path (str): The base path where the data is stored.
flexible_field_search (bool): Whether to allow regex field search.
errors (str): Whether to raise an error or issue a warning if missing data is encountered.
kwargs (dict): Additional keyword arguments to pass to a DataLoader class.*

# os.path.join(DATASETS_PATH, '*')
os.listdir(DATASETS_PATH)

['cgm', 'diet_logging', 'events', 'fundus', 'sleep', 'metadata']

The MetaLoader can be used to query all availbale fields throughout all datasets. In the following example, 3 datasets are available.

ml = MetaLoader()
ml

MetaLoader for: examples/*
with 83 fields
4 datasets:
['cgm'
 'diet_logging'
 'fundus'
 'sleep']

The object contains only the data dictionaries (metadata) of these datasets, where the columns correspond to columns in the data tables of the dataset (e.g., fundus).

ml.dicts['fundus']

tabular_field_name	fundus_image_left	fundus_image_right	collection_date	fractal_dimension_left	fractal_dimension_right
dataset	fundus	fundus	fundus	fundus	fundus
field_string	Fundus image (left)	Fundus image (right)	Collection date (YYYY-MM-DD)	Fractal dimension (left)	Fractal dimension (right)
description_string	Fundus image (left)	Fundus image (right)	Collection date (YYYY-MM-DD)	Fractal dimension (left)	Fractal dimension (right)
parent_dataframe	<NA>	<NA>	<NA>	<NA>	<NA>
relative_location	fundus/fundus.parquet	fundus/fundus.parquet	fundus/fundus.parquet	fundus/fundus.parquet	fundus/fundus.parquet
value_type	Text	Text	Date	Continuous	Continuous
units	<NA>	<NA>	Time	Time	Time
sampling_rate	<NA>	<NA>	<NA>	<NA>	<NA>
field_type	Image file (individual)	Image file (individual)	Data	Data	Data
array	Single	Single	Single	Single	Single
cohorts	10K	10K	10K	10K	10K
data_type	image	image	tabular	tabular	tabular
debut	2021-02-17	2021-02-17	2021-02-17	2021-02-17	2021-02-17
pandas_dtype	string	string	datetime64[ns]	float	float

You can query fields from multiple datasets directly.

ml[['glucose', 'fundus_image_left', 'fundus/collection_date']]

tabular_field_name	cgm/glucose	fundus/fundus_image_left	fundus/collection_date
dataset	cgm	fundus	fundus
field_string	Glucose	Fundus image (left)	Collection date (YYYY-MM-DD)
description_string	cgm temporal glucose values	Fundus image (left)	Collection date (YYYY-MM-DD)
parent_dataframe	<NA>	<NA>	<NA>
relative_location	cgm/cgm.parquet	fundus/fundus.parquet	fundus/fundus.parquet
value_type	Series data, continous	Text	Date
units	mg/dl	<NA>	Time
sampling_rate	15min	<NA>	<NA>
field_type	Continuous	Image file (individual)	Data
array	Single	Single	Single
cohorts	10K	10K	10K
data_type	time series	image	tabular
debut	2018-12-27	2021-02-17	2021-02-17
pandas_dtype	float	string	datetime64[ns]

Note that in the example above, for collection_date (that is common to all datasets) the dataset is specified in the prefix fundus/. Therefore, the loader returns the field from the fundus imaging dataset. Omitting this prefix will return all collection_date fields in the Human Phenotype Project.

You can then use the MetaLoader to load the actual data of fields from multiple datasets. Here we load glucose from the CGM dataset, and fundus_image_left from the fundus dataset.

df1 = ml.load(['glucose'])
df1

			glucose
participant_id	collection_timestamp	connection_id
0	2020-05-25 10:48:00+03:00	1000001	111.6
	2020-05-25 11:03:00+03:00	1000001	79.2
	2020-05-25 11:18:00+03:00	1000001	84.6
	2020-05-25 11:33:00+03:00	1000001	106.2
	2020-05-25 11:48:00+03:00	1000001	102.6
	...	...	...
	2020-05-30 15:03:00+03:00	1000001	90.0
	2020-05-30 15:18:00+03:00	1000001	99.0
	2020-05-30 15:33:00+03:00	1000001	106.2
	2020-05-30 15:48:00+03:00	1000001	99.0
	2020-05-30 16:03:00+03:00	1000001	84.6

502 rows × 1 columns

df2 = ml.load(['fundus_image_left'])
df2

				fundus_image_left
participant_id	cohort	research_stage	array_index
0	10k	00_00_visit	0	/path/to/file
1	10k	00_00_visit	0	/path/to/file
2	10k	00_00_visit	0	/path/to/file
3	10k	00_00_visit	0	/path/to/file
4	10k	00_00_visit	0	/path/to/file

df3 = ml.load(['glucose' ,'fundus_image_left'])
df3

					glucose	fundus_image_left
participant_id	collection_timestamp	connection_id	cohort	research_stage
0	2020-05-25 10:48:00+03:00	1000001	10k	00_00_visit	111.6	/path/to/file
	2020-05-25 11:03:00+03:00	1000001	10k	00_00_visit	79.2	/path/to/file
	2020-05-25 11:18:00+03:00	1000001	10k	00_00_visit	84.6	/path/to/file
	2020-05-25 11:33:00+03:00	1000001	10k	00_00_visit	106.2	/path/to/file
	2020-05-25 11:48:00+03:00	1000001	10k	00_00_visit	102.6	/path/to/file
	...	...	...	...	...	...
	2020-05-30 16:03:00+03:00	1000001	10k	00_00_visit	84.6	/path/to/file
1	NaT	NaN	10k	00_00_visit	NaN	/path/to/file
2	NaT	NaN	10k	00_00_visit	NaN	/path/to/file
3	NaT	NaN	10k	00_00_visit	NaN	/path/to/file
4	NaT	NaN	10k	00_00_visit	NaN	/path/to/file

506 rows × 2 columns

ml.load(['glucose' ,'fundus_image_left', 'fundus/collection_date']).head()

					glucose	fundus_image_left	collection_date
participant_id	collection_timestamp	connection_id	cohort	research_stage
0	2020-05-25 10:48:00+03:00	1000001	10k	00_00_visit	111.6	/path/to/file	2022-11-16
	2020-05-25 11:03:00+03:00	1000001	10k	00_00_visit	79.2	/path/to/file	2022-11-16
	2020-05-25 11:18:00+03:00	1000001	10k	00_00_visit	84.6	/path/to/file	2022-11-16
	2020-05-25 11:33:00+03:00	1000001	10k	00_00_visit	106.2	/path/to/file	2022-11-16
	2020-05-25 11:48:00+03:00	1000001	10k	00_00_visit	102.6	/path/to/file	2022-11-16

You may use more flexible search queries using regex and various properties of the fields. Both the get() method and load() method support the same syntax.

Example: get all bulk data fields.

ml.get('Time series', flexible=True, prop='field_type')

tabular_field_name	cgm/cgm_filename
dataset	cgm
field_string	CGM timeseries
description_string	Name of the file containing the participants' ...
parent_dataframe	<NA>
relative_location	cgm/cgm.parquet
value_type	Text
units	<NA>
sampling_rate	<NA>
field_type	Time series file (individual)
array	Single
cohorts	10K
data_type	text
debut	2018-12-27
pandas_dtype	string

ml.get('Time series file (individual)', prop='field_type')

tabular_field_name	cgm/cgm_filename
dataset	cgm
field_string	CGM timeseries
description_string	Name of the file containing the participants' ...
parent_dataframe	<NA>
relative_location	cgm/cgm.parquet
value_type	Text
units	<NA>
sampling_rate	<NA>
field_type	Time series file (individual)
array	Single
cohorts	10K
data_type	text
debut	2018-12-27
pandas_dtype	string

ml.load('Image file (individual)', prop='field_type')

				fundus_image_left	fundus_image_right
participant_id	cohort	research_stage	array_index
0	10k	00_00_visit	0	/path/to/file	/path/to/file
1	10k	00_00_visit	0	/path/to/file	/path/to/file
2	10k	00_00_visit	0	/path/to/file	/path/to/file
3	10k	00_00_visit	0	/path/to/file	/path/to/file
4	10k	00_00_visit	0	/path/to/file	/path/to/file

Example: get all fields that include “mg” in their units

ml.get('mg', flexible=True, prop='units')

tabular_field_name	cgm/1st qu_	cgm/3rd qu_	cgm/auc	cgm/ea1c	cgm/glucose	cgm/gmi	cgm/iqr	cgm/mad	cgm/mag	cgm/mage	...	cgm/modd	cgm/range	cgm/sd	cgm/sdb	cgm/sdbdm	cgm/sddm	cgm/sdhhmm	cgm/sdw	cgm/sdwsh	diet_logging/sodium_mg
dataset	cgm	cgm	cgm	cgm	cgm	cgm	cgm	cgm	cgm	cgm	...	cgm	cgm	cgm	cgm	cgm	cgm	cgm	cgm	cgm	diet_logging
field_string	1st quantile	3rd quantile	AUC	eA1C	Glucose	GMI	IQR	MAD	MAG	MAGE	...	MODD	Range	SD	SDb	SDbdm	SDdm	SDhhmm	SDw	SDwsh	Sodium intake per food logged
description_string	First quantile of all glucose values.	Third quantile of all glucose values.	Hourly average AUC. This measure integrates, t...	A linear transformation of the mean glucose va...	cgm temporal glucose values	A linear transformation of the mean glucose va...	Interquartile range (IQR), calculated as the d...	Median Absolute Deviation (MAD). This is a mea...	Mean Absolute Glucose (MAG). This is a measure...	Mean Amplitude of Glycemic Excursions (MAGE), ...	...	Mean difference between glucose values obtaine...	Difference between the maximum and minimum glu...	Standard deviation of all glucose values.	SD between days, within time points. Mean valu...	SD between days, within time points, corrected...	Horizontal SD. SD of the mean glucose values, ...	SD between time points. Standard deviation of ...	Vertical SD within days. Average value of the ...	SD within series. Taking hour-long intervals t...	Sodium intake per food logged
parent_dataframe	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	...	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>
relative_location	cgm/cgm.parquet	cgm/cgm.parquet	cgm/cgm.parquet	cgm/cgm.parquet	cgm/cgm.parquet	cgm/cgm.parquet	cgm/cgm.parquet	cgm/cgm.parquet	cgm/cgm.parquet	cgm/cgm.parquet	...	cgm/cgm.parquet	cgm/cgm.parquet	cgm/cgm.parquet	cgm/cgm.parquet	cgm/cgm.parquet	cgm/cgm.parquet	cgm/cgm.parquet	cgm/cgm.parquet	cgm/cgm.parquet	diet_logging/diet_logging.parquet
value_type	Continuous	Continuous	Continuous	Continuous	Series data, continous	Continuous	Continuous	Continuous	Continuous	Continuous	...	Continuous	Continuous	Continuous	Continuous	Continuous	Continuous	Continuous	Continuous	Continuous	Continuous
units	mg/dl	mg/dl	mg/dl*h	mg/dl	mg/dl	mg/dl	mg/dl	mg/dl	mg/dl	mg/dl	...	mg/dl	mg/dl	mg/dl	mg/dl	mg/dl	mg/dl	mg/dl	mg/dl	mg/dl	mg
sampling_rate	<NA>	<NA>	<NA>	<NA>	15min	<NA>	<NA>	<NA>	<NA>	<NA>	...	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>
field_type	Data	Data	Data	Data	Continuous	Data	Data	Data	Data	Data	...	Data	Data	Data	Data	Data	Data	Data	Data	Data	Data
array	Single	Single	Single	Single	Single	Single	Single	Single	Single	Single	...	Single	Single	Single	Single	Single	Single	Single	Single	Single	Single
cohorts	10K	10K	10K	10K	10K	10K	10K	10K	10K	10K	...	10K	10K	10K	10K	10K	10K	10K	10K	10K	10K
data_type	tabular	tabular	tabular	tabular	time series	tabular	tabular	tabular	tabular	tabular	...	tabular	tabular	tabular	tabular	tabular	tabular	tabular	tabular	tabular	Time Series
debut	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	...	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2019-09-01
pandas_dtype	float	float	float	float	float	float	float	float	float	float	...	float	float	float	float	float	float	float	float	float	float

14 rows × 24 columns