# os.path.join(DATASETS_PATH, '*')
os.listdir(DATASETS_PATH)
['cgm', 'diet_logging', 'events', 'fundus', 'sleep', 'metadata']
MetaLoader (base_path:str='nbs/examples/', cohort:str=None, flexible_field_search:bool=False, errors:str='warn', **kwargs)
*Class to load multiple dictionaries and allows to easily access the relevant fields.
Args:
base_path (str, optional): The base path where the data is stored. Defaults to DATASETS_PATH.
cohort (str, optional): The name of the cohort within the dataset. Defaults to COHORT.
flexible_field_search (bool, optional): Whether to allow regex field search. Defaults to False.
errors (str, optional): Whether to raise an error or issue a warning if missing data is encountered.
Possible values are 'raise', 'warn' and 'ignore'. Defaults to 'raise'.
**kwargs: Additional keyword arguments to pass to a DataLoader class.
Attributes:
dicts (pd.DataFrame): A dictionary of data dictionaries (dataframes) of all availbale datasets in the base_path.
fields (list): A list of all fields.
cohort (str): The name of the cohort being used.
base_path (str): The base path where the data is stored.
flexible_field_search (bool): Whether to allow regex field search.
errors (str): Whether to raise an error or issue a warning if missing data is encountered.
kwargs (dict): Additional keyword arguments to pass to a DataLoader class.*
['cgm', 'diet_logging', 'events', 'fundus', 'sleep', 'metadata']
The MetaLoader
can be used to query all availbale fields throughout all datasets. In the following example, 3 datasets are available.
MetaLoader for: examples/*
with 83 fields
4 datasets:
['cgm'
'diet_logging'
'fundus'
'sleep']
The object contains only the data dictionaries (metadata) of these datasets, where the columns correspond to columns in the data tables of the dataset (e.g., fundus).
tabular_field_name | fundus_image_left | fundus_image_right | collection_date | fractal_dimension_left | fractal_dimension_right |
---|---|---|---|---|---|
dataset | fundus | fundus | fundus | fundus | fundus |
field_string | Fundus image (left) | Fundus image (right) | Collection date (YYYY-MM-DD) | Fractal dimension (left) | Fractal dimension (right) |
description_string | Fundus image (left) | Fundus image (right) | Collection date (YYYY-MM-DD) | Fractal dimension (left) | Fractal dimension (right) |
parent_dataframe | <NA> | <NA> | <NA> | <NA> | <NA> |
relative_location | fundus/fundus.parquet | fundus/fundus.parquet | fundus/fundus.parquet | fundus/fundus.parquet | fundus/fundus.parquet |
value_type | Text | Text | Date | Continuous | Continuous |
units | <NA> | <NA> | Time | Time | Time |
sampling_rate | <NA> | <NA> | <NA> | <NA> | <NA> |
field_type | Image file (individual) | Image file (individual) | Data | Data | Data |
array | Single | Single | Single | Single | Single |
cohorts | 10K | 10K | 10K | 10K | 10K |
data_type | image | image | tabular | tabular | tabular |
debut | 2021-02-17 | 2021-02-17 | 2021-02-17 | 2021-02-17 | 2021-02-17 |
pandas_dtype | string | string | datetime64[ns] | float | float |
You can query fields from multiple datasets directly.
tabular_field_name | cgm/glucose | fundus/fundus_image_left | fundus/collection_date |
---|---|---|---|
dataset | cgm | fundus | fundus |
field_string | Glucose | Fundus image (left) | Collection date (YYYY-MM-DD) |
description_string | cgm temporal glucose values | Fundus image (left) | Collection date (YYYY-MM-DD) |
parent_dataframe | <NA> | <NA> | <NA> |
relative_location | cgm/cgm.parquet | fundus/fundus.parquet | fundus/fundus.parquet |
value_type | Series data, continous | Text | Date |
units | mg/dl | <NA> | Time |
sampling_rate | 15min | <NA> | <NA> |
field_type | Continuous | Image file (individual) | Data |
array | Single | Single | Single |
cohorts | 10K | 10K | 10K |
data_type | time series | image | tabular |
debut | 2018-12-27 | 2021-02-17 | 2021-02-17 |
pandas_dtype | float | string | datetime64[ns] |
Note that in the example above, for collection_date
(that is common to all datasets) the dataset is specified in the prefix fundus/
. Therefore, the loader returns the field from the fundus imaging dataset. Omitting this prefix will return all collection_date
fields in the Human Phenotype Project.
You can then use the MetaLoader
to load the actual data of fields from multiple datasets. Here we load glucose
from the CGM dataset, and fundus_image_left
from the fundus dataset.
glucose | |||
---|---|---|---|
participant_id | collection_timestamp | connection_id | |
0 | 2020-05-25 10:48:00+03:00 | 1000001 | 111.6 |
2020-05-25 11:03:00+03:00 | 1000001 | 79.2 | |
2020-05-25 11:18:00+03:00 | 1000001 | 84.6 | |
2020-05-25 11:33:00+03:00 | 1000001 | 106.2 | |
2020-05-25 11:48:00+03:00 | 1000001 | 102.6 | |
... | ... | ... | |
2020-05-30 15:03:00+03:00 | 1000001 | 90.0 | |
2020-05-30 15:18:00+03:00 | 1000001 | 99.0 | |
2020-05-30 15:33:00+03:00 | 1000001 | 106.2 | |
2020-05-30 15:48:00+03:00 | 1000001 | 99.0 | |
2020-05-30 16:03:00+03:00 | 1000001 | 84.6 |
502 rows × 1 columns
fundus_image_left | ||||
---|---|---|---|---|
participant_id | cohort | research_stage | array_index | |
0 | 10k | 00_00_visit | 0 | /path/to/file |
1 | 10k | 00_00_visit | 0 | /path/to/file |
2 | 10k | 00_00_visit | 0 | /path/to/file |
3 | 10k | 00_00_visit | 0 | /path/to/file |
4 | 10k | 00_00_visit | 0 | /path/to/file |
glucose | fundus_image_left | |||||
---|---|---|---|---|---|---|
participant_id | collection_timestamp | connection_id | cohort | research_stage | ||
0 | 2020-05-25 10:48:00+03:00 | 1000001 | 10k | 00_00_visit | 111.6 | /path/to/file |
2020-05-25 11:03:00+03:00 | 1000001 | 10k | 00_00_visit | 79.2 | /path/to/file | |
2020-05-25 11:18:00+03:00 | 1000001 | 10k | 00_00_visit | 84.6 | /path/to/file | |
2020-05-25 11:33:00+03:00 | 1000001 | 10k | 00_00_visit | 106.2 | /path/to/file | |
2020-05-25 11:48:00+03:00 | 1000001 | 10k | 00_00_visit | 102.6 | /path/to/file | |
... | ... | ... | ... | ... | ... | |
2020-05-30 16:03:00+03:00 | 1000001 | 10k | 00_00_visit | 84.6 | /path/to/file | |
1 | NaT | NaN | 10k | 00_00_visit | NaN | /path/to/file |
2 | NaT | NaN | 10k | 00_00_visit | NaN | /path/to/file |
3 | NaT | NaN | 10k | 00_00_visit | NaN | /path/to/file |
4 | NaT | NaN | 10k | 00_00_visit | NaN | /path/to/file |
506 rows × 2 columns
glucose | fundus_image_left | collection_date | |||||
---|---|---|---|---|---|---|---|
participant_id | collection_timestamp | connection_id | cohort | research_stage | |||
0 | 2020-05-25 10:48:00+03:00 | 1000001 | 10k | 00_00_visit | 111.6 | /path/to/file | 2022-11-16 |
2020-05-25 11:03:00+03:00 | 1000001 | 10k | 00_00_visit | 79.2 | /path/to/file | 2022-11-16 | |
2020-05-25 11:18:00+03:00 | 1000001 | 10k | 00_00_visit | 84.6 | /path/to/file | 2022-11-16 | |
2020-05-25 11:33:00+03:00 | 1000001 | 10k | 00_00_visit | 106.2 | /path/to/file | 2022-11-16 | |
2020-05-25 11:48:00+03:00 | 1000001 | 10k | 00_00_visit | 102.6 | /path/to/file | 2022-11-16 |
You may use more flexible search queries using regex and various properties of the fields. Both the get()
method and load()
method support the same syntax.
tabular_field_name | cgm/cgm_filename |
---|---|
dataset | cgm |
field_string | CGM timeseries |
description_string | Name of the file containing the participants' ... |
parent_dataframe | <NA> |
relative_location | cgm/cgm.parquet |
value_type | Text |
units | <NA> |
sampling_rate | <NA> |
field_type | Time series file (individual) |
array | Single |
cohorts | 10K |
data_type | text |
debut | 2018-12-27 |
pandas_dtype | string |
tabular_field_name | cgm/cgm_filename |
---|---|
dataset | cgm |
field_string | CGM timeseries |
description_string | Name of the file containing the participants' ... |
parent_dataframe | <NA> |
relative_location | cgm/cgm.parquet |
value_type | Text |
units | <NA> |
sampling_rate | <NA> |
field_type | Time series file (individual) |
array | Single |
cohorts | 10K |
data_type | text |
debut | 2018-12-27 |
pandas_dtype | string |
fundus_image_left | fundus_image_right | ||||
---|---|---|---|---|---|
participant_id | cohort | research_stage | array_index | ||
0 | 10k | 00_00_visit | 0 | /path/to/file | /path/to/file |
1 | 10k | 00_00_visit | 0 | /path/to/file | /path/to/file |
2 | 10k | 00_00_visit | 0 | /path/to/file | /path/to/file |
3 | 10k | 00_00_visit | 0 | /path/to/file | /path/to/file |
4 | 10k | 00_00_visit | 0 | /path/to/file | /path/to/file |
tabular_field_name | cgm/1st qu_ | cgm/3rd qu_ | cgm/auc | cgm/ea1c | cgm/glucose | cgm/gmi | cgm/iqr | cgm/mad | cgm/mag | cgm/mage | ... | cgm/modd | cgm/range | cgm/sd | cgm/sdb | cgm/sdbdm | cgm/sddm | cgm/sdhhmm | cgm/sdw | cgm/sdwsh | diet_logging/sodium_mg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
dataset | cgm | cgm | cgm | cgm | cgm | cgm | cgm | cgm | cgm | cgm | ... | cgm | cgm | cgm | cgm | cgm | cgm | cgm | cgm | cgm | diet_logging |
field_string | 1st quantile | 3rd quantile | AUC | eA1C | Glucose | GMI | IQR | MAD | MAG | MAGE | ... | MODD | Range | SD | SDb | SDbdm | SDdm | SDhhmm | SDw | SDwsh | Sodium intake per food logged |
description_string | First quantile of all glucose values. | Third quantile of all glucose values. | Hourly average AUC. This measure integrates, t... | A linear transformation of the mean glucose va... | cgm temporal glucose values | A linear transformation of the mean glucose va... | Interquartile range (IQR), calculated as the d... | Median Absolute Deviation (MAD). This is a mea... | Mean Absolute Glucose (MAG). This is a measure... | Mean Amplitude of Glycemic Excursions (MAGE), ... | ... | Mean difference between glucose values obtaine... | Difference between the maximum and minimum glu... | Standard deviation of all glucose values. | SD between days, within time points. Mean valu... | SD between days, within time points, corrected... | Horizontal SD. SD of the mean glucose values, ... | SD between time points. Standard deviation of ... | Vertical SD within days. Average value of the ... | SD within series. Taking hour-long intervals t... | Sodium intake per food logged |
parent_dataframe | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
relative_location | cgm/cgm.parquet | cgm/cgm.parquet | cgm/cgm.parquet | cgm/cgm.parquet | cgm/cgm.parquet | cgm/cgm.parquet | cgm/cgm.parquet | cgm/cgm.parquet | cgm/cgm.parquet | cgm/cgm.parquet | ... | cgm/cgm.parquet | cgm/cgm.parquet | cgm/cgm.parquet | cgm/cgm.parquet | cgm/cgm.parquet | cgm/cgm.parquet | cgm/cgm.parquet | cgm/cgm.parquet | cgm/cgm.parquet | diet_logging/diet_logging.parquet |
value_type | Continuous | Continuous | Continuous | Continuous | Series data, continous | Continuous | Continuous | Continuous | Continuous | Continuous | ... | Continuous | Continuous | Continuous | Continuous | Continuous | Continuous | Continuous | Continuous | Continuous | Continuous |
units | mg/dl | mg/dl | mg/dl*h | mg/dl | mg/dl | mg/dl | mg/dl | mg/dl | mg/dl | mg/dl | ... | mg/dl | mg/dl | mg/dl | mg/dl | mg/dl | mg/dl | mg/dl | mg/dl | mg/dl | mg |
sampling_rate | <NA> | <NA> | <NA> | <NA> | 15min | <NA> | <NA> | <NA> | <NA> | <NA> | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
field_type | Data | Data | Data | Data | Continuous | Data | Data | Data | Data | Data | ... | Data | Data | Data | Data | Data | Data | Data | Data | Data | Data |
array | Single | Single | Single | Single | Single | Single | Single | Single | Single | Single | ... | Single | Single | Single | Single | Single | Single | Single | Single | Single | Single |
cohorts | 10K | 10K | 10K | 10K | 10K | 10K | 10K | 10K | 10K | 10K | ... | 10K | 10K | 10K | 10K | 10K | 10K | 10K | 10K | 10K | 10K |
data_type | tabular | tabular | tabular | tabular | time series | tabular | tabular | tabular | tabular | tabular | ... | tabular | tabular | tabular | tabular | tabular | tabular | tabular | tabular | tabular | Time Series |
debut | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | ... | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2019-09-01 |
pandas_dtype | float | float | float | float | float | float | float | float | float | float | ... | float | float | float | float | float | float | float | float | float | float |
14 rows × 24 columns