Pheno loader

Class for loading datasets on the research platform

PhenoLoader

 PhenoLoader (dataset:str, base_path:str='nbs/examples/', cohort:str=None,
              age_sex_dataset:str='events', skip_dfs:List[str]=[],
              unique_index:bool=False, valid_dates:bool=False,
              valid_stage:bool=False, flexible_field_search:bool=False,
              squeeze:bool=False, errors:str='warn',
              read_parquet_kwargs:Dict[str,Any]={},
              preferred_language:str='english',
              keep_undefined_research_stage:bool=False,
              join_non_overlapping:bool=False)

*Class to load multiple tables from a dataset and allows to easily access their fields.

Args:

dataset (str): The name of the dataset to load.
base_path (str, optional): The base path where the data is stored. Defaults to DATASETS_PATH.
cohort (str, optional): The name of the cohort within the dataset. Defaults to COHORT.
age_sex_dataset (str, optional): The name of the dataset to use for computing age and sex. Defaults to EVENTS_DATASET.
skip_dfs (list, optional): A list of tables (or substrings that match to tables) to skip when loading the data. Defaults to [].
unique_index (bool, optional): Whether to ensure the index of the data is unique. Defaults to False.
valid_dates (bool, optional): Whether to ensure that all timestamps in the data are valid dates. Defaults to False.
valid_stage (bool, optional): Whether to ensure that all research stages in the data are valid. Defaults to False.
flexible_field_search (bool, optional): Whether to allow regex field search. Defaults to False.
keep_undefined_research_stage (bool, optional): Whether to keep samples with undefined research stage. Defaults to False.
join_non_overlapping (bool, optional): Whether to join tables with non-overlapping indices. Defaults to False.
errors (str, optional): Whether to raise an error or issue a warning if missing data is encountered.
    Possible values are 'raise', 'warn' and 'ignore'. Defaults to ERROR_ACTION.

Attributes:

dict (pd.DataFrame): The data dictionary for the dataset, containing information about each field.
dfs (dict): A dictionary of dataframes, one for each table in the dataset.
fields (list): A list of all fields in the dataset.
dataset (str): The name of the dataset being used.
cohort (str): The name of the cohort being used.
base_path (str): The base path where the data is stored.
dataset_path (str): The full path to the dataset being used.
age_sex_dataset (str): The name of the dataset being used to compute age and sex.
skip_dfs (list): A list of tables to skip when loading the data.
unique_index (bool): Whether to ensure the index of the data is unique.
valid_dates (bool): Whether to ensure that all timestamps in the data are valid dates.
valid_stage (bool): Whether to ensure that all research stages in the data are valid.
flexible_field_search (bool): Whether to allow regex field search.
keep_undefined_research_stage (bool, optional): Whether to keep samples with undefined research stage.
join_non_overlapping (bool): Whether to join tables with non-overlapping indices.
errors (str): Whether to raise an error or issue a warning if missing data is encountered.
preferred_language (str): The preferred language for the questionnaires.*

Use the dataset name to load the dataset. It may contain multiple tables. Age / sex will be added to the data by default. The default base_path is set to work on the research platform.

pl = PhenoLoader('fundus', errors='warn')
pl

PhenoLoader for fundus with
78 fields
2 tables: ['fundus', 'age_sex']

The PhenoLoader class contains several usefull attributes

The data dictionary of the dataset displays the description of each field.

pl.dict.head(3)

	field_string	description_string	parent_dataframe	relative_location	value_type	units	sampling_rate	field_type	array	cohorts	data_type	debut	pandas_dtype
tabular_field_name
fundus_image_left	Fundus image (left)	Fundus image (left)	NaN	fundus/fundus.parquet	Text	NaN	NaN	Image file (individual)	Single	10K	image	2021-02-17	string
fundus_image_right	Fundus image (right)	Fundus image (right)	NaN	fundus/fundus.parquet	Text	NaN	NaN	Image file (individual)	Single	10K	image	2021-02-17	string
collection_date	Collection date (YYYY-MM-DD)	Collection date (YYYY-MM-DD)	NaN	fundus/fundus.parquet	Date	Time	NaN	Data	Single	10K	tabular	2021-02-17	datetime64[ns]

pl.dfs.keys()

dict_keys(['fundus', 'age_sex'])

import pandas as pd

# Example DataFrame with MultiIndex
data = {
    'value': [10, 20, 30, 40]
}
tuples = [('A', 'x'), ('A', 'y'), ('B', 'x'), ('B', 'y')]
index = pd.MultiIndex.from_tuples(tuples, names=['outer', 'inner'])
df = pd.DataFrame(data, index=index)

# Reset the index
df_reset = df.reset_index()

# Re-add the desired index levels ('outer', 'inner' in this case) as index
df_final = df_reset.set_index(['outer', 'inner'])

pl.dfs['fundus'].head(3)

				fundus_image_left	fundus_image_right	collection_date	fractal_dimension_left	fractal_dimension_right	artery_average_width_left	artery_average_width_right	artery_distance_tortuosity_left	artery_distance_tortuosity_right	artery_fractal_dimension_left	...	vein_fractal_dimension_left	vein_fractal_dimension_right	vein_squared_curvature_tortuosity_left	vein_squared_curvature_tortuosity_right	vein_tortuosity_density_left	vein_tortuosity_density_right	vein_vessel_density_left	vein_vessel_density_right	vessel_density_left	vessel_density_right
participant_id	cohort	research_stage	array_index
0	10k	00_00_visit	0	/path/to/file	/path/to/file	2022-11-16	1.564989	1.520885	18430.284751	19038.547771	3.668175	3.271147	1.355673	...	1.410553	1.403108	14.208195	6.098432	0.700187	0.698546	0.046645	0.045864	0.080377	0.078671
1	10k	00_00_visit	0	/path/to/file	/path/to/file	2022-06-30	1.542311	1.534158	17315.398780	19099.489575	2.095461	1.634782	1.368933	...	1.387527	1.332864	8.999069	8.702682	0.740806	0.708911	0.037896	0.046853	0.074197	0.064578
2	10k	00_00_visit	0	/path/to/file	/path/to/file	2021-10-05	1.482051	1.545097	15375.866993	19855.576862	2.776472	2.747015	1.360404	...	1.411881	1.408791	13.119227	9.936669	0.627281	0.675100	0.053022	0.048063	0.079515	0.082102

3 rows × 76 columns

All availbale fields (columns) in all tables can be listed.

pl.fields[:5]

['artery_average_width_left',
 'artery_average_width_right',
 'artery_distance_tortuosity_left',
 'artery_distance_tortuosity_right',
 'artery_fractal_dimension_left']

pl['vein_average_width_right']

				vein_average_width_right
participant_id	cohort	research_stage	array_index
0	10k	00_00_visit	0	18436.428634
1	10k	00_00_visit	0	18888.160314
2	10k	00_00_visit	0	19013.865043
3	10k	00_00_visit	0	18809.012493
4	10k	00_00_visit	0	19428.986690

Access any of the fields (e.g., vein_average_width_right, age) or indices (e.g., research_stage) from any of the tables via the data loader API.

pl[['research_stage', 'vein_average_width_right', 'age', 'sex']]

				research_stage	vein_average_width_right	age	sex
participant_id	cohort	research_stage	array_index
0	10k	00_00_visit	0	00_00_visit	18436.428634	43.5	0
1	10k	00_00_visit	0	00_00_visit	18888.160314	53.7	1
2	10k	00_00_visit	0	00_00_visit	19013.865043	26.2	0
3	10k	00_00_visit	0	00_00_visit	18809.012493	44.6	1
4	10k	00_00_visit	0	00_00_visit	19428.986690	50.3	0

Access time series or bulk data that is stored separately for each sample via the data loader API. In the following example, the data loader retrieves the relative path of each sample’s bulk file from the main table (where it is stored in the field fundus_image_left), converts it to an absolute path, and loads the file. This is repeated for 2 samples and returned as a list. In the case of parquet DataFrames, there is no need to define the load_func and multiple DFs are concatenated by deafult.

pl.dfs['fundus']['fundus_image_left']

participant_id  cohort  research_stage  array_index
0               10k     00_00_visit     0              /path/to/file
1               10k     00_00_visit     0              /path/to/file
2               10k     00_00_visit     0              /path/to/file
3               10k     00_00_visit     0              /path/to/file
4               10k     00_00_visit     0              /path/to/file
Name: fundus_image_left, dtype: object

pl.load_bulk_data('fundus_image_left', participant_id=[0, 1])

You can perform flexible field search (with regex support), when initializing the PhenoLoader as follows:

pl = PhenoLoader('fundus', flexible_field_search=True)

For example, the following command will search for any field starting with “fractal”.

pl['^fractal']

				fractal_dimension_left	fractal_dimension_right
participant_id	cohort	research_stage	array_index
0	10k	00_00_visit	0	1.564989	1.520885
1	10k	00_00_visit	0	1.542311	1.534158
2	10k	00_00_visit	0	1.482051	1.545097
3	10k	00_00_visit	0	1.548773	1.539352
4	10k	00_00_visit	0	1.554922	1.557029

You can summarize a field or set of fields by the following command

pl.describe_field(['fundus_image_right', 'collection_date'])

	fundus_image_right	collection_date
field_string	Fundus image (right)	Collection date (YYYY-MM-DD)
description_string	Fundus image (right)	Collection date (YYYY-MM-DD)
parent_dataframe	NaN	NaN
relative_location	fundus/fundus.parquet	fundus/fundus.parquet
value_type	Text	Date
units	NaN	Time
sampling_rate	NaN	NaN
field_type	Image file (individual)	Data
array	Single	Single
cohorts	10K	10K
data_type	image	tabular
debut	2021-02-17	2021-02-17
pandas_dtype	string	datetime64[ns]
count	5	5
unique	1	5
most_frequent	/path/to/file	NaN
min	NaN	2021-10-05 00:00:00
max	NaN	2022-11-16 00:00:00
mean	NaN	NaN
median	NaN	NaN
std	NaN	NaN