13-gut_microbiome

13 - gut microbiome

Metagenomics is the study of genetic material from environmental samples, including microbial communities. It involves sequencing the DNA of all microorganisms in the sample, rather than isolating individual organisms. Metagenomics enables the identification and functional analysis of microorganisms in diverse environments, including soil, water, and the human body.

This dataset maps out the human gut microbiota per participant via shotgun metagenomic sequencing given stool samples. It is then compared to known references of gut flora to measure prevalence of specific microbes.

Gut microbiome metagenomics can be used to identify potential biomarkers of disease, develop personalized treatment strategies, and better understand the complex relationship between the gut microbiota and human physiology. It has already been providing insights into the role of gut microbes in various diseases such as obesity, diabetes and IBD.

To measure the genetic makeup of the human gut flora given stool samples via metagenomics, the following steps done:

Collection of stool sample: For each visit, a stool sample is collected from the individual and stored appropriately to preserve the microbial community.
DNA extraction: DNA is extracted from the stool sample using specialized techniques to isolate the microbial DNA from other materials present in the sample.
DNA fragmentation and sequencing: The extracted DNA is then fragmented into small pieces and sequenced using high-throughput sequencing technologies.
Quality control: The resulting raw sequencing data is then pre-processed, removing low-quality reads and artifacts of the sequencing methodology.
Taxonomic classification: The processed sequencing data is then compared to databases of known microbial sequences to identify and classify the microbial species presence and their respective abundances in the sample.

The information is stored in multiple parquet files:

metadata: Sequencing and QC statistics.
urs: Segal Lab relative abundance.
metaphlan_*: 8 tables with MetaPhlAn 4 relative abundances, separated by taxonomic levels.

from pheno_utils import PhenoLoader

dl = PhenoLoader('gut_microbiome', age_sex_dataset=None)
dl

DataLoader for gut_microbiome with
9903 fields
10 tables: ['metadata', 'urs', 'metaphlan_kingdom', 'metaphlan_phylum', 'metaphlan_class', 'metaphlan_order', 'metaphlan_family', 'metaphlan_genus', 'metaphlan_species', 'metaphlan_strain']

dl.dict

	field_string	description_string	parent_dataframe	relative_location	value_type	units	sampling_rate	item_type	array	cohorts	data_type	debut	pandas_dtype
tabular_field_name
collection_timestamp	Sampled Timestamp	Time sample was given	NaN	metadata.parquet	Date	Time	NaN	Data	Single	10K	Metadata	2019-02-14	datetime64[ns]
collection_date	Sampled Date	Date sample was given	NaN	metadata.parquet	Date	Time	NaN	Data	Single	10K	Metadata	2019-02-14	datetime64[ns]
SampleName	Sample Name	Sample Name	NaN	metadata.parquet	Text	NaN	NaN	Data	Single	10K	Metadata	2019-02-14	object
raw_read_count	Raw Reads Count	#Reads in the original FASTQ	NaN	metadata.parquet	Integer	NaN	NaN	Data	Single	10K	Metadata	2019-02-14	int64
trimmed_read_count	Post QC Reads Count	#Reads in the Post QC FASTQ	NaN	metadata.parquet	Integer	NaN	NaN	Data	Single	10K	Metadata	2019-02-14	int64
...	...	...	...	...	...	...	...	...	...	...	...	...	...
k__Bacteria\|p__Verrucomicrobia\|c__Verrucomicrobiae\|o__Verrucomicrobiales\|f__Akkermansiaceae\|g__GGB6529\|s__GGB6529_SGB9222\|t__SGB9222	SGB9222	SGB9222	NaN	metaphlan_strain.parquet	Continuous	Percent	NaN	Data	single	10K	Tabular	2019-02-14	float64
k__Eukaryota\|p__Ascomycota\|c__Saccharomycetes\|o__Saccharomycetales\|f__Saccharomycetaceae\|g__Saccharomyces\|s__Saccharomyces_cerevisiae\|t__EUK4932	EUK4932	EUK4932	NaN	metaphlan_strain.parquet	Continuous	Percent	NaN	Data	single	10K	Tabular	2019-02-14	float64
k__Eukaryota\|p__Eukaryota_unclassified\|c__Eukaryota_unclassified\|o__Eukaryota_unclassified\|f__Entamoebidae\|g__Entamoeba\|s__Entamoeba_dispar\|t__EUK46681	EUK46681	EUK46681	NaN	metaphlan_strain.parquet	Continuous	Percent	NaN	Data	single	10K	Tabular	2019-02-14	float64
k__Eukaryota\|p__Eukaryota_unclassified\|c__Eukaryota_unclassified\|o__Eukaryota_unclassified\|f__Eukaryota_unclassified\|g__Blastocystis\|s__Blastocystis_sp_subtype_1\|t__EUK944036	EUK944036	EUK944036	NaN	metaphlan_strain.parquet	Continuous	Percent	NaN	Data	single	10K	Tabular	2019-02-14	float64
k__Eukaryota\|p__Eukaryota_unclassified\|c__Eukaryota_unclassified\|o__Eukaryota_unclassified\|f__Hexamitidae\|g__Giardia\|s__Giardia_intestinalis\|t__EUK5741	EUK5741	EUK5741	NaN	metaphlan_strain.parquet	Continuous	Percent	NaN	Data	single	10K	Tabular	2019-02-14	float64

9903 rows × 13 columns