2023-05-12Zeitschriftenartikel
Combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive Shiga toxin-producing Escherichia coli
Vorimore, Fabien
Jaudou, Sandra
Tran, Mai-Lan
Richard, Hugues
Fach, Patrick
Delannoy, Sabine
Introduction: The objective of this study was to develop, using a genome wide
machine learning approach, an unambiguous model to predict the presence of
highly pathogenic STEC in E. coli reads assemblies derived from complex samples
containing potentially multiple E. coli strains. Our approach has taken into account
the high genomic plasticity of E. coli and utilized the stratification of STEC and
E. coli pathogroups classification based on the serotype and virulence factors
to identify specific combinations of biomarkers for improved characterization of
eae-positive STEC (also named EHEC for enterohemorrhagic E.coli) which are
associated with bloody diarrhea and hemolytic uremic syndrome (HUS) in human.
Methods: The Machine Learning (ML) approach was used in this study on a large
curated dataset composed of 1,493 E. coli genome sequences and 1,178 Coding
Sequences (CDS). Feature selection has been performed using eight classification
algorithms, resulting in a reduction of the number of CDS to six. From this reduced
dataset, the eight ML models were trained with hyper-parameter tuning and
cross-validation steps.
Results and discussion: It is remarkable that only using these six genes, EHEC can
be clearly identified from E. coli read assemblies obtained from in silico mixtures
and complex samples such as milk metagenomes. These various combinations
of discriminative biomarkers can be implemented as novel marker genes for the
unambiguous EHEC characterization from different E. coli strains mixtures as well
as from raw milk metagenomes