Thèse Prédiction de l'Exposition Interne aux Polluants Organiques Persistants et Risque de Cancer du Sein dans la Cohorte E3n Approches par Apprentissage Automatique et Analyse des Mélanges. H/F - Doctorat.Gouv.Fr
- CDD
- Doctorat.Gouv.Fr
Les missions du poste
Établissement : Université Paris-Saclay GS Santé publique École doctorale : Santé Publique Laboratoire de recherche : Centre de Recherche en épidémiologie et Santé des populations Direction de la thèse : Francesca Romana MANCINI ORCID 0000000322973869 Début de la thèse : 2026-10-01 Date limite de candidature : 2026-05-08T23:59:59 Les polluants organiques persistants (POP) sont des substances chimiques bioaccumulables auxquelles la population générale est principalement exposée par l'alimentation. En raison de leur longue demi-vie biologique, l'exposition interne est mieux évaluée à l'aide de biomarqueurs sanguins ; toutefois, le biomonitoring est coûteux et limite la taille des échantillons ainsi que la puissance statistique dans les grandes études épidémiologiques. Les méthodes indirectes existantes d'évaluation de l'exposition présentent des erreurs de mesure ou une scalabilité limitée. L'apprentissage automatique (machine learning, ML) offre une alternative prometteuse en combinant de multiples variables liées à l'exposition afin de prédire les concentrations internes de POP dans de larges populations, permettant ainsi des analyses plus puissantes des effets des POP sur la santé, notamment le cancer du sein.
Les principaux objectifs de ce projet sont : 1) de prédire l'exposition interne aux POP dans la cohorte E3N-Générations à l'aide de modèles de ML entraînés sur des données de biomarqueurs mesurées ; et 2) d'évaluer l'association entre l'exposition prédite aux POP et le risque de cancer du sein.
Environ 1 000 femmes de la cohorte E3N-Générations disposent déjà de mesures des taux sanguins de POP. Des données détaillées sur l'alimentation, le mode de vie, les facteurs reproductifs et les caractéristiques anthropométriques sont également disponibles.
Cette sous-cohorte sera divisée en un ensemble d'entraînement (90 %) et un ensemble de test. Une large bibliothèque de modèles sera évaluée, incluant des régressions linéaires et pénalisées, des modèles additifs généralisés, des machines à vecteurs de support, des méthodes de gradient boosting et des réseaux de neurones. Un Super Learner adaptatif aux données combinera ces modèles en un ensemble pondéré de manière optimale à l'aide de la validation croisée.
Les expositions prédites seront attribuées à environ 75 000 femmes de la cohorte E3N, dont plus de 8 000 cas incidents de cancer du sein, et analysées à l'aide de modèles de Cox, globalement et selon le statut des récepteurs aux oestrogènes (ER). Les effets des mélanges de POP seront évalués à l'aide de plusieurs approches de modélisation.
Ce projet permettra de développer un cadre méthodologique validé et scalable basé sur le ML pour prédire l'exposition interne aux POP dans de grandes cohortes, et de produire de nouvelles connaissances sur l'association entre les POP et le risque de cancer du sein.
Le/la doctorant(e) sera encadré(e) par Francesca Romana Mancini (directrice de thèse) et Germán Cano-Sancho (co-directeur), combinant une expertise en épidémiologie environnementale et en évaluation des expositions, et travaillera en étroite collaboration avec Vittorio Perduca (co-encadrant), qui possède une vaste expérience en ML appliqué aux études épidémiologiques. Persistent Organic Pollutants (POPs) are a group of chemicals characterized by their environmental persistence, widespread distribution, tendency to bioaccumulate in human and animal tissues, and documented toxicity to both human health and wildlife. Due to their lipophilic nature, POPs accumulate in adipose tissues and biomagnify along the food chain. For the general population, contaminated food, particularly of animal origin, represents the primary route of exposure (1, 2, 3).
Because most POPs have long biological half lives, internal exposure levels in humans change slowly over time and are most reliably assessed through biomarkers such as blood concentrations. However, the high costs and logistical complexity of biomonitoring often limit the number of analysed samples, thereby constraining the statistical power of prospective cohort studies, especially for investigating rare health outcomes or performing stratified analyses (4). To overcome this limitation, indirect approaches have been developed to estimate internal exposures across large populations. Traditionally, indirect approaches involve dietary exposure-based methods combining food consumption data with contamination databases but are affected by substantial measurement error, variability in food contamination levels, and the inability to capture non-dietary exposures or interindividual toxicokinetic variability. In turn, there are physiologically based pharmacokinetic (PBPK) models, but while offer mechanistic insight into absorption, distribution, metabolism, and excretion processes, their predictive performance and scalability in epidemiological settings are limited by parameter uncertainty, population heterogeneity, complex exposure patterns, and considerable computational demands. These challenges underscore the need for alternative predictive strategies (5).
In recent years, machine learning (ML) has emerged as a powerful tool in exposure science due to its ability to identify complex and non linear patterns within multidimensional datasets. Although ML methods have been successfully applied to predicting environmental fate, toxicity, and pollutant related risk, their use in estimating internal human exposure remains limited. In particular, ML methods are well suited to settings where numerous exposure-related variables, such as dietary habits, lifestyle factors, and anthropometric measures, each have low to moderate predictive value individually but may collectively explain substantial variability in internal concentrations of POPs. Developing ML models capable of leveraging this ensemble of weak predictors to estimate internal POP concentrations would represent a major methodological advance. Such models could enable estimation of internal exposures across entire cohort populations, thereby substantially increasing statistical power and facilitating the investigation of multiple health outcomes, including diseases with relatively low incidence such as breast cancer. To date, no gold standard ML model exists for predicting internal POP exposure; systematic comparison of algorithms is therefore essential to identify the most robust and generalizable approach (6).
Breast cancer incidence has increased significantly since the 1980s in Western countries and continues to rise in many transitioning and high income Asian countries. Nevertheless, known risk factors do not fully explain this trend. Breast cancer is a highly heterogeneous disease, varying in morphology, biology, clinical behaviour, therapeutic response, and prognosis. The most widely used classification relies on hormone receptor status, including estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) (7).
Exposure to environmental contaminants, particularly endocrine disrupting chemicals such as certain POPs, is suspected to contribute to breast cancer risk. However, prospective studies with sufficient statistical power to investigate associations between POP exposure and breast cancer risk, including analyses stratified by tumour subtype, remain scarce. Moreover, real life exposure involves simultaneous contact with multiple chemicals, potentially producing additive or synergistic effects. Traditional epidemiological analyses typically focus on single chemicals, whereas assessing exposure patterns and mixtures is essential given the shared sources and correlated nature of POP exposures (8).
The main objective of this project is to predict internal POP exposure levels for the entire first generation of women in the E3NGenerations cohort using ML models trained and validated on measured POPs concentrations and detailed individual data. These predicted exposures, considered both individually and as mixtures, will then be used to investigate the association between POPs exposure and breast cancer risk, overall and by estrogen receptor (ER) status (ER vs. ER).
The project is structured around two main research axes, each comprising specific objectives :
Objective 1 - Prediction of internal POPs exposure
1a. To develop and compare statistical and machine learning models for predicting internal concentrations of individual POPs using dietary, lifestyle, and anthropometric variables.
1b. To evaluate model performance, uncertainty, and interpretability, and to extend predictions to exposure mixtures using multivariate machine learning approaches.
Objective 2 - POPs exposure and breast cancer risk
2a. To investigate the association between predicted internal exposure to individual POPs and the risk of breast cancer in the E3N cohort.
2b. To assess the association between predicted exposure to POP mixtures and breast cancer risk, overall and stratified by oestrogen receptor status The E3N-Generations cohort is a three-generation family-based study built upon the historical E3N cohort, which was initiated in 1990 with the enrollment of 98,995 women aged 40-65 years, all members of an insurance plan mainly covering people working in the national education system (Mutuelle Générale de l'Éducation Nationale). The cohort has been expanded to include family members of the women, making it one of the few multigenerational prospective cohorts in Europe. Ultimately, the E3N-Generations cohort will encompass three generations: the original E3N women and the fathers of their children, constituting the first generation (E3N-G1); their children as the second generation (E3N-G2); and their grandchildren as the third generation (E3N-G3) (www. https://www.e3n-generations.fr/).
The detailed protocol of the E3N cohort (thus including only the women E3N-G1) has been previously described (9, 10). Briefly, self-administered questionnaires are sent by post mail to participants every two to three years to collect information on lifestyle, reproductive and medical history, and overall health status. Dietary data were also collected twice using a semi-quantitative food frequency questionnaire including over 250 food items assessing the habitual diet of the previous year (11).
Between 1994 and 1999, participants were invited to donate blood, resulting in the collection of samples from approximately 25,000 women. Each sample was separated into 28 aliquots, including plasma, serum, buffy coat, leukocytes, and erythrocytes, and stored in plastic straws in liquid nitrogen containers (-196°C) within a biobank.
Objective 1 - Prediction of internal POPs exposure
Study population
Among the women who provided a blood sample, a rappresentative subcohort of ~1000 women has been selected. For all these women a total of 30 organochlorine pesticides (OCP), 4 polybromodiphenyl ethers (PBDE), 10 polychlorobiphenyls (PCB) and 16 per- and poly-fluoroalkyl substances (PFAS) have been measured in serum or plasma samples with referent and very sensitive mass-spectrometry based methods. We will split the subcohort into a training set (90% of the observations) and a test set for estimating the performance of the trained prediction models.
Statistical analyses
Biomarker concentrations will be mainly analysed as continuous variables. Predicting internal exposure to each POP involves estimating the conditional expectation of each POP based on selected covariates. One common problem in ML is that the model that will best perform on a given dataset is not known in advance. We address this problem using the data adaptive strategy for model selection that we previously adopted for another application to medical data (12). We will consider a large library of ML models, spanning parametric and non-parametric approaches, including standard multiple linear regression, elastic net regression, generalised additive models (GAMs), support vector machines (SVMs), extreme gradient boosting (XGBoost), and neural networks (13). Instead of considering individual models only, we will also train and validate the continuous super learner, a meta-algorithm that combines individual models into a weighted ensemble, with coefficients determined through k-fold cross-validation. This approach has strong mathematical guarantees, ensuring performance at least as strong as the best individual model (14). A nice feature of the super learner is that each individual model can be coupled with a variety of model selection approaches. The performance of all models (individual models and super learner) will be evaluated and compared using the mean squared error (MSE) estimated through k-fold cross-validation on the training set and the model minimizing the cross-validate MSE will be retained. Its MSE performance will be finally estimated on the test set. Additionally, prediction intervals will be obtained via bootstrapping. We will apply ML interpretability techniques (e.g., SHAP values) to elucidate how models utilize explanatory variables (15). Multivariate approaches for predicting exposure mixtures, such as multi-output random forests, will also be investigated (16). We will deploy a similar strategy for binarized biomarkers (e.g., the concentration is below/above a given threshold). In this case we will use the AUC as loss function. For biomarkers that are best analysed as categorical variables with more than two categories, the surper learner is not yet a valuable option as the current R implementation was mainly developed for continuous and binary outcomes. In this case, we will rather implement individual models developed for multiclass classification such as the random forests.
Objective 2 - POPs exposure and breast cancer risk
Study population
All women enrolled in the E3N cohort having answered the semi-quantitative food frequency questionnaire sent in 1993 will be included, representing approximately 75,000 participants.
Breast cancer cases
Breast cancer cases in the E3N cohort were identified primarily through selfadministered questionnaires sent to participants every two to three years. Additional cases were ascertained through spontaneous reports from nextofkin and by linkage with the national causeofdeath registry. Overall, 93% of reported breast cancer cases have been validated through pathology reports or medical records, ensuring high diagnostic accuracy. Between 1993 and 2018, approximately 8,000 primary incident breast cancer cases have been identified and confirmed.
Statistical analyses
Associations between predicted internal POPs exposure and breast cancer risk will be assessed using Cox proportional hazards models with age as the time scale. Entry time will correspond to age at completion of the dietary questionnaire, and exit time will be age at breast cancer diagnosis for cases or age at censoring (death, loss to follow-up, diagnosis of another cancer, or end of follow-up), whichever occurred first. Different approaches, such as weighted quintile sum regression and principal component regression, will be applied to invstigate the potential mixture effect of the expsore to multiple POPs on breast cancer risk.
Analyses will be conducted for breast cancer overall and stratified by oestrogen receptor status (ER-positive and ER-negative). Directed acyclic graphs (DAGs) will be used to identify appropriate confounders. Potential interactions with relevant covariates will be examined, and stratified analyses will be performed when appropriate.
Le profil recherché
Le/la candidat(e) devra être titulaire d'un Master 2 (ou équivalent) en biostatistique, statistiques, data science, épidémiologie, santé publique ou discipline connexe.
Une bonne formation en méthodes statistiques et en analyse de données est requise. Des connaissances en apprentissage automatique (machine learning) et en modélisation seront appréciées, sans qu'une expertise avancée ne soit nécessaire au départ. Une maîtrise des logiciels d'analyse statistique, notamment R et/ou Python, est attendue.
Le/la candidat(e) devra disposer de connaissances de base en épidémiologie, et idéalement en épidémiologie environnementale. Une sensibilité aux problématiques de santé publique et aux questions liées aux expositions environnementales constituera un atout.
Autonomie, rigueur scientifique, capacité d'analyse et aptitude à travailler en équipe dans un environnement interdisciplinaire sont essentielles. De bonnes compétences rédactionnelles et un bon niveau d'anglais sont nécessaires pour la rédaction d'articles scientifiques et la présentation des résultats dans des congrès internationaux.