Recherchez une offre d'emploi

Thèse Compression des Données sur des Graphes du Traitement du Signal à l'Apprentissage Automatique H/F - 75

Description du poste

Institut Polytechnique de Paris Télécom Paris
Paris - 75
CDD
Publié le 20 Mars 2026

Établissement : Institut Polytechnique de Paris Télécom Paris
École doctorale : Ecole Doctorale de l'Institut Polytechnique de Paris
Laboratoire de recherche : Laboratoire de Traitement et Communication de l'Information
Direction de la thèse : Jhony GIRALDO ORCID 0000000200391270
Début de la thèse : 2026-10-01
Date limite de candidature : 2026-04-10T23:59:59

Ce projet vise à développer un cadre méthodologique permettant de résumer de grands ensembles de données structurés en graphes tout en préservant les informations nécessaires aux tâches d'apprentissage automatique. Les données sous forme de graphes sont présentes dans de nombreux domaines, tels que les réseaux sociaux, les systèmes de recommandation, les réseaux biologiques et l'analyse de données scientifiques, mais leur volume considérable rend souvent leur apprentissage et leur stockage très coûteux en termes de ressources informatiques. Cette thèse propose une approche novatrice qui résume conjointement la topologie du graphe et les signaux définis sur celui-ci. Au lieu de recourir à l'échantillonnage traditionnel des noeuds ou à la simplification des graphes par partitionnement, le projet exploite des structures d'ordre supérieur, telles que les cliques, pour construire des représentations agrégées compactes qui conservent les propriétés structurelles et spectrales clés du graphe d'origine.

Ce projet de doctorat développera des garanties théoriques pour la reconstruction des signaux à partir de mesures agrégées et pour les performances des réseaux neuronaux sur graphes (GNN) fonctionnant sur des graphes compressés. En étendant la théorie classique de l'échantillonnage des signaux sur graphes aux opérateurs basés sur l'agrégation, cette thèse de doctorat concevra des algorithmes évolutifs qui équilibrent compression et précision prédictive. À terme, le projet vise à permettre un apprentissage efficace et fiable sur de grands ensembles de données relationnelles, en fournissant des fondements théoriques et des outils pratiques pour un apprentissage automatique sur graphes évolutif dans différentes applications.

Graph signal processing (GSP) [1, 2, 3] and graph machine learning (GML) [4, 5, 6] are research fields that aim to generalize classical concepts from signal processing and machine learning to data defined on graphs. In GSP, classical operations such as filtering, sampling, and reconstruction are extended to graph signals, i.e., functions defined over the vertices of a graph [7, 8]. In GML, neural network architectures are adapted to graph-structured data, resulting in models such as graph neural networks (GNNs) [9] and message-passing neural networks (MPNNs) [10]. GSP and GML are crucial because graph-structured data appear in numerous applications, including social networks [3], the web [11], recommender systems [12], biological networks [13], and knowledge graphs [14], among others. A common challenge across these domains is the massive scale of the underlying graphs, which often contain millions or even billions of nodes and edges [15]. Therefore, it becomes essential to obtain compact representations from the original data, such as graphs with fewer nodes and sampled or aggregated graph signals, enabling more efficient learning and processing.

In the GSP literature, a large body of work has addressed the problems of sampling and reconstruction of graph signals [7, 16, 17, 18]. These works provide theoretical guarantees for reconstructing graph signals from samples based on the spectral properties of the graph, guiding the design of optimal or near-optimal sampling sets [8, 19, 20]. However, these methods assume the original graph is not modified and consider only the problem of sampling graph signals. In parallel, the GML community has developed principled graph coarsening methods [21], including recent approaches with message-passing guarantees [22, 23]. In graph coarsening, also called graph summarization in some contexts [24], the general objective is to compress a large graph into a smaller, more manageable representation while preserving the information that matters for a downstream task (e.g., visualization, learning, storage). Typically, these approaches only consider topological summarization, whereas we consider both graph and signal summarization.

Establishing a principled framework for jointly summarizing graph signals and topology with (i) recovery guarantees for the signal and (ii) task-level guarantees on the compressed topology remains an open challenge. We address this challenge by estimating higher-order structures (e.g., cliques) and using them to induce an aggregation operator and a coarsened graph. These higher-level structures will enable us (i) to summarize signals and reconstruct them with guarantees, and (ii) to run GNNs directly on the summarized representation while preserving downstream performance. Unlike successive local aggregations [18], which form measurements by repeatedly applying the graph shift operator (e.g., the graph adjacency matrix) while preserving the graph topology, we define a summarization operator based on higher-order structures, such as cliques, and use the resulting reduced topology for downstream learning tasks. Similarly, the graph coarsening with message-passing guarantees method [22, 23], based on spectral properties [21], is not localized in the vertex domain, and thus does not have a local structural interpretation. Our approach provides the aggregation map by sampling higher-order structures in the vertex domain, simplifying operator design while still enabling guarantees for GNN/MPNN propagation. A further difference with previous methodologies is that spectral-based coarsening [21, 22] summarizes groups of nodes into supernodes where these groups form a partition of the set of nodes, whereas we summarize by sampling higher-order units in the vertex domain. Therefore, our summarization map does not necessarily form a partition of the set of nodes.

A key question we study is: When are such higher-order structures informative and practically useful for graph summarization and learning? We target regimes where cliques are common (e.g., community-like substructures), so that aggregation captures smooth variations in the clique and preserves task performance. However, summarization should be structure-aware; other structures like cycles or paths should be considered with different mapping functions. This problem has broad applications, ranging from graph data compression and storage to efficient processing of relational datasets in domains such as social networks, the web, recommender systems, and biological systems.

The development of a unified graph and graph signal summarization and learning framework presents several challenges. First, a principled formulation that allows for the summarization and reconstruction of graph signals is lacking (Challenge 1). Here, we aim to formalize conditions under which subspace-exact reconstruction is possible, i.e., perfect recovery on a target subspace (typically a low-frequency eigenspace of a reference operator). Specifically, we plan to build upon classical GSP results in sampling and reconstruction, where recovery guarantees are provided when the graph signals lie on a subspace of the set of eigenvectors. Here, we want to extend those results to the setting where the summarization operator aggregates information from multiple nodes forming higher-order structures rather than merely selecting subsets of nodes. In practice, we will focus on graphs with clique structures, a common property in real-world networks [25, 26, 27]. Once this theoretical framework is established, we will integrate smoothness priors (Challenge 2), a typical assumption in GSP [7, 28, 29]. These priors will enable the development of theoretically grounded GML models that address the problem of graph signal aggregation and reconstruction (Challenge 3). An additional benefit of this approach is the possibility of performing efficient learning directly on summarized representations (Challenge 4), reducing computational costs without significant loss of performance for machine learning applications.