- Équipes
- Productions scientifiques
-
- Séminaires
Marie Chion, MRC Biostatistics Unit, Univ. Cambridge, UK.
https://www.mrc-bsu.cam.ac.uk/staff/marie-chion
From multiple imputation to the Bayesian framework in quantitative proteomics
Abstract In this seminar, we will look at the problem of missing values in data from quantitative proteomics using mass spectrometry. One way of dealing with this problem is to impute missing values, i.e. replace them with a value defined by the user or an algorithm. In this way, multiple imputation allows the imputation process to be iterated several times to obtain several complete data sets. These are then combined before applying conventional statistical tools. However, the usual software for the statistical analysis of proteomics data uses the average complete dataset and ignores the uncertainty induced by the random imputation process. » Therefore, we present a rigorous method for multiple imputation using Rubin's rules and a variant of the t-moderate test that considers the variability arising from both the initial dataset and the multiple imputation process. As the t-moderate test is based on a Bayesian hierarchical model, we also propose a fully Bayesian framework for differential proteomic analysis and discuss the place of multiple imputation in such a framework.
Pierre HUMBERT, LPSM, Sorbonne Univ
https://www.imo.universite-paris-saclay.fr/fr/perso/pierre-humbert/
Title: TBA
Tâm Le Minh, Inria, Laboratoire Jean Kuntzmann, Université Grenoble Alpes https://tam-leminh.github.io <https://tam-leminh.github.io/>
Des modèles échangeables pour les données d'interactions écologiques
Abstract En écologie, l’analyse des données de relevé (présences-absences, abondances, interactions entre espèces) repose souvent sur l’utilisation de “modèles nuls”. Cependant, cette présente des limitations, fréquemment ignorées dans les études écologiques. En prenant pour exemple les réseaux d’interactions plantes-pollinisateurs, nous introduisons le modèle BEDD (Bipartite Expected Degree Distribution), un modèle nul qui permet de surmonter plusieurs de ces limitations en s’appuyant sur l’hypothèse d’échangeabilité des espèces observées. » Les propriétés des modèles échangeables permettent de recourir à des méthodes d’inférence basées sur les U-statistiques, une classe de statistiques particulièrement adaptée à ce type de structure de données. Je décrirai quelques opportunités offertes par les U-statistiques pour l’analyse des réseaux bipartites, en particulier dans le contexte des interactions écologiques. À travers des exemples sur des données simulées et réelles, je mettrai en lumière le potentiel de cette approche, tout en discutant ses limites, notamment induites par l'hypothèse d'échangeabilité.
Aurélien BEAUDE, Doctorant , Equipe AROB@S, IBISC, https://forge.ibisc.univ-evry.fr/abeaude/AttOmics
Titre: The attention mechanism for omics data
Abstract The increasing availability of high-throughput omics data allows for considering a new medicine centered on individual patients. Precision medicine relies on exploiting these high-throughput data with machine-learning models, especially those based on deep-learning approaches, to improve diagnosis. Due to the high-dimensional small-sample nature of omics data, current deep-learning models end up with many parameters and have to be fitted with a limited training set. Cellular functions are governed by the combined action of multiple molecular entities specific to a patient. The expression of one gene may impact the expression of other genes differently in different patients. With classical deep learning approaches, these interactions learned during training are assumed to be identical for all patients in the inference phase. Self-attention can be used to improve the representation of the features vector by incorporating dynamically computed relationships between elements of the vector, i.e., computing patient-specific feature interactions. Applying self-attention to high-dimensional vectors such as omics profiles is challenging as the self-attention memory requirements scale quadratically with the number of elements. In AttOmics, to reduce the memory footprint of the self-attention matrix computation we decompose each omics profile into a set of groups, where each group contains related features. Group embeddings are computed by projecting each group with its own FCN, considering only intra-group interactions. Intergroup interactions are computed by applying the multi-head self-attention to the set of groups. With this approach, we can reduce the number of parameters compared to an MLP with a similar dimension while accurately predicting the type of cancer. We extended this work to a multimodal setting in CrossAttOmics and used cross-attention to compute interactions between two modalities. Instead of computing the interactions between all the modality pairs, we focused on the known regulatory links between the different omics. By using only two or three omics combinations, CrossAttOmics can achieve better accuracy than training only on one modality. When training on small datasets, CrossAttOmics performs better than other architectures.
Analyse des données fonctionnelles (trajectories, random functions), Adaptive Functional Data Analysis
Valentin Patilea, CREST, ENSAI, France, https://ensai.fr/equipe/valentin-patilea/
Abstract Functional Data Analysis (FDA) depends critically on the regularity of the observed curves or surfaces. Estimating this regularity is a difficult problem in nonparametric statistics. In FDA, however, it is much easier due to the replication nature of the data. After introducing the concept of local regularity for functional data, we provide user-friendly nonparametric methods for investigating it, for which we derive non-asymptotic concentration results. As an application of the local regularity estimation, the implications for functional PCA are shown. Flexible and computationally tractable estimators for the eigenelements of noisy, discretely observed functional data are proposed. These estimators adapt to the local smoothness of the sample paths, which may be non-differentiable and have time-varying regularity. In the course of constructing our estimator, we derive upper bounds on the quadratic risk and obtain the optimal smoothing bandwidth that minimizes these risk bounds. The optimal bandwidth can be different for each of the eigenelements. Simulation results justify our methodological contribution, which is available for use in the R package FDAdapt. Extensions of the adaptive FDA approach to streaming and multivariate functional data are also discussed.
14h00: Hugues Van Assel, ENS de Lyon
Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein Projection
Abstract Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets. Traditionally, this involves using dimensionality reduction methods to project data onto lower-dimensional spaces or organizing points into meaningful clusters. In practice, these methods are used sequentially, without guaranteeing that the clustering aligns well with the conducted dimensionality reduction. In this work, we offer a fresh perspective: that of distributions. Leveraging tools from optimal transport, particularly the Gromov-Wasserstein distance, we unify clustering and dimensionality reduction into a single framework called distributional reduction. This allows us to jointly address clustering and dimensionality reduction with a single optimization problem. Through comprehensive experiments, we highlight the versatility of our method and show that it outperforms existing approaches across a variety of image and genomics datasets.
15h15: Miguel Atencia, Universidad de Málaga, Espagne
Challenges in Reservoir Computing
Abstract In this expository talk, we will review the Echo State Network (ESN), a recurrent neural network that has achieved good results in time series tasks, such as forecasting, classification, and encoding-decoding. However, the lack of a rigorous mathematical foundation makes difficult their application in a general context. On the one hand, strong theoretical results, such as the Echo State Property and Universal Approximation, are non-constructive and require critical simplifying assumptions. On the other hand, usual heuristics for optimal hyper-parameter selection have turned out to be neither necessary nor sufficient. Some connections of ESN models with ideas from dynamical systems will be exposed, together with recent design proposals, as well as a novel application to time series clustering.
Title: Molecular Motors: Stochastic Modeling and Statistical Inference.
Speaker: John Fricks, Arizona State University, U.S.A.
Abstract Molecular motors, specifically kinesin and dynein, transport cargos, including vesicles and ion channels, along microtubules in neurons to where they are needed. Such transport is vital to the well-functioning of neurons, and the breakdown in such transport function has been implicated in a number of neurodegenerative diseases. Since their discovery several decades ago, a variety of nano-scale experimental methods have been developed to better understand the function of transport-based molecular motors. In this talk, it will be shown how stochastic modeling techniques, such as functional central limit theorems, and statistical inference techniques for time series, such as particle filtering and EM algorithms, can be combined to better understand these experiments and give insight into the mechanisms behind motor-based intra-cellular transport.
Title: Boosting diversity in Regression Ensembles
Speaker: Jean-Michel Poggi (LMO, Orsay, University Paris-Saclay & University Paris Cité, France)
Abstract The practical interest of using ensemble methods has been highlighted in several works. Aggregation estimation as well as sequential prediction provide natural frameworks for studying ensemble methods and for adapting such strategies to time series data. Sequential prediction focuses on how to combine by weighting a given set of individual experts while aggregation is mainly interested in how to generate individual experts to improve prediction performance. We propose, in the regression context, a gradient boosting-based algorithm by incorporating a diversity term to guide the gradient boosting iterations. The idea is to trade off some individual optimality for global enhancement. The improvement is obtained with progressively generated predictors by boosting diversity. A convergence result is given ensuring that the associated optimisation strategy reaches the global optimum. Finally, we consider simulated, benchmark datasets and a real-world electricity demand dataset to show, by means of numerical experiments, the appropriateness of our procedure by examining the behavior not only of the final predictor or the aggregated one but also the whole generated sequence. In the experiments we consider a variety of different base learners of increasing complexity: stumps, CART trees, purely random forests and Breiman’s random forests. This is joint work with Mathias Bourel (Universidad de la República, Montevideo, Uruguay), Jairo Cugliari (University Lyon 2, France), and Yannig Goude (EDF, France) M. Bourel, J. Cugliari, Y. Goude, J.-M. Poggi, Boosting Diversity in Regression Ensembles, Stat. Anal. Data Min.: ASA Data Sci. J. (2023), 1-17, https://doi.org/10.1002/sam.11654