Statistics Research Day 2025: Abstracts

Oral Presentations

Block 1: Theoretical Statistics

Ichiro Hashimoto

Universality of Benign Overfitting in Binary Linear Classification
The practical success of deep learning has led to the discovery of several surprising phenomena. One of these phenomena, that has spurred intense theoretical research, is “benign overfitting”: deep neural networks seem to generalize well in the over-parametrized regime even though the networks show a perfect fit to noisy training data. It is now known that benign overfitting also occurs in various classical statistical models. For linear maximum margin classifiers, benign overfitting has been established theoretically in a class of mixture models with very strong assumptions on the covariate distribution. However, even in this simple setting, many questions remain open. For instance, most of the existing literature focuses on the noiseless case where all true class labels are observed without errors, whereas the more interesting noisy case remains poorly understood. We provide a comprehensive study of benign overfitting for linear maximum margin classifiers. We discover a phase transition in test error bounds for the noisy model which was previously unknown and provide some geometric intuition behind it. We further considerably relax the required covariate assumptions in both, the noisy and noiseless case. Our results demonstrate that benign overfitting of maximum margin classifiers holds in a much wider range of scenarios than was previously known and provide new insights into the underlying mechanisms.

Bingqing Li

Regression based EM Algorithm
In Gaussian mixture models (GMM), with applications in unsupervised clustering and data segmentation, traditional Expectation-Maximization (EM) algorithms optimize a surrogate likelihood through iterative updates of posterior probabilities and model parameters. In contrast, our method leverages the structure of the EM algorithm to reformulate the estimation of posterior probabilities through the discriminant directional matrix. The estimation of this matrix can then be expressed as a multivariate regression problem. Specifically, we derive a regression-based EM framework, where the regression matrix can be estimated via ordinary least squares. This approach can also be extended with regularization to incorporate structural assumptions such as sparsity or correlation. The proposed formulation not only improves stability but also enables the integration of prior knowledge into the estimation process. Furthermore, we provide theoretical guarantees for the multi-class EM algorithm under the GMM framework—extending beyond existing results, which are largely limited to the binary case.

Block 2: Applied Statistics

Mandy Yao

Quantifying Uncertainty in Air Pollution Machine Learning Models
Given the ongoing climate crisis and increase in extreme weather events, it is now more important than ever to study the role of the environment on human health. Given the abundance and complexity of environmental data from multiple sources, machine learning (ML) methods have risen in popularity over more traditional statistical methods to explore, understand, and capture spatial and temporal trends. Yet, many ML methods have limited or no ability to quantify uncertainty, which is often needed to make insightful interpretations about predictions. We examine a popular ML method, Extreme Gradient Boosting (XGBoost), and show how a modified quantile regression can be incorporated to construct point-wise prediction intervals of specific quantiles, while allowing XGBoost to perform well by finding solutions rapidly using optimal gradient descent rates. We then compare our method to another modified quantile regression method (which uses an arctan pinball loss function), and to the implementation of quantile regression for XGBoost in the xgboost python package, by predicting particulate matter air quality exposures in California that capture wildfire events.

Lin Yu

Causal Variance Decompositions for Measuring Health Inequalities
Racial disparities in healthcare are well-documented, but what drives them? Is it differences in hospital quality, unequal access, or hospitals treating patients differently based on race? To answer these questions, we propose a new causal decomposition method that partitions the observed variation in care received (e.g., assignment of a treatment) into five components: 1) variation due to race, 2) variation in hospital quality, 3) effect modification (differential treatment within hospitals), 4) differential access to hospitals, and 5) residual. Our method overcomes the limitations of traditional effect modification approaches by enabling overall evaluation of variation of multi-categorical variables, rather than relying on pairwise comparisons, thus allowing for assessment of the validity of the quality indicator for hospital performance comparison. Additionally, our method enhances existing variance decomposition methods by introducing two new causal estimands that interpret variance from effect modification and differential access. We propose both parametric (generalized linear models, generalized linear mixed-effect models) and nonparametric (Random Forest and XGboost) estimators for the causal components. Although initially conceptualized to address racial disparities, our method is generalizable and can be applied to study other disparities in healthcare and other domains. Simulation results show the proposed estimators capture the true causal estimands well except for small sample bias.

Tianyi Pan

Estimating Associations Between Cumulative Exposure and Health via Generalized Distributed Lag Non-Linear Models using Penalized Splines
Quantifying associations between short-term exposure to ambient air pollution and health outcomes is an important public health priority. Historically, studies have restricted attention to single-day exposures or (equally weighted) average exposure over several days. Adaptive cumulative exposure distributed lag non-linear models (ACE-DLNMs), in contrast, quantify associations between health outcomes and cumulative exposure that is specified in a data-adaptive way. While the ACE-DLNM framework is highly interpretable, it is limited to continuous outcomes and does not scale well to large datasets. Motivated by a large analysis of daily pollution and circulatory and respiratory hospitalizations in Canada between 2001 and 2018, we propose a generalized ACE-DLNM incorporating penalized splines, and we propose an efficient estimation strategy based on profile likelihood and Laplace approximate marginal likelihood with Newton-type methods. Our proposed method improves upon existing approaches in three ways: (1) it applies to general response types, including over-dispersed counts; (2) estimation is computationally efficient and readily applies to large datasets; and (3) it treats the exposure process continuously with respect to time. In application to the motivating analysis, the proposed method respects the discrete responses and reduces uncertainty in estimated associations compared to generalized additive models with fixed exposures.

Block 3: Mathematical Finance and Actuarial Science

Hassan Abdelrahman

Simplifying Complexities in IBNR Claims Count Estimation With A Bayesian GLM Approach
Estimating the count of incurred but not reported (IBNR) claims is a fundamental challenge in loss reserving. The Chain Ladder method, a widely used macro-level approach, relies on aggregated claims data and provides a simple framework for reserve estimation. However, it can be inaccurate in many cases as it does not leverage detailed claims information and may introduce biases under certain conditions. To address these limitations, micro-level models have been developed to incorporate individual claim data, capturing claim occurrence and reporting dynamics more effectively. Recent literature has shown that these models outperform the Chain Ladder method in predictive accuracy.

Despite their better performance, micro-level models remain largely unused in practice due to their computational complexity and various modeling challenges. In this paper, we propose a Bayesian framework that builds on the Chain Ladder method while incorporating key micro-level elements, bridging the gap between these two approaches. Through case studies, we demonstrate that our framework not only outperforms the classical Chain Ladder method but also surpasses micro-level models adopted in recent literature, offering a practical and scalable alternative for IBNR claim count estimation.

Brandon Tam

Dimension Reduction of Distributionally Robust Optimization Problems
We study distributionally robust optimization (DRO) problems with uncertainty sets consisting of high dimensional random vectors that are close in the multivariate Wasserstein distance to a reference random vector. We give conditions under which the images of these sets under scalar-valued aggregation functions are equal to or contained in uncertainty sets of univariate random variables defined via a univariate Wasserstein distance. This allows to rewrite or bound high-dimensional DRO problems with simpler DRO problems over the space of univariate random variables. We generalize the results to uncertainty sets defined via the Bregman-Wasserstein divergence and the max-sliced Wasserstein and Bregman-Wasserstein divergence. The max-sliced divergences allow us to jointly model distributional uncertainty around the reference random vector and uncertainty in the aggregation function. Finally, we derive explicit bounds for worst-case risk measures that belong to the class of signed Choquet integrals.

Block 4: Graphical Models

Philip Choi

Inference for graphical models for extremes
Extreme value theory is the study of the tail region of a distribution, where empirical estimation is often impossible due to the lack of extreme data. In real-life situations, such as quantifying the risks of floods or financial crises, we are sometimes interested in the dependence of extremes. For example, how does a high water level at one measuring site relate to water levels at other measuring sites? In high-dimensional settings, tail dependence can become arbitrarily complex. Hence, sparsity is required to obtain interpretable models. The classical notion of graphical models fails for extremes, but an appropriate notion of conditional independence for extremes was proposed in 2020 by Engelke and Hitz. Since then, several data-driven estimators for graphical models for extremes have been developed. In this talk, I will discuss my current progress in building a framework for inference for these estimators. In particular, I will focus on Hüsler–Reiss graphical models for extremes, which can be thought of as analogous to Gaussian graphical models.

Morris Greenberg

Restricted Search Space Graph MCMC via Birth-Death Processes
Inferring a directed acyclic graph (DAG) given data is computationally challenging due to graphs existing in a discrete search space that grows super-exponentially with the number of nodes. A promising class of MCMC methods for graph inference addresses scalability by first restricting the search space to a subset of edges (where partial scores can be calculated in advance), and thereafter incrementally expanding the space until a stopping criterion is met.

In this work, we estimate lower and upper bounds on the error introduced from current methods that operate on restricted spaces instead of the full space. Building on this, we propose a novel restricted-search MCMC method that reduces these errors. Our method is an adaptive algorithm which allows for either expansion or contraction of the search space throughout the chain. This is determined by a birth-death process, which we derive by choosing birth and death rates which are informed by our error bounds. Additionally, we improve upon the computational costs of previous restricted-search methods by including block-matrix operations in expansion steps and memoization in contraction steps.

We present extensive simulations that characterize the performance and computational efficiency of our algorithm, contrast this with existing methods, and consider applications in the field of imaging proteomics.

Poster Presentation

The abstracts are listed in alphabetical order.

Mei Dong

Average Treatment Effect with Continuous Instrumental Variables
The instrumental variable (IV) approach is a widely used method for estimating the average treatment effect (ATE) in the presence of unmeasured confounders. Existing methods for continuous IVs often rely on structural equation modeling, which imposes strong parametric assumptions and can yield biased estimates, particularly for binary outcomes. In this work, we propose a novel nonparametric identification strategy for the ATE using a continuous IV under the potential outcome framework, leveraging the conditional weighted average derivative effect. For estimation, we assume a partial linear model for the IV-treatment relationship. Under this model, we develop a bounded, locally efficient, and multiply robust estimator that extends the properties of semiparametric efficient estimators for binary IVs to continuous IVs. Notably, our estimator remains consistent even if the partial linear model is misspecified. Simulation results demonstrate that our proposed multiply robust estimator is unbiased and robust to model misspecification. Finally, we apply the proposed estimators to estimate the causal effect of obesity on the two-year mortality rate of non-small cell lung cancer patients.

Arturo Esquivel

Detecting Stellar Flares in Photometric Data Using Hidden Markov Models
We present a hidden Markov model (HMM) for discovering stellar flares in light curve data of stars. HMMs provide a framework to model time series data that are non-stationary; they allow for systems to be in different states at different times and consider the probabilities that describe the switching dynamics between states. In the context of stellar flares discovery, we exploit the HMM framework by allowing the light curve of a star to be in one of three states at any given time step: Quiet, Firing, or Decaying. This three-state HMM formulation is designed to enable straightforward identification of stellar flares, their duration, and associated uncertainty. This is crucial for estimating the flare’s energy, and is useful for studies of stellar flare energy distributions. We combine our HMM with a celerite model that accounts for quasi-periodic stellar oscillations. Through an injection recovery experiment, we demonstrate and evaluate the ability of our method to detect and characterize flares in stellar time series. We also show that the proposed HMM flags fainter and lower energy flares more easily than traditional sigma-clipping methods. Lastly, we visually demonstrate that simultaneously conducting detrending and flare detection can mitigate biased estimations arising in multi-stage modelling approaches. Thus, this method paves a new way to calculating stellar flare energy. We conclude with an example application to one star observed by TESS, showing how the HMM compares with sigma-clipping when using real data.

Jianhui Gao

Using ML predictions to drive large-scale and robust scientific inquiry
Prediction-based (PB) inference is increasingly used in applications where the outcome of interest is difficult to obtain, but its predictors are readily available. Unlike traditional inference, PB inference utilizes a machine learning (ML) model to generate outcome predictions, which are then leveraged for statistical inference. Motwani and Witten (2023) revisited two key PB inference approaches for ordinary least squares. They found that the method proposed by Wang et al. (2020) yields a consistent estimator for the association of interest when the ML model perfectly captures the underlying regression function. However, the prediction-powered inference (PPI) method introduced by Angelopoulos et al. (2023) offers valid inference regardless of the model’s accuracy. In this poster, we analyzed the statistical efficiency of the PPI estimator and identified a more efficient alternative: the Chen and Chen (CC) estimator, originally proposed in 2002. By incorporating a weight into the PPI estimator, the CC method achieves a superior balance between robustness to ML model specification and statistical efficiency. We further contextualize PB inference by tracing its connections to methods in economics and statistics dating back to the 1960s. Within this framework, we introduce Synthetic Surrogate (SynSurr) Analysis for genome-wide association studies (GWAS). SynSurr ensures robustness to imputation errors by jointly analyzing original and imputed traits rather than replacing missing values. Although SynSurr adopts a joint modeling approach, it is asymptotically equivalent to the CC estimator under homoscedasticity. We apply SynSurr to empower the GWAS of dual-energy X-ray absorptiometry (DXA) traits within the UK Biobank, demonstrating its effectiveness in improving statistical power while maintaining valid inference.

Ziyi Liu

Sequential Probability Assignment with Contexts: Minimax Regret, Contextual Shtarkov Sums, and Contextual Normalized Maximum Likelihood
We study the fundamental problem of sequential probability assignment, also known as online learning with logarithmic loss, with respect to an arbitrary, possibly nonparametric hypothesis class. Our goal is to obtain a complexity measure for the hypothesis class that characterizes the minimax regret and to determine a general, minimax optimal algorithm. Notably, the sequential \(\ell^{\infty}\) entropy, extensively studied in the literature (Rakhlin and Sridharan, 2015, Bilodeau et al., 2020, Wu et al., 2023), was shown to not characterize minimax risk in general. Inspired by the seminal work of Shtarkov (1987) and Rakhlin, Sridharan, and Tewari (2010), we introduce a novel complexity measure, the contextual Shtarkov sum, corresponding to the Shtarkov sum after projection onto a multiary context tree, and show that the worst case log contextual Shtarkov sum equals the minimax regret. Using the contextual Shtarkov sum, we derive the minimax optimal strategy, dubbed contextual Normalized Maximum Likelihood (cNML). Our results hold for sequential experts, beyond binary labels, which are settings rarely considered in prior work. To illustrate the utility of this characterization, we provide a short proof of a new regret upper bound in terms of sequential \(\ell^{\infty}\) entropy, unifying and sharpening state-of-the-art bounds by Bilodeau et al. (2020) and Wu et al. (2023).

Jack Longwell

Automated segmentation of subretinal fluid from optical coherence tomography: A vision transformer approach with cross-validation
Purpose: We present an algorithm to segment subretinal fluid (SRF) on individual B-scan slices in patients with rhegmatogenous retinal detachment. Particular attention is paid to robustness, with a five-fold cross validation approach and a hold-out test set.

Design: Retrospective, cross-sectional study.

Participants: 3819 B-scan slices across 98 time points from 45 patients were used in this study.

Main Outcome Measures: SRF volume following surgical intervention for rhegmatogenous retinal detachment.

Methods: SRF was segmented on all scans. A base SegFormer model, pre-trained on four massive datasets, was further trained on raw B-scans from the ReTOUCH dataset of 4532 slices: an open dataset of intra-retinal fluid, subretinal fluid, and pigment epithelium detachment. When adequate performance was reached, transfer learning was used to train the model on our in-house dataset, to segment SRF. A five-fold cross-validation approach was used, with an additional hold-out test set. All folds were first trained and cross-validated, and then additionally tested on the hold-out set. Mean and total Dice coefficients (F1 score) were calculated for each fold.

Results: The average total Dice coefficient across the validation folds was 0.92, and the average mean Dice coefficient was 0.86. For the test set, the average total Dice coefficient was 0.94, and the average mean Dice coefficient was 0.88. The model showed strong inter-fold consistency on the hold-out set, with a variance of only 0.003.

Conclusions: The SegFormer model for SRF segmentation demonstrates a strong ability to segment SRF. This result holds up to cross-validation and hold-out testing, across all folds. The model is available open-source online.

Yuhan (Evelyn) Pan

Exploring the Dynamics of The Annual Area Burned By Forest Fires in Northeatern Ontario: A BayesGP Analysis
Forest fires are a natural ecological process, yet their increasing frequency and severity due to climate change pose significant risks to ecosystems, communities, and economies. Understanding long-term fire activity trends is essential for effective fire management and policy development. This study focuses on the Northeast Study Area (NESA) in Ontario, analyzing historical fire data from 1930 to 2019 to model and predict annual total area burned. We employ a Bayesian Gaussian Process (BayesGP) framework, leveraging Integrated Wiener Processes (IWP) for smoothing and uncertainty quantification. Our approach improves upon existing frequentist models by explicitly incorporating prior knowledge and capturing nonlinear temporal trends. Through posterior density analysis, residual diagnostics, and forecasting, we validate the model’s ability to estimate fire variability and identify potential future trends. Results indicate a historical decline in burned area but suggest an increasing trend in the near future. Our findings emphasize the importance of probabilistic modeling in wildfire risk assessment and resource allocation, highlighting the advantages of Bayesian approaches for handling high-variability data. The insights from this study contribute to improving fire management strategies and understanding climate-driven fire regime shifts.

Bertrand Sodjahin

Sparse Priors for Bayesian Networks with Application in Bacterial Genomics
Antimicrobial resistance is a pressing modern health problem associated with massive human and capital costs. As a known cause of this resistance, bacterial dormancy is a genetically controlled stochastic process. Its genetic behaviour is governed by a superexponentially complex structure that is methodologically and computationally difficult to model. Bayesian networks have been applied for this purpose, but suffer from low data information and an inconvenient posterior geometry that makes inference challenging. We propose novel sparse network structure priors, based on a spike-and-slab strategy, that incorporate microbiological knowledge of bacterial genetic structure, to facilitate posterior inference. The output of our procedure is a ranked list of potential root genes that can be provided to lab scientists to guide treatments development against this pathogen. We illustrate our procedure on the motivating application of characterizing dormancy in Pseudomonas aeruginosa, a bacterium that is a common of hospital acquired infections.

Yan Zhang

On a surprising behaviour of the likelihood ratio test in non-parametric mixture models
In the study of non-parametric mixture models, non-parametric maximum likelihood estimation (NPMLE) has become a standard and powerful tool for estimation. NPMLE is known for its remarkable adaptivity—for instance, it can adjust to the tail behavior and the support shape of the mixing distribution in tasks such as density estimation and empirical Bayes methods. However, much less is understood about the corresponding inference procedure: the non-parametric likelihood ratio test (LRT). Previous work has mostly focused on homogeneity testing, where the data-generating distribution is not itself a mixture. In this work, we present the first detailed theoretical analysis of the LRT in general non-parametric mixture models. Interestingly, unlike in the parametric setting—where the LRT statistic tends to be asymptotically distribution-free—we find that the LRT in non-parametric mixtures exhibits strongly adaptive behavior. In particular, for key examples such as Gaussian and Poisson mixtures, we show that the LRT statistic converges when the true distribution is a finite mixture, and diverges to infinity otherwise.