Coloquio de Estadística y Ciencia de Datos de la Pontificia Universidad Católica de Chile

El departamento de estadística de la pontificia universidad católica de chile tiene unos de los cuerpos académicos más grandes y destacados de las universidades chilenas y latinoamericanas, en su búsqueda por la interrelación regional de los diferentes investigadores del área de estadística y afines, el departamento de estadística busca organizar un seminario local que permita conocer de cerca el trabajo realizado por los investigadores regionales, así como también conocer los problemas actuales en investigación de los académicos de la UC, con la intención de posibilitar puentes de futuras colaboraciones.

2026-01-22
15:30hrs.

Daira Velandia. Universidad de Valparaíso
Estimation methods for a Gaussian process under fixed domain asymptotics
Sala 2
Abstract:
This talk will address some inference tools for Gaussian random fields from the increasing domain and fixed domain asymptotic approaches. First, concepts and previous results are presented. Then, the results obtained after studying some extensions of the problem of estimating covariance parameters under the two asymptotic approaches named above are addressed..

2025-08-06
3 pmhrs.

Catalina Garcia García. Universidad de Granada
Algunas líneas de modelamiento estadístico: De la modelización de emisiones de CO? al tratamiento de la multicolinealidad. Retos futuros
Auditorio Ninoslav Bralic
Abstract:

Se presenta dos líneas de investigación actuales relacionadas con modelamiento estadístico así como propuestas de trabajo para generar posibles sinergias y colaboraciones.

En primer lugar, se presenta la investigación en relación con la modelización de las emisiones de dióxido de carbono (CO?), evaluando seis distribuciones candidatas para identificar un modelo de dos parámetros capaz de describir adecuadamente la distribución completa de las emisiones fósiles y realizar predicciones y recomendaciones de políticas públicas.

En segundo lugar, se presenta las contribuciones en relación con el diagnóstico y el tratamiento de la multicolinealidad. Como propuesta de investigación futura, se plantea el análisis econométrico de las emisiones de CO? combinado el conocimiento sobre su distribución subyacente con el tratamiento riguroso de la multicolinealidad.

2025-07-21
15:00hrs.

Dr. Marcos Prates. Universidade Federal de Minas Gerais
Advances in Spatial Statistics for Large-Scale and Complex Domains
Auditorio Ninoslav Bralic
Abstract:

The proliferation of large-scale geospatial data from sources such as satellite remote sensing and cellular phone networks has created a need for new statistical methods capable of handling massive datasets and complex spatial domains, as classical techniques often face prohibitive computational burdens and restrictive assumptions. In this talk, I discuss recent advances that directly address some of these challenges, primarily through the development of a scalable model that reduces computational complexity from cubic to near-linear in the number of observations. Further, we explore some of its applications. Beyond scalability, progress has been made in tailoring methods for complex domains by defining a process using appropriate distance metrics. The synthesis of these scalable and geometrically aware methods empowers practitioners to extract meaningful insights from vast and intricate spatial data. Again, we revisit applications in other spatial domains. FAPEMIG and CNPq partially funded these works.

This is a joint work with Carlos Gonzáles, Dipak K. Dey, Harvard Rue, Heitor Ramos, Lucas Godoy, Lucas Michelin, Jun Yan and Zaida Quiroz

2025-06-27
11,00 amhrs.

Paulo Henrique Ferreira. Universidade Federal de Bahia, Brasil
Reliability analysis of multiple repairable systems under imperfect repair and unobserved heterogeneity
Auditorio Ninoslav Bralic,
Abstract:

Imperfect repairs (IRs) are widely applicable in reliability engineering since most equipment is not completely replaced after failure. In this sense, it is necessary to develop methodologies that can describe failure processes and predict the reliability of systems under this type of repair. One of the challenges in this context is to establish reliability models for multiple repairable systems considering unobserved heterogeneity associated with systems failure times and their failure intensity after performing IRs. Thus, in this work, frailty models are proposed to identify unobserved heterogeneity in these failure processes. In this context, we consider the arithmetic reduction of age (ARA) and arithmetic reduction of intensity (ARI) classes of IR models, with constant repair efficiency and a Power-Law Process (PLP) distribution to model failure times and a univariate Gamma distributed frailty by all systems failure times. Classical inferential methods are used to estimate the parameters and reliability predictors of systems under IRs. An extensive simulation study is carried out under different scenarios to investigate the suitability of the models and the asymptotic consistency and efficiency properties of the maximum likelihood estimators. Finally, we illustrate the practical relevance of the proposed models on a real data set of sugarcane harvesters.

Joint work with: Éder S. Brito, Vera L. D. Tomazella, Paulo H. Ferreira, Francisco Louzada Neto, Oilson A. Gonzatto Junior.

2025-06-06
15:00hrs.

Kerlyns Martínez. Universidad de Concepción
Modelización estocástica de especies con estructura de edad bajo la influencia del comportamiento de pescadores
Auditorio Bralic
Abstract:
En esta charla abordaremos el desarrollo y análisis de un modelo matemático para la población de Kelp, que incorpora aspectos tanto ecológicos como sociológicos, considerando en particular la respuesta de los pescadores frente a regulaciones ambientales. Comenzaremos con una derivación heurística del modelo, incluyendo la representación de la incertidumbre inherente a los sistemas abiertos. A continuación, mostraremos la existencia y unicidad de soluciones dentro del espacio de soluciones admisibles, así como un análisis asintótico de la biomasa total. También introduciremos un esquema numérico eficiente que preserva las propiedades esenciales del modelo en presencia de coeficientes de crecimiento superlineal y no Lipschitz, y presentaremos simulaciones que ilustran distintos escenarios de interacción humana y dinámica del Kelp.

2025-05-29
15:00hrs.

Natalia Da Silva. Universidad de la República, Uruguay
Potenciando datos de uso de plataformas educativas mediante aprendizaje estadístico Bayesiano
Auditorio Bralic
Abstract:

El uso de distintos Sistemas de Gestión de Aprendizaje o plataformas educativas se ha convertido en una herramienta clave en el ámbito educativo. Estos sistemas generan diariamente un enorme volumen de datos tanto de estudiantes como de docentes. Transformar estos datos en información relevante para la toma de decisiones representa un gran desafío, debido a la complejidad de su estructura y a la dificultad de resumir el proceso de aprendizaje a partir de los registros disponibles.

En este trabajo se presentan métodos para transformar los datos de plataformas educativas en información relevante y explorar cómo esta puede utilizarse para predecir el desempeño académico en educación primaria pública en Uruguay. Se aplican métodos de aprendizaje estadístico Bayesianos para predecir rendimiento académico a partir de los patrones de uso de la plataforma Little Bridge, así como variables sociodemográficas y datos a nivel institucional. Específicamente, se utiliza el modelo BART (Bayesian Additive Regression Trees) y se compara su desempeño predictivo con Random Forest. El enfoque Bayesiano es seleccionado debido a la capacidad de incorporar efectos aleatorios a nivel de escuela, lo cual permite analizar los procesos de aprendizaje en múltiples niveles.

Los resultados pueden aplicarse tanto a nivel individual —para la identificación temprana de estudiantes en riesgo— como a nivel institucional, para destacar centros educativos que requieren intervención o aquellos que pueden servir como modelos de éxito.

2025-05-14
15:00hrs.

Cristian Meza. Universidad de Valparaíso
Estimation procedure based on Stochastic EM algorithm in Zero-Inflated mixed effects models applied to microbiome data
Auditorio Bralic
Abstract:
Human microbiome studies based on genetic sequencing techniques produce compositional (or count) longitudinal data of the relative (or absolute) abundances of microbial taxa over time, allowing to understand, through mixed-effects modeling, how microbial communities evolve in response to clinical interventions, environmental changes, or disease progression. In particular, the Zero-Inflated (ZI) models fit jointly and over time the presence and abundance of each microbe taxon, considering the compositional nature of the data, its skewness, and the over-abundance of zeros. However, as for other complex random effects models, maximum likelihood estimation suffers from the intractability of likelihood integrals. Available estimation methods rely on log-likelihood approximation, which is prone to potential limitations such as biased estimates or unstable convergence. In this work we develop an alternative maximum likelihood estimation approach for the ZI models such as the Beta Regression or Beta-Binomial, based on the Stochastic Approximation Expectation Maximization (SAEM) algorithm. The proposed methodology allows to model unbalanced data, which is not always possible in existing approaches. We also provide estimations of the standard errors and the log-likelihood of the fitted model. The performance of the algorithm is established through simulation, and its use is demonstrated in microbiome studies, showing its ability to detect changes in both presence and abundance of bacterial taxa over time and in response to treatment.

2025-04-29
15:00hrs.

Ronny Vallejos. Universidad Técnica Federico Santa Maria
Advances in Agreement Coefficients for Continuous Measurements
Sala usos multiples, Felipe Villanueva
Abstract:

Assessing agreement between instruments is fundamental in clinical and observational studies to evaluate how similarly two methods measure the same set of subjects. In this talk, we present two extensions of a widely used coefficient for assessing agreement between continuous variables. The first extension introduces a novel agreement coefficient for lattice sequences observed over the same areal units, motivated by the comparison of poverty measurement methodologies in Chile. The second extension proposes a new coefficient, denoted as ρ1, designed to measure agreement between continuous measurements obtained from two instruments observing the same experimental units. Unlike traditional approaches, ρ1 is based on L1 distances, providing robustness to outliers and avoiding dependence on nuisance parameters. Both proposals are supported by theoretical results, an inference framework, and simulation studies that illustrate their performance and practical relevance.

2025-04-10
16:00hrs.

Francisco Cuevas. Universidad Técnica Federico Santa María
Composite likelihood inference for space-time point processes
Sala 1 multiuso, 1° Piso Felipe Villanueva
Abstract:

The dynamics of a rain forest is extremely complex involving births, deaths and growth

of trees with complex interactions between trees, animals, climate, and environment. We

consider the patterns of recruits (new trees) and dead trees between rain forest censuses.

For a current census we specify regression models for the conditional intensity of recruits

and the conditional probabilities of death given the current trees and spatial covariates. We

estimate regression parameters using conditional composite likelihood functions that only

involve the conditional first order properties of the data. When constructing assumption

lean estimators of covariance matrices of parameter estimates we only need mild assumptions

of decaying conditional correlations in space while assumptions regarding correlations over

time are avoided by exploiting conditional centering of composite likelihood score functions.

Time series of point patterns from rain forest censuses are quite short while each point

pattern covers a fairly big spatial region. To obtain asymptotic results we therefore use a

central limit theorem for the fixed timespan - increasing spatial domain asymptotic setting.

This also allows us to handle the challenge of using stochastic covariates constructed from

past point patterns. Conveniently, it suffices to impose weak dependence assumptions on

the innovations of the space-time process. We investigate the proposed methodology by

simulation studies and an application to rain forest data.

2025-03-07
15:00hrs.

Victor Morales-Oñate. Universidad de Las Américas, Quito Ecuador.
Machine Learning en Modelos de Riesgo de Crédito
Salas multiuso, 1° piso Villanueva
Abstract:

La modelización del riesgo de crédito ofrece un campo de oportunidades tanto para profesionales con formación estadística tradicional como para aquellos especializados en Machine Learning. Sin embargo, la elección entre métodos clásicos y enfoques basados en aprendizaje automático no es trivial. ¿Cuándo y por qué optar por una técnica sobre otra?

En esta charla, exploraremos esta pregunta clave a través del ciclo de vida del crédito, analizando cómo el Machine Learning está transformando la evaluación y gestión del riesgo. Compararemos los enfoques tradicionales con modelos avanzados, resaltando sus ventajas, limitaciones y los desafíos que implica su implementación en un entorno regulado.

Finalmente, discutiremos casos de aplicación de Analítica Avanzada en la industria financiera, identificando oportunidades de innovación y el impacto de estas metodologías en la toma de decisiones estratégicas.

2024-11-26
13:30hrs.

Víctor H. Lachos. University of Connecticut
An EM algorithm for fitting matrix-variate normal distributions on interval-censored and missing data.
Auditorio Ninoslav Bralic
Abstract:

Matrix-variate distributions are powerful tools for modeling three-way datasets that often arise in longitudinal and multidimensional spatio-temporal studies. However, observations in these datasets can be missing or subject to some detec- tion limits because of the restriction of the experimental apparatus. Here, we develop an efficient EM-type algorithm for maximum likelihood estimation of parameters, in the context of interval-censored and/or missing data, utilizing the matrix-variate normal distribution. This algorithm provides closed-form expres- sions that rely on truncated moments, offering a reliable approach to parameter estimation under these conditions. Results obtained from the analysis of both simulated data and real case studies concerning water quality monitoring are reported to demonstrate the effectiveness of the proposed method.

2024-11-20
16:00 horashrs.

Debajyoti Sinha. Florida State University
Analysis of spatially clustered survival data with unobserved covariates using SBART
sala 2 de usos múltiples, 1er. piso Edificio Felipe Villanueva
Abstract:

For large, clustered survival studies, usual parametric and semi-parametric regression are inappropriate and inadequate when the appropriate functional forms of the covariates and their interactions in hazard functions are unknown, and random cluster effects as well as some unknown cluster-level covariates are spatially correlated. We present a general nonparametric method for such studies under the Bayesian ensemble learning paradigm called Soft Bayesian Additive Regression Trees (SBART in short).
Our additional methodological and computational challenges include large number of clusters, variable cluster sizes, and proper statistical augmentation of the unobservable cluster-level covariate using a data registry different from the main survival study. We use an innovative 3-step computational tool based on latent variables to address our computational challenges. Using two different data resources, we illustrate the practical implementation of our method and its advantages over existing methods by assessing the impacts of intervention in some cluster/county level and patient-level covariates to mitigate existing disparity in breast cancer survival in 67 Florida counties (clusters) . Florida Cancer Registry (FCR) is used to obtain clustered survival data with patient-level covariates, and the Behavioral Risk Factor Surveillance Survey (BRFSS) is used as to obtain further data information on an unobservable county-level covariate of Screening Mammography Utilization (SMU).

2024-11-08
15:00hrs.

Marie-Hélène Descary. Université Du Québec À Montréal
Constructing Ancestral Recombination Graphs through Reinforcement Learning
sala de usos múltiples, 1er. piso Edificio Felipe Villanueva
Abstract:
Over the years, many approaches have been proposed to build ancestral recombination graphs (ARGs), graphs used to represent the genetic relationship between individuals. Among these methods, many rely on the assumption that the most likely graph is among the shortest ones. In this talk, I will present a new approach to build short ARGs: Reinforcement Learning (RL). Our method exploits the similarities between finding the shortest path between a set of genetic sequences and their most recent common ancestor and finding the shortest path between the entrance and exit of a maze, a classic RL problem. In the maze problem, the learner, called the agent, must learn the directions to take in order to escape as quickly as possible, whereas in our problem, the agent must learn the actions to take between coalescence, mutation, and recombination in order to reach the most recent common ancestor as quickly as possible. Our results show that RL can be used to build ARGs as short as those built with a heuristic algorithm optimized to build short ARGs, and sometimes even shorter. Moreover, our method allows to build a distribution of short ARGs for a given sample, and can also generalize learning to new samples not used during the learning process.

2024-09-25
15:00 horashrs.

Jorge Loria. Department of Computer Science Aalto University
Aprendizaje posterior de kernels bajo previas con peso infinito
sala de usos múltiples, 1er. piso Edificio Felipe Villanueva
Abstract:

Neal (1996) demostró que las redes neuronales Bayesianas (BNN) de una capa infinitamente anchas convergen a un proceso Gaussiano (GP), cuando los pesos tienen una previa de varianza finita. Cho & Saul (2009) presentaron una fórmula recursiva para procesos de kernel profundos, relacionando la matriz de covarianza de una capa con la matriz de covarianza de la capa anterior. Más aún, obtuvieron una fórmula explícita para la recursión con algunas funciones de activación comunes, incluyendo la ReLU. Trabajos posteriores han fortalecido estos resultados a arquitecturas más complejas, obteniendo límites similares para redes más profundas. A pesar de esto, trabajos recientes, incluyendo Aitchison et al. (2021), destacan como los kernels de covarianza obtenidos de esta forma son determinísticos y así imposibilitan el aprendizaje de las represenciones de la red límite, lo cual equivale a aprender un kernel posterior que sea no-degenerado dadas las observaciones. Para abordar esto proponen añadir un ruido artifical y así que el kernel retenga estocasticidad. Sin embargo, este ruido artifical puede criticarse pues no emerge del límite de la arquitectura de una BNN. Buscando evitar esto, demostramos que una red neuronal Bayesiana profunda, donde la anchura de cada capa va a infinito, y todos los pesos tienen distribución conjunta elíptica con varianza infinita, convergen a un proceso con marginales α-estable en cada capa que tengan una representación condicionalmente Gaussiana. Estas covarianzas aleatorias pueden relacionarse recursivamente en la manera de Cho & Saul (2009), a pesar de que los procesos tengan comportamiento estable, y por tanto las covarianzas no están necesariamente definidas. Nuestros resultados proveen generalizaciones a nuestro trabajo previo de Loría & Bhadra (2024) en redes de una capa, a redes de múltiples capas y evitando la intensa carga computacional. Las ventajas computacionales y estadísticas resaltan sobre otros métodos en simulaciones y en bases de datos de referencia.

2024-09-06
09:40hrs.

Hector Araya. Universidad Adolfo Ibañez
Least squares estimation for the Ornstein-Uhlenbeck process with small Hermite noise and some generalizations
Auditorio Ninoslav Bralic
Abstract:

We consider the problem of the drift parameter estimation for a non-Gaussian long memory Ornstein–Uhlenbeck process driven by a Hermite process. To estimate the unknown parameter, discrete time high-frequency observations at regularly spaced time points and the least squares estimation method are used. By means of techniques based on Wiener chaos and multiple stochastic integrals, the consistency and the limit distribution of the least squares estimator of the drift parameter have been established. To show the computational implementation of the obtained results, different simulation examples are given. Finally, an extension to a type of iterated Ornstein–Uhlenbeck is discussed.

2024-08-28
15:00hrs.

Wan-Lun Wang. National Cheng Kung University
Multivariate Contaminated Normal Censored Regression Model: Properties and Maximum Likelihood Inference
Auditorio Ninoslav Bralic
Abstract:

The multivariate contaminated normal (MCN) distribution which contains two extra parameters with respect to parameters of the multivariate normal distribution, one for controlling the proportion of mild outliers and the other for specifying the degree of contamination, has been widely applied in robust statistical modeling of the data. This paper extends the MCN model to deal with possibly censored values due to limits of quantification, referred to as the MCN with censoring (MCN-C) model. Further, it establishes the censored multivariate linear regression model where the random errors have the MCN distribution, named the MCN censored regression (MCN-CR) model. Two computationally feasible expectation conditional maximization (ECM) algorithms are developed for maximum likelihood estimation of the MCN-C and MCN-CR models. An information-based method is used to approximate the standard errors of location parameters and regression coefficients. The capability and superiority of the proposed models are illustrated by two real-data examples and simulation studies.

Keywords: Censored data; EM algorithm; Multivariate models; Outliers; Truncation.

2024-08-28
15:45hrs.

Tsung-I Lin. National Chung Hsing University
A robust factor analysis model utilizing the canonical fundamental skew-t distribution
Auditorio Ninoslav Bralic
Abstract:

Traditional factor analysis, which relies on the assumption of multivariate normality, has been extended by jointly incorporating the restricted multivariate skew-t (rMST) distribution for the unobserved factors and errors. However, the limited utility of the rMST distribution in capturing skewness concentrated in a single direction prompted the development of a more adaptable and robust factor analysis model. A more flexible, robust factor analysis model is introduced based on the broader canonical fundamental skew-t (CFUST) distribution, called the CFUSTFA model. The proposed new model can account for more complex features of skewness in multiple directions. An efficient alternating expectation conditional maximization algorithm fabricated under several reduced complete-data spaces is developed to estimate parameters under the maximum likelihood (ML) perspective. To assess the variability of parameter estimates, an information-based approach is employed to approximate the asymptotic covariance matrix of the ML estimators. The efficacy and practicality of the proposed techniques are demonstrated through the analysis of simulated and real datasets.

Keywords: AECM algorithm; Canonical fundamental skew-t distribution; Factor scores; Truncated multivariate t distribution; Unrestricted multivariate skew-t distribution

2024-07-03
15:00hrs.

Eloy Alvarado. Universidad Técnica Federico Santa Maria
Archimedean-like spatial copulas and their applications
Auditorio Ninoslav Bralic
Abstract:

The Gaussian copula is a powerful tool that has been widely used to model spatial and/or temporal correlated data with arbitrary marginal distributions. However, this model can be restrictive as it expresses a reflection symmetric dependence.

Recently, (Bevilacqua et al , 2024) proposed a new general class of spatial cop- ula models that allows the generation of random fields with arbitrary marginal distributions and types of dependence that can be reflection symmetric or not, par- ticularly focusing on an instance that can be seen as the spatial generalization of the Classical Clayton copula. In this session, we will review this general class of Archimedean-like spatial copulas and explore the various spatial extensions that this construction allows. Specifically, the Clayton-like case will be examined along with two spatial copulas currently in development: the Ali-Mikhail-Haq and Gum- bel spatial copulas. Additionally, we will present the ongoing development of an application of this methodology to model geo-referenced operational covariates us- ing Weibull regression, which can be seen as the spatial extension of the widely known proportional hazard model.

References

Bevilacqua, M., Alvarado, E. & Caaman?o-Carrillo, C. A flexible Clayton-like spa- tial copula with application to bounded support data. Journal Of Multivariate Analysis. 201 pp. 105277 (2024,5)

2024-06-19
15:00hrs.

Johan Van Der Molen. PUC
Estimación de la Matriz de Similitud a Posteriori para el análisis de clusters Bayesiano
Auditorio Ninoslav Bralic
Abstract:

Los modelos de mezcla, especialmente las mezclas de Proceso de Dirichlet, se utilizan ampliamente en análisis de clusters Bayesiano. La Matriz de Similitud a Posteriori (PSM por su sigla en inglés) es crucial para comprender la estructura de clusters de los datos, y típicamente se estima con métodos de Monte Carlo basados en Cadenas de Markov (MCMC). Sin embargo, en este contexto MCMC puede ser muy sensible con respecto a la inicialización de las cadenas, y la convergencia suele ser lenta, visitando un número muy reducido de particiones de los datos. Esto resulta en una versión restringida de la posteriori, que puede afectar negativamente tanto la estimación de la PSM, como la de los clusters.

Este trabajo propone un método más eficiente para la estimación de la PSM, sin el uso de MCMC. Basado en una fórmula analítica, se busca aproximar directamente las entradas de la PSM, particularmente para las mezclas de Proceso de Dirichlet, reduciendo el costo computacional y mejorando la precisión de la estimación. En esta presentación mostraré distintos métodos de aproximación, con resultados preliminares obtenidos mediante simulaciones y datos reales, ilustrando ventajas con respecto a MCMC, así como también sus propios desafíos.

2024-06-07
10:00hrs.

Alba Martinez. Universidad Diego Portales
Structural Equation Models and Multiblock Data Analysis: Theory and Applications
Auditorio Ninoslav Bralic
Abstract:

Structural equation models aim to represent and describe relationships between constructs, and between constructs and observed variables, whereas multiblock data analysis focuses on explaining the relationships between several blocks of variables. Multiblock data analysis enables the creation of latent variable scores and the estimation of structural equation models. A general framework is provided by Regularized Generalized Canonical Correlation Analysis (RGCCA). In this talk, I present application examples to illustrate a context for understanding the fundamental concepts of both fields and their interconnections. I review the main definitions related to RGCCA, the optimization problem, the search algorithm, and special cases. Further research is outlined.