Coloquio de Estadística y Ciencia de Datos de la Pontificia Universidad Católica de Chile

El departamento de estadística de la pontificia universidad católica de chile tiene unos de los cuerpos académicos más grandes y destacados de las universidades chilenas y latinoamericanas, en su búsqueda por la interrelación regional de los diferentes investigadores del área de estadística y afines, el departamento de estadística busca organizar un seminario local que permita conocer de cerca el trabajo realizado por los investigadores regionales, así como también conocer los problemas actuales en investigación de los académicos de la UC, con la intención de posibilitar puentes de futuras colaboraciones.


2025-10-30
15:30hrs.
Daira Velandia. Universidad de Valparaíso
Estimation methods for a Gaussian process under fixed domain asymptotics
Sala 2
Abstract:
This talk will address some inference tools for Gaussian random fields from the increasing domain and fixed domain asymptotic approaches. First, concepts and previous results are presented. Then, the results obtained after studying some extensions of the problem of estimating covariance parameters under the two asymptotic approaches named above are addressed..
2025-04-29
15:00hrs.
Ronny Vallejos. Universidad Técnica Federico Santa Maria
Advances in Agreement Coefficients for Continuous Measurements
Sala usos multiples, Felipe Villanueva
Abstract:

Assessing agreement between instruments is fundamental in clinical and observational studies to evaluate how similarly two methods measure the same set of subjects. In this talk, we present two extensions of a widely used coefficient for assessing agreement between continuous variables. The first extension introduces a novel agreement coefficient for lattice sequences observed over the same areal units, motivated by the comparison of poverty measurement methodologies in Chile. The second extension proposes a new coefficient, denoted as ρ1, designed to measure agreement between continuous measurements obtained from two instruments observing the same experimental units. Unlike traditional approaches, ρis based on Ldistances, providing robustness to outliers and avoiding dependence on nuisance parameters. Both proposals are supported by theoretical results, an inference framework, and simulation studies that illustrate their performance and practical relevance.

2025-04-10
16:00hrs.
Francisco Cuevas. Universidad Técnica Federico Santa María
Composite likelihood inference for space-time point processes
Sala 1 multiuso, 1° Piso Felipe Villanueva
Abstract:

The dynamics of a rain forest is extremely complex involving births, deaths and growth

of trees with complex interactions between trees, animals, climate, and environment. We

consider the patterns of recruits (new trees) and dead trees between rain forest censuses.

For a current census we specify regression models for the conditional intensity of recruits

and the conditional probabilities of death given the current trees and spatial covariates. We

estimate regression parameters using conditional composite likelihood functions that only

involve the conditional first order properties of the data. When constructing assumption

lean estimators of covariance matrices of parameter estimates we only need mild assumptions

of decaying conditional correlations in space while assumptions regarding correlations over

time are avoided by exploiting conditional centering of composite likelihood score functions.

Time series of point patterns from rain forest censuses are quite short while each point

pattern covers a fairly big spatial region. To obtain asymptotic results we therefore use a

central limit theorem for the fixed timespan - increasing spatial domain asymptotic setting.

This also allows us to handle the challenge of using stochastic covariates constructed from

past point patterns. Conveniently, it suffices to impose weak dependence assumptions on

the innovations of the space-time process. We investigate the proposed methodology by

simulation studies and an application to rain forest data.

2025-03-07
15:00hrs.
Victor Morales-Oñate. Universidad de Las Américas, Quito Ecuador.
Machine Learning en Modelos de Riesgo de Crédito
Salas multiuso, 1° piso Villanueva
Abstract:
La modelización del riesgo de crédito ofrece un campo de oportunidades tanto para profesionales con formación estadística tradicional como para aquellos especializados en Machine Learning. Sin embargo, la elección entre métodos clásicos y enfoques basados en aprendizaje automático no es trivial. ¿Cuándo y por qué optar por una técnica sobre otra?
 
En esta charla, exploraremos esta pregunta clave a través del ciclo de vida del crédito, analizando cómo el Machine Learning está transformando la evaluación y gestión del riesgo. Compararemos los enfoques tradicionales con modelos avanzados, resaltando sus ventajas, limitaciones y los desafíos que implica su implementación en un entorno regulado.
 
Finalmente, discutiremos casos de aplicación de Analítica Avanzada en la industria financiera, identificando oportunidades de innovación y el impacto de estas metodologías en la toma de decisiones estratégicas.
2024-11-26
13:30hrs.
Víctor H. Lachos. University of Connecticut
An EM algorithm for fitting matrix-variate normal distributions on interval-censored and missing data.
Auditorio Ninoslav Bralic
Abstract:

Matrix-variate distributions are powerful tools for modeling three-way datasets that often arise in longitudinal and multidimensional spatio-temporal studies. However, observations in these datasets can be missing or subject to some detec- tion limits because of the restriction of the experimental apparatus. Here, we develop an efficient EM-type algorithm for maximum likelihood estimation of parameters, in the context of interval-censored and/or missing data, utilizing the matrix-variate normal distribution. This algorithm provides closed-form expres- sions that rely on truncated moments, offering a reliable approach to parameter estimation under these conditions. Results obtained from the analysis of both simulated data and real case studies concerning water quality monitoring are reported to demonstrate the effectiveness of the proposed method.

2024-11-20
16:00 horashrs.
Debajyoti Sinha. Florida State University
Analysis of spatially clustered survival data with unobserved covariates using SBART
sala 2 de usos múltiples, 1er. piso Edificio Felipe Villanueva
Abstract:

For large, clustered survival studies, usual parametric and semi-parametric regression are inappropriate and inadequate when the appropriate functional forms of the covariates and their interactions in hazard functions are unknown, and random cluster effects as well as some unknown cluster-level covariates are spatially correlated. We present a general nonparametric method for such studies under the Bayesian ensemble learning paradigm called Soft Bayesian Additive Regression Trees (SBART in short).
Our additional methodological and computational challenges include large number of clusters,  variable cluster sizes, and  proper statistical augmentation of the unobservable cluster-level covariate using a data registry different from the main  survival study. 
We use an innovative 3-step computational tool based on latent variables to address our computational challenges. Using two different data resources, we illustrate the practical implementation of our method and its advantages over existing methods by assessing the impacts of intervention in some cluster/county level and patient-level covariates to mitigate existing disparity in breast cancer survival in 67 Florida counties (clusters) .  Florida Cancer Registry (FCR) is used to obtain clustered survival data with patient-level covariates, and the Behavioral Risk Factor Surveillance Survey (BRFSS) is used as to obtain further data information on an unobservable county-level covariate of Screening Mammography Utilization (SMU).

2024-11-08
15:00hrs.
Marie-Hélène Descary. Université Du Québec À Montréal
Constructing Ancestral Recombination Graphs through Reinforcement Learning
sala de usos múltiples, 1er. piso Edificio Felipe Villanueva
Abstract:
Over the years, many approaches have been proposed to build ancestral recombination graphs (ARGs), graphs used to represent the genetic relationship between individuals. Among these methods, many rely on the assumption that the most likely graph is among the shortest ones. In this talk, I will present a new approach to build short ARGs: Reinforcement Learning (RL). Our method exploits the similarities between finding the shortest path between a set of genetic sequences and their most recent common ancestor and finding the shortest path between the entrance and exit of a maze, a classic RL problem. In the maze problem, the learner, called the agent, must learn the directions to take in order to escape as quickly as possible, whereas in our problem, the agent must learn the actions to take between coalescence, mutation, and recombination in order to reach the most recent common ancestor as quickly as possible. Our results show that RL can be used to build ARGs as short as those built with a heuristic algorithm optimized to build short ARGs, and sometimes even shorter. Moreover, our method allows to build a distribution of short ARGs for a given sample, and can also generalize learning to new samples not used during the learning process.
2024-09-25
15:00 horashrs.
Jorge Loria. Department of Computer Science Aalto University
Aprendizaje posterior de kernels bajo previas con peso infinito
sala de usos múltiples, 1er. piso Edificio Felipe Villanueva
Abstract:

Neal (1996) demostró que las redes neuronales Bayesianas (BNN) de una capa infinitamente anchas convergen a un proceso Gaussiano (GP), cuando los pesos tienen una previa de varianza finita. Cho & Saul (2009) presentaron una fórmula recursiva para procesos de kernel profundos, relacionando la matriz de covarianza de una capa con la matriz de covarianza de la capa anterior. Más aún, obtuvieron una fórmula explícita para la recursión con algunas funciones de activación comunes, incluyendo la ReLU. Trabajos posteriores han fortalecido estos resultados a arquitecturas más complejas, obteniendo límites similares para redes más profundas. A pesar de esto, trabajos recientes, incluyendo Aitchison et al. (2021), destacan como los kernels de covarianza obtenidos de esta forma son determinísticos y así imposibilitan el aprendizaje de las represenciones de la red límite, lo cual equivale a aprender un kernel posterior que sea no-degenerado dadas las observaciones. Para abordar esto proponen añadir un ruido artifical y así que el kernel retenga estocasticidad. Sin embargo, este ruido artifical puede criticarse pues no emerge del límite de la arquitectura de una BNN. Buscando evitar esto, demostramos que una red neuronal Bayesiana profunda, donde la anchura de cada capa va a infinito, y todos los pesos tienen distribución conjunta elíptica con varianza infinita, convergen a un proceso con marginales α-estable en cada capa que tengan una representación condicionalmente Gaussiana. Estas covarianzas aleatorias pueden relacionarse recursivamente en la manera de Cho & Saul (2009), a pesar de que los procesos tengan comportamiento estable, y por tanto las covarianzas no están necesariamente definidas. Nuestros resultados proveen generalizaciones a nuestro trabajo previo de Loría & Bhadra (2024) en redes de una capa, a redes de múltiples capas y evitando la intensa carga computacional. Las ventajas computacionales y estadísticas resaltan sobre otros métodos en simulaciones y en bases de datos de referencia.

2024-09-06
09:40hrs.
Hector Araya. Universidad Adolfo Ibañez
Least squares estimation for the Ornstein-Uhlenbeck process with small Hermite noise and some generalizations
Auditorio Ninoslav Bralic
Abstract:
We consider the problem of the drift parameter estimation for a non-Gaussian long memory Ornstein–Uhlenbeck process driven by a Hermite process. To estimate the unknown parameter, discrete time high-frequency observations at regularly spaced time points and the least squares estimation method are used. By means of techniques based on Wiener chaos and multiple stochastic integrals, the consistency and the limit distribution of the least squares estimator of the drift parameter have been established. To show the computational implementation of the obtained results, different simulation examples are given. Finally, an extension to a type of  iterated Ornstein–Uhlenbeck is discussed. 
2024-08-28
15:00hrs.
Wan-Lun Wang. National Cheng Kung University
Multivariate Contaminated Normal Censored Regression Model: Properties and Maximum Likelihood Inference
Auditorio Ninoslav Bralic
Abstract:

The multivariate contaminated normal (MCN) distribution which contains two extra parameters with respect to parameters of the multivariate normal distribution, one for controlling the proportion of mild outliers and the other for specifying the degree of contamination, has been widely applied in robust statistical modeling of the data. This paper extends the MCN model to deal with possibly censored values due to limits of quantification, referred to as the MCN with censoring (MCN-C) model. Further, it establishes the censored multivariate linear regression model where the random errors have the MCN distribution, named the MCN censored regression (MCN-CR) model. Two computationally feasible expectation conditional maximization (ECM) algorithms are developed for maximum likelihood estimation of the MCN-C and MCN-CR models. An information-based method is used to approximate the standard errors of location parameters and regression coefficients. The capability and superiority of the proposed models are illustrated by two real-data examples and simulation studies.

Keywords: Censored data; EM algorithm; Multivariate models; Outliers; Truncation.

2024-08-28
15:45hrs.
Tsung-I Lin. National Chung Hsing University
A robust factor analysis model utilizing the canonical fundamental skew-t distribution
Auditorio Ninoslav Bralic
Abstract:

Traditional factor analysis, which relies on the assumption of multivariate normality, has been extended by jointly incorporating the restricted multivariate skew-t (rMST) distribution for the unobserved factors and errors. However, the limited utility of the rMST distribution in capturing skewness concentrated in a single direction prompted the development of a more adaptable and robust factor analysis model. A more flexible, robust factor analysis model is introduced based on the broader canonical fundamental skew-t (CFUST) distribution, called the CFUSTFA model. The proposed new model can account for more complex features of skewness in multiple directions. An efficient alternating expectation conditional maximization algorithm fabricated under several reduced complete-data spaces is developed to estimate parameters under the maximum likelihood (ML) perspective. To assess the variability of parameter estimates, an information-based approach is employed to approximate the asymptotic covariance matrix of the ML estimators. The efficacy and practicality of the proposed techniques are demonstrated through the analysis of simulated and real datasets.

 

Keywords: AECM algorithm; Canonical fundamental skew-distribution; Factor scores; Truncated multivariate distribution; Unrestricted multivariate skew-distribution

2024-07-03
15:00hrs.
Eloy Alvarado. Universidad Técnica Federico Santa Maria
Archimedean-like spatial copulas and their applications
Auditorio Ninoslav Bralic
Abstract:

The Gaussian copula is a powerful tool that has been widely used to model spatial and/or temporal correlated data with arbitrary marginal distributions. However, this model can be restrictive as it expresses a reflection symmetric dependence.

Recently, (Bevilacqua et al 2024) proposed a new general class of spatial cop- ula models that allows the generation of random fields with arbitrary marginal distributions and types of dependence that can be reflection symmetric or not, par- ticularly focusing on an instance that can be seen as the spatial generalization of the Classical Clayton copula. In this session, we will review this general class of Archimedean-like spatial copulas and explore the various spatial extensions that this construction allows. Specifically, the Clayton-like case will be examined along with two spatial copulas currently in development: the Ali-Mikhail-Haq and Gum- bel spatial copulas. Additionally, we will present the ongoing development of an application of this methodology to model geo-referenced operational covariates us- ing Weibull regression, which can be seen as the spatial extension of the widely known proportional hazard model.

References

Bevilacqua, M., Alvarado, E. & Caaman?o-Carrillo, C. A flexible Clayton-like spa- tial copula with application to bounded support data. Journal Of Multivariate Analysis201 pp. 105277 (2024,5)

2024-06-19
15:00hrs.
Johan Van Der Molen. PUC
Estimación de la Matriz de Similitud a Posteriori para el análisis de clusters Bayesiano
Auditorio Ninoslav Bralic
Abstract:

Los modelos de mezcla, especialmente las mezclas de Proceso de Dirichlet, se utilizan ampliamente en análisis de clusters Bayesiano. La Matriz de Similitud a Posteriori (PSM por su sigla en inglés) es crucial para comprender la estructura de clusters de los datos, y típicamente se estima con métodos de Monte Carlo basados en Cadenas de Markov (MCMC). Sin embargo, en este contexto MCMC puede ser muy sensible con respecto a la inicialización de las cadenas, y la convergencia suele ser lenta, visitando un número muy reducido de particiones de los datos. Esto resulta en una versión restringida de la posteriori, que puede afectar negativamente tanto la estimación de la PSM, como la de los clusters.

 

Este trabajo propone un método más eficiente para la estimación de la PSM, sin el uso de MCMC. Basado en una fórmula analítica, se busca aproximar directamente las entradas de la PSM, particularmente para las mezclas de Proceso de Dirichlet, reduciendo el costo computacional y mejorando la precisión de la estimación. En esta presentación mostraré distintos métodos de aproximación, con resultados preliminares obtenidos mediante simulaciones y datos reales, ilustrando ventajas con respecto a MCMC, así como también sus propios desafíos.

2024-06-07
10:00hrs.
Alba Martinez. Universidad Diego Portales
Structural Equation Models and Multiblock Data Analysis: Theory and Applications
Auditorio Ninoslav Bralic
Abstract:

Structural equation models aim to represent and describe relationships between constructs, and between constructs and observed variables, whereas multiblock data analysis focuses on explaining the relationships between several blocks of variables. Multiblock data analysis enables the creation of latent variable scores and the estimation of structural equation models. A general framework is provided by Regularized Generalized Canonical Correlation Analysis (RGCCA). In this talk, I present application examples to illustrate a context for understanding the fundamental concepts of both fields and their interconnections. I review the main definitions related to RGCCA, the optimization problem, the search algorithm, and special cases. Further research is outlined.

2024-05-29
15:00hrs.
Steven Maceachem. The Ohio State University
Familial inference: Tests for hypotheses on a family of centers
Auditorio Ninoslav Bralic
Abstract:
Many scientific disciplines face a replicability crisis.  While these crises have many drivers, we focus on one.  Statistical hypotheses are translations of scientific hypotheses into statements about one or more distributions.  The most basic tests focus on the centers of the distributions.   Such tests implicitly assume a specific center, e.g., the mean or the median.  Yet, scientific hypotheses do not always specify a particular center.  This ambiguity leaves a gap between scientific theory and statistical practice that can lead to rejection of a true null.  The gap is compounded when we consider deficiencies in the formal statistical model.  Rather than testing a single center, we propose testing a family of plausible centers, such as those induced by the Huber loss function (the Huber family).  Each center in the family generates a point null hypothesis and the resulting family of hypotheses constitutes a familial null hypothesis.  A Bayesian nonparametric procedure is devised to test the familial null.  Implementation for the Huber family is facilitated by a novel pathwise optimization routine.  Along the way, we visit the question of what it means to be the center of a distribution.  Surprisingly, we have been unable to find a clear and comprehensive definition of this concept in the literature. 
This is joint work with Ryan Thompson (University of New South Wales), Catherine Forbes (Monash University), and Mario Peruggia (The Ohio State University). 
2024-04-19
11:00hrs.
Garritt Page. Brigham Young University
Informed Bayesian Finite Mixture Models via Asymmetric Dirichlet Priors
Auditorio Ninoslav Bralic
Abstract:
Finite mixture models are flexible methods that are commonly used for model-based clustering. A recent focus in the model-based clustering literature is to highlight the difference between the number of components in a mixture model and the number of clusters. The number of clusters is more relevant from a practical standpoint, but to date, the focus of prior distribution formulation has been on the number of components.  In light of this, we develop a finite mixture methodology that permits eliciting prior information directly on the number of clusters in an intuitive way.  This is done by employing an asymmetric Dirichlet distribution as a prior on the weights of a finite mixture.  Further, a penalized complexity motivated prior is employed for the Dirichlet shape parameter.  We illustrate the ease to which prior information can be elicited via our construction and the flexibility of the resulting induced prior on the number of clusters.  We also demonstrate the utility of our approach using numerical experiments and two real world data sets.
2024-04-17
15:00hrs.
Tarik Faouzi. Usach
Marquis de Condorcet: from Condorcet's theory of the vote to the method of partition
Sala de usos multiple 1er piso
Abstract:

We introduce two new approaches to clustering categorical and mixed data: Condorcet clustering with a fixed number of groups, denoted $$\alpha$$-Condorcet and Mixed-Condorcet respectively. As k-modes, this approach is essentially based on similarity and dissimilarity measures. The presentation is divided into three parts: first, we propose a new Condorcet criterion, with a fixed number of groups (to select cases into clusters). In the second part, we propose a heuristic algorithm to carry out the task. In the third part, we compare $$\alpha$$ -Condorcet clustering with k-modes clustering and Mixed-Condorcet with k-prototypes. The comparison is made with a qualitys index, accuracy of a measurement, and a within-cluster sum-of-squares index.

Our findings are illustrated using real datasets: the feline dataset, the US Census 1990 dataset and other data.

2024-04-10
15:00hrs.
Ernesto San Martín. PUC
Explorando el No-Tratamiento de Datos Faltantes: Un Analisis con Datos Administrativos Chilenos
Auditorio Ninoslav Bralic
Abstract:
La Estadística, o los Métodos Estadísticos, como los llamaba Fisher en 1922, ha sido una herramienta fundamental en las ciencias de la observación (para usar la jerga de Laplace y Quetelet) desde el siglo XIX. Su aplicación se basa en tres etapas delineadas por Fisher en un momento en que los principios fundamentales de esta disciplina aún esta[ba]n envueltos en oscuridad”: el problema de especificación, el problema de estimación y el problema de distribución. Es en el problema de especificación donde radica la clave: no solo se trata de definir la distribución de probabilidad que subyace a las observaciones, sino también de representar las preguntas que los investigadores tienen sobre el fenómeno observado mediante parámetros de interés.
 
Una parte importante de la investigación social se basa en observaciones recopiladas a través de encuestas. Nadie está forzado a participar de las encuestas, salvo que una ley establezca lo contrario, como en el caso del Censo o de consultas plebiscitarias. A pesar de esta obligatoriedad, las y los participantes no están obligados a responder cada pregunta de la misma. De ahí que sea frecuente encontrar datos faltantes. Estos datos, junto con las respuestas proporcionadas por otros participantes, forman parte de lo observado. Resulta curioso, entonces, que haya esfuerzos por tratar los datos faltantes de manera que ya no falten (mediante la imputación de datos) para resolver así el problema de especificación.

Sin embargo, una premisa fundamental de las ciencias de la observación es aceptar lo que se observa. Por lo tanto, resulta urgente no-tratar los datos faltantes, sino incluirlos en la especificación del proceso generador de datos. Queremos ilustrar esta forma de proceder utilizando un pequeño panel: los cuestionarios aplicados a padres y cuidadores de escolares en 2004 y 2006. Estos datos contienen varios patrones de datos faltantes: los explicitaremos de modo de identificar parcialmente la distribución condicional del ingreso familiar en el 2006 dado el ingreso familiar en el 2004. Este parámetro de interés formaliza una pregunta sustantiva de nuestro equipo de investigación en el Núcleo Milenio MOVI: cómo ha variado el ingreso de hijos e hijas en comparación con el ingreso de sus padres y madres.
2024-01-18
14:30hrs.
Manuel González. Universidad de la Frontera
Métodos de Regularización Aplicados a Problemas de Quimiometría
Sala 2
2024-01-18
11:30hrs.
Christian Caamaño . Universidad del Bio-Bio
A flexible Clayton-like spatial copula with application to bounded support data.
Sala 2
Abstract:
The Gaussian copula is a powerful tool that has been widely used to model spatial and/or temporal correlated data with arbitrary marginal distribution. However, this kind of model can potentially be too restrictive since it expresses a reflection symmetric dependence. In this work, we propose a new spatial copula model that allows to obtain random fields with arbitrary marginal distribution with a type of dependence that can be reflection symmetric or not.
Particularly, we propose a new random field with uniform marginal distribution, that can be viewed as a spatial generalization of the classical Clayton copula model. It is obtained through a power transformation of a specific instance of a beta random field which in turn is obtained using a transformation of two independent Gamma random fields.
For the proposed random field we study the second-order properties and we provide analytic expressions for the bivariate distribution and its correlation. Finally, in the reflection symmetric case, we study the associated geometrical properties.
As an application of the proposed model we focus on spatial modeling of data with bounded support. Specifically, we focus on spatial regression models with marginal distribution of the beta type. In a simulation study, we investigate the use of the weighted pairwise composite likelihood method for the estimation of this model. Finally, the effectiveness of our methodology is illustrated by analyzing point-referenced vegetation index data using the Gaussian copula as benchmark. Our developments have been implemented in an open-source package for the R statistical environment.

Keywords: Archimedean Copula, Beta random fields, Composite likelihood, Reflection Asymmetry.