El departamento de estadística de la pontificia universidad católica de chile tiene unos de los cuerpos académicos más grandes y destacados de las universidades chilenas y latinoamericanas, en su búsqueda por la interrelación regional de los diferentes investigadores del área de estadística y afines, el departamento de estadística busca organizar un seminario local que permita conocer de cerca el trabajo realizado por los investigadores regionales, así como también conocer los problemas actuales en investigación de los académicos de la UC, con la intención de posibilitar puentes de futuras colaboraciones.
Assessing agreement between instruments is fundamental in clinical and observational studies to evaluate how similarly two methods measure the same set of subjects. In this talk, we present two extensions of a widely used coefficient for assessing agreement between continuous variables. The first extension introduces a novel agreement coefficient for lattice sequences observed over the same areal units, motivated by the comparison of poverty measurement methodologies in Chile. The second extension proposes a new coefficient, denoted as ρ1, designed to measure agreement between continuous measurements obtained from two instruments observing the same experimental units. Unlike traditional approaches, ρ1 is based on L1 distances, providing robustness to outliers and avoiding dependence on nuisance parameters. Both proposals are supported by theoretical results, an inference framework, and simulation studies that illustrate their performance and practical relevance.
The dynamics of a rain forest is extremely complex involving births, deaths and growth
of trees with complex interactions between trees, animals, climate, and environment. We
consider the patterns of recruits (new trees) and dead trees between rain forest censuses.
For a current census we specify regression models for the conditional intensity of recruits
and the conditional probabilities of death given the current trees and spatial covariates. We
estimate regression parameters using conditional composite likelihood functions that only
involve the conditional first order properties of the data. When constructing assumption
lean estimators of covariance matrices of parameter estimates we only need mild assumptions
of decaying conditional correlations in space while assumptions regarding correlations over
time are avoided by exploiting conditional centering of composite likelihood score functions.
Time series of point patterns from rain forest censuses are quite short while each point
pattern covers a fairly big spatial region. To obtain asymptotic results we therefore use a
central limit theorem for the fixed timespan - increasing spatial domain asymptotic setting.
This also allows us to handle the challenge of using stochastic covariates constructed from
past point patterns. Conveniently, it suffices to impose weak dependence assumptions on
the innovations of the space-time process. We investigate the proposed methodology by
simulation studies and an application to rain forest data.
Matrix-variate distributions are powerful tools for modeling three-way datasets that often arise in longitudinal and multidimensional spatio-temporal studies. However, observations in these datasets can be missing or subject to some detec- tion limits because of the restriction of the experimental apparatus. Here, we develop an efficient EM-type algorithm for maximum likelihood estimation of parameters, in the context of interval-censored and/or missing data, utilizing the matrix-variate normal distribution. This algorithm provides closed-form expres- sions that rely on truncated moments, offering a reliable approach to parameter estimation under these conditions. Results obtained from the analysis of both simulated data and real case studies concerning water quality monitoring are reported to demonstrate the effectiveness of the proposed method.
For large, clustered survival studies, usual parametric and semi-parametric regression are inappropriate and inadequate when the appropriate functional forms of the covariates and their interactions in hazard functions are unknown, and random cluster effects as well as some unknown cluster-level covariates are spatially correlated. We present a general nonparametric method for such studies under the Bayesian ensemble learning paradigm called Soft Bayesian Additive Regression Trees (SBART in short).Our additional methodological and computational challenges include large number of clusters, variable cluster sizes, and proper statistical augmentation of the unobservable cluster-level covariate using a data registry different from the main survival study. We use an innovative 3-step computational tool based on latent variables to address our computational challenges. Using two different data resources, we illustrate the practical implementation of our method and its advantages over existing methods by assessing the impacts of intervention in some cluster/county level and patient-level covariates to mitigate existing disparity in breast cancer survival in 67 Florida counties (clusters) . Florida Cancer Registry (FCR) is used to obtain clustered survival data with patient-level covariates, and the Behavioral Risk Factor Surveillance Survey (BRFSS) is used as to obtain further data information on an unobservable county-level covariate of Screening Mammography Utilization (SMU).
Neal (1996) demostró que las redes neuronales Bayesianas (BNN) de una capa infinitamente anchas convergen a un proceso Gaussiano (GP), cuando los pesos tienen una previa de varianza finita. Cho & Saul (2009) presentaron una fórmula recursiva para procesos de kernel profundos, relacionando la matriz de covarianza de una capa con la matriz de covarianza de la capa anterior. Más aún, obtuvieron una fórmula explícita para la recursión con algunas funciones de activación comunes, incluyendo la ReLU. Trabajos posteriores han fortalecido estos resultados a arquitecturas más complejas, obteniendo límites similares para redes más profundas. A pesar de esto, trabajos recientes, incluyendo Aitchison et al. (2021), destacan como los kernels de covarianza obtenidos de esta forma son determinísticos y así imposibilitan el aprendizaje de las represenciones de la red límite, lo cual equivale a aprender un kernel posterior que sea no-degenerado dadas las observaciones. Para abordar esto proponen añadir un ruido artifical y así que el kernel retenga estocasticidad. Sin embargo, este ruido artifical puede criticarse pues no emerge del límite de la arquitectura de una BNN. Buscando evitar esto, demostramos que una red neuronal Bayesiana profunda, donde la anchura de cada capa va a infinito, y todos los pesos tienen distribución conjunta elíptica con varianza infinita, convergen a un proceso con marginales α-estable en cada capa que tengan una representación condicionalmente Gaussiana. Estas covarianzas aleatorias pueden relacionarse recursivamente en la manera de Cho & Saul (2009), a pesar de que los procesos tengan comportamiento estable, y por tanto las covarianzas no están necesariamente definidas. Nuestros resultados proveen generalizaciones a nuestro trabajo previo de Loría & Bhadra (2024) en redes de una capa, a redes de múltiples capas y evitando la intensa carga computacional. Las ventajas computacionales y estadísticas resaltan sobre otros métodos en simulaciones y en bases de datos de referencia.
The multivariate contaminated normal (MCN) distribution which contains two extra parameters with respect to parameters of the multivariate normal distribution, one for controlling the proportion of mild outliers and the other for specifying the degree of contamination, has been widely applied in robust statistical modeling of the data. This paper extends the MCN model to deal with possibly censored values due to limits of quantification, referred to as the MCN with censoring (MCN-C) model. Further, it establishes the censored multivariate linear regression model where the random errors have the MCN distribution, named the MCN censored regression (MCN-CR) model. Two computationally feasible expectation conditional maximization (ECM) algorithms are developed for maximum likelihood estimation of the MCN-C and MCN-CR models. An information-based method is used to approximate the standard errors of location parameters and regression coefficients. The capability and superiority of the proposed models are illustrated by two real-data examples and simulation studies.
Keywords: Censored data; EM algorithm; Multivariate models; Outliers; Truncation.
Traditional factor analysis, which relies on the assumption of multivariate normality, has been extended by jointly incorporating the restricted multivariate skew-t (rMST) distribution for the unobserved factors and errors. However, the limited utility of the rMST distribution in capturing skewness concentrated in a single direction prompted the development of a more adaptable and robust factor analysis model. A more flexible, robust factor analysis model is introduced based on the broader canonical fundamental skew-t (CFUST) distribution, called the CFUSTFA model. The proposed new model can account for more complex features of skewness in multiple directions. An efficient alternating expectation conditional maximization algorithm fabricated under several reduced complete-data spaces is developed to estimate parameters under the maximum likelihood (ML) perspective. To assess the variability of parameter estimates, an information-based approach is employed to approximate the asymptotic covariance matrix of the ML estimators. The efficacy and practicality of the proposed techniques are demonstrated through the analysis of simulated and real datasets.
Keywords: AECM algorithm; Canonical fundamental skew-t distribution; Factor scores; Truncated multivariate t distribution; Unrestricted multivariate skew-t distribution
The Gaussian copula is a powerful tool that has been widely used to model spatial and/or temporal correlated data with arbitrary marginal distributions. However, this model can be restrictive as it expresses a reflection symmetric dependence.
Recently, (Bevilacqua et al , 2024) proposed a new general class of spatial cop- ula models that allows the generation of random fields with arbitrary marginal distributions and types of dependence that can be reflection symmetric or not, par- ticularly focusing on an instance that can be seen as the spatial generalization of the Classical Clayton copula. In this session, we will review this general class of Archimedean-like spatial copulas and explore the various spatial extensions that this construction allows. Specifically, the Clayton-like case will be examined along with two spatial copulas currently in development: the Ali-Mikhail-Haq and Gum- bel spatial copulas. Additionally, we will present the ongoing development of an application of this methodology to model geo-referenced operational covariates us- ing Weibull regression, which can be seen as the spatial extension of the widely known proportional hazard model.
References
Bevilacqua, M., Alvarado, E. & Caaman?o-Carrillo, C. A flexible Clayton-like spa- tial copula with application to bounded support data. Journal Of Multivariate Analysis. 201 pp. 105277 (2024,5)
Los modelos de mezcla, especialmente las mezclas de Proceso de Dirichlet, se utilizan ampliamente en análisis de clusters Bayesiano. La Matriz de Similitud a Posteriori (PSM por su sigla en inglés) es crucial para comprender la estructura de clusters de los datos, y típicamente se estima con métodos de Monte Carlo basados en Cadenas de Markov (MCMC). Sin embargo, en este contexto MCMC puede ser muy sensible con respecto a la inicialización de las cadenas, y la convergencia suele ser lenta, visitando un número muy reducido de particiones de los datos. Esto resulta en una versión restringida de la posteriori, que puede afectar negativamente tanto la estimación de la PSM, como la de los clusters.
Este trabajo propone un método más eficiente para la estimación de la PSM, sin el uso de MCMC. Basado en una fórmula analítica, se busca aproximar directamente las entradas de la PSM, particularmente para las mezclas de Proceso de Dirichlet, reduciendo el costo computacional y mejorando la precisión de la estimación. En esta presentación mostraré distintos métodos de aproximación, con resultados preliminares obtenidos mediante simulaciones y datos reales, ilustrando ventajas con respecto a MCMC, así como también sus propios desafíos.
Structural equation models aim to represent and describe relationships between constructs, and between constructs and observed variables, whereas multiblock data analysis focuses on explaining the relationships between several blocks of variables. Multiblock data analysis enables the creation of latent variable scores and the estimation of structural equation models. A general framework is provided by Regularized Generalized Canonical Correlation Analysis (RGCCA). In this talk, I present application examples to illustrate a context for understanding the fundamental concepts of both fields and their interconnections. I review the main definitions related to RGCCA, the optimization problem, the search algorithm, and special cases. Further research is outlined.
We introduce two new approaches to clustering categorical and mixed data: Condorcet clustering with a fixed number of groups, denoted $$\alpha$$-Condorcet and Mixed-Condorcet respectively. As k-modes, this approach is essentially based on similarity and dissimilarity measures. The presentation is divided into three parts: first, we propose a new Condorcet criterion, with a fixed number of groups (to select cases into clusters). In the second part, we propose a heuristic algorithm to carry out the task. In the third part, we compare $$\alpha$$ -Condorcet clustering with k-modes clustering and Mixed-Condorcet with k-prototypes. The comparison is made with a quality’s index, accuracy of a measurement, and a within-cluster sum-of-squares index.
Our findings are illustrated using real datasets: the feline dataset, the US Census 1990 dataset and other data.