Coloquio de Estadística y Ciencia de Datos de la Pontificia Universidad Católica de Chile

Presentaremos distintos métodos de estimación no-paramétrica para la densidad con soporte compacto o dominio con geometría que puede ser compleja. Veremos como se fue afectado la calidad de estimadores clásicos al borde del dominio. En particular se propondrá un nuevo estimador basados en polinomios locales para estimar la densidad en un punto x. Veremos que este estimador puede adaptarse a distintas geometrías y que tiene propiedades optímales en términos del error cuadrático medio y de la regularidad de la función a estimar. Compararemos este método al método sparr que es una alternativa popular para la estimación de densidad en dominios a geometría complicada.

2024-09-06
09:40hrs.

Hector Araya. Universidad Adolfo Ibañez
Least squares estimation for the Ornstein-Uhlenbeck process with small Hermite noise and some generalizations
Auditorio Ninoslav Bralic
Abstract:

We consider the problem of the drift parameter estimation for a non-Gaussian long memory Ornstein–Uhlenbeck process driven by a Hermite process. To estimate the unknown parameter, discrete time high-frequency observations at regularly spaced time points and the least squares estimation method are used. By means of techniques based on Wiener chaos and multiple stochastic integrals, the consistency and the limit distribution of the least squares estimator of the drift parameter have been established. To show the computational implementation of the obtained results, different simulation examples are given. Finally, an extension to a type of iterated Ornstein–Uhlenbeck is discussed.

2024-08-28
15:00hrs.

Wan-Lun Wang. National Cheng Kung University
Multivariate Contaminated Normal Censored Regression Model: Properties and Maximum Likelihood Inference
Auditorio Ninoslav Bralic
Abstract:

The multivariate contaminated normal (MCN) distribution which contains two extra parameters with respect to parameters of the multivariate normal distribution, one for controlling the proportion of mild outliers and the other for specifying the degree of contamination, has been widely applied in robust statistical modeling of the data. This paper extends the MCN model to deal with possibly censored values due to limits of quantification, referred to as the MCN with censoring (MCN-C) model. Further, it establishes the censored multivariate linear regression model where the random errors have the MCN distribution, named the MCN censored regression (MCN-CR) model. Two computationally feasible expectation conditional maximization (ECM) algorithms are developed for maximum likelihood estimation of the MCN-C and MCN-CR models. An information-based method is used to approximate the standard errors of location parameters and regression coefficients. The capability and superiority of the proposed models are illustrated by two real-data examples and simulation studies.

Keywords: Censored data; EM algorithm; Multivariate models; Outliers; Truncation.

2024-08-28
15:45hrs.

Tsung-I Lin. National Chung Hsing University
A robust factor analysis model utilizing the canonical fundamental skew-t distribution
Auditorio Ninoslav Bralic
Abstract:

Traditional factor analysis, which relies on the assumption of multivariate normality, has been extended by jointly incorporating the restricted multivariate skew-t (rMST) distribution for the unobserved factors and errors. However, the limited utility of the rMST distribution in capturing skewness concentrated in a single direction prompted the development of a more adaptable and robust factor analysis model. A more flexible, robust factor analysis model is introduced based on the broader canonical fundamental skew-t (CFUST) distribution, called the CFUSTFA model. The proposed new model can account for more complex features of skewness in multiple directions. An efficient alternating expectation conditional maximization algorithm fabricated under several reduced complete-data spaces is developed to estimate parameters under the maximum likelihood (ML) perspective. To assess the variability of parameter estimates, an information-based approach is employed to approximate the asymptotic covariance matrix of the ML estimators. The efficacy and practicality of the proposed techniques are demonstrated through the analysis of simulated and real datasets.

Keywords: AECM algorithm; Canonical fundamental skew-t distribution; Factor scores; Truncated multivariate t distribution; Unrestricted multivariate skew-t distribution

2024-07-03
15:00hrs.

Eloy Alvarado. Universidad Técnica Federico Santa Maria
Archimedean-like spatial copulas and their applications
Auditorio Ninoslav Bralic
Abstract:

The Gaussian copula is a powerful tool that has been widely used to model spatial and/or temporal correlated data with arbitrary marginal distributions. However, this model can be restrictive as it expresses a reflection symmetric dependence.

Recently, (Bevilacqua et al , 2024) proposed a new general class of spatial cop- ula models that allows the generation of random fields with arbitrary marginal distributions and types of dependence that can be reflection symmetric or not, par- ticularly focusing on an instance that can be seen as the spatial generalization of the Classical Clayton copula. In this session, we will review this general class of Archimedean-like spatial copulas and explore the various spatial extensions that this construction allows. Specifically, the Clayton-like case will be examined along with two spatial copulas currently in development: the Ali-Mikhail-Haq and Gum- bel spatial copulas. Additionally, we will present the ongoing development of an application of this methodology to model geo-referenced operational covariates us- ing Weibull regression, which can be seen as the spatial extension of the widely known proportional hazard model.

References

Bevilacqua, M., Alvarado, E. & Caaman?o-Carrillo, C. A flexible Clayton-like spa- tial copula with application to bounded support data. Journal Of Multivariate Analysis. 201 pp. 105277 (2024,5)

2024-06-19
15:00hrs.

Johan Van Der Molen. PUC
Estimación de la Matriz de Similitud a Posteriori para el análisis de clusters Bayesiano
Auditorio Ninoslav Bralic
Abstract:

Los modelos de mezcla, especialmente las mezclas de Proceso de Dirichlet, se utilizan ampliamente en análisis de clusters Bayesiano. La Matriz de Similitud a Posteriori (PSM por su sigla en inglés) es crucial para comprender la estructura de clusters de los datos, y típicamente se estima con métodos de Monte Carlo basados en Cadenas de Markov (MCMC). Sin embargo, en este contexto MCMC puede ser muy sensible con respecto a la inicialización de las cadenas, y la convergencia suele ser lenta, visitando un número muy reducido de particiones de los datos. Esto resulta en una versión restringida de la posteriori, que puede afectar negativamente tanto la estimación de la PSM, como la de los clusters.

Este trabajo propone un método más eficiente para la estimación de la PSM, sin el uso de MCMC. Basado en una fórmula analítica, se busca aproximar directamente las entradas de la PSM, particularmente para las mezclas de Proceso de Dirichlet, reduciendo el costo computacional y mejorando la precisión de la estimación. En esta presentación mostraré distintos métodos de aproximación, con resultados preliminares obtenidos mediante simulaciones y datos reales, ilustrando ventajas con respecto a MCMC, así como también sus propios desafíos.

2024-06-07
10:00hrs.

Alba Martinez. Universidad Diego Portales
Structural Equation Models and Multiblock Data Analysis: Theory and Applications
Auditorio Ninoslav Bralic
Abstract:

Structural equation models aim to represent and describe relationships between constructs, and between constructs and observed variables, whereas multiblock data analysis focuses on explaining the relationships between several blocks of variables. Multiblock data analysis enables the creation of latent variable scores and the estimation of structural equation models. A general framework is provided by Regularized Generalized Canonical Correlation Analysis (RGCCA). In this talk, I present application examples to illustrate a context for understanding the fundamental concepts of both fields and their interconnections. I review the main definitions related to RGCCA, the optimization problem, the search algorithm, and special cases. Further research is outlined.

2024-05-29
15:00hrs.

Steven Maceachem. The Ohio State University
Familial inference: Tests for hypotheses on a family of centers
Auditorio Ninoslav Bralic
Abstract:

Many scientific disciplines face a replicability crisis. While these crises have many drivers, we focus on one. Statistical hypotheses are translations of scientific hypotheses into statements about one or more distributions. The most basic tests focus on the centers of the distributions. Such tests implicitly assume a specific center, e.g., the mean or the median. Yet, scientific hypotheses do not always specify a particular center. This ambiguity leaves a gap between scientific theory and statistical practice that can lead to rejection of a true null. The gap is compounded when we consider deficiencies in the formal statistical model. Rather than testing a single center, we propose testing a family of plausible centers, such as those induced by the Huber loss function (the Huber family). Each center in the family generates a point null hypothesis and the resulting family of hypotheses constitutes a familial null hypothesis. A Bayesian nonparametric procedure is devised to test the familial null. Implementation for the Huber family is facilitated by a novel pathwise optimization routine. Along the way, we visit the question of what it means to be the center of a distribution. Surprisingly, we have been unable to find a clear and comprehensive definition of this concept in the literature.

This is joint work with Ryan Thompson (University of New South Wales), Catherine Forbes (Monash University), and Mario Peruggia (The Ohio State University).

2024-04-19
11:00hrs.

Garritt Page. Brigham Young University
Informed Bayesian Finite Mixture Models via Asymmetric Dirichlet Priors
Auditorio Ninoslav Bralic
Abstract:
Finite mixture models are flexible methods that are commonly used for model-based clustering. A recent focus in the model-based clustering literature is to highlight the difference between the number of components in a mixture model and the number of clusters. The number of clusters is more relevant from a practical standpoint, but to date, the focus of prior distribution formulation has been on the number of components. In light of this, we develop a finite mixture methodology that permits eliciting prior information directly on the number of clusters in an intuitive way. This is done by employing an asymmetric Dirichlet distribution as a prior on the weights of a finite mixture. Further, a penalized complexity motivated prior is employed for the Dirichlet shape parameter. We illustrate the ease to which prior information can be elicited via our construction and the flexibility of the resulting induced prior on the number of clusters. We also demonstrate the utility of our approach using numerical experiments and two real world data sets.

2024-04-17
15:00hrs.

Tarik Faouzi. Usach
Marquis de Condorcet: from Condorcet's theory of the vote to the method of partition
Sala de usos multiple 1er piso
Abstract:

We introduce two new approaches to clustering categorical and mixed data: Condorcet clustering with a fixed number of groups, denoted $$\alpha$$-Condorcet and Mixed-Condorcet respectively. As k-modes, this approach is essentially based on similarity and dissimilarity measures. The presentation is divided into three parts: first, we propose a new Condorcet criterion, with a fixed number of groups (to select cases into clusters). In the second part, we propose a heuristic algorithm to carry out the task. In the third part, we compare $$\alpha$$ -Condorcet clustering with k-modes clustering and Mixed-Condorcet with k-prototypes. The comparison is made with a quality’s index, accuracy of a measurement, and a within-cluster sum-of-squares index.

Our findings are illustrated using real datasets: the feline dataset, the US Census 1990 dataset and other data.

2024-04-10
15:00hrs.

Ernesto San Martín. PUC
Explorando el No-Tratamiento de Datos Faltantes: Un Analisis con Datos Administrativos Chilenos
Auditorio Ninoslav Bralic
Abstract:

La Estadística, o los Métodos Estadísticos, como los llamaba Fisher en 1922, ha sido una herramienta fundamental en las ciencias de la observación (para usar la jerga de Laplace y Quetelet) desde el siglo XIX. Su aplicación se basa en tres etapas delineadas por Fisher en un momento en que los principios fundamentales de esta disciplina aún esta[ba]n envueltos en oscuridad”: el problema de especificación, el problema de estimación y el problema de distribución. Es en el problema de especificación donde radica la clave: no solo se trata de definir la distribución de probabilidad que subyace a las observaciones, sino también de representar las preguntas que los investigadores tienen sobre el fenómeno observado mediante parámetros de interés.

Una parte importante de la investigación social se basa en observaciones recopiladas a través de encuestas. Nadie está forzado a participar de las encuestas, salvo que una ley establezca lo contrario, como en el caso del Censo o de consultas plebiscitarias. A pesar de esta obligatoriedad, las y los participantes no están obligados a responder cada pregunta de la misma. De ahí que sea frecuente encontrar datos faltantes. Estos datos, junto con las respuestas proporcionadas por otros participantes, forman parte de lo observado. Resulta curioso, entonces, que haya esfuerzos por tratar los datos faltantes de manera que ya no falten (mediante la imputación de datos) para resolver así el problema de especificación.

Sin embargo, una premisa fundamental de las ciencias de la observación es aceptar lo que se observa. Por lo tanto, resulta urgente no-tratar los datos faltantes, sino incluirlos en la especificación del proceso generador de datos. Queremos ilustrar esta forma de proceder utilizando un pequeño panel: los cuestionarios aplicados a padres y cuidadores de escolares en 2004 y 2006. Estos datos contienen varios patrones de datos faltantes: los explicitaremos de modo de identificar parcialmente la distribución condicional del ingreso familiar en el 2006 dado el ingreso familiar en el 2004. Este parámetro de interés formaliza una pregunta sustantiva de nuestro equipo de investigación en el Núcleo Milenio MOVI: cómo ha variado el ingreso de hijos e hijas en comparación con el ingreso de sus padres y madres.

2024-01-18
11:30hrs.

Christian Caamaño . Universidad del Bio-Bio
A flexible Clayton-like spatial copula with application to bounded support data.
Sala 2
Abstract:
The Gaussian copula is a powerful tool that has been widely used to model spatial and/or temporal correlated data with arbitrary marginal distribution. However, this kind of model can potentially be too restrictive since it expresses a reflection symmetric dependence. In this work, we propose a new spatial copula model that allows to obtain random fields with arbitrary marginal distribution with a type of dependence that can be reflection symmetric or not.
Particularly, we propose a new random field with uniform marginal distribution, that can be viewed as a spatial generalization of the classical Clayton copula model. It is obtained through a power transformation of a specific instance of a beta random field which in turn is obtained using a transformation of two independent Gamma random fields.
For the proposed random field we study the second-order properties and we provide analytic expressions for the bivariate distribution and its correlation. Finally, in the reflection symmetric case, we study the associated geometrical properties.
As an application of the proposed model we focus on spatial modeling of data with bounded support. Specifically, we focus on spatial regression models with marginal distribution of the beta type. In a simulation study, we investigate the use of the weighted pairwise composite likelihood method for the estimation of this model. Finally, the effectiveness of our methodology is illustrated by analyzing point-referenced vegetation index data using the Gaussian copula as benchmark. Our developments have been implemented in an open-source package for the R statistical environment.

Keywords: Archimedean Copula, Beta random fields, Composite likelihood, Reflection Asymmetry.

2024-01-18
14:30hrs.

Manuel González. Universidad de la Frontera
Métodos de Regularización Aplicados a Problemas de Quimiometría
Sala 2

2023-12-15
15:00hrs.

Felipe Osorio. Utfsm
Robust estimation in generalized linear models based on maximum Lq-likelihood procedure
Sala Multiuso 1Er Piso, Edificio Felipe Villanueva
Abstract:
In this talk we propose a procedure for robust estimation in the context of generalized linear models based on the maximum Lq-likelihood method. Alongside this, an estimation algorithm that represents a natural extension of the usual iteratively weighted least squares method in generalized linear models is presented. It is through the discussion of the asymptotic distribution of the proposed estimator and a set of statistics for testing linear hypothesis that it is possible to define standardized residuals using the mean-shift outlier model. In addition, robust versions of deviance function and the Akaike information criterion are defined with the aim of providing tools for model selection. Finally, the performance of the proposed methodology is illustrated through a simulation study and analysis of a real dataset.

2023-11-17
15:00hrs.

Hamdi Raissi. Pontificia Universidad Católica de Valparaíso
Analysis of stocks with time-varying illiquidity levels
Sala Multiuso 1Er Piso/2
Abstract:
The first and higher order serial correlations of illiquid stock's price changes are studied, allowing for unconditional heteroscedasticity and time-varying zero returns probability. The dependence structure of the categorical trade/no trade sequence is also studied. Depending on the set up, we investigate how the dependence measures can be accommodated, to deliver an accurate representation of the price changes serial correlations. We shed some light on the properties of the different tools, by means of Monte Carlo experiments. The theoretical arguments are illustrated considering shares from the Chilean stock market and the intraday returns of the Facebook stock.

2023-11-03
15:00 horashrs.

Ramsés Mena. Universidad Nacional Autonoma de México
Random probability measures via dependent stick-breaking priors
Sala Multiuso 1Er Piso/2
Abstract:
I will present a general class of stick-breaking processes with either exchangeable or Markovian length variables. This class generalizes well-known Bayesian nonparametric priors in an unexplored direction. An appealing feature of such a new family of nonparametric priors is that we are able to modulate the stochastic ordering of the weights and recover Dirichlet and Geometric priors as extreme cases. A general formula for the distribution of the latent allocation variables is derived and an MCMC algorithm is proposed for density estimation purposes.

2023-10-20
15:00hrs.

Alfredo Alegria. Universidad Técnica Federico Santa Maria
Algoritmos de Simulación y Modelación de Covarianza para Campos Aleatorios en Esferas
sala de usos múltiples, 2do. piso Edificio Felipe Villanueva
Abstract:
Los campos aleatorios en esferas desempeñan un papel fundamental en diversas ciencias naturales. Esta presentación aborda dos aspectos clave de manera integrada: algoritmos de simulación y modelación de covarianza para campos aleatorios definidos en la esfera unitaria d-dimensional. Introducimos un algoritmo de simulación, inspirado en el método de bandas rotantes espectrales utilizado en espacios Euclidianos. Este algoritmo genera de manera eficiente campos aleatorios Gaussianos en la esfera, utilizando ondas de Gegenbauer. Por otro lado, exploramos el modelado de la función de covarianza, centrándonos en los desafíos de modelar datos globales sobre la superficie de la Tierra. La familia convencional de funciones de covarianza isotrópicas de Matérn, aunque ampliamente utilizada, enfrenta limitaciones al modelar datos suaves en la esfera debido a restricciones en el parámetro de suavidad. Para abordar esto, proponemos una nueva familia de funciones de covarianza isotrópica adaptada para campos aleatorios esféricos. Esta familia flexible introduce un parámetro que rige la diferenciabilidad en media cuadrática y permite una variedad de dimensiones fractales. Esta presentación destacará las implicaciones prácticas de estos avances a través de experimentos de simulación y aplicaciones con datos reales.

2023-09-29
11:00 horashrs.

Fernando Quintana. Pontificia Universidad Católica de Chile
Childhood obesity in Singapore: A Bayesian nonparametric approach
Sala 1
Abstract:
Overweight and obesity in adults are known to be associated with increased risk of metabolic
and cardiovascular diseases. Obesity has now reached epidemic proportions, increasingly affecting children.
Therefore, it is important to understand if this condition persists from early life to childhood and if
different patterns can be detected to inform intervention policies. Our motivating application is a study
of temporal patterns of obesity in children from South Eastern Asia. Our main focus is on clustering
obesity patterns after adjusting for the effect of baseline information. Specifically, we consider a joint
model for height and weight over time. Measurements are taken every six months from birth. To allow
for data-driven clustering of trajectories, we assume a vector autoregressive sampling model with a dependent
logit stick-breaking prior. Simulation studies show good performance of the proposed model to
capture overall growth patterns, as compared to other alternatives.We also fit the model to the motivating
dataset, and discuss the results, in particular highlighting cluster differences and interpretation.

2023-09-13
12:30hrs.

Pedro Ramos. Pontificia Universidad Católica de Chile
A generalized closed-form maximum likelihood estimator
Sala Multiuso 1Er Piso/2
Abstract:
The maximum likelihood estimator plays a fundamental role in statistics. However, for many models, the estimators do not have closed-form expressions. This limitation can be significant in situations where estimates and predictions need to be computed in real-time, such as in applications based on embedded technology, in which numerical methods can not be implemented. Here we provide a generalization in the maximum likelihood estimator that allows us to obtain the estimators in closed-form expressions under some conditions. Under mild conditions, the estimator is invariant under one-to-one transformations, strongly consistent, and has an asymptotic normal distribution. The proposed generalized version of the maximum likelihood estimator is illustrated on the Gamma, Nakagami, and Beta distributions and compared with the standard maximum likelihood estimator.

2023-09-01
12:30 horashrs.

Jonathan Acosta Salazar. Pontificia Universidad Católica de Chile
Assessing the Estimation of Nearly Singular Covariance Matrices for Modeling Spatial Variables
Sala Multiuso 1Er Piso/2, Facultad de Matemáticas
Abstract:
Spatial analysis commonly relies on the estimation of a covariance matrix associated with a random field. This estimation strongly impacts the prediction where the process has not been observed, which in turn influences the construction of more sophisticated models. If some of the distances between all the possible pairs of observations in the plane are small, then we may have an ill-conditioned problem that results in a nearly singular covariance matrix. In this paper, we suggest a covariance matrix estimation method that works well even when there are very close pairs of locations on the plane. Our method is an extension to a spatial case of a method that is based on the estimation of eigenvalues of the unitary matrix decomposition of the covariance matrix. Several numerical examples are conducted to provide evidence of good performance in estimating the range parameter of the correlation structure of a spatial regression process. In addition, an application to macroalgae estimation in a restricted area of the Pacific Ocean is developed to determine a suitable estimation of the effective sample size associated with the transect sampling scheme.

2023-07-05
11:30hrs.

Riccardo Corradin. University of Nottingham
A journey through model-based clustering with intractable distributions
Sala 2
Abstract:
Model-based clustering represents one of the fundamental procedures in a statistician's toolbox. Within the model-based clustering framework, we consider the case where the kernel distribution of nonparametric mixture models is available only up to an intractable normalizing constant, in which most of the commonly used Markov chain Monte Carlo methods fail to provide posterior inference. To overcome this problem, we propose an approximate Bayesian computational strategy, whereby we approximate the posterior to avoid the intractability of the kernel. By exploiting the structure of the nonparametric prior, our proposal combines the use of predictive distributions as a proposal with transport maps to obtain an efficient and flexible sampling strategy. Further, we illustrate how the specification of our proposal can be relaxed by introducing an adaptive scheme on the degree of approximation of the posterior distribution. Empirical evidence from simulation studies shows that our proposal outperforms its main competitors in terms of computational times while preserving comparable accuracy of the estimates.