El propósito de estos seminarios es conocer los proyectos de investigación en los que han participado los y las estudiantes del programa de Doctorado en Estadística en la modalidad de ponencia. Se extiende la invitación a participar a toda la comunidad UC.
The choice of a prior distribution is a key aspect of the Bayesian method. However, in many cases, such as the family of power links, this is not trivial. In this article, we introduce a penalized complexity prior (PC prior) of the skewness parameter for this family, which is useful for dealing with imbalanced data. We derive a general expression for this density and show its usefulness for some particular cases such as the power logit and the power probit links. A simulation study and a real data application are used to assess the efficiency of the introduced densities in comparison with the Gaussian and uniform priors. Results show improvement in point and credible interval estimation for the considered models when using the PC prior in comparison to other well-known standard priors.
Scale-free networks play a fundamental role in the study of complex networks and various applied fields due to their ability to model a wide range of real-world systems. A key characteristic of these networks is their degree distribution, which often follows a power-law distribution, where the probability mass function is proportional to $x^{-\alpha}$, with $\alpha$ typically ranging between $2 < \alpha < 3$. In this talk, we introduce Bayesian inference methods to obtain more accurate estimates than those obtained using traditional methods, which often yield biased estimates, and precise credible intervals. Through a simulation study, we demonstrate that our approach provides nearly unbiased estimates for the scaling parameter, enhancing the reliability of inferences. We also evaluate new goodness-of-fit tests to improve the effectiveness of the Kolmogorov-Smirnov test, commonly used for this purpose. Our findings show that the Watson test offers superior power while maintaining a controlled type I error rate, enabling us to better determine whether data adheres to a power-law distribution. Finally, we propose a piecewise extension of this model to provide greater flexibility, evaluating the estimation and its goodness-of-fit features as well. In the complex networks field, this extension allows us to model the full degree distribution, instead of just focusing on the tail, as is commonly done. We demonstrate the utility of these novel methods through applications to two real-world datasets, showcasing their practical relevance and potential to advance the analysis of power-law behavior.
In the search for multivariate distributions that provide greater flexibility in modeling data characterized by high levels of skewness, kurtosis, and the presence of outliers, new families of multivariate distributions have emerged, among which multivariate normal mixture distributions stand out. In this context, we introduce a multivariate normal mixture distribution based on the Birnbaum-Saunders distribution and examine some of its key properties. To estimate the parameters of this normal scale mixture distribution, we propose a maximum likelihood approach implemented via the EM algorithm. To support inferential analyses, we derive the Fisher information matrix. Additionally, we formulate a linear hypothesis on the parameter vector of interest and evaluate it using the likelihood ratio, Wald, score, and gradient statistics. Finally, we illustrate the application of the proposed methodology to real datasets, complementing the analysis with a simulation study to assess its performance.
Regulators' procurement of renewable energy capacity is rapidly expanding. While stochastic optimization methods are typically employed to determine the optimal total capacity to procure, a smaller body of research explores an alternative framework borrowed from finance: portfolio optimization. In this study, we apply portfolio optimization to renewable energy , aiming to identify optimal portfolios that balance two objectives: maximizing energy production per dollar invested and minimizing the variance. We utilize principal component analysis (PCA) and other techniques to identify these portfolios under limited sample size. The proposed method is tested using historical Belgian production data spanning five years, with out-of-sample comparisons evaluating portfolio performance under real-world conditions.
Technological advances have transformed data collection and analysis, enabling the acquisition of large volumes of real-time information, commonly referred to as big data. In sectors such as the fishing industry, the adoption of modern technologies has introduced new challenges due to the excess of zeros in catch records, reflecting the natural variability in species abundance. In agriculture, satellite imagery has revolutionized crop monitoring, improving decisions related to plant health, resource management, and yield forecasting. Similarly, environmental monitoring using these technologies facilitates tracking of climate change and pollution, which is crucial for public health and sustainability. To address these issues, statistical models must account for spatial and spatio-temporal dependencies in the data, as well as the possibility of zero-inflation. In the literature, both Gaussian and (non)-Gaussian models have been developed for continuous or discrete data structures, but the excess zeros present significant challenges when modeling random fields. Techniques such as logarithmic transformations or constant adjustments have been proposed, though these often distort the data structure or are not feasible. Additionally, large-scale datasets present computational difficulties, as the high cost of likelihood methods often proves prohibitive. To mitigate this, methods such as composite likelihood have been employed, balancing statistical accuracy with computational efficiency in estimation. Furthermore, the concept of effective sample size (ESS) is essential for quantifying the information content in spatial datasets, addressing redundancy issues arising from spatial correlation.
The goal of this research is to propose a new class of continuous spatial and spatio-temporal (non)-Gaussian random fields with positive support and excess zeros. The study develops a hybrid composite likelihood function that combines block likelihood and pairwise likelihood methods to efficiently handle large-scale data estimation, while also aiding in the generation of the proposed random fields from bivariate distributions. Additionally, the effective sample size (ESS) will be defined within the context of this new class of random fields, with particular attention to assessing its asymptotic normality. The proposed methodology will be validated through simulations and comparisons with existing techniques. This work contributes to the advancement of statistical models for high-dimensional spatial and spatio-temporal data with excess zeros, providing an important tool for spatial data analysis in complex real-world scenarios.
Bayesian nonparametric (BNP) theory is well-developed for continuous random variables. For discrete data, the main BNP references fall into the Dirichlet Process (DP) or a Poisson DP mixture. DP does not allow smooth deviation from its base measure, while a Poisson mixture will never be able to fit under-dispersed data. However, assuming the existence of a continuous underlying variable can help transfer the continuous theory to a discrete one. Under this approach, the current project deals with developing a flexible regression model endowed with a model selection feature to identify the most relevant structure in context of binary, ordinal, and count data. Particularly, this project has three specific goals: 1) develop a latent dependent DP mixture model for light-tailed discrete data, 2) develop a latent dependent NGGP mixture model for heavy-tailed data, and 3) extend the two models to a multivariate case. It is hoped that the models will find the true model and fit better than those common in the literature when the sample size increases. This should hold for datasets with zero-inflated and under, equi, or over-dispersed behaviors.
Heavy-tailed distributions have been a subject of study for a long time due to their numerous applications in various fields, such as economics, natural disasters, signals, and social sciences. In particular, there is extensive research on power-law distributions ($p(x) \propto x^{\alpha}$) and their generalization, regularly varying functions ($\mathcal{RV}_\alpha$), which behave approximately like a power-law in the tail of the distribution.
Although multiple approaches have been developed to study tail behavior in both univariate and multivariate data, as well as in the presence of regressors, many of these studies tend to set an arbitrary threshold or percentile from which the fitting process begins. This can result in a loss of information contained in the body of the distribution. On the other hand, some research uses all observed data to estimate heavy-tailed densities, particularly under Bayesian approaches. However, these models tend to be complex to handle, especially when model selection is required.
This project has two main objectives. The first is to propose Bayesian model selection in flexible regression models for heavy-tailed distributions $\mathcal{RV}_\alpha$, using a simple yet flexible model such as the Gaussian mixture model under a dependent Dirichlet process (DDP-GMM), in the logarithmic space of the observations, where $\mathcal{RV}_\alpha$ distributions become light-tailed. This approach facilitates model selection through a Spike and Slab methodology, as it allows for the analytical computation of the marginal likelihood.
The second objective is to develop a model selection strategy using flexible regression for heavy-tailed $\mathcal{RV}_\alpha$ data. To achieve this, a Bayesian quantile regression will be proposed for both low and high percentiles, with errors distributed according to an asymmetric Laplace mixture under a normalized generalized gamma (NGG) process on the scale parameters. A Spike and Slab methodology will be employed for model selection, enabling the analysis of relevant regressors for the quantiles in the tails of the distribution.
Circular measurements result from sources like clocks, calendars, or compass directions. Developing statistical models for circular responses is essential for addressing diverse applications, including wind directions in meteorology, patient arrival times at healthcare facilities, animal navigation in biology, and periodic data in political science. As circular data may be mishandled by models that do not account for its cyclical nature, there have been some approaches to developing methodologies that accurately describe its behavior. Unfortunately, there is limited literature on regression models within this context and even fewer resources addressing model selection. This presentation introduces a novel Bayesian nonparametric regression model for circular data that contemplates model selection. The proposal uses a mixture of Dirichlet processes with a Projected Normal distribution and discrete spike-and-slab priors for the model selection framework. The methodology is validated through a simulation study and a practical example.
En toda área del conocimiento existe un interés particular en los datos del tipo discreto. Se podrían mencionar fácilmente datos como la frecuencia de eventos sísmicos, el número de productos vendidos por una tienda, el número de cigarrillos fumados por persona y el número de automóviles en una intersección, cada uno relacionado con la geología, la economía, la medicina o la planificación urbana, respectivamente.
En el contexto de los modelos paramétricos, el primer modelo para datos discretos, en particular de conteo, es el popular modelo de Poisson. Tal popularidad, lamentablemente, viene acompañada de su característica restrictiva de equidispersión. Alternativas al modelo de Poisson son la distribución Binomial-Negativa o versiones cero-infladas tales como los modelos ZIP y ZINB (véase, por ejemplo, Agresti, 2002). Sin embargo, la naturaleza restrictiva de los modelos paramétricos es bien conocida. Con un espacio de parámetros de dimensión finita, se podría caer en un problema de especificación. Más aún, un modelo paramétrico puede verse como caso particular de uno no paramétrico (Ghosal y van der Vaart, 2017).
La teoría Bayesiana No Paramétrica (BNP) está bien desarrollada en el contexto de variables aleatorias continuas. Para datos discretos, la afirmación puede ser al menos discutible. La incorporación de una variable subyacente continua, sin embargo, puede ayudar a transferir la teoría continua a una discreta. En este trabajo se desarrolla un modelo de regresión flexible y una metodología de selección de modelos para datos de tipo discreto utilizando el redondeo de kernels continuos (Canale y Dunson, 2011). En particular, se desarrolla un modelo LDDP redondeado con priori spike-and-slab, dotado de un esquema MCMC para un fácil cálculo a posteriori. El modelo se somete a un estudio preliminar de simulación y se aplica a un conjunto de datos correspondiente al desempeño de un equipo de fútbol a través de los años.
La diversidad beta, en el campo de la ecología, corresponde al estudio de la composición y abundancia de distintas especies localmente, permitiendo analizar comunidades en ecosistemas. Mediante el uso de técnicas no paramétricas de exploración y comparación de grupos (PERMANOVA) en base a coeficientes de disimilaridad, se exploró la diversidad beta del canal Caucahué, Chiloé, en el contexto de un estudio sobre la comunidad de zooplacton en dicho canal, que abarca centros de cultivo de salmón y mitílidos.
The Ricci distribution is widely known and used in fields such as magnetic resonance imaging and wireless communications, being particularly useful for describing signal process data. In this work, we propose an objective Bayesian inference, focusing on the Jeffreys prior, the reference prior, and a scoring rule-based prior. We demonstrate the advantages and disadvantages of these priors and compare them with the classical maximum likelihood estimator through simulations. Our results show that Bayesian estimators provide estimates with less bias than classical estimators.
El material particulado fino 2.5 (PM 2.5) es un tipo de partícula nociva para la salud, y su monitoreo tiene como objetivo establecer la calidad del aire que puede tener una región de un país. En este trabajo, se utilizan herramientas de análisis de datos funcionales para analizar la concentración de PM 2,5 durante los periodos invernales del 2018 al 2022 en la estación de monitoreo Parque O'Higgins. El enfoque consiste en un análisis de varianza funcional para estudiar si existen diferencias en las curvas medias de cada invierno, buscando patrones de comportamiento entre los años, en contraste con el Plan de descontaminación actual en Santiago de Chile.
A partir de los datos obtenidos del monitoreo de sistemas de aire acondicionado, se emplea el algoritmos basados en la densidad de datos para establecer potenciales fallas en el sistema con la finalidad de poder realizar alertas tempranas en los planes de mantenimiento.