El propósito de estos seminarios es conocer los proyectos de investigación en los que han participado los y las estudiantes del programa de Doctorado en Estadística en la modalidad de ponencia. Se extiende la invitación a participar a toda la comunidad UC.
Heavy-tailed distributions have been a subject of study for a long time due to their numerous applications in various fields, such as economics, natural disasters, signals, and social sciences. In particular, there is extensive research on power-law distributions ($p(x) \propto x^{\alpha}$) and their generalization, regularly varying functions ($\mathcal{RV}_\alpha$), which behave approximately like a power-law in the tail of the distribution.
Although multiple approaches have been developed to study tail behavior in both univariate and multivariate data, as well as in the presence of regressors, many of these studies tend to set an arbitrary threshold or percentile from which the fitting process begins. This can result in a loss of information contained in the body of the distribution. On the other hand, some research uses all observed data to estimate heavy-tailed densities, particularly under Bayesian approaches. However, these models tend to be complex to handle, especially when model selection is required.
This project has two main objectives. The first is to propose Bayesian model selection in flexible regression models for heavy-tailed distributions $\mathcal{RV}_\alpha$, using a simple yet flexible model such as the Gaussian mixture model under a dependent Dirichlet process (DDP-GMM), in the logarithmic space of the observations, where $\mathcal{RV}_\alpha$ distributions become light-tailed. This approach facilitates model selection through a Spike and Slab methodology, as it allows for the analytical computation of the marginal likelihood.
The second objective is to develop a model selection strategy using flexible regression for heavy-tailed $\mathcal{RV}_\alpha$ data. To achieve this, a Bayesian quantile regression will be proposed for both low and high percentiles, with errors distributed according to an asymmetric Laplace mixture under a normalized generalized gamma (NGG) process on the scale parameters. A Spike and Slab methodology will be employed for model selection, enabling the analysis of relevant regressors for the quantiles in the tails of the distribution.
Circular measurements result from sources like clocks, calendars, or compass directions. Developing statistical models for circular responses is essential for addressing diverse applications, including wind directions in meteorology, patient arrival times at healthcare facilities, animal navigation in biology, and periodic data in political science. As circular data may be mishandled by models that do not account for its cyclical nature, there have been some approaches to developing methodologies that accurately describe its behavior. Unfortunately, there is limited literature on regression models within this context and even fewer resources addressing model selection. This presentation introduces a novel Bayesian nonparametric regression model for circular data that contemplates model selection. The proposal uses a mixture of Dirichlet processes with a Projected Normal distribution and discrete spike-and-slab priors for the model selection framework. The methodology is validated through a simulation study and a practical example.
En toda área del conocimiento existe un interés particular en los datos del tipo discreto. Se podrían mencionar fácilmente datos como la frecuencia de eventos sísmicos, el número de productos vendidos por una tienda, el número de cigarrillos fumados por persona y el número de automóviles en una intersección, cada uno relacionado con la geología, la economía, la medicina o la planificación urbana, respectivamente.
En el contexto de los modelos paramétricos, el primer modelo para datos discretos, en particular de conteo, es el popular modelo de Poisson. Tal popularidad, lamentablemente, viene acompañada de su característica restrictiva de equidispersión. Alternativas al modelo de Poisson son la distribución Binomial-Negativa o versiones cero-infladas tales como los modelos ZIP y ZINB (véase, por ejemplo, Agresti, 2002). Sin embargo, la naturaleza restrictiva de los modelos paramétricos es bien conocida. Con un espacio de parámetros de dimensión finita, se podría caer en un problema de especificación. Más aún, un modelo paramétrico puede verse como caso particular de uno no paramétrico (Ghosal y van der Vaart, 2017).
La teoría Bayesiana No Paramétrica (BNP) está bien desarrollada en el contexto de variables aleatorias continuas. Para datos discretos, la afirmación puede ser al menos discutible. La incorporación de una variable subyacente continua, sin embargo, puede ayudar a transferir la teoría continua a una discreta. En este trabajo se desarrolla un modelo de regresión flexible y una metodología de selección de modelos para datos de tipo discreto utilizando el redondeo de kernels continuos (Canale y Dunson, 2011). En particular, se desarrolla un modelo LDDP redondeado con priori spike-and-slab, dotado de un esquema MCMC para un fácil cálculo a posteriori. El modelo se somete a un estudio preliminar de simulación y se aplica a un conjunto de datos correspondiente al desempeño de un equipo de fútbol a través de los años.
La diversidad beta, en el campo de la ecología, corresponde al estudio de la composición y abundancia de distintas especies localmente, permitiendo analizar comunidades en ecosistemas. Mediante el uso de técnicas no paramétricas de exploración y comparación de grupos (PERMANOVA) en base a coeficientes de disimilaridad, se exploró la diversidad beta del canal Caucahué, Chiloé, en el contexto de un estudio sobre la comunidad de zooplacton en dicho canal, que abarca centros de cultivo de salmón y mitílidos.
The Ricci distribution is widely known and used in fields such as magnetic resonance imaging and wireless communications, being particularly useful for describing signal process data. In this work, we propose an objective Bayesian inference, focusing on the Jeffreys prior, the reference prior, and a scoring rule-based prior. We demonstrate the advantages and disadvantages of these priors and compare them with the classical maximum likelihood estimator through simulations. Our results show that Bayesian estimators provide estimates with less bias than classical estimators.
El material particulado fino 2.5 (PM 2.5) es un tipo de partícula nociva para la salud, y su monitoreo tiene como objetivo establecer la calidad del aire que puede tener una región de un país. En este trabajo, se utilizan herramientas de análisis de datos funcionales para analizar la concentración de PM 2,5 durante los periodos invernales del 2018 al 2022 en la estación de monitoreo Parque O'Higgins. El enfoque consiste en un análisis de varianza funcional para estudiar si existen diferencias en las curvas medias de cada invierno, buscando patrones de comportamiento entre los años, en contraste con el Plan de descontaminación actual en Santiago de Chile.
A partir de los datos obtenidos del monitoreo de sistemas de aire acondicionado, se emplea el algoritmos basados en la densidad de datos para establecer potenciales fallas en el sistema con la finalidad de poder realizar alertas tempranas en los planes de mantenimiento.
Kidney cancer, a potentially life-threatening malignancy affecting the kidneys, demands early detection and proactive intervention to enhance prognosis and survival. Advancements in medical and health sciences and the emergence of novel treatments are expected to lead to a favorable response in a subset of patients. This, in turn, is anticipated to enhance overall survival and disease-free survival rates. Cure fraction models have become essential for estimating the proportion of individuals considered cured, free from adverse events. This article presents a novel piecewise power-law cure fraction model with a piecewise decreasing hazard function, deviating from the traditional piecewise constant hazard assumption. Through the analysis of real medical data, we evaluate various factors to explain the survival of individuals. Consistently positive outcomes are observed, affirming the significant potential of our approach. Furthermore, we employ a local influence analysis to detect potentially influential individuals and perform a post-deletion analysis to analyze their impact on our inferences.
Generating censored random samples while controlling the desired percentage of censorship is a critical task when assessing the performance of our model in corresponding simulation studies. In this presentation, we will explore an approach to aboard this challenge, particularly when dealing with random censorship. This method is implemented in the recently available 'rcens' package on CRAN, which also offers functionalities to control the censorship percentage in generated samples with different types of censorship (Types I, II, and III), providing researchers and professionals with a straightforward tool to simulate datasets according to desired distributional needs. Lastly, we will discuss a potential scheme for generating interval censorship also implemented in 'rcens'.
In survival studies, understanding the probability (risk) or occurrence rate of a specific event (hazard) over a set period is often sought. However, in a realistic scenario, multiple competing events can occur. If competing risks are not considered, the risk estimates for a particular event might be biased. The effect of a covariate on the hazard function for a specific cause can be estimated using Cox's proportional hazards model, censoring competing events. Yet, the interpretation in the cumulative incidence function (CIF) is limited, as in the scenario of competing risks, there is no one-to-one relationship between the hazard function and the CIF. The Fine-Gray model enables understanding the effect of a covariate on the CIF. However, an interpretation error often occurs by equating the interpretation of the subdistribution hazard ratio (SHR) and the commonly used hazard ratio (HR). This study aims to clarify the concepts used under competing risks, which can guide us to an appropriate interpretation, and apply them in research on cancer relapse, a scenario where this methodology is particularly interesting.
Regression analysis aims to explore the relationship between a response variable and predictors. A key aspect of regression analysis is model selection, which allows the researcher to decide which predictors are relevant, considering a parsimony criterion. A standard frequentist strategy is to explore the model space using, for instance, a Stepwise strategy based on some goodness of fit criteria. On the other hand, a popular Bayesian strategy is the spike-and-slab methodology, which assigns a specific prior to predictor coefficients by defining a latent binary vector that will indicate which predictors are relevant. Such a strategy includes a prior over the binary vector to penalize complex models. In this work, we developed a general Bayesian strategy for model selection in a broad range of regression models, using the spike-and-slab strategy and a data augmentation technique. We show that if the likelihood function follows certain conditions, the consistency of the Bayes Factor is guaranteed alongside the availability of closed-form expressions for the posterior distribution. We present regression models based on different choices for the response distribution, providing the necessary details for each model to be implemented alongside a Monte Carlo simulation study. Applications with health data are also discussed.
A través de los años los datos circulares han tomado relevancia en distintos ámbitos. Estos surgen de varias maneras, por ejemplo, a través de instrumentos de medición como relojes. No obstante, dada su naturaleza, no se pueden utilizar métodos estándar de medición univariados o multivariados. Por lo que se plantean múltiples desafíos debido a que es necesario definir un modelo estadístico sobre un espacio no-euclidiano, como lo es el círculo o la esfera.
Dado que la literatura sobre este tema es limitada, en esta ocasión se ha planteado como objetivo desarrollar metodologías de selección de variables mediante un enfoque bayesiano paramétrico en modelos de regresión que involucran datos circulares, asumiendo una distribución normal proyectada. Para ello, se imponen prioris de mezcla en los coeficientes de regresión conocidas como spike-and-slab. Los aspectos computacionales del estudio incluyen la implementación de métodos MCMC para generar muestras de las distribuciones posteriori llevar a cabo inferencias sobre el modelo. Estos procedimientos se ilustran mediante el uso de conjuntos de datos simulados y conjuntos de datos reales.
Piecewise models are valuable tools for enhancing pattern adjustments in statistical analysis, offering superior fit compared to standard models. The most commonly used piecewise exponential distribution, which assumes a constant hazard rate between changepoints, and therefore may not be realistic in many cases. In this talk, we present a unified approach that introduces a general structure for constructing piecewise models, allowing for different behaviors between the changepoints. The proposed structure yields models that are both easier to understand and interpret, while also providing greater accuracy and flexibility than parametric models. We discuss the mathematical properties of the proposed approach in detail, along with its application to various baseline models. We discuss inference on the model parameters, employing a profiled likelihood approach to estimate both the parameters and changepoint parameters in the model. Additionally, we provide application examples using different datasets to illustrate the effectiveness of the proposed approach.
In psychological and educational measurement, the golden standard for the analysis of tests is Item Response Theory (IRT). In IRT, a statistical model is used and the trait score is represented by a latent variable, which is a non-observable quantity. There are several problems with this approach: the latent variable is not unique and non-psychometricians find it hard to interpret. The common way of dealing with this problem in psychometrics is denial and choosing arbitrarily for a particular latent variable representing the measured trait. In an attempt to rectify the situation, Ramsay (1996) presented a new approach for IRT models that is based on differential geometry. In Ramsay’s proposal, the trait score arises naturally from the model as the distance between two models is measured along a path or arc. This arc length does not have the drawbacks of the latent variable. However, Ramsay’s proposal has not yet been fully developed and therefore did not stick into the psychometric literature. In this project, I will improve Ramsay’s approach to IRT models. A new trait score (called the information arc length) is proposed and its statistical properties are being investigated. In addition, the IRT toolbox of techniques (e.g., equating) is extended under this framework. All procedures developed in this project will be made available in open-access statistical software. The results of this project will lead to an easier-to-interpret and invariant trait score.