--- title: "BayesChange Tutorial" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Here we provide a brief tutorial of the `BayesChange` package. The `BayesChange` package contains two main functions: one that performs change points detection on univariate and multivariate time series and one that perform clustering of time series and survival functions with common change points. Here we briefly show how to implement these. ```{r setup} library(BayesChange) ``` ## Detecting change points The function `detect_cp` provide a method for detecting change points on univariate and multivariate time series, it is based on the work @MM2014 and on the work @CORRADIN202226. Depending on the structure of the data, `detect_cp` performs change points detection on univariate time series or multivariate time series. For example we can create a vector of 100 observations where the first 50 observations are sampled from a normal distribution with mean 0 and variance 0.1 and the other 50 observations still from a normal distribution with mean 0 but variance 0.25. ```{r} data_uni <- as.numeric(c(rnorm(50,0,0.1), rnorm(50,1,0.25))) ``` Now we can run the function `detect_cp`, as arguments of the function we need to specify the number of iterations, the number of burn-in steps and a list with the the autoregressive coefficient `phi` for the likelihood of the data, the parameters `a`, `b`, `c` for the priors and the probability `q` of performing a split at each step. ```{r} out <- detect_cp(data = data_uni, n_iterations = 1000, n_burnin = 100, params = list(q = 0.25, phi = 0.1, a = 1, b = 1, c = 0.1)) ``` With the methods `print` and `summary` we can get information about the algorithm. ```{r} print(out) summary(out) ``` In order to get a point estimate of the change points we can use the method `posterior_estimate` that uses the method *salso* by @Dahl02102022 to get the final latent order and then detect the change points. ```{r} table(posterior_estimate(out, loss = "binder")) ``` The package also provides a method for plotting the change points. ```{r} plot(out, loss = "binder") ``` In we define instead a matrix of data, `detect_cp` automatically performs a multivariate change points detection method. ```{r} data_multi <- matrix(NA, nrow = 3, ncol = 100) data_multi[1,] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250))) data_multi[2,] <- as.numeric(c(rnorm(50,0,0.125), rnorm(50,1,0.225))) data_multi[3,] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280))) ``` Arguments `k_0`, `nu_0`, `phi_0`, `m_0`, `par_theta_c`, `par_theta_d` and `prior_var_gamma` correspond to the parameters of the prior distributions for the multivariate likelihood. ```{r} out <- detect_cp(data = data_multi, n_iterations = 1000, n_burnin = 100, list(q = 0.25, k_0 = 0.25, nu_0 = 4, phi_0 = diag(1,3,3), m_0 = rep(0,3), par_theta_c = 2, par_theta_d = 0.2, prior_var_gamma = 0.1)) table(posterior_estimate(out, loss = "binder")) ``` ```{r} plot(out, loss = "binder") ``` ## Clustering time dependent data with common change points `BayesChange` contains another function, `clust_cp`, that cluster respectively univariate and multivariate time series and survival functions with common change points. Details about this methods can be found in @corradin2024. In `clust_cp` the argument `kernel` must be specified, if data are time series then `kernel = "ts"` must be set. Then the algorithm automatically detects if data are univariate or multivariate. If time series are univariate we need to set a matrix where each row is a time series. ```{r} data_mat <- matrix(NA, nrow = 5, ncol = 100) data_mat[1,] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250))) data_mat[2,] <- as.numeric(c(rnorm(50,0,0.125), rnorm(50,1,0.225))) data_mat[3,] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280))) data_mat[4,] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225))) data_mat[5,] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280))) ``` Arguments that need to be specified in `clust_cp` are the number of iterations `n_iterations`, the number of elements in the normalisation constant `B`, the split-and-merge step `L` performed when a new partition is proposed and a list with the parameters of the algorithm, the likelihood and the priors.. ```{r} out <- clust_cp(data = data_mat, n_iterations = 1000, n_burnin = 100, kernel = "ts", params = list(B = 1000, L = 1, gamma = 0.5)) posterior_estimate(out, loss = "binder") ``` Method `plot` for clustering univariate time series represents the data colored according to the assigned cluster. ```{r} plot(out, loss = "binder") ``` If time series are multivariate, data must be an array, where each element is a multivariate time series represented by a matrix. Each row of the matrix is a component of the time series. ```{r} data_array <- array(data = NA, dim = c(3,100,5)) data_array[1,,1] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250))) data_array[2,,1] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250))) data_array[3,,1] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250))) data_array[1,,2] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250))) data_array[2,,2] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250))) data_array[3,,2] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250))) data_array[1,,3] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280))) data_array[2,,3] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280))) data_array[3,,3] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280))) data_array[1,,4] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225))) data_array[2,,4] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225))) data_array[3,,4] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225))) data_array[1,,5] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280))) data_array[2,,5] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280))) data_array[3,,5] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280))) ``` ```{r} out <- clust_cp(data = data_array, n_iterations = 1000, n_burnin = 100, kernel = "ts", list(B = 1000, L = 1, gamma = 0.1, k_0 = 0.25, nu_0 = 5, phi_0 = diag(0.1,3,3), m_0 = rep(0,3))) posterior_estimate(out, loss = "binder") ``` ```{r} plot(out, loss = "binder") ``` Finally, if we set `kernel = "epi"`, `clust_cp` cluster survival functions with common change points. Also here details can be found in @corradin2024. Data are a matrix where each row is the number of infected at each time. Inside this package is included the function `sim_epi_data` that simulates infection times. ```{r} data_mat <- matrix(NA, nrow = 5, ncol = 50) betas <- list(c(rep(0.45, 25),rep(0.14,25)), c(rep(0.55, 25),rep(0.11,25)), c(rep(0.50, 25),rep(0.12,25)), c(rep(0.52, 10),rep(0.15,40)), c(rep(0.53, 10),rep(0.13,40))) inf_times <- list() for(i in 1:5){ inf_times[[i]] <- sim_epi_data(S0 = 10000, I0 = 10, max_time = 50, beta_vec = betas[[i]], gamma_0 = 1/8) vec <- rep(0,50) names(vec) <- as.character(1:50) for(j in 1:50){ if(as.character(j) %in% names(table(floor(inf_times[[i]])))){ vec[j] = table(floor(inf_times[[i]]))[which(names(table(floor(inf_times[[i]]))) == j)] } } data_mat[i,] <- vec } ``` In `clust_cp` we need to specify, besides the usual parameters, the number of Monte Carlo replications `M` for the approximation of the integrated likelihood and the recovery rate `gamma`. ```{r} out <- clust_cp(data = data_mat, n_iterations = 100, n_burnin = 10, kernel = "epi", list(M = 100, B = 1000, L = 1, q = 0.1, gamma = 1/8)) posterior_estimate(out, loss = "binder") ``` ```{r} plot(out, loss = "binder") ```