BayesChange Tutorial

Here we provide a brief tutorial of the BayesChange package. The BayesChange package contains two main functions: one that performs change points detection on univariate and multivariate time series and one that perform clustering of time series and survival functions with common change points. Here we briefly show how to implement these.

library(BayesChange)

Detecting change points

The function detect_cp provide a method for detecting change points on univariate and multivariate time series, it is based on the work Martínez and Mena (2014) and on the work Corradin, Danese, and Ongaro (2022).

Depending on the structure of the data, detect_cp performs change points detection on univariate time series or multivariate time series. For example we can create a vector of 100 observations where the first 50 observations are sampled from a normal distribution with mean 0 and variance 0.1 and the other 50 observations still from a normal distribution with mean 0 but variance 0.25.

data_uni <- as.numeric(c(rnorm(50,0,0.1), rnorm(50,1,0.25)))

Now we can run the function detect_cp, as arguments of the function we need to specify the number of iterations, the number of burn-in steps and a list with the the autoregressive coefficient phi for the likelihood of the data, the parameters a, b, c for the priors and the probability q of performing a split at each step.

out <- detect_cp(data = data_uni,                             
                 n_iterations = 1000, n_burnin = 100,  
                 params = list(q = 0.25, phi = 0.1, a = 1, b = 1, c = 0.1))
#> Completed:   100/1000 - in 0.025366 sec
#> Completed:   200/1000 - in 0.048452 sec
#> Completed:   300/1000 - in 0.07143 sec
#> Completed:   400/1000 - in 0.093914 sec
#> Completed:   500/1000 - in 0.117454 sec
#> Completed:   600/1000 - in 0.140207 sec
#> Completed:   700/1000 - in 0.163495 sec
#> Completed:   800/1000 - in 0.186382 sec
#> Completed:   900/1000 - in 0.209169 sec
#> Completed:   1000/1000 - in 0.232016 sec

With the methods print and summary we can get information about the algorithm.

print(out)
#> DetectCpObj object
#> Type: change points detection on univariate time series

summary(out)
#> DetectCpObj object
#> Detecting change points on an univariate time series:
#>  Number of burn-in iterations: 100 
#>  Number of MCMC iterations: 900 
#>  Computational time: 0.23 seconds

In order to get a point estimate of the change points we can use the method posterior_estimate that uses the method salso by David B. Dahl and Müller (2022) to get the final latent order and then detect the change points.

table(posterior_estimate(out, loss = "binder"))
#> 
#>  1  2 
#> 50 50

The package also provides a method for plotting the change points.

plot(out, loss = "binder")

In we define instead a matrix of data, detect_cp automatically performs a multivariate change points detection method.

data_multi <- matrix(NA, nrow = 3, ncol = 100)

data_multi[1,] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_multi[2,] <- as.numeric(c(rnorm(50,0,0.125), rnorm(50,1,0.225)))
data_multi[3,] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))

Arguments k_0, nu_0, phi_0, m_0, par_theta_c, par_theta_d and prior_var_gamma correspond to the parameters of the prior distributions for the multivariate likelihood.

out <- detect_cp(data = data_multi, n_iterations = 1000, n_burnin = 100,
                 list(q = 0.25, k_0 = 0.25, nu_0 = 4, phi_0 = diag(1,3,3), 
                      m_0 = rep(0,3), par_theta_c = 2, par_theta_d = 0.2, 
                      prior_var_gamma = 0.1))
#> Completed:   100/1000 - in 0.012738 sec
#> Completed:   200/1000 - in 0.025104 sec
#> Completed:   300/1000 - in 0.037406 sec
#> Completed:   400/1000 - in 0.050324 sec
#> Completed:   500/1000 - in 0.065016 sec
#> Completed:   600/1000 - in 0.079686 sec
#> Completed:   700/1000 - in 0.094396 sec
#> Completed:   800/1000 - in 0.111334 sec
#> Completed:   900/1000 - in 0.147951 sec
#> Completed:   1000/1000 - in 0.206238 sec

table(posterior_estimate(out, loss = "binder"))
#> 
#>  1  2  3  4  5 
#> 50  1 46  1  2
plot(out, loss = "binder")

Clustering time dependent data with common change points

BayesChange contains another function, clust_cp, that cluster respectively univariate and multivariate time series and survival functions with common change points. Details about this methods can be found in Corradin et al. (2024).

In clust_cp the argument kernel must be specified, if data are time series then kernel = "ts" must be set. Then the algorithm automatically detects if data are univariate or multivariate.

If time series are univariate we need to set a matrix where each row is a time series.

data_mat <- matrix(NA, nrow = 5, ncol = 100)

data_mat[1,] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_mat[2,] <- as.numeric(c(rnorm(50,0,0.125), rnorm(50,1,0.225)))
data_mat[3,] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))
data_mat[4,] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225)))
data_mat[5,] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280)))

Arguments that need to be specified in clust_cp are the number of iterations n_iterations, the number of elements in the normalisation constant B, the split-and-merge step L performed when a new partition is proposed and a list with the parameters of the algorithm, the likelihood and the priors..

out <- clust_cp(data = data_mat, n_iterations = 1000, n_burnin = 100, 
                kernel = "ts",
                params = list(B = 1000, L = 1, gamma = 0.5))
#> Normalization constant - completed:  100/1000 - in 0.020964 sec
#> Normalization constant - completed:  200/1000 - in 0.043184 sec
#> Normalization constant - completed:  300/1000 - in 0.064528 sec
#> Normalization constant - completed:  400/1000 - in 0.086474 sec
#> Normalization constant - completed:  500/1000 - in 0.108037 sec
#> Normalization constant - completed:  600/1000 - in 0.13033 sec
#> Normalization constant - completed:  700/1000 - in 0.151624 sec
#> Normalization constant - completed:  800/1000 - in 0.172996 sec
#> Normalization constant - completed:  900/1000 - in 0.194675 sec
#> Normalization constant - completed:  1000/1000 - in 0.216976 sec
#> 
#> ------ MAIN LOOP ------
#> 
#> Completed:   100/1000 - in 0.238357 sec
#> Completed:   200/1000 - in 0.414929 sec
#> Completed:   300/1000 - in 0.593679 sec
#> Completed:   400/1000 - in 0.769971 sec
#> Completed:   500/1000 - in 0.936128 sec
#> Completed:   600/1000 - in 1.10731 sec
#> Completed:   700/1000 - in 1.27569 sec
#> Completed:   800/1000 - in 1.43849 sec
#> Completed:   900/1000 - in 1.60661 sec
#> Completed:   1000/1000 - in 1.77736 sec

posterior_estimate(out, loss = "binder")
#> [1] 1 2 2 3 3

Method plot for clustering univariate time series represents the data colored according to the assigned cluster.

plot(out, loss = "binder")

If time series are multivariate, data must be an array, where each element is a multivariate time series represented by a matrix. Each row of the matrix is a component of the time series.

data_array <- array(data = NA, dim = c(3,100,5))

data_array[1,,1] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[2,,1] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[3,,1] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))

data_array[1,,2] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[2,,2] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[3,,2] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))

data_array[1,,3] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))
data_array[2,,3] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))
data_array[3,,3] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))

data_array[1,,4] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225)))
data_array[2,,4] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225)))
data_array[3,,4] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225)))

data_array[1,,5] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280)))
data_array[2,,5] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280)))
data_array[3,,5] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280)))
out <- clust_cp(data = data_array, n_iterations = 1000, n_burnin = 100, 
                kernel = "ts",
                list(B = 1000, L = 1, gamma = 0.1, k_0 = 0.25, nu_0 = 5, 
                     phi_0 = diag(0.1,3,3), m_0 = rep(0,3)))
#> Normalization constant - completed:  100/1000 - in 0.008352 sec
#> Normalization constant - completed:  200/1000 - in 0.016731 sec
#> Normalization constant - completed:  300/1000 - in 0.024943 sec
#> Normalization constant - completed:  400/1000 - in 0.033201 sec
#> Normalization constant - completed:  500/1000 - in 0.041403 sec
#> Normalization constant - completed:  600/1000 - in 0.049729 sec
#> Normalization constant - completed:  700/1000 - in 0.057937 sec
#> Normalization constant - completed:  800/1000 - in 0.0662 sec
#> Normalization constant - completed:  900/1000 - in 0.07449 sec
#> Normalization constant - completed:  1000/1000 - in 0.082682 sec
#> 
#> ------ MAIN LOOP ------
#> 
#> Completed:   100/1000 - in 0.1103 sec
#> Completed:   200/1000 - in 0.19314 sec
#> Completed:   300/1000 - in 0.27129 sec
#> Completed:   400/1000 - in 0.350164 sec
#> Completed:   500/1000 - in 0.428533 sec
#> Completed:   600/1000 - in 0.505917 sec
#> Completed:   700/1000 - in 0.583387 sec
#> Completed:   800/1000 - in 0.662918 sec
#> Completed:   900/1000 - in 0.740954 sec
#> Completed:   1000/1000 - in 0.822667 sec

posterior_estimate(out, loss = "binder")
#> [1] 1 1 1 2 2
plot(out, loss = "binder")

Finally, if we set kernel = "epi", clust_cp cluster survival functions with common change points. Also here details can be found in Corradin et al. (2024).

Data are a matrix where each row is the number of infected at each time. Inside this package is included the function sim_epi_data that simulates infection times.

data_mat <- matrix(NA, nrow = 5, ncol = 50)

betas <- list(c(rep(0.45, 25),rep(0.14,25)),
               c(rep(0.55, 25),rep(0.11,25)),
               c(rep(0.50, 25),rep(0.12,25)),
               c(rep(0.52, 10),rep(0.15,40)),
               c(rep(0.53, 10),rep(0.13,40)))

  inf_times <- list()

  for(i in 1:5){

    inf_times[[i]] <- sim_epi_data(S0 = 10000, I0 = 10, max_time = 50, beta_vec = betas[[i]], gamma_0 = 1/8)

    vec <- rep(0,50)
    names(vec) <- as.character(1:50)

    for(j in 1:50){
      if(as.character(j) %in% names(table(floor(inf_times[[i]])))){
      vec[j] = table(floor(inf_times[[i]]))[which(names(table(floor(inf_times[[i]]))) == j)]
      }
    }
    data_mat[i,] <- vec
  }

In clust_cp we need to specify, besides the usual parameters, the number of Monte Carlo replications M for the approximation of the integrated likelihood and the recovery rate gamma.

out <- clust_cp(data = data_mat, n_iterations = 100, n_burnin = 10, 
                kernel = "epi", 
                list(M = 100, B = 1000, L = 1, q = 0.1, gamma = 1/8))
#> Normalization constant - completed:  10/100 - in 0.030684 sec
#> Normalization constant - completed:  20/100 - in 0.06127 sec
#> Normalization constant - completed:  30/100 - in 0.091853 sec
#> Normalization constant - completed:  40/100 - in 0.122398 sec
#> Normalization constant - completed:  50/100 - in 0.152941 sec
#> Normalization constant - completed:  60/100 - in 0.1835 sec
#> Normalization constant - completed:  70/100 - in 0.214116 sec
#> Normalization constant - completed:  80/100 - in 0.244669 sec
#> Normalization constant - completed:  90/100 - in 0.275203 sec
#> Normalization constant - completed:  100/100 - in 0.30575 sec
#> 
#> ------ MAIN LOOP ------
#> 
#> Completed:   10/100 - in 0.283554 sec
#> Completed:   20/100 - in 0.540508 sec
#> Completed:   30/100 - in 0.794049 sec
#> Completed:   40/100 - in 1.03933 sec
#> Completed:   50/100 - in 1.29211 sec
#> Completed:   60/100 - in 1.54851 sec
#> Completed:   70/100 - in 1.84282 sec
#> Completed:   80/100 - in 2.12747 sec
#> Completed:   90/100 - in 2.40579 sec
#> Completed:   100/100 - in 2.69365 sec

posterior_estimate(out, loss = "binder")
#> [1] 1 1 1 2 3
plot(out, loss = "binder")

Corradin, Riccardo, Luca Danese, Wasiur R. KhudaBukhsh, and Andrea Ongaro. 2024. “Model-Based Clustering of Time-Dependent Observations with Common Structural Changes.” https://arxiv.org/abs/2410.09552.
Corradin, Riccardo, Luca Danese, and Andrea Ongaro. 2022. “Bayesian Nonparametric Change Point Detection for Multivariate Time Series with Missing Observations.” International Journal of Approximate Reasoning 143: 26–43. https://doi.org/https://doi.org/10.1016/j.ijar.2021.12.019.
David B. Dahl, Devin J. Johnson, and Peter Müller. 2022. “Search Algorithms and Loss Functions for Bayesian Clustering.” Journal of Computational and Graphical Statistics 31 (4): 1189–1201. https://doi.org/10.1080/10618600.2022.2069779.
Martínez, Asael Fabian, and Ramsés H. Mena. 2014. On a Nonparametric Change Point Detection Model in Markovian Regimes.” Bayesian Analysis 9 (4): 823–58. https://doi.org/10.1214/14-BA878.