Package 'bootcluster' reference manual

Title:	Bootstrapping Estimates of Clustering Stability
Description:	Implementation of the bootstrapping approach for the estimation of clustering stability and its application in estimating the number of clusters, as introduced by Yu et al (2016)<doi:10.1142/9789814749411_0007>. Implementation of the non-parametric bootstrap approach to assessing the stability of module detection in a graph, the extension for the selection of a parameter set that defines a graph from data in a way that optimizes stability and the corresponding visualization functions, as introduced by Tian et al (2021) <doi:10.1002/sam.11495>. Implemented out-of-bag stability estimation function and k-select Smin-based k-selection function as introduced by Liu et al (2022) <doi:10.1002/sam.11593>. Implemented ensemble clustering method based-on k-means clustering method, spectral clustering method and hierarchical clustering method.
Authors:	Han Yu [aut], Mingmei Tian [aut], Tianmou Liu [aut, cre]
Maintainer:	Tianmou Liu <[email protected]>
License:	GPL-2
Version:	0.4.1
Built:	2025-03-13 04:17:15 UTC
Source:	https://github.com/cran/bootcluster

Calculate agreement between two clustering results

Description

Calculate agreement between two clustering results

Usage

agreement(clst1, clst2)
agreement(clst1, clst2)

Arguments

`clst1`	First clustering result
`clst2`	Second clustering result

Value

Vector of agreement values

Calculate agreement between two clustering results with known number of clusters

Description

Calculate agreement between two clustering results with known number of clusters

Usage

agreement_nk(clst1, clst2, nk)
agreement_nk(clst1, clst2, nk)

Arguments

`clst1`	First clustering result
`clst2`	Second clustering result
`nk`	Number of clusters

Value

Vector of agreement values

Multi-Method Ensemble Clustering with Graph-based Consensus

Description

Implements ensemble clustering by combining multiple clustering methods (k-means, hierarchical, and spectral clustering) using a graph-based consensus approach.

Usage

ensemble.cluster.multi(
  x,
  k_km,
  k_hc,
  k_sc,
  n_ref = 3,
  B = 100,
  hc.method = "ward.D",
  dist_method = "euclidean"
)
ensemble.cluster.multi(
  x,
  k_km,
  k_hc,
  k_sc,
  n_ref = 3,
  B = 100,
  hc.method = "ward.D",
  dist_method = "euclidean"
)

Arguments

`x`	data.frame or matrix where rows are observations and columns are features
`k_km`	number of clusters for k-means clustering
`k_hc`	number of clusters for hierarchical clustering
`k_sc`	number of clusters for spectral clustering
`n_ref`	number of reference distributions for stability assessment (default: 3)
`B`	number of bootstrap samples for stability estimation (default: 100)
`hc.method`	hierarchical clustering method (default: "ward.D")
`dist_method`	distance method for spectral clustering (default: "euclidean")

Details

This function implements a multi-method ensemble clustering approach that: 1. Applies multiple clustering methods (k-means, hierarchical, spectral) 2. Assesses stability of each clustering through bootstrapping 3. Constructs a weighted bipartite graph representing all clusterings 4. Uses fast greedy community detection for final consensus

Value

A list containing:

membership: Final cluster assignments from ensemble consensus
k_consensus: Number of clusters found in consensus
individual_results: List of results from individual clustering methods
stability_measures: Stability measures for each method
graph: igraph object of the ensemble graph

Examples


data(iris)
df <- iris[,1:4]
result <- ensemble.cluster.multi(df, k_km=3, k_hc=3, k_sc=3)
plot(df[,1:2], col=result$membership, pch=16)


data(iris)
df <- iris[,1:4]
result <- ensemble.cluster.multi(df, k_km=3, k_hc=3, k_sc=3)
plot(df[,1:2], col=result$membership, pch=16)

Estimate the stability of a clustering based on non-parametric bootstrap out-of-bag scheme, with option for subsampling scheme

Description

Estimate the stability of a clustering based on non-parametric bootstrap out-of-bag scheme, with option for subsampling scheme

Usage

esmbl.stability(
  x,
  k,
  scheme = "kmeans",
  B = 100,
  hc.method = "ward.D",
  cut_ratio = 0.5,
  dist_method = "euclidean"
)
esmbl.stability(
  x,
  k,
  scheme = "kmeans",
  B = 100,
  hc.method = "ward.D",
  cut_ratio = 0.5,
  dist_method = "euclidean"
)

Arguments

`x`	`data.frame` of the data set where rows are observations and columns are features
`k`	number of clusters for which to estimate the stability
`scheme`	clustering method to use ("kmeans", "hc", or "spectral")
`B`	number of bootstrap re-samples
`hc.method`	hierarchical clustering method (default: "ward.D")
`cut_ratio`	ratio for subsampling (default: 0.5)
`dist_method`	distance method for spectral clustering (default: "euclidean")

Details

This function estimates the stability through out-of-bag observations It estimate the stability at the (1) observation level, (2) cluster level, and (3) overall.

Value

membership: vector of membership for each observation from the reference clustering
obs_wise: vector of estimated observation-wise stability
clust_wise: vector of estimated cluster-wise stability
overall: numeric estimated overall stability
Smin: numeric estimated Smin through out-of-bag scheme

Author(s)

Tianmou Liu

Examples


set.seed(123)
data(iris)
df <- iris[,1:4]
result <- esmbl.stability(df, k=3, scheme="kmeans")


set.seed(123)
data(iris)
df <- iris[,1:4]
result <- esmbl.stability(df, k=3, scheme="kmeans")

Estimate number of clusters

Description

Estimate number of clusters by bootstrapping stability

Usage

k.select(x, range = 2:7, B = 20, r = 5, threshold = 0.8, scheme_2 = TRUE)
k.select(x, range = 2:7, B = 20, r = 5, threshold = 0.8, scheme_2 = TRUE)

Arguments

`x`	a `data.frame` of the data set
`range`	a `vector` of `integer` values, of the possible numbers of clusters k
`B`	number of bootstrap re-samplings
`r`	number of runs of k-means
`threshold`	the threshold for determining k
`scheme_2`	`logical` `TRUE` if scheme 2 is used, `FASLE` if scheme 1 is used

Details

This function estimates the number of clusters through a bootstrapping approach, and a measure Smin, which is based on an observation-wise similarity among clusterings. The number of clusters k is selected as the largest number of clusters, for which the Smin is greater than a threshold. The threshold is often selected between 0.8 ~ 0.9. Two schemes are provided. Scheme 1 uses the clustering of the original data as the reference for stability calculations. Scheme 2 searches acrossthe clustering samples that gives the most stable clustering.

Value

profile: a vector of Smin measures for determining k
k: integer estimated number of clusters

Author(s)

Han Yu

References

Bootstrapping estimates of stability for clusters, observations and model selection. Han Yu, Brian Chapman, Arianna DiFlorio, Ellen Eischen, David Gotz, Matthews Jacob and Rachael Hageman Blair.

Examples


set.seed(1)
data(wine)
x0 <- wine[,2:14]
x <- scale(x0)
k.select(x, range = 2:10, B=20, r=5, scheme_2 = TRUE)

set.seed(1)
data(wine)
x0 <- wine[,2:14]
x <- scale(x0)
k.select(x, range = 2:10, B=20, r=5, scheme_2 = TRUE)

Estimate number of clusters

Description

Estimate number of clusters by bootstrapping stability

Usage

k.select_ref(df, k_range = 2:7, n_ref = 5, B = 100, B_ref = 50, r = 5)
k.select_ref(df, k_range = 2:7, n_ref = 5, B = 100, B_ref = 50, r = 5)

Arguments

`df`	`data.frame` of the input dataset
`k_range`	`integer` valued `vector` of the numbers of clusters k to be tested upon
`n_ref`	number of reference distribution to be generated
`B`	number of bootstrap re-samples
`B_ref`	number of bootstrap resamples for the reference distributions
`r`	number of runs of k-means

Details

This function uses the out-of-bag scheme to estimate the number of clusters in a dataset. The function calculate the Smin of the dataset and at the same time, generate a reference dataset with the same range as the original dataset in each dimension and calculate the Smin_ref. The differences between Smin and Smin_ref at each k,Smin_diff(k), is taken into consideration as well as the standard deviation of the differences. We choose the k to be the argmax of ( Smin_diff(k) - ( Smin_diff(k+1) + (Smin_diff(k+1)) ) ). If Smin_diff(k) less than 0.1 for all k in k_range, we say k = 1

Value

profile: vector of ( Smin_diff(k) - ( Smin_diff(k+1) + se(Smin_diff(k+1)) ) ) measures for researchers's inspection
k: estimated number of clusters

Author(s)

Tianmou Liu

References

Bootstrapping estimates of stability for clusters, observations and model selection. Han Yu, Brian Chapman, Arianna DiFlorio, Ellen Eischen, David Gotz, Matthews Jacob and Rachael Hageman Blair.

Examples


set.seed(1)
data(iris)
df <- data.frame(iris[,1:4])
df <- scale(df)
k.select_ref(df, k_range = 2:7, n_ref = 5, B=500, B_ref = 500, r=5)

set.seed(1)
data(iris)
df <- data.frame(iris[,1:4])
df <- scale(df)
k.select_ref(df, k_range = 2:7, n_ref = 5, B=500, B_ref = 500, r=5)

Calculate minimum agreement across clusters

Description

Calculates the minimum average agreement value across all clusters

Usage

min_agreement(clst, agrmt)
min_agreement(clst, agrmt)

Arguments

`clst`	clustering result vector
`agrmt`	agreement values vector

Value

minimum average agreement value across clusters

Estimate of detect module stability

Description

Estimate of detect module stability

Usage

network.stability(
  data.input,
  threshold,
  B = 20,
  cor.method,
  large.size,
  PermuNo,
  scheme_2 = FALSE
)
network.stability(
  data.input,
  threshold,
  B = 20,
  cor.method,
  large.size,
  PermuNo,
  scheme_2 = FALSE
)

Arguments

`data.input`	a `data.frame` of the data set where the rows are observations and columns are covariates
`threshold`	a `numeric` number of threshold for correlation matrix
`B`	number of bootstrap re-samplings
`cor.method`	the correlation method applied to the data set,three method are available: `"pearson", "kendall", "spearman"`.
`large.size`	the smallest set of modules, the `large.size=0` is recommended to use right now.
`PermuNo`	number of random graphs for null
`scheme_2`	`logical` `TRUE` if scheme 2 is used, `FASLE` if scheme 1 is used. Right now, only `FASLE` is recommended.

Details

This function estimates the modules' stability through bootstrapping approach for the given threshold. The approach to stability estimation is to compare the module composition of the reference correlation graph to the various bootstrapped correlation graphs, and to assess the stability at the (1) node-level, (2) module-level, and (3) overall.

Value

stabilityresult: a list of result for nodes-wise stability
modularityresult: list of modularity information with the given threshold
jaccardresult: list estimated unconditional observed stability and the estimates of expected stability under the null
originalinformation: list information for original data, igraph object and adjacency matrix constructed with the given threshold

Author(s)

Mingmei Tian

References

A framework for stability-based module detection in correlation graphs. Mingmei Tian,Rachael Hageman Blair,Lina Mu, Matthew Bonner, Richard Browne and Han Yu.

Examples


set.seed(1)
data(wine)
x0 <- wine[1:50,]

mytest<-network.stability(data.input=x0,threshold=0.7, B=20, 
cor.method='pearson',large.size=0,
PermuNo = 10,
scheme_2 = FALSE)

set.seed(1)
data(wine)
x0 <- wine[1:50,]

mytest<-network.stability(data.input=x0,threshold=0.7, B=20, 
cor.method='pearson',large.size=0,
PermuNo = 10,
scheme_2 = FALSE)

Plot method for objests from threshold.select

Description

Plot method for objests from threshold.select

Usage

network.stability.output(input, optimal.only = FALSE)
network.stability.output(input, optimal.only = FALSE)

Arguments

`input`	a `list` of results from function `threshold.select`
`optimal.only`	a `logical` value indicating whether only plot the network with optimal threshold or not. The default is False, generating all network figures with a large number of nodes could take some time.

Details

network.stability.output is used to generate a series of network plots based on the given threshold.seq,where the nodes are colored by the level of stability. The network with optimal threshold value selected by function threshold.select is colored as red.

Value

Plot of network figures

Author(s)

Mingmei Tian

References

A framework for stability-based module detection in correlation graphs. Mingmei Tian,Rachael Hageman Blair,Lina Mu, Matthew Bonner, Richard Browne and Han Yu.

Examples


set.seed(1)
data(wine)
x0 <- wine[1:50,]

mytest<-threshold.select(data.input=x0,threshold.seq=seq(0.1,0.5,by=0.05), B=20, 
cor.method='pearson',large.size=0,
PermuNo = 10,
no_cores=1,
scheme_2 = FALSE)
network.stability.output(mytest)

set.seed(1)
data(wine)
x0 <- wine[1:50,]

mytest<-threshold.select(data.input=x0,threshold.seq=seq(0.1,0.5,by=0.05), B=20, 
cor.method='pearson',large.size=0,
PermuNo = 10,
no_cores=1,
scheme_2 = FALSE)
network.stability.output(mytest)

Estimate the stability of a clustering based on non-parametric bootstrap out-of-bag scheme, with option for subsampling scheme

Description

Estimate the stability of a clustering based on non-parametric bootstrap out-of-bag scheme, with option for subsampling scheme

Usage

ob.stability(x, k, B = 500, r = 5, subsample = FALSE, cut_ratio = 0.5)
ob.stability(x, k, B = 500, r = 5, subsample = FALSE, cut_ratio = 0.5)

Arguments

`x`	`data.frame` of the data set where the rows as observations and columns as dimensions of features
`k`	number of clusters for which to estimate the stability
`B`	number of bootstrap re-samples
`r`	integer parameter in the kmeansCBI() funtion
`subsample`	logical parameter to use the subsampling scheme option in the resampling process (instead of bootstrap)
`cut_ratio`	numeric parameter between 0 and 1 for subsampling scheme training set ratio

Details

This function estimates the stability through out-of-bag observations It estimate the stability at the (1) observation level, (2) cluster level, and (3) overall.

Value

membership: vector of membership for each observation from the reference clustering
obs_wise: vector of estimated observation-wise stability
clust_wise: vector of estimated cluster-wise stability
overall: numeric estimated overall stability
Smin: numeric estimated Smin through out-of-bag scheme

Author(s)

Tianmou Liu

References

Bootstrapping estimates of stability for clusters, observations and model selection. Han Yu, Brian Chapman, Arianna DiFlorio, Ellen Eischen, David Gotz, Matthews Jacob and Rachael Hageman Blair.

Examples


set.seed(123)
data(iris)
df <- data.frame(iris[,1:4])
# You can choose to scale df before clustering by 
# df <- scale(df)
ob.stability(df, k = 2, B=500, r=5)


set.seed(123)
data(iris)
df <- data.frame(iris[,1:4])
# You can choose to scale df before clustering by 
# df <- scale(df)
ob.stability(df, k = 2, B=500, r=5)

Generate reference distribution for stability assessment

Description

Generates a reference distribution by sampling from uniform distributions with ranges determined by the original data.

Usage

ref_dist(df)
ref_dist(df)

Arguments

`df`	data.frame or matrix of the original dataset

Details

Generate Reference Distribution

Value

A scaled matrix containing the reference distribution

Examples


data(iris)
df <- iris[,1:4]
ref <- ref_dist(df)


data(iris)
df <- iris[,1:4]
ref <- ref_dist(df)

Generate reference distribution for binary data

Description

Generates a reference distribution by randomly permuting each column of the original binary dataset.

Usage

ref_dist_bin(df)
ref_dist_bin(df)

Arguments

`df`	data.frame or matrix of the original binary dataset

Details

Generate Binary Reference Distribution

Value

A matrix containing the permuted binary reference distribution

Examples


binary_data <- matrix(sample(0:1, 100, replace=TRUE), ncol=5)
ref <- ref_dist_bin(binary_data)


binary_data <- matrix(sample(0:1, 100, replace=TRUE), ncol=5)
ref <- ref_dist_bin(binary_data)

Generate PCA-based reference distribution

Description

Generates a reference distribution in PCA space by sampling from uniform distributions with ranges determined by the PCA-transformed data.

Usage

ref_dist_pca(df)
ref_dist_pca(df)

Arguments

`df`	data.frame or matrix of the original dataset

Details

Generate Reference Distribution using PCA

Value

A scaled matrix containing the reference distribution in PCA space

Examples


data(iris)
df <- iris[,1:4]
ref <- ref_dist_pca(df)


data(iris)
df <- iris[,1:4]
ref <- ref_dist_pca(df)

Estimate clustering stability of k-means

Description

Estimate of k-means bootstrapping stability

Usage

stability(x, k, B = 20, r = 5, scheme_2 = TRUE)
stability(x, k, B = 20, r = 5, scheme_2 = TRUE)

Arguments

`x`	a `data.frame` of the data set
`k`	a `integer` number of clusters
`B`	number of bootstrap re-samplings
`r`	number of runs of k-means
`scheme_2`	`logical` `TRUE` if scheme 2 is used, `FASLE` if scheme 1 is used

Details

This function estimates the clustering stability through bootstrapping approach. Two schemes are provided. Scheme 1 uses the clustering of the original data as the reference for stability calculations. Scheme 2 searches acrossthe clustering samples that gives the most stable clustering.

Value

membership: a vector of membership for each observation from the reference clustering
obs_wise: vector of estimated observation-wise stability
overall: numeric estimated overall stability

Author(s)

Han Yu

References

Bootstrapping estimates of stability for clusters, observations and model selection. Han Yu, Brian Chapman, Arianna DiFlorio, Ellen Eischen, David Gotz, Matthews Jacob and Rachael Hageman Blair.

Examples

 
set.seed(1)
data(wine)
x0 <- wine[,2:14]
x <- scale(x0)
stability(x, k = 3, B=20, r=5, scheme_2 = TRUE)

set.seed(1)
data(wine)
x0 <- wine[,2:14]
x <- scale(x0)
stability(x, k = 3, B=20, r=5, scheme_2 = TRUE)

Estimate of the overall Jaccard stability

Description

Estimate of the overall Jaccard stability

Arguments

`data.input`	a `data.frame` of the data set where the rows are observations and columns are covariates
`threshold.seq`	a `numeric` sequence of candidate threshold
`B`	number of bootstrap re-samplings
`cor.method`	the correlation method applied to the data set,three method are available: `"pearson", "kendall", "spearman"`.
`large.size`	the smallest set of modules, the `large.size=0` is recommended to use right now.
`PermuNo`	number of random graphs for the estimation of expected stability
`no_cores`	a `interger` number of CPU cores on the current host (This function can't not be used yet).

Details

threshold.select is used to estimate of the overall Jaccard stability from a sequence of given threshold candidates, threshold.seq.

Value

stabilityresult: a list of result for nodes-wise stability
modularityresult: a list of modularity information with each candidate threshold
jaccardresult: a list estimated unconditional observed stability and the estimates of expected stability under the nul
originalinformation: a list information for original data, igraph object and adjacency matrix constructed with each candidate threshold
threshold.seq: a list of candicate threshold given to the function

Author(s)

Mingmei Tian

References

A framework for stability-based module detection in correlation graphs. Mingmei Tian,Rachael Hageman Blair,Lina Mu, Matthew Bonner, Richard Browne and Han Yu.

Examples


set.seed(1)
data(wine)
x0 <- wine[1:50,]

mytest<-threshold.select(data.input=x0,threshold.seq=seq(0.5,0.8,by=0.05), B=20, 
cor.method='pearson',large.size=0,
PermuNo = 10,
no_cores=1,
scheme_2 = FALSE)

set.seed(1)
data(wine)
x0 <- wine[1:50,]

mytest<-threshold.select(data.input=x0,threshold.seq=seq(0.5,0.8,by=0.05), B=20, 
cor.method='pearson',large.size=0,
PermuNo = 10,
no_cores=1,
scheme_2 = FALSE)

Wine Data Set

Description

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

Usage

data(wine)
data(wine)

Format

The data set wine contains a data.frame of 14 variables. The first variable is the types of wines. The other 13 variables are quantities of the constituents.

References

https://archive.ics.uci.edu/ml/datasets/wine

Package 'bootcluster'

Help Index

Calculate agreement between two clustering results

Description

Usage

Arguments

Value

Calculate agreement between two clustering results with known number of clusters

Description

Usage

Arguments

Value

Multi-Method Ensemble Clustering with Graph-based Consensus

Description

Usage

Arguments

Details

Value

Examples

Estimate the stability of a clustering based on non-parametric bootstrap out-of-bag scheme, with option for subsampling scheme

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Estimate number of clusters

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

Estimate number of clusters

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

Calculate minimum agreement across clusters

Description

Usage

Arguments

Value

Estimate of detect module stability

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

Plot method for objests from threshold.select

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

Estimate the stability of a clustering based on non-parametric bootstrap out-of-bag scheme, with option for subsampling scheme

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

Generate reference distribution for stability assessment

Description

Usage