Title: | Bootstrapping Estimates of Clustering Stability |
---|---|
Description: | Implementation of the bootstrapping approach for the estimation of clustering stability and its application in estimating the number of clusters, as introduced by Yu et al (2016)<doi:10.1142/9789814749411_0007>. Implementation of the non-parametric bootstrap approach to assessing the stability of module detection in a graph, the extension for the selection of a parameter set that defines a graph from data in a way that optimizes stability and the corresponding visualization functions, as introduced by Tian et al (2021) <doi:10.1002/sam.11495>. Implemented out-of-bag stability estimation function and k-select Smin-based k-selection function as introduced by Liu et al (2022) <doi:10.1002/sam.11593>. Implemented ensemble clustering method based-on k-means clustering method, spectral clustering method and hierarchical clustering method. |
Authors: | Han Yu [aut], Mingmei Tian [aut], Tianmou Liu [aut, cre] |
Maintainer: | Tianmou Liu <[email protected]> |
License: | GPL-2 |
Version: | 0.4.1 |
Built: | 2025-02-11 04:12:53 UTC |
Source: | https://github.com/cran/bootcluster |
Calculate agreement between two clustering results
agreement(clst1, clst2)
agreement(clst1, clst2)
clst1 |
First clustering result |
clst2 |
Second clustering result |
Vector of agreement values
Calculate agreement between two clustering results with known number of clusters
agreement_nk(clst1, clst2, nk)
agreement_nk(clst1, clst2, nk)
clst1 |
First clustering result |
clst2 |
Second clustering result |
nk |
Number of clusters |
Vector of agreement values
Implements ensemble clustering by combining multiple clustering methods (k-means, hierarchical, and spectral clustering) using a graph-based consensus approach.
ensemble.cluster.multi( x, k_km, k_hc, k_sc, n_ref = 3, B = 100, hc.method = "ward.D", dist_method = "euclidean" )
ensemble.cluster.multi( x, k_km, k_hc, k_sc, n_ref = 3, B = 100, hc.method = "ward.D", dist_method = "euclidean" )
x |
data.frame or matrix where rows are observations and columns are features |
k_km |
number of clusters for k-means clustering |
k_hc |
number of clusters for hierarchical clustering |
k_sc |
number of clusters for spectral clustering |
n_ref |
number of reference distributions for stability assessment (default: 3) |
B |
number of bootstrap samples for stability estimation (default: 100) |
hc.method |
hierarchical clustering method (default: "ward.D") |
dist_method |
distance method for spectral clustering (default: "euclidean") |
This function implements a multi-method ensemble clustering approach that: 1. Applies multiple clustering methods (k-means, hierarchical, spectral) 2. Assesses stability of each clustering through bootstrapping 3. Constructs a weighted bipartite graph representing all clusterings 4. Uses fast greedy community detection for final consensus
A list containing:
Final cluster assignments from ensemble consensus
Number of clusters found in consensus
List of results from individual clustering methods
Stability measures for each method
igraph object of the ensemble graph
data(iris) df <- iris[,1:4] result <- ensemble.cluster.multi(df, k_km=3, k_hc=3, k_sc=3) plot(df[,1:2], col=result$membership, pch=16)
data(iris) df <- iris[,1:4] result <- ensemble.cluster.multi(df, k_km=3, k_hc=3, k_sc=3) plot(df[,1:2], col=result$membership, pch=16)
Estimate the stability of a clustering based on non-parametric bootstrap out-of-bag scheme, with option for subsampling scheme
esmbl.stability( x, k, scheme = "kmeans", B = 100, hc.method = "ward.D", cut_ratio = 0.5, dist_method = "euclidean" )
esmbl.stability( x, k, scheme = "kmeans", B = 100, hc.method = "ward.D", cut_ratio = 0.5, dist_method = "euclidean" )
x |
|
k |
number of clusters for which to estimate the stability |
scheme |
clustering method to use ("kmeans", "hc", or "spectral") |
B |
number of bootstrap re-samples |
hc.method |
hierarchical clustering method (default: "ward.D") |
cut_ratio |
ratio for subsampling (default: 0.5) |
dist_method |
distance method for spectral clustering (default: "euclidean") |
This function estimates the stability through out-of-bag observations It estimate the stability at the (1) observation level, (2) cluster level, and (3) overall.
vector of membership for each observation from the reference clustering
vector of estimated observation-wise stability
vector of estimated cluster-wise stability
numeric estimated overall stability
numeric estimated Smin through out-of-bag scheme
Tianmou Liu
set.seed(123) data(iris) df <- iris[,1:4] result <- esmbl.stability(df, k=3, scheme="kmeans")
set.seed(123) data(iris) df <- iris[,1:4] result <- esmbl.stability(df, k=3, scheme="kmeans")
Estimate number of clusters by bootstrapping stability
k.select(x, range = 2:7, B = 20, r = 5, threshold = 0.8, scheme_2 = TRUE)
k.select(x, range = 2:7, B = 20, r = 5, threshold = 0.8, scheme_2 = TRUE)
x |
a |
range |
a |
B |
number of bootstrap re-samplings |
r |
number of runs of k-means |
threshold |
the threshold for determining k |
scheme_2 |
|
This function estimates the number of clusters through a bootstrapping approach, and a measure Smin, which is based on an observation-wise similarity among clusterings. The number of clusters k is selected as the largest number of clusters, for which the Smin is greater than a threshold. The threshold is often selected between 0.8 ~ 0.9. Two schemes are provided. Scheme 1 uses the clustering of the original data as the reference for stability calculations. Scheme 2 searches acrossthe clustering samples that gives the most stable clustering.
profile
a vector
of Smin measures for determining k
k
integer
estimated number of clusters
Han Yu
Bootstrapping estimates of stability for clusters, observations and model selection. Han Yu, Brian Chapman, Arianna DiFlorio, Ellen Eischen, David Gotz, Matthews Jacob and Rachael Hageman Blair.
set.seed(1) data(wine) x0 <- wine[,2:14] x <- scale(x0) k.select(x, range = 2:10, B=20, r=5, scheme_2 = TRUE)
set.seed(1) data(wine) x0 <- wine[,2:14] x <- scale(x0) k.select(x, range = 2:10, B=20, r=5, scheme_2 = TRUE)
Estimate number of clusters by bootstrapping stability
k.select_ref(df, k_range = 2:7, n_ref = 5, B = 100, B_ref = 50, r = 5)
k.select_ref(df, k_range = 2:7, n_ref = 5, B = 100, B_ref = 50, r = 5)
df |
|
k_range |
|
n_ref |
number of reference distribution to be generated |
B |
number of bootstrap re-samples |
B_ref |
number of bootstrap resamples for the reference distributions |
r |
number of runs of k-means |
This function uses the out-of-bag scheme to estimate the number of clusters in a dataset. The function calculate the Smin of the dataset and at the same time, generate a reference dataset with the same range as the original dataset in each dimension and calculate the Smin_ref. The differences between Smin and Smin_ref at each k,Smin_diff(k), is taken into consideration as well as the standard deviation of the differences. We choose the k to be the argmax of ( Smin_diff(k) - ( Smin_diff(k+1) + (Smin_diff(k+1)) ) ). If Smin_diff(k) less than 0.1 for all k in k_range, we say k = 1
profile
vector
of ( Smin_diff(k) - ( Smin_diff(k+1) + se(Smin_diff(k+1)) ) ) measures for researchers's inspection
k
estimated number of clusters
Tianmou Liu
Bootstrapping estimates of stability for clusters, observations and model selection. Han Yu, Brian Chapman, Arianna DiFlorio, Ellen Eischen, David Gotz, Matthews Jacob and Rachael Hageman Blair.
set.seed(1) data(iris) df <- data.frame(iris[,1:4]) df <- scale(df) k.select_ref(df, k_range = 2:7, n_ref = 5, B=500, B_ref = 500, r=5)
set.seed(1) data(iris) df <- data.frame(iris[,1:4]) df <- scale(df) k.select_ref(df, k_range = 2:7, n_ref = 5, B=500, B_ref = 500, r=5)
Calculates the minimum average agreement value across all clusters
min_agreement(clst, agrmt)
min_agreement(clst, agrmt)
clst |
clustering result vector |
agrmt |
agreement values vector |
minimum average agreement value across clusters
Estimate of detect module stability
network.stability( data.input, threshold, B = 20, cor.method, large.size, PermuNo, scheme_2 = FALSE )
network.stability( data.input, threshold, B = 20, cor.method, large.size, PermuNo, scheme_2 = FALSE )
data.input |
a |
threshold |
a |
B |
number of bootstrap re-samplings |
cor.method |
the correlation method applied to the data set,three method are available: |
large.size |
the smallest set of modules, the |
PermuNo |
number of random graphs for null |
scheme_2 |
|
This function estimates the modules' stability through bootstrapping approach for the given threshold. The approach to stability estimation is to compare the module composition of the reference correlation graph to the various bootstrapped correlation graphs, and to assess the stability at the (1) node-level, (2) module-level, and (3) overall.
stabilityresult
a list
of result for nodes-wise stability
modularityresult
list
of modularity information with the given threshold
jaccardresult
list
estimated unconditional observed stability and
the estimates of expected stability under the null
originalinformation
list
information for original data,
igraph object and adjacency matrix constructed with the given threshold
Mingmei Tian
A framework for stability-based module detection in correlation graphs. Mingmei Tian,Rachael Hageman Blair,Lina Mu, Matthew Bonner, Richard Browne and Han Yu.
set.seed(1) data(wine) x0 <- wine[1:50,] mytest<-network.stability(data.input=x0,threshold=0.7, B=20, cor.method='pearson',large.size=0, PermuNo = 10, scheme_2 = FALSE)
set.seed(1) data(wine) x0 <- wine[1:50,] mytest<-network.stability(data.input=x0,threshold=0.7, B=20, cor.method='pearson',large.size=0, PermuNo = 10, scheme_2 = FALSE)
Plot method for objests from threshold.select
network.stability.output(input, optimal.only = FALSE)
network.stability.output(input, optimal.only = FALSE)
input |
a |
optimal.only |
a |
network.stability.output
is used to generate a series of network plots based on the given threshold.seq
,where the nodes are
colored by the level of stability. The network with optimal
threshold value selected by function threshold.select
is colored as red.
Plot of network figures
Mingmei Tian
A framework for stability-based module detection in correlation graphs. Mingmei Tian,Rachael Hageman Blair,Lina Mu, Matthew Bonner, Richard Browne and Han Yu.
set.seed(1) data(wine) x0 <- wine[1:50,] mytest<-threshold.select(data.input=x0,threshold.seq=seq(0.1,0.5,by=0.05), B=20, cor.method='pearson',large.size=0, PermuNo = 10, no_cores=1, scheme_2 = FALSE) network.stability.output(mytest)
set.seed(1) data(wine) x0 <- wine[1:50,] mytest<-threshold.select(data.input=x0,threshold.seq=seq(0.1,0.5,by=0.05), B=20, cor.method='pearson',large.size=0, PermuNo = 10, no_cores=1, scheme_2 = FALSE) network.stability.output(mytest)
Estimate the stability of a clustering based on non-parametric bootstrap out-of-bag scheme, with option for subsampling scheme
ob.stability(x, k, B = 500, r = 5, subsample = FALSE, cut_ratio = 0.5)
ob.stability(x, k, B = 500, r = 5, subsample = FALSE, cut_ratio = 0.5)
x |
|
k |
number of clusters for which to estimate the stability |
B |
number of bootstrap re-samples |
r |
integer parameter in the kmeansCBI() funtion |
subsample |
logical parameter to use the subsampling scheme option in the resampling process (instead of bootstrap) |
cut_ratio |
numeric parameter between 0 and 1 for subsampling scheme training set ratio |
This function estimates the stability through out-of-bag observations It estimate the stability at the (1) observation level, (2) cluster level, and (3) overall.
membership
vector
of membership for each observation from the reference clustering
obs_wise
vector
of estimated observation-wise stability
clust_wise
vector
of estimated cluster-wise stability
overall
numeric
estimated overall stability
Smin
numeric
estimated Smin through out-of-bag scheme
Tianmou Liu
Bootstrapping estimates of stability for clusters, observations and model selection. Han Yu, Brian Chapman, Arianna DiFlorio, Ellen Eischen, David Gotz, Matthews Jacob and Rachael Hageman Blair.
set.seed(123) data(iris) df <- data.frame(iris[,1:4]) # You can choose to scale df before clustering by # df <- scale(df) ob.stability(df, k = 2, B=500, r=5)
set.seed(123) data(iris) df <- data.frame(iris[,1:4]) # You can choose to scale df before clustering by # df <- scale(df) ob.stability(df, k = 2, B=500, r=5)
Generates a reference distribution by sampling from uniform distributions with ranges determined by the original data.
ref_dist(df)
ref_dist(df)
df |
data.frame or matrix of the original dataset |
Generate Reference Distribution
A scaled matrix containing the reference distribution
data(iris) df <- iris[,1:4] ref <- ref_dist(df)
data(iris) df <- iris[,1:4] ref <- ref_dist(df)
Generates a reference distribution by randomly permuting each column of the original binary dataset.
ref_dist_bin(df)
ref_dist_bin(df)
df |
data.frame or matrix of the original binary dataset |
Generate Binary Reference Distribution
A matrix containing the permuted binary reference distribution
binary_data <- matrix(sample(0:1, 100, replace=TRUE), ncol=5) ref <- ref_dist_bin(binary_data)
binary_data <- matrix(sample(0:1, 100, replace=TRUE), ncol=5) ref <- ref_dist_bin(binary_data)
Generates a reference distribution in PCA space by sampling from uniform distributions with ranges determined by the PCA-transformed data.
ref_dist_pca(df)
ref_dist_pca(df)
df |
data.frame or matrix of the original dataset |
Generate Reference Distribution using PCA
A scaled matrix containing the reference distribution in PCA space
data(iris) df <- iris[,1:4] ref <- ref_dist_pca(df)
data(iris) df <- iris[,1:4] ref <- ref_dist_pca(df)
Estimate of k-means bootstrapping stability
stability(x, k, B = 20, r = 5, scheme_2 = TRUE)
stability(x, k, B = 20, r = 5, scheme_2 = TRUE)
x |
a |
k |
a |
B |
number of bootstrap re-samplings |
r |
number of runs of k-means |
scheme_2 |
|
This function estimates the clustering stability through bootstrapping approach. Two schemes are provided. Scheme 1 uses the clustering of the original data as the reference for stability calculations. Scheme 2 searches acrossthe clustering samples that gives the most stable clustering.
membership
a vector
of membership for each observation from the reference clustering
obs_wise
vector
of estimated observation-wise stability
overall
numeric
estimated overall stability
Han Yu
Bootstrapping estimates of stability for clusters, observations and model selection. Han Yu, Brian Chapman, Arianna DiFlorio, Ellen Eischen, David Gotz, Matthews Jacob and Rachael Hageman Blair.
set.seed(1) data(wine) x0 <- wine[,2:14] x <- scale(x0) stability(x, k = 3, B=20, r=5, scheme_2 = TRUE)
set.seed(1) data(wine) x0 <- wine[,2:14] x <- scale(x0) stability(x, k = 3, B=20, r=5, scheme_2 = TRUE)
Estimate of the overall Jaccard stability
data.input |
a |
threshold.seq |
a |
B |
number of bootstrap re-samplings |
cor.method |
the correlation method applied to the data set,three method are available: |
large.size |
the smallest set of modules, the |
PermuNo |
number of random graphs for the estimation of expected stability |
no_cores |
a |
threshold.select
is used to estimate of the overall Jaccard stability from
a sequence of given threshold candidates, threshold.seq
.
stabilityresult
a list
of result for nodes-wise stability
modularityresult
a list
of modularity information with each candidate threshold
jaccardresult
a list
estimated unconditional observed stability and
the estimates of expected stability under the nul
originalinformation
a list
information for original data,
igraph object and adjacency matrix constructed with each candidate threshold
threshold.seq
a list
of candicate threshold given to the function
Mingmei Tian
A framework for stability-based module detection in correlation graphs. Mingmei Tian,Rachael Hageman Blair,Lina Mu, Matthew Bonner, Richard Browne and Han Yu.
set.seed(1) data(wine) x0 <- wine[1:50,] mytest<-threshold.select(data.input=x0,threshold.seq=seq(0.5,0.8,by=0.05), B=20, cor.method='pearson',large.size=0, PermuNo = 10, no_cores=1, scheme_2 = FALSE)
set.seed(1) data(wine) x0 <- wine[1:50,] mytest<-threshold.select(data.input=x0,threshold.seq=seq(0.5,0.8,by=0.05), B=20, cor.method='pearson',large.size=0, PermuNo = 10, no_cores=1, scheme_2 = FALSE)
These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
data(wine)
data(wine)
The data set wine
contains a data.frame
of 14 variables. The first variable is the types
of wines. The other 13 variables are quantities of the constituents.
https://archive.ics.uci.edu/ml/datasets/wine