Create an undirected degree corrected stochastic blockmodel object
Source:R/undirected_dcsbm.R
dcsbm.Rd
To specify a degree-corrected stochastic blockmodel, you must specify
the degree-heterogeneity parameters (via n
or theta
),
the mixing matrix (via k
or B
), and the relative block
probabilities (optional, via pi
). We provide defaults for most of these
options to enable rapid exploration, or you can invest the effort
for more control over the model parameters. We strongly recommend
setting the expected_degree
or expected_density
argument
to avoid large memory allocations associated with
sampling large, dense graphs.
Usage
dcsbm(
n = NULL,
theta = NULL,
k = NULL,
B = NULL,
...,
pi = rep(1/k, k),
sort_nodes = TRUE,
force_identifiability = FALSE,
poisson_edges = TRUE,
allow_self_loops = TRUE
)
Arguments
- n
(degree heterogeneity) The number of nodes in the blockmodel. Use when you don't want to specify the degree-heterogeneity parameters
theta
by hand. Whenn
is specified,theta
is randomly generated from aLogNormal(2, 1)
distribution. This is subject to change, and may not be reproducible.n
defaults toNULL
. You must specify eithern
ortheta
, but not both.- theta
(degree heterogeneity) A numeric vector explicitly specifying the degree heterogeneity parameters. This implicitly determines the number of nodes in the resulting graph, i.e. it will have
length(theta)
nodes. Must be positive. Setting to a vector of ones recovers a stochastic blockmodel without degree correction. Defaults toNULL
. You must specify eithern
ortheta
, but not both.- k
(mixing matrix) The number of blocks in the blockmodel. Use when you don't want to specify the mixing-matrix by hand. When
k
is specified, the elements ofB
are drawn randomly from aUniform(0, 1)
distribution. This is subject to change, and may not be reproducible.k
defaults toNULL
. You must specify eitherk
orB
, but not both.- B
(mixing matrix) A
k
byk
matrix of block connection probabilities. The probability that a node in blocki
connects to a node in communityj
isPoisson(B[i, j])
. Must be a square matrix.matrix
andMatrix
objects are both acceptable. IfB
is not symmetric, it will be symmetrized via the updateB := B + t(B)
. Defaults toNULL
. You must specify eitherk
orB
, but not both.- ...
Arguments passed on to
undirected_factor_model
expected_degree
If specified, the desired expected degree of the graph. Specifying
expected_degree
simply rescalesS
to achieve this. Defaults toNULL
. Do not specify bothexpected_degree
andexpected_density
at the same time.expected_density
If specified, the desired expected density of the graph. Specifying
expected_density
simply rescalesS
to achieve this. Defaults toNULL
. Do not specify bothexpected_degree
andexpected_density
at the same time.
- pi
(relative block probabilities) Relative block probabilities. Must be positive, but do not need to sum to one, as they will be normalized internally. Must match the dimensions of
B
ork
. Defaults torep(1 / k, k)
, or a balanced blocks.- sort_nodes
Logical indicating whether or not to sort the nodes so that they are grouped by block and by
theta
. Useful for plotting. Defaults toTRUE
.- force_identifiability
Logical indicating whether or not to normalize
theta
such that it sums to one within each block. Defaults toFALSE
, since this behavior can be surprise whentheta
is set to a vector of all ones to recover the DC-SBM case.- poisson_edges
Logical indicating whether or not multiple edges are allowed to form between a pair of nodes. Defaults to
TRUE
. WhenFALSE
, sampling proceeds as usual, and duplicate edges are removed afterwards. Further, whenFALSE
, we assume thatS
specifies a desired between-factor connection probability, and back-transform thisS
to the appropriate Poisson intensity parameter to approximate Bernoulli factor connection probabilities. See Section 2.3 of Rohe et al. (2017) for some additional details.- allow_self_loops
Logical indicating whether or not nodes should be allowed to form edges with themselves. Defaults to
TRUE
. WhenFALSE
, sampling proceeds allowing self-loops, and these are then removed after the fact.
Value
An undirected_dcsbm
S3 object, a subclass of the
undirected_factor_model()
with the following additional
fields:
theta
: A numeric vector of degree-heterogeneity parameters.z
: The community memberships of each node, as afactor()
. The factor will havek
levels, wherek
is the number of communities in the stochastic blockmodel. There will not always necessarily be observed nodes in each community.pi
: Sampling probabilities for each block.sorted
: Logical indicating where nodes are arranged by block (and additionally by degree heterogeneity parameter) within each block.
Generative Model
There are two levels of randomness in a degree-corrected
stochastic blockmodel. First, we randomly chose a block
membership for each node in the blockmodel. This is
handled by dcsbm()
. Then, given these block memberships,
we randomly sample edges between nodes. This second
operation is handled by sample_edgelist()
,
sample_sparse()
, sample_igraph()
and
sample_tidygraph()
, depending depending on your desired
graph representation.
Block memberships
Let \(z_i\) represent the block membership of node \(i\). To generate \(z_i\) we sample from a categorical distribution (note that this is a special case of a multinomial) with parameter \(\pi\), such that \(\pi_i\) represents the probability of ending up in the ith block. Block memberships for each node are independent.
Degree heterogeneity
In addition to block membership, the DCSBM also allows nodes to have different propensities for edge formation. We represent this propensity for node \(i\) by a positive number \(\theta_i\). Typically the \(\theta_i\) are constrained to sum to one for identifiability purposes, but this doesn't really matter during sampling (i.e. without the sum constraint scaling \(B\) and \(\theta\) has the same effect on edge probabilities, but whether \(B\) or \(\theta\) is responsible for this change is uncertain).
Edge formulation
Once we know the block memberships \(z\) and the degree
heterogeneity parameters \(theta\), we need one more
ingredient, which is the baseline intensity of connections
between nodes in block i
and block j
. Then each edge
\(A_{i,j}\) is Poisson distributed with parameter
$$ \lambda[i, j] = \theta_i \cdot B_{z_i, z_j} \cdot \theta_j. $$
See also
Other stochastic block models:
directed_dcsbm()
,
mmsbm()
,
overlapping_sbm()
,
planted_partition()
,
sbm()
Other undirected graphs:
chung_lu()
,
erdos_renyi()
,
mmsbm()
,
overlapping_sbm()
,
planted_partition()
,
sbm()
Examples
set.seed(27)
lazy_dcsbm <- dcsbm(n = 1000, k = 5, expected_density = 0.01)
#> Generating random degree heterogeneity parameters `theta` from a LogNormal(2, 1) distribution. This distribution may change in the future. Explicitly set `theta` for reproducible results.
#> Generating random mixing matrix `B` with independent Uniform(0, 1) entries. This distribution may change in the future. Explicitly set `B` for reproducible results.
lazy_dcsbm
#> Undirected Degree-Corrected Stochastic Blockmodel
#> -------------------------------------------------
#>
#> Nodes (n): 1000 (arranged by block)
#> Blocks (k): 5
#>
#> Traditional DCSBM parameterization:
#>
#> Block memberships (z): 1000 [factor]
#> Degree heterogeneity (theta): 1000 [numeric]
#> Block probabilities (pi): 5 [numeric]
#>
#> Factor model parameterization:
#>
#> X: 1000 x 5 [dgCMatrix]
#> S: 5 x 5 [dgeMatrix]
#>
#> Poisson edges: TRUE
#> Allow self loops: TRUE
#>
#> Expected edges: 4995
#> Expected degree: 5
#> Expected density: 0.01
# sometimes you gotta let the world burn and
# sample a wildly dense graph
dense_lazy_dcsbm <- dcsbm(n = 500, k = 3, expected_density = 0.8)
#> Generating random degree heterogeneity parameters `theta` from a LogNormal(2, 1) distribution. This distribution may change in the future. Explicitly set `theta` for reproducible results.
#> Generating random mixing matrix `B` with independent Uniform(0, 1) entries. This distribution may change in the future. Explicitly set `B` for reproducible results.
dense_lazy_dcsbm
#> Undirected Degree-Corrected Stochastic Blockmodel
#> -------------------------------------------------
#>
#> Nodes (n): 500 (arranged by block)
#> Blocks (k): 3
#>
#> Traditional DCSBM parameterization:
#>
#> Block memberships (z): 500 [factor]
#> Degree heterogeneity (theta): 500 [numeric]
#> Block probabilities (pi): 3 [numeric]
#>
#> Factor model parameterization:
#>
#> X: 500 x 3 [dgCMatrix]
#> S: 3 x 3 [dgeMatrix]
#>
#> Poisson edges: TRUE
#> Allow self loops: TRUE
#>
#> Expected edges: 99800
#> Expected degree: 199.6
#> Expected density: 0.8
# explicitly setting the degree heterogeneity parameter,
# mixing matrix, and relative community sizes rather
# than using randomly generated defaults
k <- 5
n <- 1000
B <- matrix(stats::runif(k * k), nrow = k, ncol = k)
theta <- round(stats::rlnorm(n, 2))
pi <- c(1, 2, 4, 1, 1)
custom_dcsbm <- dcsbm(
theta = theta,
B = B,
pi = pi,
expected_degree = 50
)
custom_dcsbm
#> Undirected Degree-Corrected Stochastic Blockmodel
#> -------------------------------------------------
#>
#> Nodes (n): 1000 (arranged by block)
#> Blocks (k): 5
#>
#> Traditional DCSBM parameterization:
#>
#> Block memberships (z): 1000 [factor]
#> Degree heterogeneity (theta): 1000 [numeric]
#> Block probabilities (pi): 5 [numeric]
#>
#> Factor model parameterization:
#>
#> X: 1000 x 5 [dgCMatrix]
#> S: 5 x 5 [dgeMatrix]
#>
#> Poisson edges: TRUE
#> Allow self loops: TRUE
#>
#> Expected edges: 50000
#> Expected degree: 50
#> Expected density: 0.1001
edgelist <- sample_edgelist(custom_dcsbm)
edgelist
#> # A tibble: 49,838 × 2
#> from to
#> <int> <int>
#> 1 19 43
#> 2 48 59
#> 3 6 41
#> 4 10 50
#> 5 6 33
#> 6 13 35
#> 7 35 35
#> 8 1 31
#> 9 1 2
#> 10 19 47
#> # … with 49,828 more rows
# efficient eigendecompostion that leverages low-rank structure in
# E(A) so that you don't have to form E(A) to find eigenvectors,
# as E(A) is typically dense. computation is
# handled via RSpectra
population_eigs <- eigs_sym(custom_dcsbm)