Interpreting factors with bff(), the Best Feature Function
Fan Chen
2024-11-05
Source:vignettes/bff.Rmd
bff.Rmd
Intro
In post-clustering analysis, the Best Feature Function (BFF) is useful in selecting representative features for each cluster, especially in the case when additional covariates are available for each feature. For example, consider a social network of users partitioned into clusters, and each user possess a series of text document (covariates). We want to summarize words that are representative to each cluster. The BFF is suitable for this type of task.
This document describes the intuition behind the BFF as a follow-up
step after the vsp
(vintage spectral clustering) and
touches several technical issues regarding implementation.
Methodology
For simplicity, we consider a symmetric square input matrix (e.g.,
the adjacency matrix of an undirected graph); the analysis on
rectangular input is also supported by bff()
. Given a data
matrix
,
the vsp
returns an approximation with factorization,
,
where
is low-rank, and
encodes the loadings of each feature (i.e., columns of
)
with respect to clusters. In particular, when
is the adjacency matrix of an undirected block model graph, each row of
decodes the block (cluster) membership of the vertex (feature).
Generally, the loading
(for
and
)
can be interpreted as an membership measure of the
-th
feature to the
-th
cluster.
Now, suppose in addition that we have covariates on each feature, , where is the dimension of covariates. For example, can be a document-term matrix, where all text data associated with -th (for ) feature are pooled into a meta document, and under this circumstance is the size of corpus (i.e., total number of words/terms), and is the frequency of word (for ) appearing in the -th document.
The BFF then uses and to produce an assessment of covariates “best” for each cluster. To start with, suppose both and has only non-negative entries.Define the importance, , of the -th covariate to the -th cluster by the average of -th covariate (the -th columns of ), weighted by the -th column of ,
or compactly (side note: the cross product is defined as as in convention),
As such, a higher value in indicates more significant importance. BFF selects the “best” covariates for each cluster according to the -th (for ) column of .
Implementation
Below are a few notes on the implementation of BFF:
Positive skewness. When is a document-term matrix (a.k.a., bags of words), it holds that all elements are non-negative. However, there is absolutely no guarantee that has all non-negative entries. This motivates the positive-skew transformation, i.e., we flip the signs of those columns of that have negative sample third moment.
Handling negative elements. For now, we undergo a rather ad-hoc solution to the existence of negative elements in – pretending they have little effects. In the above importance calculation, negative weights () are equally treated as . In other words, the negative elements result in some subtractions/reduction/contraction in the importance metrics.
Weight normalization. In BFF, we utilize the matrix as a way of weighting covariates (in ). It is therefore natrual to expect the columns of to be (akin to) some probability distributions, i.e., the probability to select one member from the cluster at random. Recall also that the columns of all have (or close to) unit -norm. Hence, additional transformation is needed: we normalized by column. In particular, this is done separately for positive and negative elements.
Variance stabilization. If we model with Poisson rate model , the sample mean and variance are coupled (i.e., both have the expectation of ). In order to standardize our importance measure , we need to decouple these two statistics. Performing a square-root transformation, , does the trick; it stabilizes the sampling variance, which becomes nearly constant.