Skip to contents

Intro

In post-clustering analysis, the Best Feature Function (BFF) is useful in selecting representative features for each cluster, especially in the case when additional covariates are available for each feature. For example, consider a social network of nn users partitioned into kk clusters, and each user possess a series of text document (covariates). We want to summarize words that are representative to each cluster. The BFF is suitable for this type of task.

This document describes the intuition behind the BFF as a follow-up step after the vsp (vintage spectral clustering) and touches several technical issues regarding implementation.

Methodology

For simplicity, we consider a symmetric square input matrix (e.g., the adjacency matrix of an undirected graph); the analysis on rectangular input is also supported by bff(). Given a data matrix An×nA \in \mathbb{R}^{n \times n}, the vsp returns an approximation with factorization, ZBYTZBY^T, where Bk×kB \in \mathbb{R}^{k \times k} is low-rank, and Yn×kY \in \mathbb{R}^{n \times k} encodes the loadings of each feature (i.e., columns of AA) with respect to clusters. In particular, when AA is the adjacency matrix of an undirected block model graph, each row of YY decodes the block (cluster) membership of the vertex (feature). Generally, the loading YijY_{ij} (for i=1,...,ni=1,...,n and j=1,...,kj=1,...,k) can be interpreted as an membership measure of the ii-th feature to the jj-th cluster.

Now, suppose in addition that we have covariates on each feature, Dn×pD \in \mathbb{R}^{n \times p}, where pp is the dimension of covariates. For example, DD can be a document-term matrix, where all text data associated with ii-th (for i=1,...,ni=1,...,n) feature are pooled into a meta document, and pp under this circumstance is the size of corpus (i.e., total number of words/terms), and DilD_{il} is the frequency of word ll (for l=1,...,pl=1,...,p) appearing in the ii-th document.

The BFF then uses YY and DD to produce an assessment of covariates “best” for each cluster. To start with, suppose both YY and DD has only non-negative entries.Define the importance, Ip×kI \in \mathbb{R}^{p \times k}, of the ll-th covariate to the jj-th cluster by the average of ll-th covariate (the ll-th columns of DD), weighted by the jj-th column of YY,

Ilj=j=1nDjlYij, for l=1,...,p,i=1,...n,I_{lj} = \sum_{j=1}^n D_{jl} Y_{ij}, \text{ for } l=1,...,p,i=1,...n,

or compactly (side note: the cross product D,Y\langle D,Y \rangle is defined as DTYD^T Y as in convention),

I=D,Y.I=\langle D,Y \rangle.

As such, a higher value in IljI_{lj} indicates more significant importance. BFF selects the “best” covariates for each cluster according to the jj-th (for j=1,...,kj=1, ..., k) column of II.

Implementation

Below are a few notes on the implementation of BFF:

  • Positive skewness. When DD is a document-term matrix (a.k.a., bags of words), it holds that all elements are non-negative. However, there is absolutely no guarantee that YY has all non-negative entries. This motivates the positive-skew transformation, i.e., we flip the signs of those columns of YY that have negative sample third moment.

  • Handling negative elements. For now, we undergo a rather ad-hoc solution to the existence of negative elements in YY – pretending they have little effects. In the above importance calculation, negative weights (Yij<0Y_{ij}<0) are equally treated as 1-1. In other words, the negative elements result in some subtractions/reduction/contraction in the importance metrics.

  • Weight normalization. In BFF, we utilize the YY matrix as a way of weighting covariates (in DD). It is therefore natrual to expect the columns of YY to be (akin to) some probability distributions, i.e., the probability to select one member from the cluster at random. Recall also that the columns of YY all have (or close to) unit 2\ell_2-norm. Hence, additional transformation is needed: we normalized YY by column. In particular, this is done separately for positive and negative elements.

  • Variance stabilization. If we model IljI_{lj} with Poisson rate model Poisson(λlj)\text{Poisson}(\lambda_{lj}), the sample mean and variance are coupled (i.e., both have the expectation of λlj\lambda_{lj}). In order to standardize our importance measure II, we need to decouple these two statistics. Performing a square-root transformation, f(x)=xf(x)=\sqrt{x}, does the trick; it stabilizes the sampling variance, which becomes nearly constant.