Approximate personalized pageranks

Computes the personalized pagerank for specified seeds using the ApproximatePageRank algorithm of Andersen et al. (2006). Computes degree-adjustments and degree-regularization of personalized pagerank vectors as described in Algorithms 3 and 4 of Chen et al. (2019). These algorithms are randomized; if results are unstable across multiple runs, decrease epsilon.

Usage

appr(
  graph,
  seeds,
  ...,
  alpha = 0.15,
  epsilon = 1e-06,
  tau = NULL,
  max_visits = Inf
)

# S3 method for igraph
appr(graph, seeds, ...)

# S3 method for rtweet_graph
appr(graph, seeds, ...)

Arguments

graph

An abstract_graph() object, such as that created by rtweet_graph(). This argument is required.

seeds

A character vector of seeds for the personalized pagerank. The personalized pagerank will return to each of these seeds with probability alpha at each node transition. At the moment, all seeds are given equal weighting. This argument is required.

...

Ignored. Passing arguments to ... results in a warning.

alpha

Teleportation constant. The teleportation constant is the probability of returning to a seed node at each node transition. alpha must be a valid probabilty; that is, between zero and one. Defaults to 0.15. This is the inverse of the "dampening factor" in the original PageRank paper, so alpha = 0.15 corresponds to a dampening factor of 0.85. Runtime is proportional to 1 / (epsilon * alpha), so small alpha can result in long runtimes.

epsilon

Desired accuracy of approximation. epsilon must be a small positive number. Defaults to 1e-6. aPPR guarantees that approximated personalized pageranks are uniformly within epsilon of their true value. That is, the approximation is guaranteed to be good in an L-infinity sense. This does not guarantee, however, that a ranking of nodes by aPPR is close to a ranking of nodes by PPR.

For Twitter graphs, we recommend testing your code with 1e-4 or 1e-5, using 1e-6 for exploration, and 1e-7 to 1e-8 for final results, although these numbers are very rough. It also perfectly reasonable to run aPPR for a given number of steps (set via max_visits), and then note the approximation accuracy of your results. Internally, aPPR keeps a running estimate of achieved accuracy that is always valid.

Anytime you would like to explore more of the graph, you can simply decrease epsilon. So you can start with epsilon = 1e-5 and then gradually decrease epsilon until you have a sample of the graph that you are happy with.

Also note that runtime is proportional to 1 / (epsilon * alpha), so small epsilon can result in long runtimes.

tau

Regularization term. Additionally inflates the in-degree of each observation by this term by performing the degree adjustment described in Algorithm 3 and Algorithm 4, which are described in vignette("Mathematical details"). Defaults to NULL, in which case tau is set to the average in-degree of the observed nodes. In general, setting it's reasonable to set tau to the average in-degree of the graph.

max_visits

Maximum number of unique nodes to visit. Should be a positive integer. Defaults to Inf, such that there is no upper bound on the number of unique nodes to visit. Useful when you want to specify a fixed amount of computation (or API calls) to use rather than an error tolerance. We recommend debugging with max_visits ~ 20, exploration with max_visits in the hundreds, and max_visits in the thousands to ten of thousands for precise results, although this is a very rough heuristic.

Value

A Tracker() object. Most relevant is the stats field, a tibble::tibble() with the following columns:

name: Name of a node (character).
p: Current estimate of residual per out-degree for a node.
r: Estimated error of pagerank estimate for a node.
in_degree: Number of incoming edges to a node.
out_degree: Number of outcoming edges from a node.
degree_adjusted: The personalized pagerank divided by the node in-degree.
regularized: The personalized pagerank divide by the node in-degree plus tau.

When computing personalized pageranks for Twitter users (either via rtweet_graph(), name is given as a user ID, not a screen name, regardless of how the seed nodes were specified.

References

Chen, Fan, Yini Zhang, and Karl Rohe. “Targeted Sampling from Massive Block Model Graphs with Personalized PageRank.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82, no. 1 (February 2020): 99–126. https://doi.org/10.1111/rssb.12349.
Andersen, Reid, Fan Chung, and Kevin Lang. “Local Graph Partitioning Using PageRank Vectors.” In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), 475–86. Berkeley, CA, USA: IEEE, 2006. https://doi.org/10.1109/FOCS.2006.44.

Examples


library(aPPR)
library(igraph)
#> 
#> Attaching package: ‘igraph’
#> The following object is masked from ‘package:aPPR’:
#> 
#>     neighborhood
#> The following objects are masked from ‘package:stats’:
#> 
#>     decompose, spectrum
#> The following object is masked from ‘package:base’:
#> 
#>     union

set.seed(27)

graph <- rtweet_graph()

if (FALSE) {
appr(graph, "alexpghayes")
}

graph2 <- sample_pa(100)

# this creates a Tracker object
ppr_results <- appr(graph2, seeds = "5")

# the portion of the Tracker object you probably care about
ppr_results$stats
#> # A tibble: 2 × 7
#>   name  regularized     p in_degree out_degree degree_adjusted           r
#>   <chr>       <dbl> <dbl>     <dbl>      <dbl>           <dbl>       <dbl>
#> 1 5          0.0300 0.150         2          1          0.0750 0.000000833
#> 2 4          0.0182 0.127         4          1          0.0319 0.000000667