Invited Speakers
Tiago Peixoto, Bremen - Hierarchical Block Structures and High-Resolution Model Selection in Large Networks
David Choi, Carnegie Mellon - Consistency of Co-clustering exchangeable graph data
Daniel Roy, Cambridge - Probabilistic symmetries and statistical models of graphs
JP Onnela, Harvard - Detecting change-points in correlation-based networks
Johan Ugander, Cornell - Graph cluster randomization: design and analysis for experiments in networks
Simon Lunagomez, Harvard - Valid Inference from Non-Ignorable Network Sampling Designs
Tiago Peixoto, Bremen - Hierarchical Block Structures and High-Resolution Model Selection in Large Networks
Many social, technological, and biological networks are composed of
modules, representing groups of nodes that are assumed to have a similar
role in the functioning of the network. The use of statistical
generative models that formally characterize this modular structure is a
powerful tool that allows the development of well-defined and principled
techniques to extract such information from empirical data. What is
perhaps the most popular example is the stochastic block model, which
allows the detection of arbitrary mixing patterns, and has been extended
to include many other properties such as arbitrary degree variability,
multiple group memberships, edge weights, etc. However, even such
well-motivated techniques may possess pitfalls, specially for very large
networks. In particular, when performing model selection for the
stochastic block model via minimum description length, a strong
resolution limit is encountered if simple noninformative prior
assumptions are made, where the minimum size of detectable groups scales
with the square root of the total number of nodes, regardless of how
well-defined the modular structures appear to be. In this talk, I
present a hierarchical generalization of the stochastic block model
which lifts this limitation, while at the same time remaining fully
nonparametric. Even with its increased resolution, the method is
based on the principle of parsimony, and is capable of separating signal
from noise, and thus will not lead to the identification of spurious
modules even on sparse networks. Besides serving as a solution to this
practical problem, the model provides the entire multilevel structure of
the data, through a complete description of the entire network hierarchy
at multiple scales. Furthermore, it fully generalizes other approaches
in that it is not restricted to purely assortative mixing patterns,
directed or undirected graphs, and ad hoc hierarchical structures such
as binary trees. Despite its general character, the approach is
tractable, and yields an efficient algorithm which scales well for very
large networks.
I illustrate the application of the method on several empirical networks, including, among others, the whole internet at the autonomous systems level, and a bipartite network of actors and films with over 10^6 edges.
I illustrate the application of the method on several empirical networks, including, among others, the whole internet at the autonomous systems level, and a bipartite network of actors and films with over 10^6 edges.
David Choi, Carnegie Mellon - Consistency of Co-clustering exchangeable graph data
We analyze the problem of partitioning a 0-1 array or
bipartite graph into subgroups (also known as co-clustering), under a
relatively mild assumption that the data is generated by a general
nonparametric process. This problem can be thought of as co-clustering
under model misspecification; we show that the additional error due to
misspecification can be bounded by O(n^(-1/4)). Our result suggests
that under certain sparsity regimes, community detection algorithms
may be robust to modeling assumptions, and that their usage is
analogous to the usage of histograms in exploratory data analysis.
Daniel Roy, Cambridge - Probabilistic symmetries and statistical models of graphs
I will discuss the challenge of performing statistical inference in sparse graphs from the perspective of probabilistic symmetries, and give some suggestions as to where some progress might be made.
JP Onnela, Harvard - Detecting change-points in correlation-based networks
We consider networks whose structures are inferred from the temporal correlations between the observed characteristics of their nodes. In these types of correlation-based networks, it is often of interest to infer meso-scale changes in the underlying correlation structures. Existing methods for detecting such change-points often require a range of different assumptions. We propose a simple algorithm for detecting change-points that requires no distributional or modeling assumptions. We then demonstrate its application to evolving synthetic and empirical correlation-based networks.
Johan Ugander, Cornell - Graph cluster randomization: design and analysis for experiments in networks
A/B testing is a standard approach for evaluating the effect of online experiments; the goal is to estimate the Average Treatment Effect (ATE) of a new feature or condition by exposing a sample of the overall population to it. A drawback with A/B testing is that it is poorly suited for experiments involving social interference, when the treatment of individuals spills over to neighboring individuals along an underlying social network. In this work, we propose a novel methodology for using graph clustering to analyze average treatment effects under social interference. To begin, we characterize graph-theoretic conditions under which individuals can be considered to be 'network exposed' to an experiment. We then show how clustered graph randomization admits an efficient exact algorithm to compute the probabilities for each vertex being network exposed under several of these exposure conditions, allowing for a straightforward effect estimator using inverse probability weights. This estimator is also unbiased assuming that the exposure model has been properly specified. Given this framework for graph cluster randomization, we analyze the variance of the estimator as a property of cluster design, and the bias/variance trade-offs of the estimator under simulated exposure model misspecifications. Our analysis of the variance includes a novel clustering algorithm for which the variance is at most linear in the degrees of the graph, an important challenge. Our analysis of misspecifications highlights when clustering appears to be strongly favorable: when the network has a sufficiently clustered structure, and when social interference is sufficiently strong. We further illustrate our method with real experiments, including a large experiment on Facebook on Thanksgiving Day 2012.
Simon Lunagomez, Harvard - Valid Inference from Non-Ignorable Network Sampling Designs
Consider a population where subjects are susceptible to a disease (e.g. AIDS). The objective is to perform inferences on a population quantity (like the incidence of HIV on a high-risk subpopulation, e.g. intra-venous drug abusers) via sampling mechanisms based on a social network (link-tracing designs, respondent-driven sampling). We phrase this problem in terms of the framework proposed by Rubin for making inferences on a population quantity and, within this context, prove that respondent-driven sampling is non-ignorable. By non-ignorable it is meant that the uncertainty of the sampling mechanism must be modeled in the likelihood in order to get valid inferences. We develop a general framework for making Bayesian inference on the population quantity that: models the uncertainty in the underlying social network, incorporates dependence among the individual responses according to the network, and deals with the non-ignorability of the sampling design. The proposed framework is general in the sense that it allows a wide range of different specifications for the components of the model we just mentioned. Our model is compared with state of the art methods in simulation studies and it is applied to real data.