Hostname: page-component-78c5997874-mlc7c Total loading time: 0 Render date: 2024-11-10T14:45:11.165Z Has data issue: false hasContentIssue false

Sampling methods and estimation of triangle count distributions in large networks

Published online by Cambridge University Press:  26 February 2021

Nelson Antunes*
Affiliation:
Center for Computational and Stochastic Mathematics, University of Lisbon, Avenida Rovisco Pais 1049-001, Lisbon, Portugal University of Algarve, Faro, Portugal
Tianjian Guo
Affiliation:
Department of Statistics and Operations Research, University of North Carolina, CB 3260, Chapel Hill, NC 27599, USA (e-mails: Tianjian.Guo@mccombs.utexas.edu, pipiras@email.unc.edu)
Vladas Pipiras
Affiliation:
Department of Statistics and Operations Research, University of North Carolina, CB 3260, Chapel Hill, NC 27599, USA (e-mails: Tianjian.Guo@mccombs.utexas.edu, pipiras@email.unc.edu)
*
*Corresponding author. Email: nantunes@ualg.pt

Abstract

This paper investigates the distributions of triangle counts per vertex and edge, as a means for network description, analysis, model building, and other tasks. The main interest is in estimating these distributions through sampling, especially for large networks. A novel sampling method tailored for the estimation analysis is proposed, with three sampling designs motivated by several network access scenarios. An estimation method based on inversion and an asymptotic method are developed to recover the entire distribution. A single method to estimate the distribution using multiple samples is also considered. Algorithms are presented to sample the network under the various access scenarios. Finally, the estimation methods on synthetic and real-world networks are evaluated in a data study.

Type
Research Article
Copyright
© The Author(s), 2021. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Action Editor: Hocine Cherifi

References

Al Hasan, M., & Dave, V. S. (2018). Triangle counting in large networks: A review. WIREs Data Mining Knowledge Discovery, 8(2), e1226.CrossRefGoogle Scholar
Antunes, N., Guo, T., & Pipiras, V. (2020). Induced edge samplings and triangle count distributions in large networks. In Cherifi, H., Gaito, S., Mendes, J. F., Moro, E., & Rocha, L. M. (Eds.), Complex networks and their applications VIII (pp. 203215). Springer International Publishing.CrossRefGoogle Scholar
Antunes, N., & Pipiras, V. (2016). Estimation of flow distributions from sampled traffic. ACM Transactions on Modeling and Performance Evaluation of Computing Systems, 1(3), 11:111:28.CrossRefGoogle Scholar
Bar-Yossef, Z., Kumar, R., & Sivakumar, D. (2002). Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proceedings of the 13th Annual ACM-SIAM SODA (pp. 623632).Google Scholar
Becchetti, L., Castillo, C., Donato, D., Baeza-Yates, R., & Leonardi, S. (2008). Link analysis for web spam detection. ACM Transactions on the Web, 2(1), 2:12:42.CrossRefGoogle Scholar
Buriol, L. S., Frahling, G., Leonardi, S., Marchetti-Spaccamela, A., & Sohler, C. (2006). Counting triangles in data streams. In Proceedings of the 25th ACM SIGMOD-SIGACT-SIGART PODS (pp. 253262).CrossRefGoogle Scholar
Eckmann, J., & Moses, E. (2002). Curvature of co-links uncovers hidden thematic layers in the world wide web. Proceedings of the National Academy of Sciences of the United States of America, 99(9), 58255829.CrossRefGoogle Scholar
Eldar, Y. C. (2009). Generalized SURE for exponential families: Applications to regularization. IEEE Transactions on Signal Processing, 57(2), 471481.CrossRefGoogle Scholar
Jha, M., Seshadhri, C., & Pinar, A. (2015). A space-efficient streaming algorithm for estimating transitivity and triangle counts using the birthday paradox. ACM Transactions on Knowledge Discovery from Data, 9(3), 15:115:21.CrossRefGoogle Scholar
Katzir, L., Liberty, E., & Somekh, O. (2011). Estimating sizes of social networks via biased sampling. In WWW’11. ACM.CrossRefGoogle Scholar
Kolaczyk, E. D. (2009). Statistical analysis of network data. New York: Springer-Verlag.CrossRefGoogle Scholar
Leskovec, J., & Faloutsos, C. (2006). Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD’06 (pp. 631–636).CrossRefGoogle Scholar
Lim, Y., Jung, M., & Kang, U. (2018). Memory-efficient and accurate sampling for counting local triangles in graph streams: From simple to multigraphs. ACM Transactions on Knowledge Discovery from Data, 12(1), 4:1–4:28.CrossRefGoogle Scholar
Mohaisen, A., Luo, P., Li, Y., Kim, Y., & Zhang, Z. (2012). Measuring bias in the mixing time of social graphs due to graph sampling. In IEEE Military Communications Conference, MILCOM 2012 (pp. 16).CrossRefGoogle Scholar
Newman, M. (2018). Networks: An introduction (2nd ed.). New York: Oxford University Press.CrossRefGoogle Scholar
Palla, G., Derényi, I., Farkas, I., & Vicsek, T. (2005). Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435(7043), 814818.CrossRefGoogle ScholarPubMed
Stefani, L. D., Epasto, A., Riondato, M., & Upfal, E. (2017). TRIÈST: Counting local and global triangles in fully dynamic streams with fixed memory size. ACM Transactions on Knowledge Discovery from Data, 11(4), 43:143:50.CrossRefGoogle Scholar
Thompson, S. K. (2012). Sampling (3rd ed.). Wiley Series in Probability and Statistics. Hoboken, NJ: John Wiley & Sons, Inc.CrossRefGoogle ScholarPubMed
Tillé, Y. (2006). Sampling algorithms. Springer Series in Statistics. New York: Springer.Google Scholar
Tune, P., & Veitch, D. (2011). Fisher information in flow size distribution estimation. IEEE Transactions on Information Theory, 57(10), 70117035.CrossRefGoogle Scholar
Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1), 3757.CrossRefGoogle Scholar
Zhang, Y., Kolaczyk, E. D., & Spencer, B. D. (2015). Estimating network degree distributions under sampling: an inverse problem, with applications to monitoring social media networks. The Annals of Applied Statistics, 9(1), 166199.CrossRefGoogle Scholar