Algorithmic Gems in the Data Miner’s Cave

  • Paolo Boldi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8496)


When I was younger and spent most of my time playing in the field of (more) theoretical computer science, I used to think of data mining as an uninteresting kind of game: I thought that area was a wild jungle of ad hoc techniques with no flesh to seek my teeth into. The truth is, I immediately become kind-of skeptical when I see a lot of money flying around: my communist nature pops out and I start seeing flaws everywhere.

I was an idealist, back then, which is good. But in that specific case, I was simply wrong. You may say that I am trying to convince myself just because my soul has been sold already (and they didn’t even give me the thirty pieces of silver they promised, btw). Nonetheless, I will try to offer you evidences that there are some gems, out there in the data miner’s cave, that you yourself may appreciate.

Who knows? Maybe you will decide to sell your soul to the devil too, after all.


Data Miner Hash Function Distance Distribution Target Person Social Graph 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Johnson, S.: The Ghost Map: the Story of London’s Most Terrifying Epidemic - And How It Changed Science, Cities, and the Modern World. Riverhead Books (2006)Google Scholar
  2. 2.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web. Technical Report 66, Stanford University (1999)Google Scholar
  3. 3.
    Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience 34(8), 711–726 (2004)Google Scholar
  4. 4.
    Boldi, P., Marino, A., Santini, M., Vigna, S.: Bubing: Massive crawling for the masses. Poster Proc. of 23rd International World Wide Web Conference, Seoul, Korea (2014)Google Scholar
  5. 5.
    Lee, H.T., Leonard, D., Wang, X., Loguinov, D.: Irlbot: Scaling to 6 billion pages and beyond. ACM Trans. Web 3(5), 8:1–8:34 (2009)Google Scholar
  6. 6.
    Cho, J., Garcia-Molina, H.: Parallel crawlers. In: Proceedings of the 11th International Conference on World Wide Web, pp. 124–135. ACM (2002)Google Scholar
  7. 7.
    Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In: Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing, pp. 654–663. ACM (1997)Google Scholar
  8. 8.
    Majewski, B.S., Wormald, N.C., Havas, G., Czech, Z.J.: A family of perfect hashing methods. Comput. J. 39(6), 547–554 (1996)CrossRefGoogle Scholar
  9. 9.
    Jacobson, G.: Space-efficient static trees and graphs. In: 30th Annual Symposium on Foundations of Computer Science, Research Triangle Park, North Carolina, pp. 549–554. IEEE (1989)Google Scholar
  10. 10.
    Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Theory and practise of monotone minimal perfect hashing. In: Proceedings of the Tenth Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 132–144. SIAM (2009)Google Scholar
  11. 11.
    Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Monotone minimal perfect hashing: Searching a sorted table with O(1) accesses. In: Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Mathematics (SODA), pp. 785–794. ACM Press, New York (2009)CrossRefGoogle Scholar
  12. 12.
    Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Fast prefix search in little space, with applications. In: de Berg, M., Meyer, U. (eds.) ESA 2010, Part I. LNCS, vol. 6346, pp. 427–438. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  13. 13.
    Belazzougui, D., Boldi, P., Vigna, S.: Dynamic z-fast tries. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 159–172. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  14. 14.
    Randall, K.H., Stata, R., Wiener, J.L., Wickremesinghe, R.G.: The Link Database: Fast access to graphs of the web. In: Proceedings of the Data Compression Conference, pp. 122–131. IEEE Computer Society, Washington, DC (2002)Google Scholar
  15. 15.
    Boldi, P., Vigna, S.: The WebGraph framework I: Compression techniques. In: Proc. of the Thirteenth International World Wide Web Conference, pp. 595–601. ACM Press (2004)Google Scholar
  16. 16.
    Moffat, A.: Compressing integer sequences and sets. In: Kao, M.-Y. (ed.) Encyclopedia of Algorithms, pp. 1–99. Springer, US (2008)Google Scholar
  17. 17.
    Chierichetti, F., Kumar, R., Lattanzi, S., Mitzenmacher, M., Panconesi, A., Raghavan, P.: On compressing social networks. In: KDD 2009: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 219–228. ACM, New York (2009)Google Scholar
  18. 18.
    Boldi, P., Santini, M., Vigna, S.: Permuting web and social graphs. Internet Math. 6(3), 257–283 (2010)CrossRefMathSciNetGoogle Scholar
  19. 19.
    Boldi, P., Santini, M., Vigna, S.: Permuting web graphs. In: Avrachenkov, K., Donato, D., Litvak, N. (eds.) WAW 2009. LNCS, vol. 5427, pp. 116–126. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  20. 20.
    Boldi, P., Rosa, M., Santini, M., Vigna, S.: Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In: Srinivasan, S., Ramamritham, K., Kumar, A., Ravindra, M.P., Bertino, E., Kumar, R. (eds.) Proceedings of the 20th International Conference on World Wide Web, pp. 587–596. ACM (2011)Google Scholar
  21. 21.
    Milgram, S.: The small world problem. Psychology Today 2(1), 60–67 (1967)MathSciNetGoogle Scholar
  22. 22.
    Travers, J., Milgram, S.: An experimental study of the small world problem. Sociometry 32(4), 425–443 (1969)CrossRefGoogle Scholar
  23. 23.
    Lipton, R.J., Naughton, J.F.: Estimating the size of generalized transitive closures. In: VLDB 1989: Proceedings of the 15th International Conference on Very Large Data Bases, pp. 165–171. Morgan Kaufmann Publishers Inc. (1989)Google Scholar
  24. 24.
    Crescenzi, P., Grossi, R., Lanzi, L., Marino, A.: A comparison of three algorithms for approximating the distance distribution in real-world graphs. In: Marchetti-Spaccamela, A., Segal, M. (eds.) TAPAS 2011. LNCS, vol. 6595, pp. 92–103. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  25. 25.
    Palmer, C.R., Gibbons, P.B., Faloutsos, C.: Anf: a fast and scalable tool for data mining in massive graphs. In: KDD 2002: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 81–90. ACM, New York (2002)Google Scholar
  26. 26.
    Boldi, P., Rosa, M., Vigna, S.: HyperANF: Approximating the neighbourhood function of very large graphs on a budget. In: Srinivasan, S., Ramamritham, K., Kumar, A., Ravindra, M.P., Bertino, E., Kumar, R. (eds.) Proceedings of the 20th International Conference on World Wide Web, pp. 625–634. ACM (2011)Google Scholar
  27. 27.
    Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Proceedings of the 13th Conference on Analysis of Algorithm (AofA 2007), pp. 127–146 (2007)Google Scholar
  28. 28.
    Backstrom, L., Boldi, P., Rosa, M., Ugander, J., Vigna, S.: Four degrees of separation. In: ACM Web Science 2012: Conference Proceedings, pp. 45–54. ACM Press (2012), Best paper awardGoogle Scholar
  29. 29.
    Backstrom, L., Dwork, C., Kleinberg, J.M.: Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. In: WWW, pp. 181–190 (2007)Google Scholar
  30. 30.
    Narayanan, A., Shmatikov, V.: De-anonymizing social networks. In: IEEE Symposium on Security and Privacy (2009)Google Scholar
  31. 31.
    Boldi, P., Bonchi, F., Gionis, A., Tassa, T.: Injecting uncertainty in graphs for identity obfuscation. Proceedings of the VLDB Endowment 5(11), 1376–1387 (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Paolo Boldi
    • 1
  1. 1.Dipartimento di InformaticaUniversità degli Studi di MilanoMilanoItaly

Personalised recommendations