The VLDB Journal

, Volume 27, Issue 3, pp 297–320 | Cite as

Spatio-textual user matching and clustering based on set similarity joins

  • Alexandros Belesiotis
  • Dimitrios SkoutasEmail author
  • Christodoulos Efstathiades
  • Vassilis Kaffes
  • Dieter Pfoser
Regular Paper


This paper addresses the problem of matching and clustering users based on their geolocated posts. Individual posts are matched according to spatial distance and textual similarity thresholds. Then, user similarity is defined as the ratio of their posts that match each other. Based on these criteria, we introduce efficient algorithms for identifying pairs of matching users in a large dataset, as well as for computing the top-k matching pairs. We then proceed to identify spatio-textual user clusters. For this purpose, we use the Louvain method for community detection. Our algorithms operate on a user graph where edge weights represent spatio-textual user similarities. Since the exact user similarity graph can be prohibitively expensive to compute, we exploit our previous algorithms to derive efficient methods that reduce execution time both by avoiding to compute exact similarity scores and by reducing the number of similarity calculations performed. The presented solution allows a trade-off between computation time and quality of detected clusters. The proposed algorithms are evaluated using three real-world datasets.


Spatio-textual join Set similarity join Spatio-textual clustering 


  1. 1.
    Adelfio, M.D., Nutanong, S., Samet, H.: Searching web documents as location sets. In: SIGSPATIAL, pp. 489–492 (2011a)Google Scholar
  2. 2.
    Adelfio, M.D., Nutanong, S., Samet, H.: Similarity search on a large collection of point sets. In: SIGSPATIAL, pp. 132–141 (2011b)Google Scholar
  3. 3.
    Aynaud, T., Blondel, V.D., Guillaume, J.-L., Lambiotte, R.: Multilevel local optimization of modularity. In: Bichot, C.-E., Siarry, P. (eds.) Graph Partitioning. Wiley, Hoboken, NJ (2013).
  4. 4.
    Ballesteros, J., Cary, A., Rishe, N.: SpSJoin: parallel spatial similarity joins. In: SIGSPATIAL, pp. 481–484 (2011)Google Scholar
  5. 5.
    Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)Google Scholar
  6. 6.
    Bichot, C.E., Siarry, P.: Graph Partitioning. Wiley, New York (2013)CrossRefzbMATHGoogle Scholar
  7. 7.
    Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008(10), P10008 (2008)CrossRefGoogle Scholar
  8. 8.
    Bouros, P., Ge, S., Mamoulis, N.: Spatio-textual similarity joins. PVLDB 6(1), 1–12 (2012)Google Scholar
  9. 9.
    Brinkhoff, T., Kriegel, H., Seeger, B.: Efficient processing of spatial joins using r-trees. In: SIGMOD, pp. 237–246 (1993)Google Scholar
  10. 10.
    Buluç, A., Meyerhenke, H., Safro, I., Sanders, P., Schulz, C.: Recent advances in graph partitioning. CoRR abs/1311.3144 (2013)Google Scholar
  11. 11.
    Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006)Google Scholar
  12. 12.
    Chen, L., Cong, G., Jensen, C.S., Wu, D.: Spatial keyword query processing: an experimental evaluation. PVLDB 6(3), 217–228 (2013)Google Scholar
  13. 13.
    Chen, Y., Suel, T., Markowetz, A.: Efficient query processing in geographic web search engines. In: SIGMOD, pp. 277–288 (2006)Google Scholar
  14. 14.
    Chen, Y., Xu, J., Xu, M.: Finding community structure in spatially constrained complex networks. Int. J. Geogr. Inf. Sci. 29(6), 889–911 (2015)CrossRefGoogle Scholar
  15. 15.
    Christoforaki, M., He, J., Dimopoulos, C., Markowetz, A., Suel, T.: Text versus space: efficient geo-search query processing. In: CIKM, pp. 423–432 (2011)Google Scholar
  16. 16.
    Clauset, A., Newman, M.E., Moore, C.: Finding community structure in very large networks. Phys. Rev. E 70(6), 066,111 (2004)CrossRefGoogle Scholar
  17. 17.
    Cong, G., Jensen, C.S., Wu, D.: Efficient retrieval of the top-k most relevant spatial web objects. PVLDB 2(1), 337–348 (2009)Google Scholar
  18. 18.
    Efstathiades, C., Belesiotis, A., Skoutas, D., Pfoser, D.: Similarity search on spatio-textual point sets. In: EDBT, pp. 329–340 (2016)Google Scholar
  19. 19.
    Efstathiades, H., Antoniades, D., Pallis, G., Dikaiakos, M.D.: Identification of key locations based on online social network activity. In: ASONAM, pp. 218–225 (2015)Google Scholar
  20. 20.
    Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: A latent variable model for geographic lexical variation. In: EMNLP, pp. 1277–1287 (2010)Google Scholar
  21. 21.
    Expert, P., Evans, T.S., Blondel, V.D., Lambiotte, R.: Uncovering space-independent communities in spatial networks. Proc. Natl. Acad. Sci. 108(19), 7663–7668 (2011)CrossRefzbMATHGoogle Scholar
  22. 22.
    Fan, J., Li, G., Zhou, L., Chen, S., Hu, J.: SEAL: spatio-textual similarity search. PVLDB 5(9), 824–835 (2012)Google Scholar
  23. 23.
    Fang, Y., Cheng, R., Li, X., Luo, S., Hu, J.: Effective community search over large spatial graphs. PVLDB 10(6), 709–720 (2017)Google Scholar
  24. 24.
    Felipe, I.D., Hristidis, V., Rishe, N.: Keyword search on spatial databases. In: ICDE, pp. 656–665 (2008)Google Scholar
  25. 25.
    Freedman, D.A.: Statistical Models: Theory and Practice. Cambridge University Press, Cambridge (2009)CrossRefzbMATHGoogle Scholar
  26. 26.
    Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, pp. 518–529 (1999)Google Scholar
  27. 27.
    Jacox, E.H., Samet, H.: Spatial join techniques. TODS 32(1), 7 (2007)CrossRefGoogle Scholar
  28. 28.
    Jiang, Y., Li, G., Feng, J., Li, W.: String similarity joins: an experimental evaluation. PVLDB 7(8), 625–636 (2014)Google Scholar
  29. 29.
    Liu, S., Li, G., Feng, J.: Star-join: spatio-textual similarity join. In: CIKM, pp. 2194–2198 (2012)Google Scholar
  30. 30.
    Liu, S., Li, G., Feng, J.: A prefix-filter based method for spatio-textual similarity join. TKDE 26(10), 2354–2367 (2014)Google Scholar
  31. 31.
    Newman, M.E.: Modularity and community structure in networks. Proc. Natl. Acad. Sci. 103(23), 8577–8582 (2006)CrossRefGoogle Scholar
  32. 32.
    Newman, M.E., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(2), 026,113 (2004)CrossRefGoogle Scholar
  33. 33.
    Onnela, J.P., Arbesman, S., González, M.C., Barabási, A.L., Christakis, N.A.: Geographic constraints on social network groups. PLoS ONE 6(4), e16,939 (2011)CrossRefGoogle Scholar
  34. 34.
    Papadias, D., Kalnis, P., Zhang, J., Tao, Y.: Efficient OLAP operations in spatial data warehouses. In: SSTD, pp. 443–459 (2001)Google Scholar
  35. 35.
    Pons, P., Latapy, M.: Computing communities in large networks using random walks. J. Graph Algorithms Appl. 10(2), 191–218 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Rao, J., Lin, J.J., Samet, H.: Partitioning strategies for spatio-textual similarity join. In: SIGSPATIAL, pp. 40–49 (2014)Google Scholar
  37. 37.
    Rocha-Junior, J.B., Gkorgkas, O., Jonassen, S., Nørvåg, K.: Efficient processing of top-k spatial keyword queries. In: SSTD, pp. 205–222 (2011)Google Scholar
  38. 38.
    Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD, pp. 743–754 (2004)Google Scholar
  39. 39.
    Schaeffer, S.E.: Survey: graph clustering. Comput. Sci. Rev. 1(1), 27–64 (2007)CrossRefzbMATHGoogle Scholar
  40. 40.
    Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. JMLR 3, 583–617 (2002)MathSciNetzbMATHGoogle Scholar
  41. 41.
    Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817 (2015)
  42. 42.
    Vaid, S., Jones, C.B., Joho, H., Sanderson, M.: Spatio-textual indexing for geographical search on the web. In: SSTD, pp. 218–235 (2005)Google Scholar
  43. 43.
    Wakita, K., Tsurumi, T.: Finding community structure in mega-scale social networks. CoRR abs/cs/0702048 (2007)Google Scholar
  44. 44.
    Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012)Google Scholar
  45. 45.
    Wu, D., Cong, G., Jensen, C.S.: A framework for efficient spatial web object retrieval. VLDB J. 21(6), 797–822 (2012)CrossRefGoogle Scholar
  46. 46.
    Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. TODS 36(3), 15 (2011)CrossRefGoogle Scholar
  47. 47.
    Zhang, D., Tan, K., Tung, A.K.H.: Scalable top-k spatial keyword search. In: EDBT, pp. 359–370 (2013)Google Scholar
  48. 48.
    Zhang, D., Chan, C., Tan, K.: Processing spatial keyword query as a top-k aggregation query. In: SIGIR, pp. 355–364 (2014a)Google Scholar
  49. 49.
    Zhang, Y., Ma, Y., Meng, X.: Efficient spatio-textual similarity join using mapreduce. In: WI-IAT, pp. 52–59 (2014b)Google Scholar
  50. 50.
    Zhao, W.X., Jiang, J., He, J., Song, Y., Achananuparp, P., Lim, E., Li, X.: Topical keyphrase extraction from twitter. In: ACL, pp. 379–388 (2011)Google Scholar
  51. 51.
    Zhou, Y., Xie, X., Wang, C., Gong, Y., Ma, W.: Hybrid index structures for location-based web search. In: CIKM, pp. 155–162 (2005)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.IMIS, R.C. AthenaAthensGreece
  2. 2.European University CyprusNicosiaCyprus
  3. 3.George Mason UniversityFairfaxUSA

Personalised recommendations