Skip to main content
Log in

User group analytics: hypothesis generation and exploratory analysis of user data

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

User data is becoming increasingly available in multiple domains ranging from the social Web to retail store receipts. User data is described by user demographics (e.g., age, gender, occupation) and user actions (e.g., rating a movie, publishing a paper, following a medical treatment). The analysis of user data is appealing to scientists who work on population studies, online marketing, recommendations, and large-scale data analytics. User data analytics usually relies on identifying group-level behavior such as “Asian women who publish regularly in databases.” Group analytics addresses peculiarities of user data such as noise and sparsity to enable insights. In this paper, we introduce a framework for user group analytics by developing several components which cover the life cycle of user groups. We provide two different analytical environments to support “hypothesis generation” and “exploratory analysis” on user groups. Experiments on datasets with different characteristics show the usability and efficiency of our group analytics framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. http://www.wired.com/2014/04/forget-the-quantified-self-we-need-to-build-the-quantified-us/.

  2. Internet Movie Database: http://www.imdb.com.

  3. Jane Ciabattari: Why is Rumi the best-selling poet in the US? http://www.bbc.com/culture/story/20140414-americas-best-selling-poet.

  4. http://www.bookcrossing.com.

  5. Note that this is just the way we illustrate the concept and of course we do not make this concatenation on the user data in the database.

  6. http://www.imdb.com/title/tt0068646/ratings?ref_=tt_ov_rt.

  7. http://www.imdb.com/title/tt2322441/ratings?ref_=tt_ov_rt.

  8. http://zip4.usps.com.

  9. Based on Google Scholar: https://goo.gl/r4FaLh.

  10. http://dblp.uni-trier.de/db/.

  11. DM-Authors dataset: http://dx.doi.org/10.18709/PERSCIDO.2016.10.DS32.

  12. https://www.mturk.com/.

  13. Overall Non-dominated Vector Generation.

  14. http://choco-solver.org/?q=Choco3.

  15. http://blog.testmunk.com/how-teens-really-use-apps/.

  16. Long short-term memory networks.

References

  1. Amer-Yahia, S., Kleisarchaki, S., Kolloju, N.K., Lakshmanan, L.V.S., Zamar, R.H.: Exploring rated datasets with rating maps. In: WWW (2017)

  2. Amer-Yahia, S., Omidvar Tehrani, B., Roy, S.B., Shabib, N.: Group recommendation with temporal affinities. In: EDBT (2015)

  3. Omidvar-Tehrani, B., Amer-Yahia, S., Termier, A.: Interactive user group analysis. In: CIKM (2015)

  4. Cao, L.: Behavior informatics to discover behavior insight for active and tailored client management. In: SIGKDD (2017)

  5. Wikipedia. Behavioral Analytics. https://en.wikipedia.org/wiki/behavioral_analytics (2014). Accessed 15 Mar 2018

  6. Abiteboul, S., Bonchi, F., Oliver, N., Yu, B.: Toward personal knowledge bases. In: DSAA (2015)

  7. Gramazio, C.C., Schloss, K.B., Laidlaw, D.H.: The relation between visualization size, grouping, and user performance. TVCG 20, 1953 (2014)

    Google Scholar 

  8. Doodson, J., Gavin, J., Joiner, R.: Information seeking, acquainted with groups and individuals: information seeking, social uncertainty and social network sites. In: ICWSM (2013)

  9. Amer-Yahia, S., Omidvar-Tehrani, B., Comba, J., Moreira, V., Zegarra, F.C.: Exploration of user groups invexus. In: ICDE demo (2018)

  10. Geng, L., Hamilton, H.J.: Interestingness measures for data mining: a survey. ACM Comput. Surv. (CSUR) 38(3), 1–32 (2006)

  11. Vreeken, J., Van Leeuwen, M., Siebes, A.: Krimp: mining itemsets that compress. Data Min. Knowl. Discov. 23(1), 169–214 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  12. Sidana, S., Mishra, S., Amer-Yahia, S., Clausel, M., Amini, M.-R.: Health monitoring on social media over time. In: SIGIR (2016)

  13. Amer-Yahia, S., Rousset, M.-C.: Toppi: an efficient algorithm for item-centric mining. In: DaWaK (2016)

  14. Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. (TiiS) 5, 19 (2016)

  15. Bertin-Mahieux, T., Ellis, D.P.W., Whitman, B., Lamere, P.: The million song dataset. In: ISMIR (2011)

  16. Monroe, M., Lan, R., Lee, H., Plaisant, C., Shneiderman, B.: Temporal event sequence simplification. TVCG 9, 2227 (2013)

    Google Scholar 

  17. Ziegler, C.-N., McNee, S.M., Konstan, J.A., Lausen, G.: Improving recommendation lists through topic diversification. In: WWW (2005)

  18. Uno, T., Asai, T., Uchida, Y., Arimura, H.: Lcm: an efficient algorithm for enumerating frequent closed item sets. In: Proceedings of Workshop on Frequent itemset Mining Implementations FIMI03 (2003)

  19. Zhao, Z., De Stefani, L., Zgraggen, E., Binnig, C., Upfal, E., Kraska, T.: Controlling false discoveries during interactive data exploration. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 527–540. ACM (2017)

  20. Xu, C., Brown, S., Grant, C., Weaver, C.: Interactive visual analytics for Simpsons paradox detection. In: HILDA (2018)

  21. Ganguly, S., Hasan, W., Krishnamurthy, R.: Query Optimization for Parallel Execution. ACM, New York (1992)

    Google Scholar 

  22. Trummer, I., Koch, C.: Approximation schemes for many-objective query optimization. In: SIGMOD. ACM (2014)

  23. Omidvar-Tehrani, B., Amer-Yahia, S., Dutot, P.-F., Trystram, D.: Multi-objective group discovery on the social web. Research Report RR-LIG-052, LIG, Grenoble, France (2016)

  24. Russell, S.J., Norvig, P.: Probabilistic reasoning. In: Artificial Intelligence: A Modern Approach. Pearson Education Ltd (2003)

  25. Robinson, D.J.S.: An Introduction to Abstract Algebra. Walter de Gruyter, Berlin (2003)

    Book  MATH  Google Scholar 

  26. Liu, A.-A., Yu-Ting, S., Wei-Zhi, N., Kankanhalli, M.: Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(1), 102–114 (2017)

    Article  Google Scholar 

  27. Nandi, A., Jagadish, H.V.: Guided interaction: rethinking the query-result paradigm. In: Proceedings of the VLDB Endowment (2011)

  28. Sarawagi, S., Sathe, G.: i3: intelligent, interactive investigation of OLAP data cubes. In: SIGMOD, vol. 29, p. 589. ACM (2000)

  29. Indyk, P., Mahabadi, S., Mahdian, M., Mirrokni, V.S.: Composable core-sets for diversity and coverage maximization. In: SIGART (2014)

  30. Omidvar-Tehrani, B., Amer-Yahia, S., Termier, A.: Interactive user group analysis. Research Report RR-LIG-048, LIG, Grenoble, France (2015)

  31. Kittur, A., Chi, H., Suh, B.: Crowdsourcing user studies with mechanical turk. In: SIGCHI (2008)

  32. Eickhoff, C.: Cognitive biases in crowdsourcing. In: WSDM (2018)

  33. Nah, F.F.-H.: A study on tolerable waiting time: how long are web users willing to wait? Behav. Inf. Technol. 23(3), 153–163 (2004)

    Article  Google Scholar 

  34. Kirchgessner, M., Leroy, V., Amer-Yahia, S., Mishra, S.: Testing interestingness measures in practice: a large-scale analysis of buying patterns. In: DSAA (2016)

  35. Mishra, S., Leroy, V., Amer-Yahia, S.: Colloquial region discovery for retail products: discovery and application. Int. J. Data Sci. Anal. 4, 17 (2017)

    Article  Google Scholar 

  36. Encyclopædia Britannica. Ockhams razor. Encyclopædia Britannica Online. Encyclopædia Britannica Inc, Chicago, IL (2009). Accessed 21 June 2009

  37. Miller, G.: Human memory and the storage of information. IRE Trans. Inf. Theory 2, 129 (1956)

    Article  Google Scholar 

  38. Riquelme, N., Von Lücken, C., Baran, B.: Performance metrics in multi-objective optimization. In: CLEI. IEEE (2015)

  39. Ke, L., Deb, K., Yao, X.: R-metric: evaluating the performance of preference-based evolutionary multi-objective optimization using reference points. IEEE Trans. Evol. Comput. (2017)

  40. Omidvar-Tehrani, B., Amer-Yahia, S., Dutot, P.-F., Trystram, D.: Multi-objective group discovery on the social web. In: ECML/PKDD, pp. 296–312. Springer (2016)

  41. Deb, K.: Multi-objective Optimization Using Evolutionary Algorithms, vol. 16. Wiley, New York (2001)

    MATH  Google Scholar 

  42. Zitzler, E., Thiele, L.: Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach. IEEE Trans. Evol. Comput. 3(4), 257–271 (1999)

    Article  Google Scholar 

  43. Fekete, J.-D., Primet, R.: Progressive analytics: a computation paradigm for exploratory data analysis (2016). arXiv preprint arXiv:1607.05162

  44. Boley, M., Kang, B., Tokmakov, P., Mampaey, M., Wrobel, S.: One click mining: interactive local pattern discovery through implicit preference and performance learning. IDEAS (ACM SIGKDD Workshop) (2013)

  45. West, R., Leskovec, J.: Automatic versus human navigation in information networks. In: ICWSM (2012)

  46. Mampaey, M., Tatti, N., Vreeken, J.: Tell me what i need to know: succinctly summarizing data with itemsets. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 573–581. ACM (2011)

  47. Newman, M.E.J.: Detecting community structure in networks. Eur. Phys. J. B Condens. Matter Complex Syst. 38(2), 321–330 (2004)

    Article  Google Scholar 

  48. Yang, J., Leskovec, J.: Overlapping communities explain core–periphery organization of networks. In: Proceedings of the IEEE (2014)

  49. Leskovec, J., Lang, K.J., Mahoney, M.: Empirical comparison of algorithms for network community detection. In: WWW (2010)

  50. Cai, H., Zheng, V.W., Zhu, F., Chang, K.C.-C., Huang, Z.: From community detection to community profiling. In: Proceedings of the VLDB Endowment (2017)

  51. Das, M., Amer-Yahia, S., Das, Gautam, M., Yu, C.: Meaningful interpretations of collaborative ratings. In: VLDB (2011)

  52. Baytas, I.M., Xiao, C., Zhang, X., Wang, F., Jain, A.K., Zhou, J.: Patient subtyping via time-aware LSTM networks. In: SIGKDD, pp. 65–74. ACM (2017)

  53. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications, vol. 27. ACM (1998)

  54. Srikant, R., Agrawal, R.: Mining generalized association rules. ACM (1995)

  55. Pandey, S., Aly, M., Bagherjeiran, A., Hatch, A., Ciccolo, P., Ratnaparkhi, A., Zinkevich, M.: Learning to target: what works for behavioral targeting. In: CIKM (2011)

  56. Kargar, M., An, A., Zihayat, M.: Efficient bi-objective team formation in social networks. In: Machine Learning and Knowledge Discovery in Databases. Springer Berlin, Heidelberg (2012)

  57. Cao, C.C., She, J., Tong, Y., Chen, L.: Whom to ask? Jury selection for decision making tasks on micro-blog services. VLDB 5, 1495 (2012)

    Google Scholar 

  58. Coello Coello, C.A., Lamont, G.B., Van Veldhuizen, D.A.: Evolutionary Algorithms for Solving Multi-objective Problems, vol. 5. Springer, Berlin (2007)

    MATH  Google Scholar 

  59. Papadimitriou, C.H., Yannakakis, M.: On the approximability of trade-offs and optimal access of web sources. In: FOCS (2000)

  60. Migdalas, A., Pardalos, P.M., Värbrand, P.: Multilevel Optimization: Algorithms and Applications. Springer, Berlin (1997)

    MATH  Google Scholar 

  61. Soulet, A., Raïssi, C., Plantevit, M., Cremilleux, B.: Mining dominant patterns in the sky. In: ICDM. IEEE (2011)

  62. Bonchi, F., Giannotti, F., Lucchese, C., Orlando, S., Perego, R., Trasarti, R.: Conquest: a constraint-based querying system for exploratory pattern discovery. In: ICDE. IEEE (2006)

  63. Bonchi, F., Giannotti, F., Mazzanti, A., Pedreschi, D.: Exante: anticipated data reduction in constrained pattern mining. In: PKDD, vol. 2838, pp. 59–70. Springer (2003)

  64. Kifer, D., Bucila, C., Gehrke, J., White, W.: Dualminer: a dual-pruning algorithm for itemsets with constraints. In: SIGKDD (2002)

  65. Yan, N., Li, C., Roy, S.B., Ramegowda, R., Das, G.: Facetedpedia: enabling query-dependent faceted search for wikipedia. In: CIKM (2010)

  66. Khan, A.R., Garcia-Molina, H.: Crowddqs: dynamic question selection in crowdsourcing systems. In: Proceedings of the 2017 ACM International Conference on Management of Data. ACM (2017)

  67. Mottin, D., Lissandrini, M., Velegrakis, Y., Palpanas, T.: New trends on exploratory methods for data analytics. Proc. VLDB Endow. 10(12), 1977–1980 (2017)

    Article  Google Scholar 

  68. Shneiderman, B.: The eyes have it: a task by data type taxonomy for information visualizations. In: The Craft of Information Visualization, pp. 364–371. Elsevier (2003)

  69. Feige, U., Kortsarz, G., Peleg, D.: The dense k-subgraph problem. Algorithmica 29(3), 410–421 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  70. Johnson, D.S.: Approximation algorithms for combinatorial problems. In: Proceedings of the 5th Annual ACM Symposium on Theory of Computing (1973)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Behrooz Omidvar-Tehrani.

Appendix: NP-hardness proofs

Appendix: NP-hardness proofs

Theorem 2

The decision version of gDiscover problem is NP-Complete.

Proof

It is shown in [51] that a single-objective optimization problem for user group set discovery is NP-Complete by a reduction from the Exact 3-Set Cover problem (EC3). There, homogeneity is maximized and a threshold on coverage is satisfied. In our case, two new conflicting dimensions (diversity and coverage) are added. This means that the problem in [51] is a special case of ours, hence our problem is obviously harder. \(\square \)

For our proofs of hardness, we consider an infinite time limit in gNavigate since that does not affect the complexity of our problem.

Theorem 3

The exploration operation of gNavigate, i.e., opExplore, is NP-complete.

Proof

The decision version of the problem is as follows: For a given group g, a set of groups \({{\mathcal {G}}}\) and a positive integer k, an overlap threshold \(\mu \), is there a subset of groups \({{\mathcal {G}}'} \subseteq explore (g,{{\mathcal {G}}},\mu )\) such that (i) \(g' \in {{\mathcal {G}}'} \wedge g' \ne g \wedge overlap(g,g') \ge \mu \) and (ii) \(\varSigma _{(g_1,g_2) \in {{\mathcal {G}}'} | g_1 \ne g_2}(1- overlap(g_1,g_2) )\) is maximized. A verifier v which returns true if both conditions (i) and (ii) are satisfied runs in polynomial time in the length of its input.

To verify NP-completeness, we reduce the Maximum Edge Subgraph (MES) [69] (also known as Dense k-subgraph) to the decision version of our problem. The problem of MES is defined as follows. Given an instance I consisting of a graph \(G=(V,E)\), a weight function \(w: E \rightarrow {\mathbb {N}}\), and a positive integer k, find a subset \(V' \subseteq V\), \(|V'|=k\) such that the total weight of the edges induced by \(V'\), i.e., \(\varSigma _{(v_i,v_j)} w(v_i,v_j)\) (where \((v_i,v_j) \in V' \times V'\)) is maximized. This is an NP-complete problem [69] (originally reduced from the Clique problem).

Given I, we create an instance J of our problem as follows. J consists of a graph \(G=(V,E)\) where the set of vertices \(V= explore (g,{{\mathcal {G}}},\mu )\) are groups that satisfy (i). Every pair of groups \((g_1,g_2) \in V \times V\) is also connected with a labeled edge, i.e., \(w(g_1,g_2)=1 - overlap(g_1,g_2) \). The subset \(V' \subseteq V\) (\(|V'|=k\)) is then a subset of groups where the sum of the weights between each pair of groups in \(V'\) is maximized, i.e.,  \(|E(V')|=\frac{k \times (k-1)}{2}\). The set \(V'\) is the most diverse subset of \({{\mathcal {G}}}\) that satisfies the overlap condition (\(\forall g' \in {{\mathcal {G}}}, overlap(g,g') \ge \mu \)). Therefore, a set \(V'\) is a solution in instance I of MES \( iff \) it is a solution in instance J of our problem. Hence, the exploration problem is NP-complete. \(\square \)

Theorem 4

The exploitation operation of gNavigate, i.e., opExploit, is NP-complete.

Proof

Similar to opExplore, a verifier v for exploitation runs in polynomial time in the length of its input. To verify NP-completeness, we reduce the Maximum Coverage Problem [70] to the decision version of our problem. The problem of Maximum Coverage Problem (MCP) is defined as follows. Given an instance I consisting of m sets S = { \(S_1 \dots S_m\)} where \(S_i \in S_M\) (\(S_M\) being a reference set), and a positive integer k, find a subset \(S' \subseteq S\), such that \(|S'|=k\) and the number of covered elements in \(S_M\), i.e., \(|\cup _{S_i \in S'} S_i| / |S_M|\) is maximized. This is an NP-complete problem [70]. Given I, we can create an instance J of our problem which consists of m sets \(S = exploit (g,{{\mathcal {G}}},\mu )\) and a reference group, i.e., \(S_M = g_{in}\). In opExploit, we are interested to have k groups \(S' \subseteq S\) that cover maximum number of users in \(S_M\), i.e., \(|\cup _{S_i \in S'} S_i| / |S_M|\) is maximized. Therefore, a set \(S'\) is a solution in instance I of MCP \( iff \) it is a solution in instance J of opExploit. Hence opExploit is NP-complete. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Omidvar-Tehrani, B., Amer-Yahia, S. & Borromeo, R.M. User group analytics: hypothesis generation and exploratory analysis of user data. The VLDB Journal 28, 243–266 (2019). https://doi.org/10.1007/s00778-018-0527-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-018-0527-4

Keywords

Navigation