Abstract
User data is becoming increasingly available in multiple domains ranging from the social Web to retail store receipts. User data is described by user demographics (e.g., age, gender, occupation) and user actions (e.g., rating a movie, publishing a paper, following a medical treatment). The analysis of user data is appealing to scientists who work on population studies, online marketing, recommendations, and large-scale data analytics. User data analytics usually relies on identifying group-level behavior such as “Asian women who publish regularly in databases.” Group analytics addresses peculiarities of user data such as noise and sparsity to enable insights. In this paper, we introduce a framework for user group analytics by developing several components which cover the life cycle of user groups. We provide two different analytical environments to support “hypothesis generation” and “exploratory analysis” on user groups. Experiments on datasets with different characteristics show the usability and efficiency of our group analytics framework.
Similar content being viewed by others
Notes
Internet Movie Database: http://www.imdb.com.
Jane Ciabattari: Why is Rumi the best-selling poet in the US? http://www.bbc.com/culture/story/20140414-americas-best-selling-poet.
Note that this is just the way we illustrate the concept and of course we do not make this concatenation on the user data in the database.
Based on Google Scholar: https://goo.gl/r4FaLh.
DM-Authors dataset: http://dx.doi.org/10.18709/PERSCIDO.2016.10.DS32.
Overall Non-dominated Vector Generation.
Long short-term memory networks.
References
Amer-Yahia, S., Kleisarchaki, S., Kolloju, N.K., Lakshmanan, L.V.S., Zamar, R.H.: Exploring rated datasets with rating maps. In: WWW (2017)
Amer-Yahia, S., Omidvar Tehrani, B., Roy, S.B., Shabib, N.: Group recommendation with temporal affinities. In: EDBT (2015)
Omidvar-Tehrani, B., Amer-Yahia, S., Termier, A.: Interactive user group analysis. In: CIKM (2015)
Cao, L.: Behavior informatics to discover behavior insight for active and tailored client management. In: SIGKDD (2017)
Wikipedia. Behavioral Analytics. https://en.wikipedia.org/wiki/behavioral_analytics (2014). Accessed 15 Mar 2018
Abiteboul, S., Bonchi, F., Oliver, N., Yu, B.: Toward personal knowledge bases. In: DSAA (2015)
Gramazio, C.C., Schloss, K.B., Laidlaw, D.H.: The relation between visualization size, grouping, and user performance. TVCG 20, 1953 (2014)
Doodson, J., Gavin, J., Joiner, R.: Information seeking, acquainted with groups and individuals: information seeking, social uncertainty and social network sites. In: ICWSM (2013)
Amer-Yahia, S., Omidvar-Tehrani, B., Comba, J., Moreira, V., Zegarra, F.C.: Exploration of user groups invexus. In: ICDE demo (2018)
Geng, L., Hamilton, H.J.: Interestingness measures for data mining: a survey. ACM Comput. Surv. (CSUR) 38(3), 1–32 (2006)
Vreeken, J., Van Leeuwen, M., Siebes, A.: Krimp: mining itemsets that compress. Data Min. Knowl. Discov. 23(1), 169–214 (2011)
Sidana, S., Mishra, S., Amer-Yahia, S., Clausel, M., Amini, M.-R.: Health monitoring on social media over time. In: SIGIR (2016)
Amer-Yahia, S., Rousset, M.-C.: Toppi: an efficient algorithm for item-centric mining. In: DaWaK (2016)
Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. (TiiS) 5, 19 (2016)
Bertin-Mahieux, T., Ellis, D.P.W., Whitman, B., Lamere, P.: The million song dataset. In: ISMIR (2011)
Monroe, M., Lan, R., Lee, H., Plaisant, C., Shneiderman, B.: Temporal event sequence simplification. TVCG 9, 2227 (2013)
Ziegler, C.-N., McNee, S.M., Konstan, J.A., Lausen, G.: Improving recommendation lists through topic diversification. In: WWW (2005)
Uno, T., Asai, T., Uchida, Y., Arimura, H.: Lcm: an efficient algorithm for enumerating frequent closed item sets. In: Proceedings of Workshop on Frequent itemset Mining Implementations FIMI03 (2003)
Zhao, Z., De Stefani, L., Zgraggen, E., Binnig, C., Upfal, E., Kraska, T.: Controlling false discoveries during interactive data exploration. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 527–540. ACM (2017)
Xu, C., Brown, S., Grant, C., Weaver, C.: Interactive visual analytics for Simpsons paradox detection. In: HILDA (2018)
Ganguly, S., Hasan, W., Krishnamurthy, R.: Query Optimization for Parallel Execution. ACM, New York (1992)
Trummer, I., Koch, C.: Approximation schemes for many-objective query optimization. In: SIGMOD. ACM (2014)
Omidvar-Tehrani, B., Amer-Yahia, S., Dutot, P.-F., Trystram, D.: Multi-objective group discovery on the social web. Research Report RR-LIG-052, LIG, Grenoble, France (2016)
Russell, S.J., Norvig, P.: Probabilistic reasoning. In: Artificial Intelligence: A Modern Approach. Pearson Education Ltd (2003)
Robinson, D.J.S.: An Introduction to Abstract Algebra. Walter de Gruyter, Berlin (2003)
Liu, A.-A., Yu-Ting, S., Wei-Zhi, N., Kankanhalli, M.: Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(1), 102–114 (2017)
Nandi, A., Jagadish, H.V.: Guided interaction: rethinking the query-result paradigm. In: Proceedings of the VLDB Endowment (2011)
Sarawagi, S., Sathe, G.: i3: intelligent, interactive investigation of OLAP data cubes. In: SIGMOD, vol. 29, p. 589. ACM (2000)
Indyk, P., Mahabadi, S., Mahdian, M., Mirrokni, V.S.: Composable core-sets for diversity and coverage maximization. In: SIGART (2014)
Omidvar-Tehrani, B., Amer-Yahia, S., Termier, A.: Interactive user group analysis. Research Report RR-LIG-048, LIG, Grenoble, France (2015)
Kittur, A., Chi, H., Suh, B.: Crowdsourcing user studies with mechanical turk. In: SIGCHI (2008)
Eickhoff, C.: Cognitive biases in crowdsourcing. In: WSDM (2018)
Nah, F.F.-H.: A study on tolerable waiting time: how long are web users willing to wait? Behav. Inf. Technol. 23(3), 153–163 (2004)
Kirchgessner, M., Leroy, V., Amer-Yahia, S., Mishra, S.: Testing interestingness measures in practice: a large-scale analysis of buying patterns. In: DSAA (2016)
Mishra, S., Leroy, V., Amer-Yahia, S.: Colloquial region discovery for retail products: discovery and application. Int. J. Data Sci. Anal. 4, 17 (2017)
Encyclopædia Britannica. Ockhams razor. Encyclopædia Britannica Online. Encyclopædia Britannica Inc, Chicago, IL (2009). Accessed 21 June 2009
Miller, G.: Human memory and the storage of information. IRE Trans. Inf. Theory 2, 129 (1956)
Riquelme, N., Von Lücken, C., Baran, B.: Performance metrics in multi-objective optimization. In: CLEI. IEEE (2015)
Ke, L., Deb, K., Yao, X.: R-metric: evaluating the performance of preference-based evolutionary multi-objective optimization using reference points. IEEE Trans. Evol. Comput. (2017)
Omidvar-Tehrani, B., Amer-Yahia, S., Dutot, P.-F., Trystram, D.: Multi-objective group discovery on the social web. In: ECML/PKDD, pp. 296–312. Springer (2016)
Deb, K.: Multi-objective Optimization Using Evolutionary Algorithms, vol. 16. Wiley, New York (2001)
Zitzler, E., Thiele, L.: Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach. IEEE Trans. Evol. Comput. 3(4), 257–271 (1999)
Fekete, J.-D., Primet, R.: Progressive analytics: a computation paradigm for exploratory data analysis (2016). arXiv preprint arXiv:1607.05162
Boley, M., Kang, B., Tokmakov, P., Mampaey, M., Wrobel, S.: One click mining: interactive local pattern discovery through implicit preference and performance learning. IDEAS (ACM SIGKDD Workshop) (2013)
West, R., Leskovec, J.: Automatic versus human navigation in information networks. In: ICWSM (2012)
Mampaey, M., Tatti, N., Vreeken, J.: Tell me what i need to know: succinctly summarizing data with itemsets. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 573–581. ACM (2011)
Newman, M.E.J.: Detecting community structure in networks. Eur. Phys. J. B Condens. Matter Complex Syst. 38(2), 321–330 (2004)
Yang, J., Leskovec, J.: Overlapping communities explain core–periphery organization of networks. In: Proceedings of the IEEE (2014)
Leskovec, J., Lang, K.J., Mahoney, M.: Empirical comparison of algorithms for network community detection. In: WWW (2010)
Cai, H., Zheng, V.W., Zhu, F., Chang, K.C.-C., Huang, Z.: From community detection to community profiling. In: Proceedings of the VLDB Endowment (2017)
Das, M., Amer-Yahia, S., Das, Gautam, M., Yu, C.: Meaningful interpretations of collaborative ratings. In: VLDB (2011)
Baytas, I.M., Xiao, C., Zhang, X., Wang, F., Jain, A.K., Zhou, J.: Patient subtyping via time-aware LSTM networks. In: SIGKDD, pp. 65–74. ACM (2017)
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications, vol. 27. ACM (1998)
Srikant, R., Agrawal, R.: Mining generalized association rules. ACM (1995)
Pandey, S., Aly, M., Bagherjeiran, A., Hatch, A., Ciccolo, P., Ratnaparkhi, A., Zinkevich, M.: Learning to target: what works for behavioral targeting. In: CIKM (2011)
Kargar, M., An, A., Zihayat, M.: Efficient bi-objective team formation in social networks. In: Machine Learning and Knowledge Discovery in Databases. Springer Berlin, Heidelberg (2012)
Cao, C.C., She, J., Tong, Y., Chen, L.: Whom to ask? Jury selection for decision making tasks on micro-blog services. VLDB 5, 1495 (2012)
Coello Coello, C.A., Lamont, G.B., Van Veldhuizen, D.A.: Evolutionary Algorithms for Solving Multi-objective Problems, vol. 5. Springer, Berlin (2007)
Papadimitriou, C.H., Yannakakis, M.: On the approximability of trade-offs and optimal access of web sources. In: FOCS (2000)
Migdalas, A., Pardalos, P.M., Värbrand, P.: Multilevel Optimization: Algorithms and Applications. Springer, Berlin (1997)
Soulet, A., Raïssi, C., Plantevit, M., Cremilleux, B.: Mining dominant patterns in the sky. In: ICDM. IEEE (2011)
Bonchi, F., Giannotti, F., Lucchese, C., Orlando, S., Perego, R., Trasarti, R.: Conquest: a constraint-based querying system for exploratory pattern discovery. In: ICDE. IEEE (2006)
Bonchi, F., Giannotti, F., Mazzanti, A., Pedreschi, D.: Exante: anticipated data reduction in constrained pattern mining. In: PKDD, vol. 2838, pp. 59–70. Springer (2003)
Kifer, D., Bucila, C., Gehrke, J., White, W.: Dualminer: a dual-pruning algorithm for itemsets with constraints. In: SIGKDD (2002)
Yan, N., Li, C., Roy, S.B., Ramegowda, R., Das, G.: Facetedpedia: enabling query-dependent faceted search for wikipedia. In: CIKM (2010)
Khan, A.R., Garcia-Molina, H.: Crowddqs: dynamic question selection in crowdsourcing systems. In: Proceedings of the 2017 ACM International Conference on Management of Data. ACM (2017)
Mottin, D., Lissandrini, M., Velegrakis, Y., Palpanas, T.: New trends on exploratory methods for data analytics. Proc. VLDB Endow. 10(12), 1977–1980 (2017)
Shneiderman, B.: The eyes have it: a task by data type taxonomy for information visualizations. In: The Craft of Information Visualization, pp. 364–371. Elsevier (2003)
Feige, U., Kortsarz, G., Peleg, D.: The dense k-subgraph problem. Algorithmica 29(3), 410–421 (2001)
Johnson, D.S.: Approximation algorithms for combinatorial problems. In: Proceedings of the 5th Annual ACM Symposium on Theory of Computing (1973)
Author information
Authors and Affiliations
Corresponding author
Appendix: NP-hardness proofs
Appendix: NP-hardness proofs
Theorem 2
The decision version of gDiscover problem is NP-Complete.
Proof
It is shown in [51] that a single-objective optimization problem for user group set discovery is NP-Complete by a reduction from the Exact 3-Set Cover problem (EC3). There, homogeneity is maximized and a threshold on coverage is satisfied. In our case, two new conflicting dimensions (diversity and coverage) are added. This means that the problem in [51] is a special case of ours, hence our problem is obviously harder. \(\square \)
For our proofs of hardness, we consider an infinite time limit in gNavigate since that does not affect the complexity of our problem.
Theorem 3
The exploration operation of gNavigate, i.e., opExplore, is NP-complete.
Proof
The decision version of the problem is as follows: For a given group g, a set of groups \({{\mathcal {G}}}\) and a positive integer k, an overlap threshold \(\mu \), is there a subset of groups \({{\mathcal {G}}'} \subseteq explore (g,{{\mathcal {G}}},\mu )\) such that (i) \(g' \in {{\mathcal {G}}'} \wedge g' \ne g \wedge overlap(g,g') \ge \mu \) and (ii) \(\varSigma _{(g_1,g_2) \in {{\mathcal {G}}'} | g_1 \ne g_2}(1- overlap(g_1,g_2) )\) is maximized. A verifier v which returns true if both conditions (i) and (ii) are satisfied runs in polynomial time in the length of its input.
To verify NP-completeness, we reduce the Maximum Edge Subgraph (MES) [69] (also known as Dense k-subgraph) to the decision version of our problem. The problem of MES is defined as follows. Given an instance I consisting of a graph \(G=(V,E)\), a weight function \(w: E \rightarrow {\mathbb {N}}\), and a positive integer k, find a subset \(V' \subseteq V\), \(|V'|=k\) such that the total weight of the edges induced by \(V'\), i.e., \(\varSigma _{(v_i,v_j)} w(v_i,v_j)\) (where \((v_i,v_j) \in V' \times V'\)) is maximized. This is an NP-complete problem [69] (originally reduced from the Clique problem).
Given I, we create an instance J of our problem as follows. J consists of a graph \(G=(V,E)\) where the set of vertices \(V= explore (g,{{\mathcal {G}}},\mu )\) are groups that satisfy (i). Every pair of groups \((g_1,g_2) \in V \times V\) is also connected with a labeled edge, i.e., \(w(g_1,g_2)=1 - overlap(g_1,g_2) \). The subset \(V' \subseteq V\) (\(|V'|=k\)) is then a subset of groups where the sum of the weights between each pair of groups in \(V'\) is maximized, i.e., \(|E(V')|=\frac{k \times (k-1)}{2}\). The set \(V'\) is the most diverse subset of \({{\mathcal {G}}}\) that satisfies the overlap condition (\(\forall g' \in {{\mathcal {G}}}, overlap(g,g') \ge \mu \)). Therefore, a set \(V'\) is a solution in instance I of MES \( iff \) it is a solution in instance J of our problem. Hence, the exploration problem is NP-complete. \(\square \)
Theorem 4
The exploitation operation of gNavigate, i.e., opExploit, is NP-complete.
Proof
Similar to opExplore, a verifier v for exploitation runs in polynomial time in the length of its input. To verify NP-completeness, we reduce the Maximum Coverage Problem [70] to the decision version of our problem. The problem of Maximum Coverage Problem (MCP) is defined as follows. Given an instance I consisting of m sets S = { \(S_1 \dots S_m\)} where \(S_i \in S_M\) (\(S_M\) being a reference set), and a positive integer k, find a subset \(S' \subseteq S\), such that \(|S'|=k\) and the number of covered elements in \(S_M\), i.e., \(|\cup _{S_i \in S'} S_i| / |S_M|\) is maximized. This is an NP-complete problem [70]. Given I, we can create an instance J of our problem which consists of m sets \(S = exploit (g,{{\mathcal {G}}},\mu )\) and a reference group, i.e., \(S_M = g_{in}\). In opExploit, we are interested to have k groups \(S' \subseteq S\) that cover maximum number of users in \(S_M\), i.e., \(|\cup _{S_i \in S'} S_i| / |S_M|\) is maximized. Therefore, a set \(S'\) is a solution in instance I of MCP \( iff \) it is a solution in instance J of opExploit. Hence opExploit is NP-complete. \(\square \)
Rights and permissions
About this article
Cite this article
Omidvar-Tehrani, B., Amer-Yahia, S. & Borromeo, R.M. User group analytics: hypothesis generation and exploratory analysis of user data. The VLDB Journal 28, 243–266 (2019). https://doi.org/10.1007/s00778-018-0527-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-018-0527-4