User group analytics: hypothesis generation and exploratory analysis of user data

Omidvar-Tehrani, Behrooz; Amer-Yahia, Sihem; Borromeo, Ria Mae

doi:10.1007/s00778-018-0527-4

User group analytics: hypothesis generation and exploratory analysis of user data

Regular Paper
Published: 26 October 2018

Volume 28, pages 243–266, (2019)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Behrooz Omidvar-Tehrani ORCID: orcid.org/0000-0002-9405-3386¹,
Sihem Amer-Yahia¹ &
Ria Mae Borromeo²

656 Accesses
1 Citation
Explore all metrics

Abstract

User data is becoming increasingly available in multiple domains ranging from the social Web to retail store receipts. User data is described by user demographics (e.g., age, gender, occupation) and user actions (e.g., rating a movie, publishing a paper, following a medical treatment). The analysis of user data is appealing to scientists who work on population studies, online marketing, recommendations, and large-scale data analytics. User data analytics usually relies on identifying group-level behavior such as “Asian women who publish regularly in databases.” Group analytics addresses peculiarities of user data such as noise and sparsity to enable insights. In this paper, we introduce a framework for user group analytics by developing several components which cover the life cycle of user groups. We provide two different analytical environments to support “hypothesis generation” and “exploratory analysis” on user groups. Experiments on datasets with different characteristics show the usability and efficiency of our group analytics framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Toward Interactive User Data Analytics

Nonparametric Bayesian Probabilistic Latent Factor Model for Group Recommender Systems

Multi-Objective Group Discovery on the Social Web

Notes

http://www.wired.com/2014/04/forget-the-quantified-self-we-need-to-build-the-quantified-us/.
Internet Movie Database: http://www.imdb.com.
Jane Ciabattari: Why is Rumi the best-selling poet in the US? http://www.bbc.com/culture/story/20140414-americas-best-selling-poet.
http://www.bookcrossing.com.
Note that this is just the way we illustrate the concept and of course we do not make this concatenation on the user data in the database.
http://www.imdb.com/title/tt0068646/ratings?ref_=tt_ov_rt.
http://www.imdb.com/title/tt2322441/ratings?ref_=tt_ov_rt.
http://zip4.usps.com.
Based on Google Scholar: https://goo.gl/r4FaLh.
http://dblp.uni-trier.de/db/.
DM-Authors dataset: http://dx.doi.org/10.18709/PERSCIDO.2016.10.DS32.
https://www.mturk.com/.
Overall Non-dominated Vector Generation.
http://choco-solver.org/?q=Choco3.
http://blog.testmunk.com/how-teens-really-use-apps/.
Long short-term memory networks.

References

Amer-Yahia, S., Kleisarchaki, S., Kolloju, N.K., Lakshmanan, L.V.S., Zamar, R.H.: Exploring rated datasets with rating maps. In: WWW (2017)
Amer-Yahia, S., Omidvar Tehrani, B., Roy, S.B., Shabib, N.: Group recommendation with temporal affinities. In: EDBT (2015)
Omidvar-Tehrani, B., Amer-Yahia, S., Termier, A.: Interactive user group analysis. In: CIKM (2015)
Cao, L.: Behavior informatics to discover behavior insight for active and tailored client management. In: SIGKDD (2017)
Wikipedia. Behavioral Analytics. https://en.wikipedia.org/wiki/behavioral_analytics (2014). Accessed 15 Mar 2018
Abiteboul, S., Bonchi, F., Oliver, N., Yu, B.: Toward personal knowledge bases. In: DSAA (2015)
Gramazio, C.C., Schloss, K.B., Laidlaw, D.H.: The relation between visualization size, grouping, and user performance. TVCG 20, 1953 (2014)
Google Scholar
Doodson, J., Gavin, J., Joiner, R.: Information seeking, acquainted with groups and individuals: information seeking, social uncertainty and social network sites. In: ICWSM (2013)
Amer-Yahia, S., Omidvar-Tehrani, B., Comba, J., Moreira, V., Zegarra, F.C.: Exploration of user groups invexus. In: ICDE demo (2018)
Geng, L., Hamilton, H.J.: Interestingness measures for data mining: a survey. ACM Comput. Surv. (CSUR) 38(3), 1–32 (2006)
Vreeken, J., Van Leeuwen, M., Siebes, A.: Krimp: mining itemsets that compress. Data Min. Knowl. Discov. 23(1), 169–214 (2011)
Article MathSciNet MATH Google Scholar
Sidana, S., Mishra, S., Amer-Yahia, S., Clausel, M., Amini, M.-R.: Health monitoring on social media over time. In: SIGIR (2016)
Amer-Yahia, S., Rousset, M.-C.: Toppi: an efficient algorithm for item-centric mining. In: DaWaK (2016)
Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. (TiiS) 5, 19 (2016)
Bertin-Mahieux, T., Ellis, D.P.W., Whitman, B., Lamere, P.: The million song dataset. In: ISMIR (2011)
Monroe, M., Lan, R., Lee, H., Plaisant, C., Shneiderman, B.: Temporal event sequence simplification. TVCG 9, 2227 (2013)
Google Scholar
Ziegler, C.-N., McNee, S.M., Konstan, J.A., Lausen, G.: Improving recommendation lists through topic diversification. In: WWW (2005)
Uno, T., Asai, T., Uchida, Y., Arimura, H.: Lcm: an efficient algorithm for enumerating frequent closed item sets. In: Proceedings of Workshop on Frequent itemset Mining Implementations FIMI03 (2003)
Zhao, Z., De Stefani, L., Zgraggen, E., Binnig, C., Upfal, E., Kraska, T.: Controlling false discoveries during interactive data exploration. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 527–540. ACM (2017)
Xu, C., Brown, S., Grant, C., Weaver, C.: Interactive visual analytics for Simpsons paradox detection. In: HILDA (2018)
Ganguly, S., Hasan, W., Krishnamurthy, R.: Query Optimization for Parallel Execution. ACM, New York (1992)
Google Scholar
Trummer, I., Koch, C.: Approximation schemes for many-objective query optimization. In: SIGMOD. ACM (2014)
Omidvar-Tehrani, B., Amer-Yahia, S., Dutot, P.-F., Trystram, D.: Multi-objective group discovery on the social web. Research Report RR-LIG-052, LIG, Grenoble, France (2016)
Russell, S.J., Norvig, P.: Probabilistic reasoning. In: Artificial Intelligence: A Modern Approach. Pearson Education Ltd (2003)
Robinson, D.J.S.: An Introduction to Abstract Algebra. Walter de Gruyter, Berlin (2003)
Book MATH Google Scholar
Liu, A.-A., Yu-Ting, S., Wei-Zhi, N., Kankanhalli, M.: Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(1), 102–114 (2017)
Article Google Scholar
Nandi, A., Jagadish, H.V.: Guided interaction: rethinking the query-result paradigm. In: Proceedings of the VLDB Endowment (2011)
Sarawagi, S., Sathe, G.: i3: intelligent, interactive investigation of OLAP data cubes. In: SIGMOD, vol. 29, p. 589. ACM (2000)
Indyk, P., Mahabadi, S., Mahdian, M., Mirrokni, V.S.: Composable core-sets for diversity and coverage maximization. In: SIGART (2014)
Omidvar-Tehrani, B., Amer-Yahia, S., Termier, A.: Interactive user group analysis. Research Report RR-LIG-048, LIG, Grenoble, France (2015)
Kittur, A., Chi, H., Suh, B.: Crowdsourcing user studies with mechanical turk. In: SIGCHI (2008)
Eickhoff, C.: Cognitive biases in crowdsourcing. In: WSDM (2018)
Nah, F.F.-H.: A study on tolerable waiting time: how long are web users willing to wait? Behav. Inf. Technol. 23(3), 153–163 (2004)
Article Google Scholar
Kirchgessner, M., Leroy, V., Amer-Yahia, S., Mishra, S.: Testing interestingness measures in practice: a large-scale analysis of buying patterns. In: DSAA (2016)
Mishra, S., Leroy, V., Amer-Yahia, S.: Colloquial region discovery for retail products: discovery and application. Int. J. Data Sci. Anal. 4, 17 (2017)
Article Google Scholar
Encyclopædia Britannica. Ockhams razor. Encyclopædia Britannica Online. Encyclopædia Britannica Inc, Chicago, IL (2009). Accessed 21 June 2009
Miller, G.: Human memory and the storage of information. IRE Trans. Inf. Theory 2, 129 (1956)
Article Google Scholar
Riquelme, N., Von Lücken, C., Baran, B.: Performance metrics in multi-objective optimization. In: CLEI. IEEE (2015)
Ke, L., Deb, K., Yao, X.: R-metric: evaluating the performance of preference-based evolutionary multi-objective optimization using reference points. IEEE Trans. Evol. Comput. (2017)
Omidvar-Tehrani, B., Amer-Yahia, S., Dutot, P.-F., Trystram, D.: Multi-objective group discovery on the social web. In: ECML/PKDD, pp. 296–312. Springer (2016)
Deb, K.: Multi-objective Optimization Using Evolutionary Algorithms, vol. 16. Wiley, New York (2001)
MATH Google Scholar
Zitzler, E., Thiele, L.: Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach. IEEE Trans. Evol. Comput. 3(4), 257–271 (1999)
Article Google Scholar
Fekete, J.-D., Primet, R.: Progressive analytics: a computation paradigm for exploratory data analysis (2016). arXiv preprint arXiv:1607.05162
Boley, M., Kang, B., Tokmakov, P., Mampaey, M., Wrobel, S.: One click mining: interactive local pattern discovery through implicit preference and performance learning. IDEAS (ACM SIGKDD Workshop) (2013)
West, R., Leskovec, J.: Automatic versus human navigation in information networks. In: ICWSM (2012)
Mampaey, M., Tatti, N., Vreeken, J.: Tell me what i need to know: succinctly summarizing data with itemsets. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 573–581. ACM (2011)
Newman, M.E.J.: Detecting community structure in networks. Eur. Phys. J. B Condens. Matter Complex Syst. 38(2), 321–330 (2004)
Article Google Scholar
Yang, J., Leskovec, J.: Overlapping communities explain core–periphery organization of networks. In: Proceedings of the IEEE (2014)
Leskovec, J., Lang, K.J., Mahoney, M.: Empirical comparison of algorithms for network community detection. In: WWW (2010)
Cai, H., Zheng, V.W., Zhu, F., Chang, K.C.-C., Huang, Z.: From community detection to community profiling. In: Proceedings of the VLDB Endowment (2017)
Das, M., Amer-Yahia, S., Das, Gautam, M., Yu, C.: Meaningful interpretations of collaborative ratings. In: VLDB (2011)
Baytas, I.M., Xiao, C., Zhang, X., Wang, F., Jain, A.K., Zhou, J.: Patient subtyping via time-aware LSTM networks. In: SIGKDD, pp. 65–74. ACM (2017)
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications, vol. 27. ACM (1998)
Srikant, R., Agrawal, R.: Mining generalized association rules. ACM (1995)
Pandey, S., Aly, M., Bagherjeiran, A., Hatch, A., Ciccolo, P., Ratnaparkhi, A., Zinkevich, M.: Learning to target: what works for behavioral targeting. In: CIKM (2011)
Kargar, M., An, A., Zihayat, M.: Efficient bi-objective team formation in social networks. In: Machine Learning and Knowledge Discovery in Databases. Springer Berlin, Heidelberg (2012)
Cao, C.C., She, J., Tong, Y., Chen, L.: Whom to ask? Jury selection for decision making tasks on micro-blog services. VLDB 5, 1495 (2012)
Google Scholar
Coello Coello, C.A., Lamont, G.B., Van Veldhuizen, D.A.: Evolutionary Algorithms for Solving Multi-objective Problems, vol. 5. Springer, Berlin (2007)
MATH Google Scholar
Papadimitriou, C.H., Yannakakis, M.: On the approximability of trade-offs and optimal access of web sources. In: FOCS (2000)
Migdalas, A., Pardalos, P.M., Värbrand, P.: Multilevel Optimization: Algorithms and Applications. Springer, Berlin (1997)
MATH Google Scholar
Soulet, A., Raïssi, C., Plantevit, M., Cremilleux, B.: Mining dominant patterns in the sky. In: ICDM. IEEE (2011)
Bonchi, F., Giannotti, F., Lucchese, C., Orlando, S., Perego, R., Trasarti, R.: Conquest: a constraint-based querying system for exploratory pattern discovery. In: ICDE. IEEE (2006)
Bonchi, F., Giannotti, F., Mazzanti, A., Pedreschi, D.: Exante: anticipated data reduction in constrained pattern mining. In: PKDD, vol. 2838, pp. 59–70. Springer (2003)
Kifer, D., Bucila, C., Gehrke, J., White, W.: Dualminer: a dual-pruning algorithm for itemsets with constraints. In: SIGKDD (2002)
Yan, N., Li, C., Roy, S.B., Ramegowda, R., Das, G.: Facetedpedia: enabling query-dependent faceted search for wikipedia. In: CIKM (2010)
Khan, A.R., Garcia-Molina, H.: Crowddqs: dynamic question selection in crowdsourcing systems. In: Proceedings of the 2017 ACM International Conference on Management of Data. ACM (2017)
Mottin, D., Lissandrini, M., Velegrakis, Y., Palpanas, T.: New trends on exploratory methods for data analytics. Proc. VLDB Endow. 10(12), 1977–1980 (2017)
Article Google Scholar
Shneiderman, B.: The eyes have it: a task by data type taxonomy for information visualizations. In: The Craft of Information Visualization, pp. 364–371. Elsevier (2003)
Feige, U., Kortsarz, G., Peleg, D.: The dense k-subgraph problem. Algorithmica 29(3), 410–421 (2001)
Article MathSciNet MATH Google Scholar
Johnson, D.S.: Approximation algorithms for combinatorial problems. In: Proceedings of the 5th Annual ACM Symposium on Theory of Computing (1973)

Download references

Author information

Authors and Affiliations

CNRS, Université Grenoble Alpes, Grenoble, France
Behrooz Omidvar-Tehrani & Sihem Amer-Yahia
University of the Philippines Open University, Laguna, Philippines
Ria Mae Borromeo

Authors

Behrooz Omidvar-Tehrani
View author publications
You can also search for this author in PubMed Google Scholar
Sihem Amer-Yahia
View author publications
You can also search for this author in PubMed Google Scholar
Ria Mae Borromeo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Behrooz Omidvar-Tehrani.

Appendix: NP-hardness proofs

Theorem 2

The decision version of gDiscover problem is NP-Complete.

Proof

It is shown in [51] that a single-objective optimization problem for user group set discovery is NP-Complete by a reduction from the Exact 3-Set Cover problem (EC3). There, homogeneity is maximized and a threshold on coverage is satisfied. In our case, two new conflicting dimensions (diversity and coverage) are added. This means that the problem in [51] is a special case of ours, hence our problem is obviously harder. \(\square \)

For our proofs of hardness, we consider an infinite time limit in gNavigate since that does not affect the complexity of our problem.

Theorem 3

The exploration operation of gNavigate, i.e., opExplore, is NP-complete.

Proof

The decision version of the problem is as follows: For a given group g, a set of groups \({{\mathcal {G}}}\) and a positive integer k, an overlap threshold \(\mu \), is there a subset of groups \({{\mathcal {G}}'} \subseteq explore (g,{{\mathcal {G}}},\mu )\) such that (i) \(g' \in {{\mathcal {G}}'} \wedge g' \ne g \wedge overlap(g,g') \ge \mu \) and (ii) \(\varSigma _{(g_1,g_2) \in {{\mathcal {G}}'} | g_1 \ne g_2}(1- overlap(g_1,g_2) )\) is maximized. A verifier v which returns true if both conditions (i) and (ii) are satisfied runs in polynomial time in the length of its input.

To verify NP-completeness, we reduce the Maximum Edge Subgraph (MES) [69] (also known as Dense k-subgraph) to the decision version of our problem. The problem of MES is defined as follows. Given an instance I consisting of a graph \(G=(V,E)\), a weight function \(w: E \rightarrow {\mathbb {N}}\), and a positive integer k, find a subset \(V' \subseteq V\), \(|V'|=k\) such that the total weight of the edges induced by \(V'\), i.e., \(\varSigma _{(v_i,v_j)} w(v_i,v_j)\) (where \((v_i,v_j) \in V' \times V'\)) is maximized. This is an NP-complete problem [69] (originally reduced from the Clique problem).

Given I, we create an instance J of our problem as follows. J consists of a graph \(G=(V,E)\) where the set of vertices \(V= explore (g,{{\mathcal {G}}},\mu )\) are groups that satisfy (i). Every pair of groups \((g_1,g_2) \in V \times V\) is also connected with a labeled edge, i.e., \(w(g_1,g_2)=1 - overlap(g_1,g_2) \). The subset \(V' \subseteq V\) (\(|V'|=k\)) is then a subset of groups where the sum of the weights between each pair of groups in \(V'\) is maximized, i.e., \(|E(V')|=\frac{k \times (k-1)}{2}\). The set \(V'\) is the most diverse subset of \({{\mathcal {G}}}\) that satisfies the overlap condition (\(\forall g' \in {{\mathcal {G}}}, overlap(g,g') \ge \mu \)). Therefore, a set \(V'\) is a solution in instance I of MES \( iff \) it is a solution in instance J of our problem. Hence, the exploration problem is NP-complete. \(\square \)

Theorem 4

The exploitation operation of gNavigate, i.e., opExploit, is NP-complete.

Proof

Similar to opExplore, a verifier v for exploitation runs in polynomial time in the length of its input. To verify NP-completeness, we reduce the Maximum Coverage Problem [70] to the decision version of our problem. The problem of Maximum Coverage Problem (MCP) is defined as follows. Given an instance I consisting of m sets S = { \(S_1 \dots S_m\)} where \(S_i \in S_M\) (\(S_M\) being a reference set), and a positive integer k, find a subset \(S' \subseteq S\), such that \(|S'|=k\) and the number of covered elements in \(S_M\), i.e., \(|\cup _{S_i \in S'} S_i| / |S_M|\) is maximized. This is an NP-complete problem [70]. Given I, we can create an instance J of our problem which consists of m sets \(S = exploit (g,{{\mathcal {G}}},\mu )\) and a reference group, i.e., \(S_M = g_{in}\). In opExploit, we are interested to have k groups \(S' \subseteq S\) that cover maximum number of users in \(S_M\), i.e., \(|\cup _{S_i \in S'} S_i| / |S_M|\) is maximized. Therefore, a set \(S'\) is a solution in instance I of MCP \( iff \) it is a solution in instance J of opExploit. Hence opExploit is NP-complete. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Omidvar-Tehrani, B., Amer-Yahia, S. & Borromeo, R.M. User group analytics: hypothesis generation and exploratory analysis of user data. The VLDB Journal 28, 243–266 (2019). https://doi.org/10.1007/s00778-018-0527-4

Download citation

Received: 06 February 2018
Revised: 25 September 2018
Accepted: 13 October 2018
Published: 26 October 2018
Issue Date: 11 April 2019
DOI: https://doi.org/10.1007/s00778-018-0527-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

User group analytics: hypothesis generation and exploratory analysis of user data

Abstract

Access this article

Similar content being viewed by others

Toward Interactive User Data Analytics

Nonparametric Bayesian Probabilistic Latent Factor Model for Group Recommender Systems

Multi-Objective Group Discovery on the Social Web

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix: NP-hardness proofs

Theorem 2

Proof

Theorem 3

Proof

Theorem 4

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

User group analytics: hypothesis generation and exploratory analysis of user data

Abstract

Access this article

Similar content being viewed by others

Toward Interactive User Data Analytics

Nonparametric Bayesian Probabilistic Latent Factor Model for Group Recommender Systems

Multi-Objective Group Discovery on the Social Web

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix: NP-hardness proofs

Appendix: NP-hardness proofs

Theorem 2

Proof

Theorem 3

Proof

Theorem 4

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation