Skip to main content
Log in

Mining Entity Rankings

  • SCHWERPUNKTBEITRAG
  • Published:
Datenbank-Spektrum Aims and scope Submit manuscript

Abstract

In this paper, we propose models, algorithms, and implementation details of an approach that extract the most relevant entity rankings from large datasets. This is done in a fully automated way, as with large amounts of structured data, beyond well understood databases (schemas), manual solutions do not scale. The core task of our approach is to decide which categorical constraints, ranking order (descending or ascending), and length form together an interesting ranking. We make use of a model based on information entropy to find interesting/relevant categorical constraints and devise pruning conditions to avoid generating too many irrelevant rankings. We further investigate the skewness of the value distributions of ranking criteria to find suitable ranking dimensions and ranking order, and present an overall scoring model to assess the meaningfulness of a ranking. For each individual step of our approach, we discuss iterative MapReduce-based algorithms. Finally, the experimental evaluation on real-world data is reported where the users manually evaluate our approach of generating most relevant rankings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Abb. 1
Abb. 2
Abb. 3
Abb. 4

Similar content being viewed by others

Literatur

  1. Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases, vol 22. ACM SIGMOD Record, New York, pp 207–216

  2. Alvanaki F, Ilieva E, Michel S, Stupar A (2013) Interesting event detection through hall of fame rankings. In: Proceedings of the 3rd ACM SIGMOD Workshop on Databases and Social Networks, DBSocial, New York, 23 June 2013, pp 7–12

  3. Alvanaki F, Michel S, Stupar A (2012) Building and maintaining halls of fame over a database. CoRR abs/1208.1231

  4. Balanda KP, MacGillivray H (1988) Kurtosis: a critical review. Am Stat 42(2):111–119

  5. Bizer C, Heath T, Berners-Lee T (2009) Linked data - the story so far. Int J Semantic Web Inf Syst 5(3):1–22

  6. Chambers J, Cleveland W, Kleiner B, Tukey P (1983) Graphical methods for data analysis. The Wadsworth statistics/probability series. Duxury, Boston

  7. Chaudhuri S, Dayal U (1997) An overview of data warehousing and olap technology. SIGMOD Rec 26(1), 65–74

  8. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

  9. Dean J, Ghemawat S (2010) Mapreduce: a flexible data processing tool. Commun ACM 53(1):72–77

  10. DeCarlo LT (1997) On the meaning and use of kurtosis. Psychol Methods 2(3):292

  11. Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases. AI Magazine 17(3):37–54

  12. Foundation, A.S. (2014) Apache Hadoop. http://hadoop.apache.org/

  13. Foundation, A.S. (2014) Apache Hive. http://hive.apache.org/

  14. Foundation, A.S. (2014) Apache Hive language manual. https://cwiki.apache.org/confluence/display/Hive/LanguageManual

  15. Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D, Venkatrao M, Pellow F, Pirahesh H (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub totals. Data Min Knowl Disco 1(1), 29–53

  16. Ilieva E, Michel S, Stupar A (2013) The essence of knowledge (bases) through entity rankings. In: 22nd ACM International Conference on Information and Knowledge Management, CIKM'13, San Francisco, October 27 - November 1, 2013, pp 1537–1540

  17. Ledermann W, Lloyd E (1984) Handbook of applicable mathematics: statistics, part B. Handbook of applicable mathematics. Wiley, New York

  18. Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Database Theory – ICDT '99, 7th International Conference, Jerusalem, January 10–12, Proceedings., pp 398–416

  19. Pébay P (2008) Formulas for robust, one-pass parallel computation of covariances and arbitrary-order statistical moments. Sandia Report SAND2008-6212, Sandia National Laboratories

  20. Schwarte A, Haase P, Hose K, Schenkel R, Schmidt M (2011) Fedx: optimization techniques for federated query processing on linked data. In: International Semantic Web Conference (1), pp 601–616

  21. Shannon CE (2001) A mathematical theory of communication. SIGMOBILE Mob Comput Commun Rev 5(1):3–55

  22. Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pp 1–10. IEEE

  23. Snedecor GW, Cochran WG (1989) Statistical Methods, 8th Edn. Iowa State University Press, Iowa

  24. Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, Banff, Alberta, Canada, May 8–12, 2007, pp 697–706

  25. Terriberry TB (2008) Computing higher-order moments online

  26. Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a map-reduce framework. Proceedings VLDB Endowment 2(2):1626–1629

  27. Zaki MJ (2002) Efficiently mining frequent trees in a forest. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 71–80. ACM

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Michel.

Additional information

This work has been partially supported by the German Research Foundation (DFG) in project MI 1794/1-1.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pal, K., Reinartz, F. & Michel, S. Mining Entity Rankings. Datenbank Spektrum 16, 27–38 (2016). https://doi.org/10.1007/s13222-015-0205-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13222-015-0205-2

Keywords

Navigation