Skip to main content

Supporting a Social Media Observatory with Customizable Index Structures: Architecture and Performance

Abstract

The intensive research activity in analysis of social media and micro-blogging data in recent years suggests the necessity and great potential of platforms that can efficiently store, query, analyze, and visualize social media data. To support these “social media observatories” effectively, a storage platform must satisfy special requirements for loading and storage of multi-terabyte datasets, as well as efficient evaluation of queries involving analysis of the text of millions of social updates. Traditional inverted indexing techniques do not meet such requirements. As a solution, we propose a general indexing framework, IndexedHBase, to build specially customized index structures for facilitating efficient queries on an HBase distributed data storage system. IndexedHBase is used to support a social media observatory that collects and analyzes data obtained through the Twitter streaming API. We develop a parallel query evaluation strategy that can explore the customized index structures efficiently, and test it on a set of typical social media data queries. We evaluate the performance of IndexedHBase on FutureGrid and compare it with Riak, a widely adopted commercial NoSQL database system. The results show that IndexedHBase provides a data loading speed that is six times faster than Riak and is significantly more efficient in evaluating queries involving large result sets.

Keywords

  • Index Structure
  • Query Evaluation
  • Index Table
  • Inverted Index
  • Hadoop Distribute File System

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-1-4939-1905-5_17
  • Chapter length: 27 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   139.00
Price excludes VAT (USA)
  • ISBN: 978-1-4939-1905-5
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   179.00
Price excludes VAT (USA)
Hardcover Book
USD   159.99
Price excludes VAT (USA)
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

References

  1. Alonso, O., Strötgen, J., Baeza-Yates, R. A., Gertz. M. Temporal Information Retrieval: Challenges and Opportunities. In: Proc. 1st Temporal Web Analytics Workshop (TWAW 2011)

    Google Scholar 

  2. Apache Hadoop. http://hadoop.apache.org/

  3. Apache HBase. http://hbase.apache.org/

  4. Apache Hive. http://hive.apache.org/

  5. Apache Zookeeper. http://zookeeper.apache.org/

  6. Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., Gruber, R. Bigtable: A Distributed Storage System for Structured Data. In: Proc. 7th Symp. Operating System Design and Implementation (OSDI 2006)

    Google Scholar 

  7. Conover, M., Ratkiewicz, J., Francisco, M., Goncalves, B., Flammini, A., Menczer, F. Political Polarization on Twitter. In: Proc. 5th Intl. AAAI Conf. Weblogs and Social Media (ICWSM 2011)

    Google Scholar 

  8. Conover, M., Gonçalves, B., Ratkiewicz, J., Flammini, A., Menczer, Filippo. Predicting the Political Alignment of Twitter Users. In: Proc. IEEE 3rd Intl. Conf. Social Computing (SocialCom 2011)

    Google Scholar 

  9. Conover, M., Gonçalves, B., Flammini, A., Menczer, F. Partisan Asymmetries in Online Political Activity. EPJ Data Science, 1:6 (2012)

    CrossRef  Google Scholar 

  10. Conover, M., Davis, C., Ferrara, E., McKelvey, K., Menczer, F., Flammini, A. The Geospatial Characteristics of a Social Movement Communication Network. PLoS ONE, 8(3): e55957 (2013)

    CrossRef  Google Scholar 

  11. Conover, M., Ferrara, E., Menczer, F., Flammini, A. The Digital Evolution of Occupy Wall Street. PloS ONE, 8(5), e64679 (2013)

    CrossRef  Google Scholar 

  12. Datasift. http://datasift.com

  13. DataStax. http://www.datastax.com/

  14. Derczynski, L., Yang, B., Jensen, C. Towards Context-Aware Search and Analysis on Social Media Data. In: Proc. 16th Intl. Conf. Extending Database Technology (EDBT 2013)

    Google Scholar 

  15. DiGrazia, J., McKelvey, K., Bollen, J., Rojas, F. More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior. Available at SSRN: http://dx.doi.org/10.2139/ssrn.2235423 (2013)

  16. Graefe, G. Query Evaluation Techniques for Large Databases. ACM Computing Surveys (CSUR), 25(2): 73–169 (1993)

    CrossRef  Google Scholar 

  17. Hall, A., Bachmann, O., Büssow, R., Gǎnceanu, S., Nunkesser, M. Processing a Trillion Cells per Mouse Click. In: Proc. 38th Intl. Conf. Very Large Data Bases (VLDB 2012)

    Google Scholar 

  18. McKelvey, K., Menczer, F. Design and Prototyping of a Social Media Observatory. In: Proc. 22nd Intl. Conf. World Wide Web Companion (WWW 2013)

    Google Scholar 

  19. Melnik, S., Gubarev, A., Long, J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T. Dremel: Interactive Analysis of Web-Scale Datasets. In: Proc. 36th Intl. Conf. Very Large Data Bases (VLDB 2010)

    Google Scholar 

  20. Padmanabhan, A., Wang, S., Cao, G., Hwang, M., Zhao, Y., Zhang, Z., Gao, Y. FluMapper: An Interactive CyberGIS Environment for Massive Location-based Social Media Data Analysis. In: Proc. Extreme Science and Engineering Discovery Environment: Gateway to Discovery (XSEDE 2013)

    Google Scholar 

  21. Peng, D., Dabek, F. Large-scale Incremental Processing Using Distributed Transactions and Notifications. In: Proc. 9th USENIX Symp. Operating Systems Design and Implementation (USENIX 2010)

    Google Scholar 

  22. PeopleBrowsr. http://peoplebrowsr.com

  23. Ratkiewicz, J., Conover, M., Meiss, M., Gonçalves, B., Flammini, A., Menczer, F. Detecting and Tracking Political Abuse in Social Media. In: Proc. 5th Intl. AAAI Conf. Weblogs and Social Media (ICWSM 2011)

    Google Scholar 

  24. Ratkiewicz, J. Conover, M., Meiss, M., Goncalves, B., Patil, S., Flammini, A., Menczer, F. Truthy: Mapping the Spread of Astroturf in Microblog Streams. In: Proc. 20th Intl. Conf. World Wide Web Companion (WWW 2011)

    Google Scholar 

  25. Riak. http://basho.com/riak/

  26. Ripples. https://plus.google.com/ripple/details?url=google.com

  27. Shvachko, K., Kuang, H., Radia, S. and Chansler, R. The Hadoop Distributed File System. In: Proc. 26th IEEE Symp. Mass Storage Systems and Technologies (MSST 2010)

    Google Scholar 

  28. SocialFlow. http://socialflow.com

  29. TwitInfo. http://twitinfo.csail.mit.edu

  30. Twitter Streaming API. https://dev.twitter.com/docs/streaming-apis

  31. VisPolitics. http://vispolitics.com

  32. Von Laszewski, G., Fox, G., Wang, F., Younge, A., Kulshrestha, A., Pike, G. Design of the FutureGrid Experiment Management Framework. In: Proc. Gateway Computing Environments Workshop (GCE 2010)

    Google Scholar 

  33. Weikum, G., Ntarmos, N., Spaniol, M., Triantafillou, P., Benczúr, A., Kirkpatrick, S., Rigaux, P., Williamson, M. Longitudinal Analytics on Web Archive Data: It’s About Time! In: Proc. 5th Biennial Conf. Innovative Data Systems Research (CIDR 2011)

    Google Scholar 

  34. Weng, L., Flammini, A., Vespignani, A., Menczer, F. Competition among Memes in a World with Limited Attention. Nature Sci. Rep., (2) 335 (2012).

    Google Scholar 

  35. Weng, L., Ratkiewicz, J., Perra, N., Gonçalves, B., Castillo, C., Bonchi, F., Schifanella, S., Menczer, F., Flammini, F. The Role of Information Diffusion in the Evolution of Social Networks. In: Proc. 19th ACM Conf. Knowledge Discovery and Data Mining (SIGKDD 2013)

    Google Scholar 

  36. Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I. Discretized Streams: an Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters. In: Proc. 4th USENIX Conf. Hot Topics in Cloud Computing (HotCloud 2012)

    Google Scholar 

  37. Zobel, J. Moffat, A. Inverted files for text search engines. ACM Computing Surveys, 38(2) - 6 (2006)

    Google Scholar 

Download references

Acknowledgements

We would like to thank Onur Varol, Mohsen JafariAsbagh, Alessandro Flammini, Geoffrey Fox, and other colleagues and members of the Center for Complex Networks and Systems Research (cnets.indiana.edu) at Indiana University for helpful discussions and contributions to the Truthy project and the present work. We would also like to personally thank Koji Tanaka and the rest of the FutureGrid team for their continued help. We gratefully acknowledge support from the National Science Foundation (grant CCF-1101743), DARPA (grant W911NF-12-1-0037), and the J. S. McDonnell Foundation. FutureGrid is supported by the National Science Foundation under Grant 0910812 to Indiana University for “An Experimental, High-Performance Grid Test-bed.” IndexedHBased is in part supported by National Science Foundation CAREER Award OCI-1149432.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoming Gao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this chapter

Cite this chapter

Gao, X. et al. (2014). Supporting a Social Media Observatory with Customizable Index Structures: Architecture and Performance. In: Li, X., Qiu, J. (eds) Cloud Computing for Data-Intensive Applications. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1905-5_17

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-1905-5_17

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4939-1904-8

  • Online ISBN: 978-1-4939-1905-5

  • eBook Packages: Computer ScienceComputer Science (R0)