Knowledge and Information Systems

, Volume 56, Issue 1, pp 141–163 | Cite as

Tractable queries on big data via preprocessing with logarithmic-size output

  • Jiannan Yang
  • Hanpin Wang
  • Yongzhi Cao
Regular Paper


To provide a dichotomy between those queries that are feasible on big data after appropriate preprocessing and those for which preprocessing does not help, Fan et al. developed the \(\sqcap \)-tractability theory, which provides a formal foundation on the tractability of query classes in the context of big data. Inspired by some technologies used to deal with big data, we introduce a novel notion of \(\sqcap '\)-tractability in this paper. We place a restriction on preprocessing functions, which limits the functions to produce relatively short outputs, at most logarithmic-size of the inputs. We set a complexity class to denote the classes of Boolean queries that are \(\sqcap '\)-tractable and conclude that it is properly contained in that of \(\sqcap \)-tractable query classes, after discovering that a \(\sqcap \)-tractable query class is not \(\sqcap '\)-tractable. With an existing reduction, which does not allow re-factorizing data and query parts, we define complete query classes for the complexity class and give an efficient way to detect such query classes. We also investigate the query classes that can be made \(\sqcap '\)-tractable and prove that all PTIME classes of Boolean queries can be made \(\sqcap '\)-tractable.


Big data Complexity class Preprocessing Query Tractability 



The authors are very grateful to Professor Wenfei Fan and the anonymous reviewers for their invaluable suggestions. This work was supported by the National Natural Science Foundation of China (Grants Nos. 61370053, 61572003, and 61772035).


  1. 1.
    Cao Y, Fan W, Wo T, Yu W (2014) Bounded conjunctive queries. PVLDB 7(12):1231–1242Google Scholar
  2. 2.
    Chen CP, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347CrossRefGoogle Scholar
  3. 3.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  4. 4.
    Fan W, Huai J (2014) Querying big data: bridging theory and practice. J Comput Sci Technol 29(5):849–869MathSciNetCrossRefGoogle Scholar
  5. 5.
    Fan W, Li J, Wang X, Wu Y (2012) Query preserving graph compression. In: Proceedings of the ACM 2012 international conference on management of data, pp 157–168Google Scholar
  6. 6.
    Fan W, Geerts F, Neven F (2013) Making queries tractable on big data with preprocessing: through the eyes of complexity theory. PVLDB 6(9):685–696Google Scholar
  7. 7.
    Fan W, Geerts F, Libkin L (2014) On scale independence for querying big data. In: Proceedings of the ACM 33rd symposium on principles of database systems, pp 51–62Google Scholar
  8. 8.
    Fan W, Wang X, Wu Y (2014) Querying big graphs within bounded resources. In: Proceedings of the ACM 2014 international conference on management of data, pp 301–312Google Scholar
  9. 9.
    Fiori A, Mignone A, Rospo G (2016) Decoclu: density consensus clustering approach for public transport data. Inf Sci 328:378–388CrossRefGoogle Scholar
  10. 10.
    Gani A, Siddiqa A, Shamshirband S, Hanum F (2016) A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowl Inf Syst 46(2):241–284CrossRefGoogle Scholar
  11. 11.
    Greenlaw R (1993) Breadth-depth search is P-complete. Parallel Process Lett 3(03):209–222MathSciNetCrossRefGoogle Scholar
  12. 12.
    Greenlaw R, Hoover HJ, Ruzzo WL (1995) Limits to parallel computation: P-completeness theory. Oxford University Press, New YorkzbMATHGoogle Scholar
  13. 13.
    Hamooni H, Mueen A, Neel A (2016) Phoneme sequence recognition via dtw-based classification. Knowl Inf Syst 48(2):253–275CrossRefGoogle Scholar
  14. 14.
    Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of “big data” on cloud computing: review and open research issues. Inf Syst 47:98–115CrossRefGoogle Scholar
  15. 15.
    Jagadish HV, Gehrke J, Labrinidis A, Papakonstantinou Y, Patel JM, Ramakrishnan R, Shahabi C (2014) Big data and its technical challenges. Commun ACM 57(7):86–94CrossRefGoogle Scholar
  16. 16.
    Jung G, Gnanasambandam N, Mukherjee T (2012) Synchronous parallel processing of big-data analytics services to optimize performance in federated clouds. In: IEEE proceedings of the 5th international conference on cloud computing, pp 811–818Google Scholar
  17. 17.
    Kang U, Tong H, Sun J, Lin C, Faloutsos C (2011) Gbase: A scalable and general graph management system. In: ACM proceedings of the 17th international conference on knowledge discovery and data mining, pp 1091–1099Google Scholar
  18. 18.
    Marz N, Warren J (2015) Big data: principles and best practices of scalable realtime data systems. Manning Publications Co, GreenwichGoogle Scholar
  19. 19.
    Michael K, Miller KW (2013) Big data: new opportunities and new challenges. Computer 46(6):22–24CrossRefGoogle Scholar
  20. 20.
    Mozafari B, Zeng K, D’Antoni L, Zaniolo C (2013) High-performance complex event processing over hierarchical data. ACM T Database Syst 38(4):21MathSciNetzbMATHGoogle Scholar
  21. 21.
    National Research Council (2013) Frontiers in massive data analysis. The National Academies Press, WashingtonGoogle Scholar
  22. 22.
    Papadimitriou CH (2003) Computational complexity. In: Encyclopedia of computer science. Wiley, Chichester, pp 260–265Google Scholar
  23. 23.
    Ramentol E, Caballero Y, Bello R, Herrera F (2012) Smote-rsb*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265CrossRefGoogle Scholar
  24. 24.
    del Río S, López V, Benítez JM, Herrera F (2014) On the use of mapreduce for imbalanced big data using random forest. Inf Sci 285:112–137CrossRefGoogle Scholar
  25. 25.
    Sarma AD, Lee H, Gonzalez H, Madhavan J, Halevy AY (2013) Consistent thinning of large geographical data for map visualization. ACM T Database Syst 38(4):22MathSciNetzbMATHGoogle Scholar
  26. 26.
    Vardi MY (1982) The complexity of relational query languages. In: Proceedings of the 14th Annual ACM Symposium on Theory of Computing, pp 137–146Google Scholar
  27. 27.
    Wu X, Zhu X, Wu G, Ding W (2014) Data mining with big data. IEEE T Knowl Data En 26(1):97–107CrossRefGoogle Scholar
  28. 28.
    Yang C, Zhang X, Zhong C, Liu C, Pei J, Ramamohanarao K, Chen J (2014) A spatiotemporal compression based approach for efficient big data processing on cloud. J Comput Syst Sci 80(8):1563–1583MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag London Ltd. 2017

Authors and Affiliations

  1. 1.Key Laboratory of High Confidence Software Technologies (MOE), School of Electronics Engineering and Computer SciencePeking UniversityBeijingChina

Personalised recommendations