Skip to main content

Generation and Corruption of Semi-Structured and Structured Data

  • Chapter
  • First Online:

Part of the book series: Lecture Notes in Social Networks ((LNSN))

Abstract

It is crucial for data to be a reliable source of information so that decisions made based on the analysis of this data could provide a competitive edge and reduce the negative impacts that pose significant cost to organizations on an annual basis. This data could have more than one form, including that both of semi-structured and structured data. There are many factors that could corrupt and cause degradation in the quality of data including duplicate records, inaccurate values, inconsistent values, outdated data, or incomplete information. To maintain the quality of data, the algorithms of different data quality management approaches need to be compared, and to accomplish this, common datasets need to be presented. These datasets could be real or synthetic. In the latter type, the datasets need to satisfy intrinsic characteristics of data. However, such datasets are not common for reasons such as privacy constraints in the case of real datasets, or the synthetic data that is generated or corrupted by the existing systems may not satisfy the quality aspects. To address these issues, we present a system that allows for generation of semi-structured and structured data. The generated semi-structured data is XML documents and the generated structured datasets satisfy a set of integrity constraints. Also our system generates other data values such as personal data and sensors data. Additionally, it allows for the corruption of the generated semi-structured and structured data.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Watts, S., Shankaranarayanan, G., Even, A.: Data quality assessment in context: a cognitive perspective. Decis. Support. Syst. 48(1), 202–211 (2009)

    Article  Google Scholar 

  2. Eckerson, W.: Data Quality and the Bottom Line: Achieving Business Success Through a Commitment to High Quality Data, pp. 1–36. The Data Warehousing Institute, Renton (2002)

    Google Scholar 

  3. Judah, S., Friedman, T.: Magic Quadrant for Data Quality Tools. Technical Report. Gartner, Stamford (2014)

    Google Scholar 

  4. Fan, W., Geerts, F.: Foundations of Data Quality Management. Morgan & Claypool Publishers, San Rafael (2012)

    Book  Google Scholar 

  5. Silberschatz, A., Korth, H.F., Sudarshan, S.: Database System Concepts. McGraw-Hill, New York (2006)

    MATH  Google Scholar 

  6. Batini, C., Scannapieca, M.: Data Quality Concepts, Methodologies and Techniques. Springer, New York (2006)

    Google Scholar 

  7. Buneman, P.: Semistructured Data. In: PODS ’97 Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 117–121. ACM, New York (1997)

    Google Scholar 

  8. Elmasri, R., Navathe, S.B.: Fundamentals of Database Systems. Pearson, Boston (2015)

    MATH  Google Scholar 

  9. Al-janabi, S., Janicki, R.: A density-based data cleaning approach for deduplication with data consistency and accuracy. In: SAI Computing Conference (SAI), pp. 492–501. IEEE, Piscataway (2016)

    Google Scholar 

  10. Cao, Y., Fan, W., Yu, W.: Determining the Relative Accuracy of Attributes. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 565–576. ACM, New York (2013)

    Google Scholar 

  11. Fan, W., Geerts, F., Tang, N., Yu, W.: Conflict resolution with data currency and consistency. J. Data Inf. Qual. 5(1–2), 6 (2014)

    Google Scholar 

  12. Christen, P.: Data Matching Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Canberra (2012)

    Google Scholar 

  13. Naumann, F., Herschel, M.: An Introduction to Duplicate Detection. Morgan & Claypool Publishers, San Rafael (2010)

    Book  Google Scholar 

  14. Weis, M., Naumann, F., Brosy, F.: A duplicate detection benchmark for XML (and relational) data. In: SIGMOD Workshop on Information Quality for Information Systems (IQIS) (2006)

    Google Scholar 

  15. Al-janabi, S., Hamid, A., Janicki, R.: datumPIPE: data generator and corrupter for multiple data quality aspects. In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017, pp. 589–592. ACM, New York (2017)

    Google Scholar 

  16. Pérez, M., Sanz, I., Berlanga, R.: XTaGe: A flexible XML collection generator. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 1139–1142. ACM, New York (2010)

    Google Scholar 

  17. Rychnovský, D., Holubová, I.: Generating XML data for XPath queries. In: Proceedings of the 30th Annual ACM Symposium on Applied Computing, pp. 724–731. ACM, New York (2015)

    Google Scholar 

  18. Tran, K.-N., Vatsalan, D., Christen, P.: GeCo: an online personal data generator and corruptor. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 2473–2476. ACM, New York (2013)

    Google Scholar 

  19. Houkjær, K. , Torp, K., Wind, R.: Simple and realistic data generation. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 1243–1246. VLDB Endowment (2006)

    Google Scholar 

  20. Eno, J., Thompson, C.: Generating synthetic data to match data mining patterns. IEEE Internet Comput. 12(3), 78–82 (2008)

    Article  Google Scholar 

  21. Lin, P., Samadi, B., Cipolone, A., Jeske, D., Cox, S., Rendón, C., Holt, D., Xiao. R.: Development of a synthetic data set generator for building and testing information discovery systems. In: Third International Conference on Information Technology: New Generations, 2006. ITNG 2006, pp. 707–712. IEEE, Piscataway (2006)

    Google Scholar 

  22. Pelekis, N., Sideridis, S., Tampakis, P., Theodoridis, Y.: Hermoupolis: a semantic trajectory generator in the data science era. SIGSPATIAL Spec. 7(1), 19–26 (2015)

    Article  Google Scholar 

  23. Hernández, M., Stolfoz, S.: The merge/purge problem for large databases. In: Proceedings of the 1998 ACM-SIGMOD Conference (1995)

    Google Scholar 

  24. Christen, P.: Development and user experiences of an open source data cleaning, deduplication and record linkage system. SIGKDD Explor. 11(1), 39–48 (2009)

    Article  Google Scholar 

  25. Nakuçi, E., Theodorou, V., Jovanovic, P., Abelló, A.: Bijoux: data generator for evaluating ETL process quality. In: Proceedings of the 17th International Workshop on Data Warehousing and OLAP, pp. 23–32. ACM, New York (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Samir Al-janabi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Al-janabi, S., Janicki, R. (2019). Generation and Corruption of Semi-Structured and Structured Data. In: Karampelas, P., Kawash, J., Özyer, T. (eds) From Security to Community Detection in Social Networking Platforms. ASONAM 2017. Lecture Notes in Social Networks. Springer, Cham. https://doi.org/10.1007/978-3-030-11286-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-11286-8_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-11285-1

  • Online ISBN: 978-3-030-11286-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics