Skip to main content

An Automatic Approach to Generate Corpus in Spanish

  • 871 Accesses

Part of the Communications in Computer and Information Science book series (CCIS,volume 885)

Abstract

A corpus is an indispensable linguistic resource for any application of natural language processing. Some corpora have been created manually or semi-automatically for a specific domain. In this paper, we present an automatic approach to generate corpus from digital information sources such as Wikipedia and web pages. The information extracted by Wikipedia is done by delimiting the domain, using a propagation algorithm to determine the categories associated with a domain region and a set of seeds to delimit the search. The information extracted from the web pages is carried out efficiently, determining the patterns associated with the structure of each page with the purpose of defining the quality of the extraction.

Keywords

  • Text mining
  • Corpus
  • Knowledge extraction
  • Natural language processing
  • Linguistic computational

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-98998-3_12
  • Chapter length: 12 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   84.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-98998-3
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   109.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.

References

  1. Arnold, P., Rahm, E.: Automatic extraction of semantic relations from wikipedia. Int. J. Artif. Intell. Tools 24(2), 1540010 (2015)

    CrossRef  Google Scholar 

  2. Berners-Lee, T., Connolly, D.: Hypertext markup language - 2.0. Technical report, USA (1995)

    Google Scholar 

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)

    MATH  Google Scholar 

  4. World Wide Web Consortium, et al.: Extensible markup language (xml) 1.1 (2006)

    Google Scholar 

  5. Crawford, W., Csomay, E.: Doing Corpus Linguistics. Routledge, Abingdon (2015)

    Google Scholar 

  6. Crockford, D.: The application/json media type for javascript object notation (JSON) (2006)

    Google Scholar 

  7. Drechsler, A., Hevner, A.: A four-cycle model of is design science research: capturing the dynamic nature of is artifact design. In: Breakthroughs and Emerging Insights from Ongoing Design Science Projects: Research-in-Progress Papers and Poster Presentations from the 11th International Conference on Design Science Research in Information Systems and Technology (DESRIST). DESRIST 2016, 23–25 May 2016, St. John, Canada (2016)

    Google Scholar 

  8. Dutta, B., Chatterjee, U., Madalli, D.P.: YAMO: yet another methodology for large-scale faceted ontology construction. J. Knowl. Manag. 19(1), 6–24 (2015)

    CrossRef  Google Scholar 

  9. Edeki, C.: Agile unified process. Int. J. Comput. Sci. 1(3), 13–17 (2013)

    Google Scholar 

  10. Fan, J., Kalyanpur, A., Gondek, D.C., Ferrucci, D.A.: Automatic knowledge extraction from documents. IBM J. Res. Dev. 56(3.4), 5:1–5:10 (2012)

    CrossRef  Google Scholar 

  11. Ferrara, E., De Meo, P., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. Knowl.-Based Syst. 70, 301–323 (2014)

    CrossRef  Google Scholar 

  12. Gharib, T.F., Badr, N.L., Haridy, S., Abraham, A.: Enriching ontology concepts based on texts from WWW and corpus. J. UCS 18(16), 2234–2251 (2012)

    Google Scholar 

  13. Jiang, J.: Information extraction from text. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 11–41. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4_2

    CrossRef  Google Scholar 

  14. Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall series in artificial intelligence, pp. 1–1024 (2009)

    Google Scholar 

  15. Kanakaraj, M., Kamath, S.S.: NLP based intelligent news search engine using information extraction from e-newspapers. In: 2014 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), pp. 1–5. IEEE (2014)

    Google Scholar 

  16. Kanavos, A., Makris, C., Plegas, Y., Theodoridis, E.: Ranking web search results exploiting wikipedia. Int. J. Artif. Intell. Tools 25(03), 1650018 (2016)

    CrossRef  Google Scholar 

  17. Kozareva, Z., Hovy, E.: Tailoring the automated construction of large-scale taxonomies using the web. Lang. Resour. Eval. 47(3), 859–890 (2013)

    CrossRef  Google Scholar 

  18. Küçük, D., Arslan, Y.: Semi-automatic construction of a domain ontology for wind energy using wikipedia articles. Renew. Energy 62, 484–489 (2014)

    CrossRef  Google Scholar 

  19. Lahbib, W., Bounhas, I., Slimani, Y.: Arabic terminology extraction and enrichment based on domain-specific text mining. In: 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 340–347. IEEE (2015)

    Google Scholar 

  20. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)

    CrossRef  Google Scholar 

  21. Liu, S., Zhang, C.: Termhood-based comparability metrics of comparable corpus in special domain. In: Ji, D., Xiao, G. (eds.) CLSW 2012. LNCS (LNAI), vol. 7717, pp. 134–144. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36337-5_15

    CrossRef  Google Scholar 

  22. Loria, S., et al.: TextBlob: simplified text processing. Secondary TextBlob: simplified text processing (2014)

    Google Scholar 

  23. March, S.T., Smith, G.F.: Design and natural science research on information technology. Decis. Support Syst. 15(4), 251–266 (1995)

    CrossRef  Google Scholar 

  24. March, S.T., Storey, V.C.: Design science in the information systems discipline: an introduction to the special issue on design science research. MIS Q. 32, 725–730 (2008)

    CrossRef  Google Scholar 

  25. Medelyan, O., Witten, I.H., Divoli, A., Broekstra, J.: Automatic construction of lexicons, taxonomies, ontologies, and other knowledge structures. Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 3(4), 257–279 (2013)

    Google Scholar 

  26. Morell, M.F.: The Wikimedia foundation and the governance of Wikipedias infrastructure: historical trajectories and its hybrid character. In: Critical Point of View: A Wikipedia Reader, pp. 325–341 (2011)

    Google Scholar 

  27. Petrov, S., Das, D., McDonald, R.: A universal part-of-speech tagset. arXiv preprint arXiv:1104.2086 (2011)

  28. Powers, D.M.W.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2011)

    MathSciNet  Google Scholar 

  29. Richardson, L., Ruby, S.: RESTful Web Services. O’Reilly Media, Inc., Sebastopol (2008)

    Google Scholar 

  30. Schwaber, K., Beedle, M.: Agile Software Development with Scrum, vol. 1. Prentice Hall, Upper Saddle River (2002)

    MATH  Google Scholar 

  31. Vállez, M., Pedraza-Jiménez, R., Codina, L., Blanco, S., Rovira, C.: A semi-automatic indexing system based on embedded information in HTML documents. In: Library Hi Tech, vol. 33, no. 2, pp. 195–210 (2015)

    Google Scholar 

  32. Van Rossum, G., Drake, F.L.: Python Language Reference Manual. Network Theory, Bristol (2003)

    Google Scholar 

  33. Wood, L., Nicol, G., Robie, J., Champion, M., Byrne, S.: Document object model (DOM) level 3 core specification (2004)

    Google Scholar 

  34. Zhu, M.: Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, vol. 2, p. 30 (2004)

    Google Scholar 

Download references

Acknowledgements

The tool presented was carried out within the construction of research capabilities of the Center for Excellence and Appropriation in Big Data and Data Analytics (CAOBA), led by the Pontificia Universidad Javeriana, funded by the Ministry of Information Technologies and Telecommunications of the Republic of Colombia (MinTIC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Edwin Puertas .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Puertas, E., Alvarado-Valencia, J.A., Moreno-Sandoval, L.G., Pomares-Quimbaya, A. (2018). An Automatic Approach to Generate Corpus in Spanish. In: Serrano C., J., Martínez-Santos, J. (eds) Advances in Computing. CCC 2018. Communications in Computer and Information Science, vol 885. Springer, Cham. https://doi.org/10.1007/978-3-319-98998-3_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-98998-3_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-98997-6

  • Online ISBN: 978-3-319-98998-3

  • eBook Packages: Computer ScienceComputer Science (R0)