Internet Data Extraction and Analysis for Profile Generation

  • Álvaro BartoloméEmail author
  • David García-Retuerta
  • Francisco Pinto-Santos
  • Pablo Chamoso
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1006)


Almost everything is stored on the Internet nowadays, and relying data on the Internet has become usual over the last years, directly increasing the value of data retrieval. Via Internet, data scientist can now find a way to access all the available data that is stored on the Internet, so they can turn that data into useful information. As people rely a lot of data on the Internet, they sometimes ignore the fact that all that data can be easily extracted, even when people think their information is safe or unavailable. In this article, we propose a system in where some data extraction techniques are going to be analysed in order to have an overview of the amount of data of a person that can be extracted from the Internet, and how that data is turned into information with an additional value in order to make data useful. The proposed system is going to be capable of retrieving huge loads of data from a person and process it using Artificial Intelligence, in order to classify its content to generate a personal profile containing all the information once its analysed. This research is based on personal profile generation of people from Spain, but it could be implemented for any other country. The proposed system has been implemented and tested on different people, and the results were quite satisfactory.


Information recovery Information fusion Big Data Profile generation 



This research has been partially supported by the European Regional Development Fund (FEDER) within the framework of the Interreg program V-A Spain-Portugal 2014–2020 (PocTep) under the IOTEC project grant 0123 IOTEC 3 E.


  1. 1.
    Olston, C., Najork, M., et al.: Web crawling. Found. Trends® Inf. Retrieval 4(3), 175–246 (2010)CrossRefGoogle Scholar
  2. 2.
    Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B., Kochut, K.: A brief survey of text mining: classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919 (2017)
  3. 3.
    Moreno, A., Redondo, T.: Text analytics: the convergence of big data and artificial intelligence. IJIMAI 3(6), 57–64 (2016)CrossRefGoogle Scholar
  4. 4.
    Bahrami, M., Singhal, M., Zhuang, Z.: A cloud-based web crawler architecture. In: 2015 18th International Conference on Intelligence in Next Generation Networks, pp. 216–223. IEEE (2015)Google Scholar
  5. 5.
    Jose, B., Abraham, S.: Exploring the merits of NoSQL: a study based on MongoDB. In: 2017 International Conference on Networks & Advances in Computational Technologies (NetACT), pp. 266–271. IEEE (2017)Google Scholar
  6. 6.
    Sun, S., Gong, J., Zomaya, A.Y., Wu, A.: A distributed incremental information acquisition model for large-scale text data. Cluster Comput. 20, 1–12 (2017)CrossRefGoogle Scholar
  7. 7.
    Roy, D., Ganguly, D., Mitra, M., Jones, G.J.F.: Representing documents and queries as sets of word embedded vectors for information retrieval. arXiv preprint arXiv:1606.07869 (2016)
  8. 8.
    Ali, N., Bajwa, K.B., Sablatnig, R., Mehmood, Z.: Image retrieval by addition of spatial information based on histograms of triangular regions. Comput. Electr. Eng. 54, 539–550 (2016)CrossRefGoogle Scholar
  9. 9.
    Rivas, A., Martín, L., Sittón, I., Chamoso, P., Martín-Limorti, J.J., Prieto, J., González-Briones, A.: Semantic analysis system for industry 4.0. In: International Conference on Knowledge Management in Organizations, pp. 537–548. Springer (2018)Google Scholar
  10. 10.
    Binkheder, S., Wu, H.-Y., Quinney, S., Li, L.: Analyzing patterns of literature-based phenotyping definitions for text mining applications. In: 2018 IEEE International Conference on Healthcare Informatics (ICHI), pp. 374–376. IEEE (2018)Google Scholar
  11. 11.
    Shah, J.H., Sharif, M., Yasmin, M., Fernandes, S.L.: Facial expressions classification and false label reduction using LDA and threefold SVM. Pattern Recogn. Lett. (2017)Google Scholar
  12. 12.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 886–893. IEEE (2005)Google Scholar
  13. 13.
    Kasar, M.M., Bhattacharyya, D., Kim, T.H.: Face recognition using neural network: a review. Int. J. Secur. Appl. 10(3), 81–100 (2016)Google Scholar
  14. 14.
    Amos, B., Ludwiczuk, B., Satyanarayanan, M., et al.: OpenFace: a general-purpose face recognition library with mobile applications. CMU School of Computer Science (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Álvaro Bartolomé
    • 1
    Email author
  • David García-Retuerta
    • 1
  • Francisco Pinto-Santos
    • 1
  • Pablo Chamoso
    • 1
  1. 1.BISITE Research GroupUniversity of SalamancaSalamancaSpain

Personalised recommendations