A Crawler–Parser-Based Approach to Newspaper Scraping and Reverse Searching of Desired Articles

Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 701)

Abstract

How often does it happen, that we cannot get enough information from a newspaper. Often an article mentions a name we have not heard before or simply does not shed enough light on the news and its details. Online newspapers even have a problem of webpage noise. Every article is filled with HTML, Meta tags, JavaScript, and whatnot. This paper provides a fast and efficient approach to scraping a newspaper to get any desired article without the noise and reverse search the same topic on Google to get a list of the most relevant information regarding that article. The algorithm supports ten languages and works with the best newspapers like CNN and BBC.

Keywords

Reverse searching Parsing Crawling Newspaper 

References

  1. 1.
    Dominiguez, T.: How Much of The Internet is Hidden. Seeker, 2 Sept 2015Google Scholar
  2. 2.
    Pavalam, S.M., Kashmir Raja, S.V., Akorli, F.K., Jawahar, M.: A survey of web crawler algorithms. Int. J. Comput. Sci. Issues 8(6), no 1, 309–313 (2011)Google Scholar
  3. 3.
    Shen, A.: Algorithms and Programming: Problems and Solutions, 2nd edn. p. 135. Springer (2010)Google Scholar
  4. 4.
    Brin, S., Page, L.: Anatomy of a large scale hypertextual web search engine. In: Proceedings of the WWW Conference (2004)Google Scholar
  5. 5.
    Sivanandam, S.N., Deepa, S.N.: Introduction to Genetic Algorithms, p. 20. Springer (2008)Google Scholar
  6. 6.
    Zhang, H.: The Optimality of Naive Bayes. American Association for Artificial Intelligence (2004)Google Scholar
  7. 7.
    Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms. In: Proceedings of the 23 rd International Conference on Machine Learning, Pittsburgh, PA (2006)Google Scholar
  8. 8.
    Kleinberg, John “Hubs, Authorities, and Communities” ACM computing survey (1998)Google Scholar
  9. 9.
    Pradhan, S., Ward, W., Hacioglu, K., Martin, J.H., Jurafsky, D.: Shallow Semantic Parsing using Support Vector Machines. University of Colorado, Stanford University (2004)Google Scholar
  10. 10.
    Lucas, O.-Y.: Newspaper. github.com/codelucas/newspaper
  11. 11.
    Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. IBM 1(4), 315 (1957)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Spark, Jones K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28, 11–21 (1972)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.St. Thomas College of Engineering and TechnologyKolkataIndia

Personalised recommendations