International Conference of the Pacific Association for Computational Linguistics

Computational Linguistics pp 193-208 | Cite as

Detecting Vital Documents Using Negative Relevance Feedback in Distributed Realtime Computation Framework

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 593)

Abstract

Existing knowledge bases including Wikipedia are typically written and maintained by a group of voluntary editors. Meanwhile, numerous web documents are being published partly due to the popularization of online news and social media. Some of the web documents contain novel information, called “vital documents”, that should be taken into account to update articles of the knowledge bases. However, it is virtually impossible for the editors to manually monitor all the relevant web documents. As a result, there is a considerable time lag between an edit to knowledge base and the publication dates of the web documents. This paper proposes a realtime detection framework of web documents containing novel information flowing in massive document streams. The framework consists of two-step filter using statistical language models. Further, the framework is implemented on the distributed and fault-tolerant realtime computation system, Apache Storm, in order to process the sheer amount of web documents. The validity of the proposed framework is demonstrated on a publicly available web document data set, the TREC KBA Stream Corpus.

Keywords

Negative feedback Realtime processing Text data streams Wikipedia 

References

  1. 1.
    Abbes, R., Pinel-Sauvagnat, K., Hernandez, N., Boughanem, M.: IRIT at TREC knowledge base acceleration 2013: cumulative citation recommendation task. In: Proceedings of the Text REtrieval Conference (TREC) (2013)Google Scholar
  2. 2.
    Balog, K., Ramampiaro, H., Takhirov, N., Nørvåg, K.: Multi-step classification approaches to cumulative citation recommendation. In: Proceedings of the 10th Conference on Open Research Areas in Information Retrieval, pp. 121–128 (2013)Google Scholar
  3. 3.
    Balog, K., Serdyukov, P., Vries, A.P.d.: Overview of the TREC 2011 entity track. In: Proceedings of the Text REtrieval Conference (TREC) (2011)Google Scholar
  4. 4.
    Bellogín, A., Gebremeskel, G.G., He, J., Lin, J., Said, A., Samar, T., de Vries, A.P., Vuurens, J.B.: CWI and TU Delft at TREC 2013: Contextual suggestion, federated web search, KBA, and web tracks. In: Proceedings of the Text REtrieval Conference (TREC) (2013)Google Scholar
  5. 5.
    Bonnefoy, L., Bouvier, V., Bellot, P.: A weakly-supervised detection of entity central documents in a stream. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 769–772. ACM Press (2013)Google Scholar
  6. 6.
    Dang, H.T., Kelly, D., Lin, J.J.: Overview of the TREC 2007 question answering track. In: Proceedings of the Text REtrieval Conference (TREC) (2007)Google Scholar
  7. 7.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B 39(1), 1–38 (1977)MathSciNetMATHGoogle Scholar
  8. 8.
    Dietz, L., Dalton, J.: UMass at TREC 2013 knowledge base acceleration track: bi-directional entity linking and time-aware evaluation. In: Proceedings of the Text REtrieval Conference (TREC) (2013)Google Scholar
  9. 9.
    Elsas, J.L., Arguello, J., Callan, J., Carbonell, J.G.: Retrieval and feedback models for blog feed search. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 347–354 (2008)Google Scholar
  10. 10.
    Frank, J.R., Bauer, S.J., Kleiman-Weiner, M., Roberts, D.A., Tripuraneni, N., Zhang, C., Re, C., Voorhees, E., Soboroff, I.: Evaluating stream filtering for entity profile updates for TREC 2013 (KBA Track Overview). In: Proceedings of the Text REtrieval Conference (TREC) (2013)Google Scholar
  11. 11.
    Frank, J.R., Kleiman-Weiner, M., Roberts, D.A., Niu, F., Zhang, C., Ré, C., Soboroff, I.: Building an entity-centric stream filtering test collection for TREC 2012. In: Proceedings of the Text REtrieval Conference (TREC) (2012)Google Scholar
  12. 12.
    Kenter, T.: Filtering documents over time for evolving topics-the university of amsterdam at TREC 2013 KBA CCR. In: Proceedings of the Text REtrieval Conference (TREC) (2013)Google Scholar
  13. 13.
    Lafferty, J., Zhai, C.: Document language models, query models, and risk minimization for information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 111–119 (2001)Google Scholar
  14. 14.
    Liu, X., Darko, J., Fang, H.: A related entity based approach for knowledge base acceleration. In: Proceedings of the Text REtrieval Conference (TREC) (2013)Google Scholar
  15. 15.
    McCreadie, R., Macdonald, C., Ounis, I., Osborne, M., Petrovic, S.: Scalable distributed event detection for twitter. In: 2013 IEEE International Conference on Big Data, pp. 543–549. IEEE (2013)Google Scholar
  16. 16.
    Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 233–242 (2007)Google Scholar
  17. 17.
    Porter, M.F.: An algorithm for suffix stripping. Prog. Electron. Libr. Inf. Syst. 14(3), 130–137 (1980)Google Scholar
  18. 18.
    Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Patel, J.M., Kulkarni, S., Jackson, J., Gade, K., Fu, M., Donham, J., et al.: Storm@ twitter. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 147–156. ACM (2014)Google Scholar
  19. 19.
    Wang, J., Song, D., Lin, C.Y., Liao, L.: BIT and MSRA at TREC KBA CCR Track 2013. In: Proceedings of the Text REtrieval Conference (TREC) (2013)Google Scholar
  20. 20.
    Wang, X., Fang, H., Zhai, C.: A study of methods for negative relevance feedback. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 219–226 (2008)Google Scholar
  21. 21.
    Xu, Y., Jones, G.J., Wang, B.: Query dependent pseudo-relevance feedback based on wikipedia. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 59–66 (2009)Google Scholar
  22. 22.
    Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to Ad Hoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334–342 (2001)Google Scholar

Copyright information

© Springer Science+Business Media Singapore 2016

Authors and Affiliations

  1. 1.Graduate Schools of System InformaticsKobe UniversityKobeJapan
  2. 2.Faculty of Intelligence and InformaticsKonan UniversityKobeJapan

Personalised recommendations