Self-supervised Learning Approach for Extracting Citation Information on the Web

  • Dat T. Huynh
  • Wen Hua
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7235)


In this paper, we propose a framework for automatically training a model to extract citation information on the web. Constructing manually labeled training data to learn an extraction model is tedious, time consuming and difficult to be applied to several styles of citations with different types of entities. To eliminate the requirement of manually labeled training data, we exploit a knowledge base of citation domain and web search to derive labeled training data automatically. Our experiments show that the combination of knowledge base, heuristics and statistical methods can automate the extraction process and achieve good performance.


Hide Markov Model Conditional Random Field Reference Table Text Segment Extraction Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agichtein, E., Ganti, V.: Mining reference tables for automatic text segmentation. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 20–29 (2004)Google Scholar
  2. 2.
    Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 337–348 (2003)Google Scholar
  3. 3.
    Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pp. 175–186 (2001)Google Scholar
  4. 4.
    Cortez, E., da Silva, A.S., Gonçalves, M.A., Mesquita, F., de Moura, E.S.: A flexible approach for extracting metadata from bibliographic citations. Journal of the American Society for Information Science and Technology 60, 1144–1158 (2009)CrossRefGoogle Scholar
  5. 5.
    Councill, I.G., Giles, C.L., Yen Kan, M.: Parscit: An open-source crf reference string parsing package. In: International Language Resources and Evaluation. European Language Resources Association (2008)Google Scholar
  6. 6.
    Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data bases, pp. 109–118 (2001)Google Scholar
  7. 7.
    Day, M.-Y., Tsai, R.T.-H., Sung, C.-L., Hsieh, C.-C., Lee, C.-W., Wu, S.-H., Wu, K.-P., Ong, C.-S., Hsu, W.-L.: Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support System 43, 152–167 (2007)CrossRefGoogle Scholar
  8. 8.
    Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 37–48 (2003)Google Scholar
  9. 9.
    Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289 (2001)Google Scholar
  10. 10.
    Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)MathSciNetGoogle Scholar
  11. 11.
    Mansuri, I.R., Sarawagi, S.: Integrating unstructured data into relational databases. In: Proceedings of the 22nd International Conference on Data Engineering, pp. 29–40 (2006)Google Scholar
  12. 12.
    Peng, F., McCallum, A.: Information extraction from research papers using conditional random fields. Information Processing and Management 42, 963–979 (2006)CrossRefGoogle Scholar
  13. 13.
    Sarawagi, S.: Information extraction. Foundation and Trends in Databases 1(3), 261–377 (2008)CrossRefGoogle Scholar
  14. 14.
    Seymore, K., Mccallum, A., Rosenfeld, R.: Learning hidden markov model structure for information extraction. In: AAAI 1999 Workshop on Machine Learning for Information Extraction, pp. 37–42 (1999)Google Scholar
  15. 15.
    Venetis, P., Halve, A., Madhavan, J., Pasca, M., Shen, W., Wu, F., Miao, G., Wu, C.: Recovering semantics of tables on the web. Proceedings of the VLDB Endowment (2011)Google Scholar
  16. 16.
    Zhao, C., Mahmud, J., Ramakrishnan, I.V.: Exploiting structured reference data for unsupervised text segmentation with conditional random fields. In: Proceedings of the SIAM International Conference on Data Mining, pp. 420–431 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Dat T. Huynh
    • 1
  • Wen Hua
    • 1
  1. 1.School of Information Technology and Electrical EngineeringThe University of QueenslandAustralia

Personalised recommendations