Two-Tier Machine Learning Using Conditional Random Fields with Constraints

  • Sebastian LindnerEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 454)


This paper shows a novel approach of two-tier machine learning to locate bibliographic references in HTML and separate them into fields. First it is demonstrated, how Conditional Random Fields (CRFs) with constraints can be used to split bibliographic references into fields e.g. authors and title. Therefore a unique feature set, constraints and a method for automatic keyword extraction are introduced. The output of this CRF for tagging bibliographic references, Part Of Speech (POS) analysis and Named Entity Recognition (NER) build the first tier and their output is used to locate the bibliographic reference section in the first place. For this the documents are split into blocks, which are then used for classification. For this task a Support Vector Machines (SVM) approach is compared with another one using a CRF. We demonstrate this two-tier approach archives very good results, while the reference tagging approach is able to compete with other state-of-the-art approaches.


Classification Conditional random fields Constraint-based learning Information extraction Information retrieval Machine learning References parsing Semi-supervised learning Support vector machines 


  1. 1.
    Bollacker, K.D., Lawrence, S., Giles, C.L.: CiteSeer: An autonomous web agent for automatic retrieval and identification of interesting publications. In: Proceedings of the Second International Conference on Autonomous Agents, pp. 116–123. ACM (1998)Google Scholar
  2. 2.
    Zou, J., Le, D., Thoma, G.R.: Locating and parsing bibliographic references in HTML medical articles. Int. J. Doc. Anal. Recogn. 2, 107–119 (2010)CrossRefGoogle Scholar
  3. 3.
    Hetzner, E.: A simple method for citation metadata extraction using hidden markov models. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 280–284. ACM (2008)Google Scholar
  4. 4.
    Gao, L., Qi, X., Tang, Z., Lin, X., Liu, Y.: Web-based citation parsing, correction and augmentation. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 295–304. ACM (2012)Google Scholar
  5. 5.
    Park, S.H., Ehrich, R.W., Fox, E.A.: A hybrid two-stage approach for discipline-independent canonical representation extraction from references. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2012, pp. 285–294. ACM, New York (2012)Google Scholar
  6. 6.
    Sutton, C., McCallum, A.: Introduction to Conditional Random Fields for Relational Learning. MIT Press, Cambridge (2006)Google Scholar
  7. 7.
    Mann, G.S., McCallum, A.: Generalized expectation criteria for semi-supervised learning with weakly labeled data. J. Mach. Learn. Res. 11, 955–984 (2010)zbMATHMathSciNetGoogle Scholar
  8. 8.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probablistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML-2001), pp. 282–289 (2001)Google Scholar
  9. 9.
    McCallum, A.: Mallet: A machine learning for language toolkit (2002).
  10. 10.
    Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: An open-source CRF reference string parsing package. In: International Language Resources and Evaluation. European Language Resources Association (2008)Google Scholar
  11. 11.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRefGoogle Scholar
  12. 12.
    Lindner, S., Höhn, W.: Parsing and maintaining bibliographic references. In: International Conference on Knowledge Discovery and Information Retrieval (KDIR 2012) (2012)Google Scholar
  13. 13.
    Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006)CrossRefGoogle Scholar
  14. 14.
    Fontan, L., Lopez-Garcia, R., Alvarez, M., Pan, A.: Automatically extracting complex data structures from the web. In: International Conference on Knowledge Discovery and Information Retrieval (KDIR 2012) (2012)Google Scholar
  15. 15.
    Ha, J., Haralick, R.M., Phillips, I.T.: Recursive XY cut using bounding boxes of connected components. In: Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 2, pp. 952–955. IEEE (1995)Google Scholar
  16. 16.
    Jain, A.K., Yu, B.: Document representation and its application to page decomposition. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 294–308 (1998)CrossRefGoogle Scholar
  17. 17.
    Finkel, J.R.: Named entity recognition and the stanford NER software (2007)Google Scholar
  18. 18.
    Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 173–180. Association for Computational Linguistics (2003)Google Scholar
  19. 19.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  20. 20.
    Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to platt’s SMO algorithm for SVM classifier design. Neural Comput. 13(3), 637–649 (2001)CrossRefzbMATHGoogle Scholar
  21. 21.
    Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Thirteenth International Joint Conference on Articial Intelligence, vol. 2, pp. 1022–1027. Morgan Kaufmann Publishers (1993)Google Scholar
  22. 22.
    McCallum, A., Nigam, K., Rennie, J., Seymore, K.: Automating the contruction of internet portals with machine learning. Inf. Retrieval J. 3, 127–163 (2000)CrossRefGoogle Scholar
  23. 23.
    Chang, M.W., Ratinov, L., Roth, D.: Guiding semi-supervision with constraint-driven learning. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 280–287 (2007)Google Scholar
  24. 24.
    Ganchev, K., Graca, J., Gillenwater, J., Taskar, B.: Posterior regularization for structured latent variable models. J. Mach. Learn. Res. 11, 2001–2049 (2010)zbMATHMathSciNetGoogle Scholar
  25. 25.
    Swain, M., Fawcett, S.: Accounting system implications of TOC. In: Swamidass, P. (ed.) Encyclopedia of Production and Manufacturing Management. Springer, Heidelberg (2000). January 31 2011Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.illucIT Software GmbHWürzburgGermany

Personalised recommendations