Named Entity Recognition and Identification for Finding the Owner of a Home Page

  • Vassilis Plachouras
  • Matthieu Rivière
  • Michalis Vazirgiannis
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7301)


Entity-based applications, such as expert search or online social networks where users search for persons, require high-quality datasets of named entity references. Obtaining such high-quality datasets can be achieved by automatically extracting metadata from Web pages. In this work, we focus on the identification of the named entity that corresponds to the owner of a particular Web page, for example, a home page or an organizational staff Web page. More specifically, from a set of named entities that have already been extracted from a Web page, we identify the one which corresponds to the owner of the home page. First, we develop a set of features which are combined in a scoring function to select the named entity of the Web page owner. Second, we formulate the problem as a classification problem in which a pair of a Web page and named entity is classified as being associated or not. We evaluate the proposed approaches on a set of Web pages in which we have previously identified named entities. Our experimental results show that we can identify the named entity corresponding to the owner of a home page with accuracy over 90%.


named entity recognition entity selection 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a high-performance learning name-finder. In: Procs. of the 5th ANLC, pp. 194–201 (1997)Google Scholar
  2. 2.
    Chang, C.C., Lin, C.J.: Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011)Google Scholar
  3. 3.
    Changuel, S., Labroche, N., Bouchon-Meunier, B.: Automatic web pages author extraction. In: Procs. of the 8th FQAS, pp. 300–311 (2009)Google Scholar
  4. 4.
    Chieu, H.L., Ng, H.T.: Named entity recognition with a maximum entropy approach. In: Procs. of the 7th Conference on Natural Language Learning at HLT-NAACL 2003, CONLL 2003, vol. 4, pp. 160–163 (2003)Google Scholar
  5. 5.
    Culotta, A., Bekkerman, R., McCallum, A.: Extracting social networks and contact information from email and the web. In: CEAS (2004)Google Scholar
  6. 6.
    Culotta, A., Wick, M., Hall, R., McCallum, A.: First-order probabilistic models for coreference resolution. In: Procs. of HLT/NAACL, pp. 81–88 (2007)Google Scholar
  7. 7.
    Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Procs. of the 43rd Annual Meeting on ACL, pp. 363–370 (2005)Google Scholar
  8. 8.
    Gollapalli, S.D., Giles, C.L., Mitra, P., Caragea, C.: On identifying academic homepages for digital libraries. In: Procs. of the 11th JCDL, pp. 123–132 (2011)Google Scholar
  9. 9.
    Kato, Y., Kawahara, D., Inui, K., Kurohashi, S., Shibata, T.: Extracting the author of web pages. In: Procs. of the 2nd ACM WICOW, pp. 35–42 (2008)Google Scholar
  10. 10.
    Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Procs. of the 18th ICML, pp. 282–289 (2001)Google Scholar
  11. 11.
    Minkov, E., Wang, R.C., Cohen, W.W.: Extracting personal names from email: applying named entity recognition to informal text. In: Procs. of the Conf. on HLT and EMNLP, HLT 2005, pp. 443–450 (2005)Google Scholar
  12. 12.
    Ng, V., Cardie, C.: Improving machine learning approaches to coreference resolution. In: Procs. of the 40th Annual Meeting on ACL, ACL 2002, pp. 104–111 (2002)Google Scholar
  13. 13.
    Shi, Y., Wang, M.: A dual-layer crfs based joint decoding method for cascaded segmentation and labeling tasks. In: Procs. of the 20th IJCAI, pp. 1707–1712 (2007)Google Scholar
  14. 14.
    Takeuchi, K., Collier, N.: Use of support vector machines in extended named entity recognition. In: Procs. of the 6th Conference on Natural Language Learning, COLING 2002, vol. 20, pp. 1–7 (2002)Google Scholar
  15. 15.
    Tang, J., Zhang, D., Yao, L.: Social network extraction of academic researchers. In: Procs. of the 7th ICDM, pp. 292–301 (2007)Google Scholar
  16. 16.
    Zheng, S., Zhou, D., Li, J., Giles, C.L.: Extracting author meta-data from web using visual features. In: Procs. of the 7th ICDMW, pp. 33–40 (2007)Google Scholar
  17. 17.
    Zhu, J., Nie, Z., Wen, J.R., Zhang, B., Ma, W.Y.: 2d conditional random fields for web information extraction. In: Procs. of the 22nd ICML, pp. 1044–1051 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Vassilis Plachouras
    • 1
    • 2
  • Matthieu Rivière
    • 2
  • Michalis Vazirgiannis
    • 1
    • 3
  1. 1.LIX, École PolytechniquePalaiseauFrance
  2. 2.PRESANS, X-TEC, École PolytechniquePalaiseauFrance
  3. 3.Dept of InformaticsAUEBAthensGreece

Personalised recommendations