PLIDMiner: A Quality Based Approach for Researcher’s Homepage Discovery

  • Junting Ye
  • Yanan Qian
  • Qinghua Zheng
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7675)


Researchers’ high quality homepages are important resources in academic search because they provide comprehensive and up-to-date information about researchers. Meanwhile, low quality homepages widely exist. A case study shows that 57.8% of all homepages retrieved among top 10 results from Google are low quality and 95% top researchers own out-of-date homepages. Besides, some academic portals generate dynamic homepages introducing researchers. These homepages are not maintained by researchers and may contain incorrect information. The quality of discovered homepages can not be ensured by existing work, which decreases the efficiency of academic search. It is difficult to define a high quality homepage from a quantitative perspective. Instead, on the basis of analyzing labeled high quality homepages, we propose “informative researcher’s homepage”, at least consisting of identifiable information (introducing a researcher’s basic information) and publication list (listing his/her corresponding publications), as an estimation for high quality homepage. Based on the observation that informative researchers’ homepages are organized in two ways, integrated and scattered, we propose an effective discovering model, PLIDMiner, with F1 scores over 0.9 on labeled data. Our model can also be applied to verify homepages’ quality. We crawl thousands of homepage resources from popular academic portals and assess their overall qualities. It turns out that nearly 25% of homepage resources in these portals are not informative, which strengthens our motivation.


Researcher’s homepage Quality based Machine learning 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Kang, I.-S., et al.: Construction of a Large-scale Test Set for Author Disambiguation. Information Processing and Management 47, 452–465 (2011)CrossRefGoogle Scholar
  2. 2.
    Yang, K.-H., Ho, J.-M.: Parsing Publication Lists on the Web. In: Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol. 1, pp. 444–447 (2010)Google Scholar
  3. 3.
    Doan, A., Ramakrishnan, R., et al.: Community information management. IEEE Data Engineering Bulletin 29, 64–72 (2006)Google Scholar
  4. 4.
    Li, J., Tang, J., et al.: Arnetminer: Expertise Oriented Search Using Social Networks. Frontiers of Computer Science in China, 94–105 (2008)Google Scholar
  5. 5.
    Tang, J., Zhang, J., et al.: ArnetMiner: Extraction and Mining of Academic Social Networks. In: Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 990–998 (2008)Google Scholar
  6. 6.
    Torvik, V., Weeber, M., et al.: A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology 56, 140–158 (2005)CrossRefGoogle Scholar
  7. 7.
    Kang, I.-S., Na, S.-H., et al.: On co-authorship for author disambiguation. Information Processing and Management 45, 84–97 (2009)CrossRefGoogle Scholar
  8. 8.
    Qian, Y., Hu, Y., et al.: Combining machine learning and human judgment in author disambiguation. In: International Conference on Information and Knowledge Management, pp. 1241–1246 (2011)Google Scholar
  9. 9.
    Yang, K.H., Chung, J.M., et al.: PLF: A Publication list Web page finder for researchers. In: IEEE/WIC/ACM International Conference on Web Intelligence, pp. 295–298 (2007)Google Scholar
  10. 10.
    Xi, W., Fox, E.A., Tan, R.P., Shu, J.: Machine Learning Approach for Homepage Finding Task. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 145–159. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  11. 11.
    Kraaij, W., Westerveld, T., Hiemstra, D.: The importance of prior probabilities for entry page search. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 27–34 (2002)Google Scholar
  12. 12.
    Upstill, T., Craswell, N., et al.: Query-independent evidence in home page finding. ACM Transactions on Information Systems 21, 286–313 (2003)CrossRefGoogle Scholar
  13. 13.
    Shakes, J., Langheinrich, M., et al.: Dynamic reference sifting: A case study in the homepage domain. Computer Networks and ISDN Systems 29, 1193–1204 (1997)CrossRefGoogle Scholar
  14. 14.
    Fang, Y., Si, L., et al.: Discriminative graphical models for researcher’s homepage discovery. Information Retrieval 13, 618–635 (2010)CrossRefGoogle Scholar
  15. 15.
    Tan, Y.F., Kan, M.Y., et al.: Search engine driven author disambiguation. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 314–315 (2006)Google Scholar
  16. 16.
    Pereira, D.A., Ribeiro-neto, B.A., et al.: Using web information for author name disambiguation. In: Proceedings of 9th ACM/IEEE Joint Conference on Digital Libraries, pp. 49–58 (2009)Google Scholar
  17. 17.
    Culotta, A., Bekkerman, R., et al.: Extracting social networks and contact information from email and the Web. In: Proceeding of Conference on Email and Anti-Spam (2004)Google Scholar
  18. 18.
    Matsuo, Y., et al.: Mining Social Network of Conference Participants from the Web. In: IEEE/WIC International Conference on Web Intelligence, pp. 190–193 (2003)Google Scholar
  19. 19.
    Mori, J., Tsujishita, T., Matsuo, Y., Ishizuka, M.: Extracting Relations in Social Networks from the Web Using Similarity Between Collective Contexts. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 487–500. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  20. 20.
    Kang, I., Kim, P., et al.: A largescale testset for authordisambiguation. Journal of the Korea Contents Association, 455–464 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Junting Ye
    • 1
  • Yanan Qian
    • 1
  • Qinghua Zheng
    • 1
  1. 1.SPKLSTN Lab, Department of Computer Science and TechnologyXi’an Jiaotong UniversityXi’anChina

Personalised recommendations