Abstract
Researchers’ high quality homepages are important resources in academic search because they provide comprehensive and up-to-date information about researchers. Meanwhile, low quality homepages widely exist. A case study shows that 57.8% of all homepages retrieved among top 10 results from Google are low quality and 95% top researchers own out-of-date homepages. Besides, some academic portals generate dynamic homepages introducing researchers. These homepages are not maintained by researchers and may contain incorrect information. The quality of discovered homepages can not be ensured by existing work, which decreases the efficiency of academic search. It is difficult to define a high quality homepage from a quantitative perspective. Instead, on the basis of analyzing labeled high quality homepages, we propose “informative researcher’s homepage”, at least consisting of identifiable information (introducing a researcher’s basic information) and publication list (listing his/her corresponding publications), as an estimation for high quality homepage. Based on the observation that informative researchers’ homepages are organized in two ways, integrated and scattered, we propose an effective discovering model, PLIDMiner, with F1 scores over 0.9 on labeled data. Our model can also be applied to verify homepages’ quality. We crawl thousands of homepage resources from popular academic portals and assess their overall qualities. It turns out that nearly 25% of homepage resources in these portals are not informative, which strengthens our motivation.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Kang, I.-S., et al.: Construction of a Large-scale Test Set for Author Disambiguation. Information Processing and Management 47, 452–465 (2011)
Yang, K.-H., Ho, J.-M.: Parsing Publication Lists on the Web. In: Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol. 1, pp. 444–447 (2010)
Doan, A., Ramakrishnan, R., et al.: Community information management. IEEE Data Engineering Bulletin 29, 64–72 (2006)
Li, J., Tang, J., et al.: Arnetminer: Expertise Oriented Search Using Social Networks. Frontiers of Computer Science in China, 94–105 (2008)
Tang, J., Zhang, J., et al.: ArnetMiner: Extraction and Mining of Academic Social Networks. In: Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 990–998 (2008)
Torvik, V., Weeber, M., et al.: A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology 56, 140–158 (2005)
Kang, I.-S., Na, S.-H., et al.: On co-authorship for author disambiguation. Information Processing and Management 45, 84–97 (2009)
Qian, Y., Hu, Y., et al.: Combining machine learning and human judgment in author disambiguation. In: International Conference on Information and Knowledge Management, pp. 1241–1246 (2011)
Yang, K.H., Chung, J.M., et al.: PLF: A Publication list Web page finder for researchers. In: IEEE/WIC/ACM International Conference on Web Intelligence, pp. 295–298 (2007)
Xi, W., Fox, E.A., Tan, R.P., Shu, J.: Machine Learning Approach for Homepage Finding Task. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 145–159. Springer, Heidelberg (2002)
Kraaij, W., Westerveld, T., Hiemstra, D.: The importance of prior probabilities for entry page search. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 27–34 (2002)
Upstill, T., Craswell, N., et al.: Query-independent evidence in home page finding. ACM Transactions on Information Systems 21, 286–313 (2003)
Shakes, J., Langheinrich, M., et al.: Dynamic reference sifting: A case study in the homepage domain. Computer Networks and ISDN Systems 29, 1193–1204 (1997)
Fang, Y., Si, L., et al.: Discriminative graphical models for researcher’s homepage discovery. Information Retrieval 13, 618–635 (2010)
Tan, Y.F., Kan, M.Y., et al.: Search engine driven author disambiguation. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 314–315 (2006)
Pereira, D.A., Ribeiro-neto, B.A., et al.: Using web information for author name disambiguation. In: Proceedings of 9th ACM/IEEE Joint Conference on Digital Libraries, pp. 49–58 (2009)
Culotta, A., Bekkerman, R., et al.: Extracting social networks and contact information from email and the Web. In: Proceeding of Conference on Email and Anti-Spam (2004)
Matsuo, Y., et al.: Mining Social Network of Conference Participants from the Web. In: IEEE/WIC International Conference on Web Intelligence, pp. 190–193 (2003)
Mori, J., Tsujishita, T., Matsuo, Y., Ishizuka, M.: Extracting Relations in Social Networks from the Web Using Similarity Between Collective Contexts. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 487–500. Springer, Heidelberg (2006)
Kang, I., Kim, P., et al.: A largescale testset for authordisambiguation. Journal of the Korea Contents Association, 455–464 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ye, J., Qian, Y., Zheng, Q. (2012). PLIDMiner: A Quality Based Approach for Researcher’s Homepage Discovery. In: Hou, Y., Nie, JY., Sun, L., Wang, B., Zhang, P. (eds) Information Retrieval Technology. AIRS 2012. Lecture Notes in Computer Science, vol 7675. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35341-3_17
Download citation
DOI: https://doi.org/10.1007/978-3-642-35341-3_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35340-6
Online ISBN: 978-3-642-35341-3
eBook Packages: Computer ScienceComputer Science (R0)