Abstract
Keyword search is widely used in many practical applications. Unfortunately, most keyword-based search engines compute the similarity distance between two Web documents by only matching the keywords at the same positions in both the query and the document vectors, without considering the impact of the keywords at neighbouring positions. Such approach usually results in incompleteness of search results. In this paper, we exploit the Earth Mover’s Distance (EMD) as a distance function, which is more flexible against other distance functions such as Euclidean distance. To overcome the limitation of EMD-based computation complexity, we use the filtering techniques to minimize the total number of actual EMD computations. We further develop a novel lower bound as a new EMD filter for partial matching technique that is suitable for searching Web documents. The experimental results demonstrate the efficiency of EMD-based search with filtering techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Al-Masri, E., Mahmoud, Q.: Investigating web services on the world wide web. In: Proc. of the 17th Intl. World Wide Web Conf., WWW 2008 (2008)
Assent, I., Wenning, A., Seidl, T.: Approximation techniques for indexing the earth mover’s distance in multimedia databases. In: Proc. of the 22nd Intl. Conference on Data Engineering, ICDE 2006, pp. 11–22 (2006)
Dong, X., et al.: Similarity search for web services. In: Proc. of the 30th Intl. Conf. on Very Large Data Bases, VLDB 2004 (2004)
Fu, A., Liu, W., Deng, X.: Detecting phishing web pages with visual similarity assessment based on earth mover’s distance (EMD). IEEE Trans. on Dependable and Secure Computing 3(4), 301–311 (2006)
Fujii, A.: Modeling anchor text and classifying queries to enhance web document retrieval. In: Proc. of the 17th Intl. World Wide Web Conf., WWW 2008 (2008)
Hitchcock, F.: The distribution of a product from several sources to numerous localities. J. Math. Phys. 20(2), 224–230 (1941)
Karmarkar, N.: A new polynomial-time algorithm for linear programming. In: Proc. of the 16th Annual ACM Symposium on Theory of Computing, pp. 302–311 (1984)
Ling, H., Okada, K.: An efficient earth mover’s distance algorithm for robust histogram comparison. IEEE Trans. on Pattern Analysis and Machine Intelligence 29(5), 840–853 (2007)
Ljosa, V., Bhattacharya, A., Singh, A.K.: Indexing spatially sensitive distance measures using multi-resolution lower bounds. In: Ioannidis, Y., et al. (eds.) EDBT 2006. LNCS, vol. 3896, pp. 865–883. Springer, Heidelberg (2006)
Pele, O., Werman, M.: Fast and robust earth mover’s distances. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 460–467. IEEE (2009)
Poblete, B., Baeza-Yates, R.: Query-sets: using implicit feedback and query patterns to organize web documents. In: Proc. of the 17th Intl. World Wide Web Conf., WWW 2008 (2008)
Rubner, Y., Tomasi, C., Guibas, L.: The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision 40(2), 99–121 (2000)
Shirdhonkar, S., Jacobs, D.: Approximate earth mover’s distance in linear time. In: Proc. of Intl. Conf. on Computer Vision and Pattern Recognition, CVPR 2008 (2008)
Wan, X.: A novel document similarity measure based on earth mover’s distance. Information Sciences 177(18), 3718–3730 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Ma, J., Sheng, Q.Z., Yao, L., Xu, Y., Shemshadi, A. (2014). Keyword Search over Web Documents Based on Earth Mover’s Distance. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds) Web Information Systems Engineering – WISE 2014. WISE 2014. Lecture Notes in Computer Science, vol 8786. Springer, Cham. https://doi.org/10.1007/978-3-319-11749-2_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-11749-2_20
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11748-5
Online ISBN: 978-3-319-11749-2
eBook Packages: Computer ScienceComputer Science (R0)