A Model of Uncertainty for Near-Duplicates in Document Reference Networks

Hess, Claudia; de Rougemont, Michel

doi:10.1007/978-3-540-74851-9_40

Claudia Hess¹ &
Michel de Rougemont²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4675))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

1153 Accesses

Abstract

We introduce a model of uncertainty where documents are not uniquely identified in a reference network, and some links may be incorrect. It generalizes the probabilistic approach on databases to graphs, and defines subgraphs with a probability distribution. The answer to a relational query is a distribution of documents, and we study how to approximate the ranking of the most likely documents and quantify the quality of the approximation. The answer to a function query is a distribution of values and we consider the size of the interval of Minimum and Maximum values as a measure for the precision of the answer.

The work was supported by the German Academic Exchange Service.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Networks 29(8-13), 1157–1166 (1997)
Google Scholar
Shivakumar, N., Garcia-Molina, H.: Finding near-replicas of documents on the web. In: Atzeni, P., Mendelzon, A.O., Mecca, G. (eds.) The World Wide Web and Databases. LNCS, vol. 1590, Springer, Heidelberg (1999)
Chapter Google Scholar
Broder, A.Z.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000)
Chapter Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project (1998)
Google Scholar
Andritsos, P., Fuxman, A., Miller, R.J.: Clean answers over dirty databases: A probabilistic approach. In: ICDE. Proceedings of the International Conference on Data Engineering (2006)
Google Scholar
Fagin, R., Kumar, R., Mahdian, M., Sivakumar, D., Vee, E.: Comparing and aggregating rankings with ties. In: ACM Principles on Databases Systems, pp. 47–58. ACM Press, New York (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory for Semantic Information Technology, Bamberg University,
Claudia Hess
LRI, Université Paris-Sud 11,
Michel de Rougemont

Authors

Claudia Hess
View author publications
You can also search for this author in PubMed Google Scholar
Michel de Rougemont
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

László Kovács Norbert Fuhr Carlo Meghini

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hess, C., de Rougemont, M. (2007). A Model of Uncertainty for Near-Duplicates in Document Reference Networks. In: Kovács, L., Fuhr, N., Meghini, C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2007. Lecture Notes in Computer Science, vol 4675. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74851-9_40

Download citation

DOI: https://doi.org/10.1007/978-3-540-74851-9_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74850-2
Online ISBN: 978-3-540-74851-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics