Retrieving Customary Web Language to Assist Writers
This paper introduces Netspeak, a Web service which assists writers in finding adequate expressions. To provide statistically relevant suggestions, the service indexes more than 1.8 billion n-grams, n ≤ 5, along with their occurrence frequencies on the Web. If in doubt about a wording, a user can specify a query that has wildcards inserted at those positions where she feels uncertain.
Queries define patterns for which a ranked list of matching n-grams along with usage examples are retrieved. The ranking reflects the occurrence frequencies of the n-grams and informs about both absolute and relative usage. Given this choice of customary wordings, one can easily select the most appropriate. Especially second-language speakers can learn about style conventions and language usage.
To guarantee response times within milliseconds we have developed an index that considers occurrence probabilities, allowing for a biased sampling during retrieval. Our analysis shows that the extreme speedup obtained with this strategy (factor 68) comes without significant loss in retrieval quality.
KeywordsRetrieval Time Retrieval Quality Statistical Natural Language Processing Literal Word Relevant Suggestion
Unable to display preview. Download preview PDF.
- 1.Bast, H., Majumdar, D., Schenkel, R., Theobald, M., Weikum, G.: IO-Top-k: Index-access Optimized Top-k Query Processing. In: Proc. of VLDB 2006 (2006)Google Scholar
- 3.Brants, T., Franz, A.: Web 1T 5-gram Version 1. Linguistic Data Consortium (2006)Google Scholar
- 4.Brockett, C., Dolan, W.B., Gamon, M.: Correcting ESL Errors Using Phrasal SMT Techniques. In: Proc. of ACL 2006 (2006)Google Scholar
- 5.Cafarella, M.J., Etzioni, O.: A Search Engine for Natural Language Applications. In: Proc. of WWW 2005 (2005)Google Scholar
- 8.Resnik, P., Elkiss, A.: The Linguist’s Search Engine: An Overview. In: Proc. of ACL 2005 (2005)Google Scholar