A Structuralist Approach for Personal Knowledge Exploration Systems on Mobile Devices

Bordag, Stefan; Hänig, Christian; Beutenmüller, Christian

doi:10.1007/978-3-319-12655-5_6

Stefan Bordag⁶,
Christian Hänig⁶ &
Christian Beutenmüller⁶

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

3822 Accesses

Abstract

We describe the reasons and choices we made when designing an architecture for a multilingual Natural Language Processing (NLP) system for mobile devices. The most tangible limitations and problems are limited processing power of mobile devices, strong influence of idiolect (or generally personal language usage differentiation between individual users in their personal communication), effort required to port the NLP system to multiple languages, and finally the additional processing layers required when dealing with real-world data as opposed to controlled academic set-ups. Our solution is based on a strict differentiation between server-side preprocessing and client-side processing, as well as maximized usage of unsupervised techniques to avoid the problems posed by personal language usage variations. Hence it represents an adequate combination of solutions to provide robust NLP despite all these limitations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Assuming the dichotomy between structured and unstructured data to be loosely defined as “information or pieces of data stored in explicit relations with each other (cf. relational data bases) to be structured data and information stored without explicit relations within unanalyzed texts to be unstructured”. Another possible definition is that “any data base is structured if it is possible to use a simple and precise query system which guarantees to retrieve a particular piece of information if it exists”. No current system is able to achieve that fully with textual information.
2.
Input by people educated in a relevant field of linguistics.
3.
rtw.ml.cmu.edu/rtw, retrieved on 24.03.2014.
4.
www.google.com/insidesearch/features/search/knowledge.html, retrieved on 24.03.2014.
5.
www-05.ibm.com/de/watson, retrieved on 24.03.2014.
6.
The product is an email client with enhanced NLP capabilities; see http://mailbe.at.
7.
trec.nist.gov.
8.
This equals an average of more than 18 person name tokens per mail. Besides many false positives (classified incorrectly due to broken contexts in log file texts, quoted mails and signature parts), most of them can be verified as correct.

References

Agirre E, Cer D, Diab M, Gonzalez-Agirre A. Semeval-2012 task 6: A pilot on semantic textual similarity. In:*SEM 2012: The first joint conference on lexical and computational semantics – vol 1: Proceedings of the main conference and the shared task, and vol 2: Proceedings of the 6th International workshop on semantic evaluation (SemEval 2012), pp 385–393, Montréal, Canada, 7–8 June 2012
Google Scholar
Agirre E, Cer D, Diab M, Gonzalez-Agirre A, Guo W (2013) *Sem 2013 shared task: Semantic textual similarity. In: Second joint conference on lexical and computational semantics (*SEM), vol 1 Proceedings of the main conference and the shared task: semantic textual similarity, pp 32–43, Atlanta, Georgia, June 2013
Google Scholar
Azzopardi L, Balog K (2011) Towards a living lab for information retrieval research and development: a proposal for a living lab for product search tasks. In: Proceedings of the 2nd international conference on multilingual and multimodal information access evaluation, CLEF’11. Springer, Berlin, pp 26–37
Google Scholar
Bär D, Biemann C, Gurevych I, Zesch T (2012) UKP: Computing semantic textual similarity by combining multiple content similarity measures. In: *SEM 2012: The first joint conference on lexical and computational semantics – vol 1: Proceedings of the main conference and the shared task, and vol 2: Proceedings of the 6th international workshop on semantic evaluation (SemEval 2012), pages 435–440, Montréal, Canada, 7–8 June 2012
Google Scholar
Barlow M (2013) Individual differences and usage-based grammar. Int J Corpus Linguist 18(4):443–478
Article Google Scholar
Bordag S (2007) Elements of knowledge-free and unsupervised lexical acquisition. Phd, University of Leipzig, Leipzig
MATH Google Scholar
Carlson A, Betteridge J, Kisiel B, Settles B, Hruschka ER, Mitchell TM (2010) Toward an architecture for never-ending language learning. In: Proceedings of the conference on artificial intelligence (AAAI), pp 1306–1313, AAAI Press
Google Scholar
Carvalho VR, Cohen WW (2004) Learning to extract signature and reply lines from email. In: Proceedings of the conference on email and anti-spam
Google Scholar
Chieu HL, Ng HT (2002) Named entity recognition: a maximum entropy approach using global information. In: Proceedings of the 19th international conference on Computational linguistics, pp 1–7, Morristown, NJ
Chapter Google Scholar
Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, Helsinki, Finland
Google Scholar
Corbett P, Batchelor C, Teufel S (2007) Annotation of chemical named entities. In: Proceedings of the annual meeting of the ACL, pp 57–64, ACL
Google Scholar
De Saussure F (1916) Cours de linguistique générale. Payot, Lausanne/Paris
Google Scholar
Dumais S, Cutrell E, Cadiz JJ, Jancke G, Sarin R, Robbins DC (2003) Stuff I’ve seen: a system for personal information retrieval and re-use. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval, SIGIR ’03, pp 72–79, New York
Google Scholar
Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19(1):61–74
Google Scholar
Ferrucci D, Brown E, Chu-Carroll J, Fan J, Gondek D, Kalyanpur A, Adam L, J William Murdock, Eric Nyberg, John Prager, Nico Schlaefer, Chris Welty (2010) The ai behind watson - the technical article. AI Mag 31
Google Scholar
Fleischman Michael, Hovy E (2002) Fine grained classification of named entities. In: Proceedings of the 19th international conference on Computational linguistics, pp 1–7, Morristown, NJ
Google Scholar
Florian R, Ittycheriah A, Jing H, Zhang T (2003) Named entity recognition through classifier combination. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003, vol 4, pp 168–171, Edmonton
Google Scholar
Goldhahn D, Eckart T, Quasthoff U (2012) Building large monolingual dictionaries at the leipzig corpora collection: From 100 to 200 languages. In: Proceedings of the 8th international conference on language resources and evaluation (LREC’12), pp 759–765
Google Scholar
Grishman R (1995) The NYU system for MUC-6 or where’s the syntax? In: MUC6 ’95: Proceedings of the 6th conference on Message understanding, pp 167–175, Morristown, NJ
Chapter Google Scholar
Grishman R, Sundheim B (1995) Design of the MUC-6 evaluation. In: MUC6 ’95: Proceedings of the 6th conference on message understanding, pp 1–11, Morristown, NJ
Chapter Google Scholar
Grouin C, Rosset S, Zweigenbaum P, Fort K, Galibert O, Quintard L (2011) Proposal for an extension of traditional named entities: From guidelines to evaluation, an overview. In: Proceedings of the 5th linguistic annotation workshop, LAW V ’11, pp 92–100, Stroudsburg, PA, 2011
Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: An update. SIGKDD Explor Newsl 11(1):10–18
Article Google Scholar
Heyer G, Bordag S (2007) A structuralist framework for quantitative linguistics. In: Alexander Mehler and Reinhard Köhler, editors, Aspects of Automatic Text Analysis / Series: Studies in Fuzziness and Soft Computing. Springer, Berlin, New York
Google Scholar
Heyer G, Quasthoff U, Wittig T (2008) Text mining: Wissensrohstoff text – konzepte, algorithmen, ergebnisse. W3L-Verlag, Herdecke
Google Scholar
Kim JD, Pyysalo S, Ohta T, Bossy R, Nguyen N, Tsujii J (2011) Overview of BioNLP shared task 2011. In: Proceedings of the BioNLP shared task 2011 workshop, pp 1–6. ACL, 2011
Google Scholar
Kim J (2012) Retrieval and evaluation techniques for personal information. PhD thesis, Graduate School of the University of Massachusetts, 2012
Google Scholar
Kiss T, Strunk J (2006) Unsupervised multilingual sentence boundary detection. Comput Linguist 32(4):485–525
Article Google Scholar
Kushmerick N (2000) Wrapper verification. WWW 3(2):79–94
Article MATH Google Scholar
Lamar M, Maron Y, Johnson M, Bienenstock E (2010) SVD and clustering for unsupervised pos tagging. In: Proceedings of the ACL 2010 conference short papers. Uppsala, pp 215–219
Google Scholar
Lampert A, Dale R, Paris C (2009) Segmenting email message text into zones. In: Proceedings of the 2009 conference on empirical methods in natural language processing: vol 2 - vol 2, EMNLP ’09. Stroudsburg, PA, pp 919–928
Google Scholar
McMenamin GR (2002) Forensic linguistics: Advances in forensic stylistics. CRC Press, London
Book Google Scholar
Richardson R, Smeaton AF, Murphy J (1994) Using WordNet as a knowledge base for measuring semantic similarity between words. In: Technical Report, Proceedings of AICS conference, 1994
Google Scholar
Rudman J (1997) The state of authorship attribution studies: Some problems and solutions. Comput Hum 31(4):351–365
Article Google Scholar
Salton G (1989) Automatic text processing: The transformation, analysis, and retrieval of information by computer. Addison Wesley, Reading
Google Scholar
Tjong Kim Sang EF, Meulder FDe (2003) Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In: Daelemans W, Osborne M (eds), Proceedings of CoNLL-2003, volume pages, pages 142–147
Google Scholar
Schierle M (2011) Language engineering for information extraction. Phd, University of Leipzig, Leipzig
Google Scholar
Schuetze H, Scheible C (2013) Two svds produce more focal deep learning representations. CoRR, abs/1301.3, 2013
Google Scholar
Varelas G, Voutsakis E, Euripides, Petrakis EG, Milios EE, Raftopoulou P (2005) Semantic similarity methods in WordNet and their application to information retrieval on the web. In: 7 th ACM international workshop on web information and data management (WIDM 2005), pp 10–16, ACM Press, 2005
Google Scholar
Witschel Hf (2004) Terminologie-extraktion – möglichkeiten der kombination statistischer und musterbasierter verfahren. Ergon Verlag, Würzburg
Google Scholar
Witschel Hf (2007) Multi-level association graphs - a new graph-based model for information retrieval. In: Proceedings of the HLT-NAACL-07 Workshop on Textgraphs-07, New York, 2007
Google Scholar

Download references

Author information

Authors and Affiliations

ExB Research & Development GmbH, Seeburgstr. 100, 04103, Leipzig, Germany
Stefan Bordag, Christian Hänig & Christian Beutenmüller

Authors

Stefan Bordag
View author publications
You can also search for this author in PubMed Google Scholar
Christian Hänig
View author publications
You can also search for this author in PubMed Google Scholar
Christian Beutenmüller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefan Bordag .

Editor information

Editors and Affiliations

Computer Science Department, Technische Universität Darmstadt FG Language Technology, Darmstadt, Germany
Chris Biemann
Computer Science Department, Goethe University WG Text Technology, Frankfurt am Main, Hessen, Germany
Alexander Mehler

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bordag, S., Hänig, C., Beutenmüller, C. (2014). A Structuralist Approach for Personal Knowledge Exploration Systems on Mobile Devices. In: Biemann, C., Mehler, A. (eds) Text Mining. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-12655-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-12655-5_6
Published: 13 December 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12654-8
Online ISBN: 978-3-319-12655-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics