Skip to main content

A Structuralist Approach for Personal Knowledge Exploration Systems on Mobile Devices

  • Chapter
  • First Online:
Text Mining

Abstract

We describe the reasons and choices we made when designing an architecture for a multilingual Natural Language Processing (NLP) system for mobile devices. The most tangible limitations and problems are limited processing power of mobile devices, strong influence of idiolect (or generally personal language usage differentiation between individual users in their personal communication), effort required to port the NLP system to multiple languages, and finally the additional processing layers required when dealing with real-world data as opposed to controlled academic set-ups. Our solution is based on a strict differentiation between server-side preprocessing and client-side processing, as well as maximized usage of unsupervised techniques to avoid the problems posed by personal language usage variations. Hence it represents an adequate combination of solutions to provide robust NLP despite all these limitations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Assuming the dichotomy between structured and unstructured data to be loosely defined as “information or pieces of data stored in explicit relations with each other (cf. relational data bases) to be structured data and information stored without explicit relations within unanalyzed texts to be unstructured”. Another possible definition is that “any data base is structured if it is possible to use a simple and precise query system which guarantees to retrieve a particular piece of information if it exists”. No current system is able to achieve that fully with textual information.

  2. 2.

    Input by people educated in a relevant field of linguistics.

  3. 3.

    rtw.ml.cmu.edu/rtw, retrieved on 24.03.2014.

  4. 4.

    www.google.com/insidesearch/features/search/knowledge.html, retrieved on 24.03.2014.

  5. 5.

    www-05.ibm.com/de/watson, retrieved on 24.03.2014.

  6. 6.

    The product is an email client with enhanced NLP capabilities; see http://mailbe.at.

  7. 7.

    trec.nist.gov.

  8. 8.

    This equals an average of more than 18 person name tokens per mail. Besides many false positives (classified incorrectly due to broken contexts in log file texts, quoted mails and signature parts), most of them can be verified as correct.

References

  1. Agirre E, Cer D, Diab M, Gonzalez-Agirre A. Semeval-2012 task 6: A pilot on semantic textual similarity. In:*SEM 2012: The first joint conference on lexical and computational semantics – vol 1: Proceedings of the main conference and the shared task, and vol 2: Proceedings of the 6th International workshop on semantic evaluation (SemEval 2012), pp 385–393, Montréal, Canada, 7–8 June 2012

    Google Scholar 

  2. Agirre E, Cer D, Diab M, Gonzalez-Agirre A, Guo W (2013) *Sem 2013 shared task: Semantic textual similarity. In: Second joint conference on lexical and computational semantics (*SEM), vol 1 Proceedings of the main conference and the shared task: semantic textual similarity, pp 32–43, Atlanta, Georgia, June 2013

    Google Scholar 

  3. Azzopardi L, Balog K (2011) Towards a living lab for information retrieval research and development: a proposal for a living lab for product search tasks. In: Proceedings of the 2nd international conference on multilingual and multimodal information access evaluation, CLEF’11. Springer, Berlin, pp 26–37

    Google Scholar 

  4. Bär D, Biemann C, Gurevych I, Zesch T (2012) UKP: Computing semantic textual similarity by combining multiple content similarity measures. In: *SEM 2012: The first joint conference on lexical and computational semantics – vol 1: Proceedings of the main conference and the shared task, and vol 2: Proceedings of the 6th international workshop on semantic evaluation (SemEval 2012), pages 435–440, Montréal, Canada, 7–8 June 2012

    Google Scholar 

  5. Barlow M (2013) Individual differences and usage-based grammar. Int J Corpus Linguist 18(4):443–478

    Article  Google Scholar 

  6. Bordag S (2007) Elements of knowledge-free and unsupervised lexical acquisition. Phd, University of Leipzig, Leipzig

    MATH  Google Scholar 

  7. Carlson A, Betteridge J, Kisiel B, Settles B, Hruschka ER, Mitchell TM (2010) Toward an architecture for never-ending language learning. In: Proceedings of the conference on artificial intelligence (AAAI), pp 1306–1313, AAAI Press

    Google Scholar 

  8. Carvalho VR, Cohen WW (2004) Learning to extract signature and reply lines from email. In: Proceedings of the conference on email and anti-spam

    Google Scholar 

  9. Chieu HL, Ng HT (2002) Named entity recognition: a maximum entropy approach using global information. In: Proceedings of the 19th international conference on Computational linguistics, pp 1–7, Morristown, NJ

    Chapter  Google Scholar 

  10. Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, Helsinki, Finland

    Google Scholar 

  11. Corbett P, Batchelor C, Teufel S (2007) Annotation of chemical named entities. In: Proceedings of the annual meeting of the ACL, pp 57–64, ACL

    Google Scholar 

  12. De Saussure F (1916) Cours de linguistique générale. Payot, Lausanne/Paris

    Google Scholar 

  13. Dumais S, Cutrell E, Cadiz JJ, Jancke G, Sarin R, Robbins DC (2003) Stuff I’ve seen: a system for personal information retrieval and re-use. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval, SIGIR ’03, pp 72–79, New York

    Google Scholar 

  14. Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19(1):61–74

    Google Scholar 

  15. Ferrucci D, Brown E, Chu-Carroll J, Fan J, Gondek D, Kalyanpur A, Adam L, J William Murdock, Eric Nyberg, John Prager, Nico Schlaefer, Chris Welty (2010) The ai behind watson - the technical article. AI Mag 31

    Google Scholar 

  16. Fleischman Michael, Hovy E (2002) Fine grained classification of named entities. In: Proceedings of the 19th international conference on Computational linguistics, pp 1–7, Morristown, NJ

    Google Scholar 

  17. Florian R, Ittycheriah A, Jing H, Zhang T (2003) Named entity recognition through classifier combination. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003, vol 4, pp 168–171, Edmonton

    Google Scholar 

  18. Goldhahn D, Eckart T, Quasthoff U (2012) Building large monolingual dictionaries at the leipzig corpora collection: From 100 to 200 languages. In: Proceedings of the 8th international conference on language resources and evaluation (LREC’12), pp 759–765

    Google Scholar 

  19. Grishman R (1995) The NYU system for MUC-6 or where’s the syntax? In: MUC6 ’95: Proceedings of the 6th conference on Message understanding, pp 167–175, Morristown, NJ

    Chapter  Google Scholar 

  20. Grishman R, Sundheim B (1995) Design of the MUC-6 evaluation. In: MUC6 ’95: Proceedings of the 6th conference on message understanding, pp 1–11, Morristown, NJ

    Chapter  Google Scholar 

  21. Grouin C, Rosset S, Zweigenbaum P, Fort K, Galibert O, Quintard L (2011) Proposal for an extension of traditional named entities: From guidelines to evaluation, an overview. In: Proceedings of the 5th linguistic annotation workshop, LAW V ’11, pp 92–100, Stroudsburg, PA, 2011

    Google Scholar 

  22. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: An update. SIGKDD Explor Newsl 11(1):10–18

    Article  Google Scholar 

  23. Heyer G, Bordag S (2007) A structuralist framework for quantitative linguistics. In: Alexander Mehler and Reinhard Köhler, editors, Aspects of Automatic Text Analysis / Series: Studies in Fuzziness and Soft Computing. Springer, Berlin, New York

    Google Scholar 

  24. Heyer G, Quasthoff U, Wittig T (2008) Text mining: Wissensrohstoff text – konzepte, algorithmen, ergebnisse. W3L-Verlag, Herdecke

    Google Scholar 

  25. Kim JD, Pyysalo S, Ohta T, Bossy R, Nguyen N, Tsujii J (2011) Overview of BioNLP shared task 2011. In: Proceedings of the BioNLP shared task 2011 workshop, pp 1–6. ACL, 2011

    Google Scholar 

  26. Kim J (2012) Retrieval and evaluation techniques for personal information. PhD thesis, Graduate School of the University of Massachusetts, 2012

    Google Scholar 

  27. Kiss T, Strunk J (2006) Unsupervised multilingual sentence boundary detection. Comput Linguist 32(4):485–525

    Article  Google Scholar 

  28. Kushmerick N (2000) Wrapper verification. WWW 3(2):79–94

    Article  MATH  Google Scholar 

  29. Lamar M, Maron Y, Johnson M, Bienenstock E (2010) SVD and clustering for unsupervised pos tagging. In: Proceedings of the ACL 2010 conference short papers. Uppsala, pp 215–219

    Google Scholar 

  30. Lampert A, Dale R, Paris C (2009) Segmenting email message text into zones. In: Proceedings of the 2009 conference on empirical methods in natural language processing: vol 2 - vol 2, EMNLP ’09. Stroudsburg, PA, pp 919–928

    Google Scholar 

  31. McMenamin GR (2002) Forensic linguistics: Advances in forensic stylistics. CRC Press, London

    Book  Google Scholar 

  32. Richardson R, Smeaton AF, Murphy J (1994) Using WordNet as a knowledge base for measuring semantic similarity between words. In: Technical Report, Proceedings of AICS conference, 1994

    Google Scholar 

  33. Rudman J (1997) The state of authorship attribution studies: Some problems and solutions. Comput Hum 31(4):351–365

    Article  Google Scholar 

  34. Salton G (1989) Automatic text processing: The transformation, analysis, and retrieval of information by computer. Addison Wesley, Reading

    Google Scholar 

  35. Tjong Kim Sang EF, Meulder FDe (2003) Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In: Daelemans W, Osborne M (eds), Proceedings of CoNLL-2003, volume pages, pages 142–147

    Google Scholar 

  36. Schierle M (2011) Language engineering for information extraction. Phd, University of Leipzig, Leipzig

    Google Scholar 

  37. Schuetze H, Scheible C (2013) Two svds produce more focal deep learning representations. CoRR, abs/1301.3, 2013

    Google Scholar 

  38. Varelas G, Voutsakis E, Euripides, Petrakis EG, Milios EE, Raftopoulou P (2005) Semantic similarity methods in WordNet and their application to information retrieval on the web. In: 7 th ACM international workshop on web information and data management (WIDM 2005), pp 10–16, ACM Press, 2005

    Google Scholar 

  39. Witschel Hf (2004) Terminologie-extraktion – möglichkeiten der kombination statistischer und musterbasierter verfahren. Ergon Verlag, Würzburg

    Google Scholar 

  40. Witschel Hf (2007) Multi-level association graphs - a new graph-based model for information retrieval. In: Proceedings of the HLT-NAACL-07 Workshop on Textgraphs-07, New York, 2007

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefan Bordag .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Bordag, S., Hänig, C., Beutenmüller, C. (2014). A Structuralist Approach for Personal Knowledge Exploration Systems on Mobile Devices. In: Biemann, C., Mehler, A. (eds) Text Mining. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-12655-5_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12655-5_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12654-8

  • Online ISBN: 978-3-319-12655-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics