Demonstration + Natural Language: Multimodal Interfaces for GUI-Based Interactive Task Learning Agents

Li, Toby Jia-Jun; Mitchell, Tom M.; Myers, Brad A.

doi:10.1007/978-3-030-82681-9_15

Toby Jia-Jun Li⁴,
Tom M. Mitchell⁴ &
Brad A. Myers⁴

Part of the book series: Human–Computer Interaction Series ((HCIS))

2306 Accesses
1 Citations

Abstract

We summarize our past five years of work on designing, building, and studying Sugilite, an interactive task learning agent that can learn new tasks and relevant associated concepts interactively from the user’s natural language instructions and demonstrations leveraging the graphical user interfaces (GUIs) of third-party mobile apps. Through its multi-modal and mixed-initiative approaches for Human-AI interaction, Sugilite made important contributions in improving the usability, applicability, generalizability, flexibility, robustness, and shareability of interactive task learning agents. Sugilite also represents a new human-AI interaction paradigm for interactive task learning, where it uses existing app GUIs as a medium for users to communicate their intents with an AI agent instead of the interfaces for users to interact with the underlying computing services. In this chapter, we describe the Sugilite system, explain the design and implementation of its key features, and show a prototype in the form of a conversational assistant on Android.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Sugilite is named after a purple gemstone, and stands for: Smartphone Users Generating Intelligent Likeable Interfaces Through Examples.
2.
A demo video is available at https://www.youtube.com/watch?v=tdHEk-GeaqE.
3.
https://github.com/tobyli/Sugilite_development.
4.
Sovite is named after a type of rock. It is also an acronym for System for Optimizing Voice Interfaces to Tackle Errors.
5.
Available at: https://github.com/tobyli/screen2vec.
6.
Available at: http://interactionmining.org/rico.
7.
Since the next screen is always within the same app, and therefore, shares an app description embedding, the prediction task favors having information about the specific app (i.e., app store description embedding) dominate the embedding

References

Adar E, Dontcheva M, Laput G (2014) CommandSpace: modeling the relationships between tasks, descriptions and features. In: Proceedings of the 27th annual ACM symposium on user interface software and technology, UIST ’14, pp 167–176. ACM, New York, NY, USA. https://doi.org/10.1145/2642918.2647395. http://doi.acm.org/10.1145/2642918.2647395
Alharbi K, Yeh T (2015) Collect, decompile, extract, stats, and diff: mining design pattern changes in android apps. In: Proceedings of the 17th international conference on human-computer interaction with mobile devices and services, MobileHCI ’15, pp 515–524. ACM, New York, NY, USA. https://doi.org/10.1145/2785830.2785892. http://doi.acm.org/10.1145/2785830.2785892
Allen J, Chambers N, Ferguson G, Galescu L, Jung H, Swift M, Taysom W (2007) PLOW: a collaborative task learning agent. In: Proceedings of the 22Nd national conference on artificial intelligence - volume 2, AAAI’07, pp 1514–1519. AAAI Press, Vancouver, British Columbia, Canada
Google Scholar
Allen JF, Guinn CI, Horvtz E (1999) Mixed-initiative interaction. IEEE Intell Syst Appl 14(5):14–23
Article Google Scholar
Amazon: Alexa Design Guide (2020). https://developer.amazon.com/en-US/docs/alexa/alexa-design/get-started.html
Antila V, Polet J, Lämsä A, Liikka J (2012) RoutineMaker: towards end-user automation of daily routines using smartphones. In: 2012 IEEE international conference on pervasive computing and communications workshops (PERCOM workshops), pp 399–402. https://doi.org/10.1109/PerComW.2012.6197519
Argall BD, Chernova S, Veloso M, Browning B (2009) A survey of robot learning from demonstration. Robot Auton Syst 57(5):469–483. https://doi.org/10.1016/j.robot.2008.10.024.
Ashktorab Z, Jain M, Liao QV, Weisz JD (2019) Resilient chatbots: repair strategy preferences for conversational breakdowns. In: Proceedings of the 2019 CHI conference on human factors in computing systems, p 254. ACM
Google Scholar
Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z (2007) Dbpedia: a nucleus for a web of open data. The semantic web, pp 722–735. http://www.springerlink.com/index/rm32474088w54378.pdf
Azaria A, Krishnamurthy J, Mitchell TM (2016) Instructable intelligent personal agent. In: Proceedings of the 30th AAAI conference on artificial intelligence (AAAI), vol 4
Google Scholar
Ballard BW, Biermann AW (1979) Programming in natural language “NLC” as a prototype. In: Proceedings of the 1979 annual conference, ACM ’79, pp 228–237. ACM, New York, NY, USA. https://doi.org/10.1145/800177.810072. http://doi.acm.org/10.1145/800177.810072
Banovic N, Grossman T, Matejka J, Fitzmaurice G (2012) Waken: reverse engineering usage information and interface structure from software videos. In: Proceedings of the 25th annual ACM symposium on user interface software and technology, UIST ’12, pp 83–92. ACM, New York, NY, USA. https://doi.org/10.1145/2380116.2380129. http://doi.acm.org/10.1145/2380116.2380129
Barman S, Chasins S, Bodik R, Gulwani S (2016) Ringer: web automation by demonstration. In: Proceedings of the 2016 ACM SIGPLAN international conference on object-oriented programming, systems, languages, and applications, OOPSLA 2016, pp 748–764. ACM, New York, NY, USA. https://doi.org/10.1145/2983990.2984020. http://doi.acm.org/10.1145/2983990.2984020
Beneteau E, Richards OK, Zhang M, Kientz JA, Yip J, Hiniker A (2019) Communication breakdowns between families and alexa. In: Proceedings of the 2019 CHI conference on human factors in computing systems, CHI ’19, pp 243:1–243:13. ACM, New York, NY, USA. https://doi.org/10.1145/3290605.3300473. http://doi.acm.org/10.1145/3290605.3300473
Bentley F, Luvogt C, Silverman M, Wirasinghe R, White B, Lottridge D (2018) Understanding the long-term use of smart speaker assistants. Proc ACM Interact Mob Wearable Ubiquitous Technol 2(3). https://doi.org/10.1145/3264901
Berant J, Chou A, Frostig R, Liang P (2013) Semantic parsing on freebase from question-answer pairs. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1533–1544
Google Scholar
Bergman L, Castelli V, Lau T, Oblinger D (2005) DocWizards: a system for authoring follow-me documentation wizards. In: Proceedings of the 18th annual ACM symposium on user interface software and technology, UIST ’05, pp 191–200. ACM, New York, NY, USA. https://doi.org/10.1145/1095034.1095067. http://doi.acm.org/10.1145/1095034.1095067
Biermann AW (1983) Natural Language Programming. In: Biermann AW, Guiho G (eds) Computer program synthesis methodologies, NATO advanced study institutes series. Springer, Netherlands, pp 335–368
Google Scholar
Bigham JP, Lau T, Nichols J (2009) Trailblazer: enabling blind users to blaze trails through the web. In: Proceedings of the 14th international conference on intelligent user interfaces, IUI ’09, pp 177–186. ACM, New York, NY, USA. https://doi.org/10.1145/1502650.1502677
Billard A, Calinon S, Dillmann R, Schaal S (2008) Robot programming by demonstration. In: Springer handbook of robotics, pp 1371–1394. Springer. http://link.springer.com/10.1007/978-3-540-30301-5_60
Bohus D, Rudnicky AI (2005) Sorry, I didn’t catch that!-An investigation of non-understanding errors and recovery strategies. In: 6th SIGdial workshop on discourse and dialogue
Google Scholar
Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp 1247–1250. ACM. http://dl.acm.org/citation.cfm?id=1376746
Bolt RA (1980) “Put-that-there”: voice and gesture at the graphics interface. In: Proceedings of the 7th annual conference on computer graphics and interactive techniques, SIGGRAPH ’80, pp 262–270. ACM, New York, NY, USA
Google Scholar
Bosselut A, Rashkin H, Sap M, Malaviya C, Celikyilmaz A, Choi Y (2019) COMET: commonsense transformers for automatic knowledge graph construction. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 4762–4779. ACL, Florence, Italy. https://doi.org/10.18653/v1/P19-1470. https://www.aclweb.org/anthology/P19-1470
Brennan SE (1991) Conversation with and through computers. User Model User-Adap Int 1(1):67–86. https://doi.org/10.1007/BF00158952
Brennan SE (1998) The grounding problem in conversations with and through computers. Social and cognitive approaches to interpersonal communication, pp 201–225
Google Scholar
Böhmer M, Hecht B, Schöning J, Krüger A, Bauer G (2011) Falling asleep with angry birds, facebook and kindle: a large scale study on mobile application usage. In: Proceedings of the 13th international conference on human computer interaction with mobile devices and services, MobileHCI ’11, pp 47–56. ACM, New York, NY, USA. https://doi.org/10.1145/2037373.2037383. http://doi.acm.org/10.1145/2037373.2037383
Chai JY, Gao Q, She L, Yang S, Saba-Sadiya S, Xu G (2018) Language to action: towards interactive task learning with physical agents. In: IJCAI, pp 2–9
Google Scholar
Chandramouli V, Chakraborty A, Navda V, Guha S, Padmanabhan V, Ramjee R (2015) Insider: towards breaking down mobile app silos. In: TRIOS workshop held in conjunction with the SIGOPS SOSP 2015
Google Scholar
Chen F, Xia K, Dhabalia K, Hong JI (2019) Messageontap: a suggestive interface to facilitate messaging-related tasks. In: Proceedings of the 2019 CHI conference on human factors in computing systems, CHI ’19. ACM, New York, NY, USA. https://doi.org/10.1145/3290605.3300805
Chen J, Chen C, Xing Z, Xu X, Zhu L, Li G, Wang J (2020) Unblind your apps: predicting natural-language labels for mobile gui components by deep learning. In: Proceedings of the 42nd international conference on software engineering, ICSE ’20
Google Scholar
Chen JH, Weld DS (2008) Recovering from errors during programming by demonstration. In: Proceedings of the 13th international conference on intelligent user interfaces, IUI ’08, pp 159–168. ACM, New York, NY, USA. https://doi.org/10.1145/1378773.1378794. http://doi.acm.org/10.1145/1378773.1378794
Chkroun M, Azaria A (2019) Lia: a virtual assistant that can be taught new commands by speech. Int J Hum–Comput Interact 1–12
Google Scholar
Cho J, Rader E (2020) The role of conversational grounding in supporting symbiosis between people and digital assistants. Proc ACM Hum-Comput Interact 4(CSCW1)
Google Scholar
Clark HH, Brennan SE (1991) Grounding in communication. In: Perspectives on socially shared cognition, pp 127–149. APA, Washington, DC, US. https://doi.org/10.1037/10096-006
Cowan BR, Pantidi N, Coyle D, Morrissey K, Clarke P, Al-Shehri S, Earley D, Bandeira N (2017) “what can i help you with?”: Infrequent users’ experiences of intelligent personal assistants. In: Proceedings of the 19th international conference on human-computer interaction with mobile devices and services, MobileHCI ’17, pp 43:1–43:12. ACM, New York, NY, USA. https://doi.org/10.1145/3098279.3098539. http://doi.acm.org/10.1145/3098279.3098539
Cypher A, Halbert DC (1993) Watch what I do: programming by demonstration. MIT Press
Google Scholar
Deka B, Huang Z, Franzen C, Hibschman J, Afergan D, Li Y, Nichols J, Kumar R (2017) Rico: a mobile app dataset for building data-driven design applications. In: Proceedings of the 30th annual ACM symposium on user interface software and technology, UIST ’17, pp 845–854. ACM, New York, NY, USA. https://doi.org/10.1145/3126594.3126651. http://doi.acm.org/10.1145/3126594.3126651
Deka B, Huang Z, Kumar R (2016) ERICA: interaction mining mobile apps. In: Proceedings of the 29th annual symposium on user interface software and technology, UIST ’16, pp 767–776. ACM, New York, NY, USA. https://doi.org/10.1145/2984511.2984581. http://doi.acm.org/10.1145/2984511.2984581
Dixon M, Fogarty J (2010) Prefab: implementing advanced behaviors using pixel-based reverse engineering of interface structure. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’10, pp 1525–1534. ACM, New York, NY, USA. https://doi.org/10.1145/1753326.1753554. http://doi.acm.org/10.1145/1753326.1753554
Dixon M, Leventhal D, Fogarty J (2011) Content and hierarchy in pixel-based methods for reverse engineering interface structure. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’11, pp 969–978. ACM, New York, NY, USA. https://doi.org/10.1145/1978942.1979086. http://doi.acm.org/10.1145/1978942.1979086
Dixon M, Nied A, Fogarty J (2014) Prefab layers and prefab annotations: extensible pixel-based interpretation of graphical interfaces. In: Proceedings of the 27th annual ACM symposium on user interface software and technology, UIST ’14, pp 221–230. ACM, New York, NY, USA. https://doi.org/10.1145/2642918.2647412. http://doi.acm.org/10.1145/2642918.2647412
Fast E, Chen B, Mendelsohn J, Bassen J, Bernstein MS (2018) Iris: a conversational agent for complex tasks. In: Proceedings of the 2018 CHI conference on human factors in computing systems, CHI ’18, pp 473:1–473:12. ACM, New York, NY, USA. https://doi.org/10.1145/3173574.3174047. http://doi.acm.org/10.1145/3173574.3174047
Gao X, Gong R, Zhao Y, Wang S, Shu T, Zhu SC (2020) Joint mind modeling for explanation generation in complex human-robot collaborative tasks. In: 2020 29th IEEE international conference on robot and human interactive communication (RO-MAN), pp 1119–1126. IEEE
Google Scholar
Gluck KA, Laird JE (2019) Interactive task learning: humans, robots, and agents acquiring new tasks through natural interactions, vol 26. MIT Press
Google Scholar
Green TR (1989) Cognitive dimensions of notations. People and Computers V pp 443–460. https://books.google.com/books?hl=en&lr=&id=BTxOtt4X920C&oi=fnd&pg=PA443&dq=Cognitive+dimensions+of+notations&ots=OEqg1By_Rj&sig=dpg1zZFRHpBVC_r0--XLyLr6718
Grudin J, Jacques R (2019) Chatbots, humbots, and the quest for artificial general intelligence. In: Proceedings of the 2019 CHI conference on human factors in computing systems, pp 1–11
Google Scholar
Guo A, Kong J, Rivera M, Xu FF, Bigham JP (2019) StateLens: a reverse engineering solution for making existing dynamic touchscreens accessible. In: Proceedings of the 32nd annual ACM symposium on user interface software and technology (UIST 2019), p 15
Google Scholar
Gur I, Yavuz S, Su Y, Yan X (2018) DialSQL: dialogue based structured query generation. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), pp 1339–1349. ACL, Melbourne, Australia. https://doi.org/10.18653/v1/P18-1124. https://www.aclweb.org/anthology/P18-1124
Hartmann B, Wu L, Collins K, Klemmer SR (2007) Programming by a sample: rapidly creating web applications with d.mix. In: Proceedings of the 20th annual ACM symposium on user interface software and technology, UIST ’07, pp 241–250. ACM, New York, NY, USA. https://doi.org/10.1145/1294211.1294254. http://doi.acm.org/10.1145/1294211.1294254
Horvitz E (1999) Principles of mixed-initiative user interfaces. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’99, pp 159–166. ACM, New York, NY, USA. https://doi.org/10.1145/302979.303030
Huang F, Canny JF, Nichols J (2019) Swire: sketch-based user interface retrieval. In: Proceedings of the 2019 CHI conference on human factors in computing systems, CHI ’19, pp 1–10. ACM, New York, NY, USA. https://doi.org/10.1145/3290605.3300334
Huang THK, Azaria A, Bigham JP (2016) InstructableCrowd: creating IF-THEN rules via conversations with the crowd, pp 1555–1562. ACM Press. https://doi.org/10.1145/2851581.2892502. http://dl.acm.org/citation.cfm?doid=2851581.2892502
Hutchins EL, Hollan JD, Norman DA (1986) Direct manipulation interfaces
Google Scholar
Iba S, Paredis CJJ, Khosla PK (2005) Interactive multimodal robot programming. Int J Robot Res 24(1):83–104. https://doi.org/10.1177/0278364904049250
IFTTT (2016) IFTTT: connects the apps you love. https://ifttt.com/
Intharah T, Turmukhambetov D, Brostow GJ (2019) Hilc: domain-independent pbd system via computer vision and follow-up questions. ACM Trans Interact Intell Syst 9(2-3):16:1–16:27. https://doi.org/10.1145/3234508. http://doi.acm.org/10.1145/3234508
Jain M, Kumar P, Kota R, Patel SN (2018) Evaluating and informing the design of chatbots. In: Proceedings of the 2018 designing interactive systems conference, pp 895–906. ACM
Google Scholar
Jiang J, Jeng W, He D (2013) How do users respond to voice input errors?: lexical and phonetic query reformulation in voice search. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, pp 143–152. ACM
Google Scholar
Kasturi T, Jin H, Pappu A, Lee S, Harrison B, Murthy R, Stent A (2015) The cohort and speechify libraries for rapid construction of speech enabled applications for android. In: Proceedings of the 16th annual meeting of the special interest group on discourse and dialogue, pp 441–443
Google Scholar
Kate RJ, Wong YW, Mooney RJ (2005) Learning to transform natural to formal languages. In: Proceedings of the 20th national conference on artificial intelligence - volume 3, AAAI’05, pp 1062–1068. AAAI Press, Pittsburgh, Pennsylvania. http://dl.acm.org/citation.cfm?id=1619499.1619504
Kim D, Park S, Ko J, Ko SY, Lee SJ (2019) X-droid: a quick and easy android prototyping framework with a single-app illusion. In: Proceedings of the 32nd annual ACM symposium on user interface software and technology, UIST ’19, pp 95–108. ACM, New York, NY, USA. https://doi.org/10.1145/3332165.3347890
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings. http://arxiv.org/abs/1412.6980
Kirk J, Mininger A, Laird J (2016) Learning task goals interactively with visual demonstrations. Biol Inspired Cogn Archit 18:1–8
Google Scholar
Ko AJ, Abraham R, Beckwith L, Blackwell A, Burnett M, Erwig M, Scaffidi C, Lawrance J, Lieberman H, Myers B, Rosson MB, Rothermel G, Shaw M, Wiedenbeck S (2011) The state of the art in end-user software engineering. ACM Comput Surv 43(3), 21:1–21:44. https://doi.org/10.1145/1922649.1922658. http://doi.acm.org/10.1145/1922649.1922658
Kumar R, Satyanarayan A, Torres C, Lim M, Ahmad S, Klemmer SR, Talton JO (2013) Webzeitgeist: design mining the web. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’13, pp 3083–3092. ACM, New York, NY, USA. https://doi.org/10.1145/2470654.2466420
Kurihara K, Goto M, Ogata J, Igarashi T (2006) Speech pen: predictive handwriting based on ambient multimodal recognition. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 851–860. ACM
Google Scholar
Labutov I, Srivastava S, Mitchell T (2018) Lia: a natural language programmable personal assistant. In: Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations, pp 145–150
Google Scholar
Laird JE, Gluck K, Anderson J, Forbus KD, Jenkins OC, Lebiere C, Salvucci D, Scheutz M, Thomaz A, Trafton G, Wray RE, Mohan S, Kirk JR (2017) Interactive task learning. IEEE Intell Syst 32(4):6–21. https://doi.org/10.1109/MIS.2017.3121552
Article Google Scholar
Laput GP, Dontcheva M, Wilensky G, Chang W, Agarwala A, Linder J, Adar E (2013) PixelTone: a multimodal interface for image editing. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’13, pp 2185–2194. ACM, New York, NY, USA. https://doi.org/10.1145/2470654.2481301. http://doi.acm.org/10.1145/2470654.2481301
Lau T (2009) Why programming-by-demonstration systems fail: lessons learned for usable AI. AI Mag 30(4):65–67. http://www.aaai.org/ojs/index.php/aimagazine/article/view/2262
Lee C, Kim S, Han D, Yang H, Park YW, Kwon BC, Ko S (2020) Guicomp: a gui design assistant with real-time, multi-faceted feedback. In: Proceedings of the 2020 CHI conference on human factors in computing systems, CHI ’20, pp 1–13. ACM, New York, NY, USA. https://doi.org/10.1145/3313831.3376327
Lee HY, Yang W, Jiang L, Le M, Essa I, Gong H, Yang MH (2020) Neural design network: graphic layout generation with constraints. In: European conference on computer vision (ECCV)
Google Scholar
Lee TY, Dugan C, Bederson BB (2017) Towards understanding human mistakes of programming by example: an online user study. In: Proceedings of the 22nd international conference on intelligent user interfaces, IUI ’17, pp 257–261. ACM, New York, NY, USA. https://doi.org/10.1145/3025171.3025203. http://doi.acm.org/10.1145/3025171.3025203
Leshed G, Haber EM, Matthews T, Lau T (2008) CoScripter: automating & sharing how-to knowledge in the enterprise. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’08, pp 1719–1728. ACM, New York, NY, USA. https://doi.org/10.1145/1357054.1357323. http://doi.acm.org/10.1145/1357054.1357323
Li F, Jagadish HV (2014) Constructing an interactive natural language interface for relational databases. Proc VLDB Endow 8(1):73–84. https://doi.org/10.14778/2735461.2735468
Li H, Wang YP, Yin J, Tan G (2019) Smartshell: automated shell scripts synthesis from natural language. Int J Softw Eng Knowl Eng 29(02):197–220
Article Google Scholar
Li I, Nichols J, Lau T, Drews C, Cypher A (2010) Here’s What I Did: sharing and reusing web activity with ActionShot. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’10, pp 723–732. ACM, New York, NY, USA. https://doi.org/10.1145/1753326.1753432. http://doi.acm.org/10.1145/1753326.1753432
Li J, Yang J, Hertzmann A, Zhang J, Xu T (2019) Layoutgan: synthesizing graphic layouts with vector-wireframe adversarial networks. IEEE Trans Pattern Anal Mach Intell
Google Scholar
Li TJJ, Azaria A, Myers BA (2017) SUGILITE: creating multimodal smartphone automation by demonstration. In: Proceedings of the 2017 CHI conference on human factors in computing systems, CHI ’17, pp 6038–6049. ACM, New York, NY, USA. https://doi.org/10.1145/3025453.3025483. http://doi.acm.org/10.1145/3025453.3025483
Li TJJ, Chen J, Canfield B, Myers BA (2020) Privacy-preserving script sharing in gui-based programming-by-demonstration systems. Proc ACM Hum-Comput Interact 4(CSCW1). https://doi.org/10.1145/3392869
Li TJJ, Chen J, Xia H, Mitchell TM, Myers BA (2020) Multi-modal repairs of conversational breakdowns in task-oriented dialogs. In: Proceedings of the 33rd annual ACM symposium on user interface software and technology, UIST 2020. ACM. https://doi.org/10.1145/3379337.3415820
Li TJJ, Hecht B (2014) WikiBrain: making computer programs smarter with knowledge from wikipedia
Google Scholar
Li TJJ, Labutov I, Li XN, Zhang X, Shi W, Mitchell TM, Myers BA (2018) APPINITE: a multi-modal interface for specifying data descriptions in programming by demonstration using verbal instructions. In: Proceedings of the 2018 IEEE symposium on visual languages and human-centric computing (VL/HCC 2018)
Google Scholar
Li TJJ, Labutov I, Myers BA, Azaria A, Rudnicky AI, Mitchell TM (2018) Teaching agents when they fail: end user development in goal-oriented conversational agents. In: Studies in conversational UX design. Springer
Google Scholar
Li TJJ, Li Y, Chen F, Myers BA (2017) Programming IoT devices by demonstration using mobile apps. In: Barbosa S, Markopoulos P, Paterno F, Stumpf S, Valtolina S (eds) End-user development. Springer, Cham, pp 3–17
Chapter Google Scholar
Li TJJ, Popowski L, Mitchell TM, Myers BA (2021) Screen2vec: semantic embedding of gui screens and gui components. In: Proceedings of the 2021 CHI conference on human factors in computing systems, CHI ’21. ACM
Google Scholar
Li TJJ, Radensky M, Jia J, Singarajah K, Mitchell TM, Myers BA (2019) PUMICE: a multi-modal agent that learns concepts and conditionals from natural language and demonstrations. In: Proceedings of the 32nd annual ACM symposium on user interface software and technology (UIST 2019), UIST 2019. ACM. https://doi.org/10.1145/3332165.3347899
Li TJJ, Riva O (2018) KITE: building conversational bots from mobile apps. In: Proceedings of the 16th ACM international conference on mobile systems, applications, and services (MobiSys 2018). ACM
Google Scholar
Li Y, He J, Zhou X, Zhang Y, Baldridge J (2020) Mapping natural language instructions to mobile UI action sequences. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 8198–8210. ACL, Online. https://doi.org/10.18653/v1/2020.acl-main.729. https://www.aclweb.org/anthology/2020.acl-main.729
Li Y, Li G, He L, Zheng J, Li H, Guan Z (2020) Widget captioning: generating natural language description for mobile user interface elements. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp 5495–5510. ACL, Online. https://doi.org/10.18653/v1/2020.emnlp-main.443. https://www.aclweb.org/anthology/2020.emnlp-main.443
Liang P (2016) Learning executable semantic parsers for natural language understanding. Commun ACM 59(9):68–76
Article Google Scholar
Liang P, Jordan MI, Klein D (2013) Learning dependency-based compositional semantics. Comput Linguist 39(2):389–446
Article MathSciNet Google Scholar
Lieberman H (2001) Your wish is my command: programming by example. Morgan Kaufmann
Google Scholar
Lieberman H, Liu H (2006) Feasibility studies for programming in natural language. In: End user development, pp 459–473. Springer
Google Scholar
Lieberman H, Maulsby D (1996) Instructible agents: software that just keeps getting better. IBM Syst J 35(3.4):539–556. https://doi.org/10.1147/sj.353.0539
Lin J, Wong J, Nichols J, Cypher A, Lau TA (2009) End-user programming of mashups with vegemite. In: Proceedings of the 14th international conference on intelligent user interfaces, IUI ’09, pp 97–106. ACM, New York, NY, USA. https://doi.org/10.1145/1502650.1502667. http://doi.acm.org/10.1145/1502650.1502667
Liu EZ, Guu K, Pasupat P, Shi T, Liang P (2018) Reinforcement learning on web interfaces using workflow-guided exploration. CoRR. http://arxiv.org/abs/1802.08802
Liu TF, Craft M, Situ J, Yumer E, Mech R, Kumar R (2018) Learning design semantics for mobile apps. In: Proceedings of the 31st annual ACM symposium on user interface software and technology, UIST ’18, pp 569–579. ACM, New York, NY, USA. https://doi.org/10.1145/3242587.3242650
LlamaLab: Automate: everyday automation for Android (2016). http://llamalab.com/automate/
Luger E, Sellen A (2016) “like having a really bad pa”: the gulf between user expectation and experience of conversational agents. In: Proceedings of the 2016 CHI conference on human factors in computing systems, CHI ’16, pp 5286–5297. ACM, New York, NY, USA. https://doi.org/10.1145/2858036.2858288. http://doi.acm.org/10.1145/2858036.2858288
Maes P (1994) Agents that reduce work and information overload. Commun ACM 37(7):30–40. https://doi.org/10.1145/176789.176792. http://doi.acm.org/10.1145/176789.176792
Mankoff J, Abowd GD, Hudson SE (2000) Oops: a toolkit supporting mediation techniques for resolving ambiguity in recognition-based interfaces. Comput Graph 24(6):819–834
Article Google Scholar
Marin R, Sanz PJ, Nebot P, Wirz R (2005) A multimodal interface to control a robot arm via the web: a case study on remote programming. IEEE Trans Ind Electron 52(6):1506–1520. https://doi.org/10.1109/TIE.2005.858733
Article Google Scholar
Maués RDA, Barbosa SDJ (2013) Keep doing what i just did: automating smartphones by demonstration. In: Proceedings of the 15th international conference on human-computer interaction with mobile devices and services, MobileHCI ’13, pp 295–303. ACM, New York, NY, USA. https://doi.org/10.1145/2493190.2493216. http://doi.acm.org/10.1145/2493190.2493216
McDaniel RG, Myers BA (1999) Getting more out of programming-by-demonstration. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’99, pp 442–449. ACM, New York, NY, USA. https://doi.org/10.1145/302979.303127. http://doi.acm.org/10.1145/302979.303127
McTear M, O’Neill I, Hanna P, Liu X (2005) Handling errors and determining confirmation strategies–an object-based approach. Speech Commun 45(3):249–269. https://doi.org/10.1016/j.specom.2004.11.006. http://www.sciencedirect.com/science/article/pii/S0167639304001426. Special Issue on Error Handling in Spoken Dialogue Systems
Menon A, Tamuz O, Gulwani S, Lampson B, Kalai A (2013) A machine learning framework for programming by example, pp 187–195. http://machinelearning.wustl.edu/mlpapers/papers/ICML2013_menon13
Mihalcea R, Liu H, Lieberman H (2006) NLP (Natural Language Processing) for NLP (Natural Language Programming). In: Gelbukh A (ed) Computational linguistics and intelligent text processing. Lecture notes in computer science. Springer, Berlin, Heidelberg, pp 319–330
Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781 [cs]. http://arxiv.org/abs/1301.3781. ArXiv: 1301.3781
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality
Mohan S, Laird JE (2014) Learning goal-oriented hierarchical tasks from situated interactive instruction. In: Proceedings of the twenty-eighth AAAI conference on artificial intelligence, AAAI’14, pp 387–394. AAAI Press
Google Scholar
Myers B, Malkin R, Bett M, Waibel A, Bostwick B, Miller RC, Yang J, Denecke M, Seemann E, Zhu J et al (2002) Flexi-modal and multi-machine user interfaces. In: Proceedings of the fourth IEEE international conference on multimodal interfaces, pp 343–348. IEEE
Google Scholar
Myers BA (1986) Visual programming, programming by example, and program visualization: a taxonomy. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’86, pp 59–66. ACM, New York, NY, USA. https://doi.org/10.1145/22627.22349. http://doi.acm.org/10.1145/22627.22349
Myers BA, Ko AJ, Scaffidi C, Oney S, Yoon Y, Chang K, Kery MB, Li TJJ (2017) Making end user development more natural. In: New perspectives in end-user development, pp 1–22. Springer, Cham. https://doi.org/10.1007/978-3-319-60291-2_1. https://link.springer.com/chapter/10.1007/978-3-319-60291-2_1
Myers BA, McDaniel R (2001) Sometimes you need a little intelligence, sometimes you need a lot. Your wish is my command: programming by example. Morgan Kaufmann Publishers, San Francisco, CA, pp 45–60. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.2.8085&rep=rep1&type=pdf
Myers C, Furqan A, Nebolsky J, Caro K, Zhu J (2018) Patterns for how users overcome obstacles in voice user interfaces. In: Proceedings of the 2018 CHI conference on human factors in computing systems, pp 1–7
Google Scholar
Norman D (2013) The design of everyday things: revised and expanded edition. Basic Books
Google Scholar
Oviatt S (1999) Mutual disambiguation of recognition errors in a multimodel architecture. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 576–583. ACM
Google Scholar
Oviatt S (1999) Ten myths of multimodal interaction. Commun ACM 42(11):74–81 https://doi.org/10.1145/319382.319398. http://doi.acm.org/10.1145/319382.319398
Oviatt S, Cohen P (2000) Perceptual user interfaces: multimodal interfaces that process what comes naturally. Commun ACM 43(3):45–53
Article Google Scholar
Pasupat P, Jiang TS, Liu E, Guu K, Liang P (2018) Mapping natural language commands to web elements. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 4970–4976. ACL, Brussels, Belgium. https://doi.org/10.18653/v1/D18-1540. https://www.aclweb.org/anthology/D18-1540
Pasupat P, Liang P (2015) Compositional semantic parsing on semi-structured tables. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing. http://arxiv.org/abs/1508.00305. ArXiv: 1508.00305
Porcheron M, Fischer JE, Reeves S, Sharples S (2018) Voice interfaces in everyday life. In: Proceedings of the 2018 CHI conference on human factors in computing systems, CHI ’18. ACM, New York, NY, USA. https://doi.org/10.1145/3173574.3174214
Price D, Rilofff E, Zachary J, Harvey B (2000) NaturalJava: a natural language interface for programming in java. In: Proceedings of the 5th international conference on intelligent user interfaces, IUI ’00, pp 207–211. ACM, New York, NY, USA. https://doi.org/10.1145/325737.325845. http://doi.acm.org/10.1145/325737.325845
Qi S, Jia B, Huang S, Wei P, Zhu SC (2020) A generalized earley parser for human activity parsing and prediction. IEEE Trans Pattern Anal Mach Intell
Google Scholar
Ravindranath L, Thiagarajan A, Balakrishnan H, Madden S (2012) Code in the air: simplifying sensing and coordination tasks on smartphones. In: Proceedings of the twelfth workshop on mobile computing systems & applications, HotMobile ’12, pp 4:1–4:6. ACM, New York, NY, USA. https://doi.org/10.1145/2162081.2162087. http://doi.acm.org/10.1145/2162081.2162087
Reimers N, Gurevych I (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing. ACL. http://arxiv.org/abs/1908.10084
Rodrigues A (2015) Breaking barriers with assistive macros. In: Proceedings of the 17th international ACM SIGACCESS conference on computers & accessibility, ASSETS ’15, pp 351–352. ACM, New York, NY, USA. https://doi.org/10.1145/2700648.2811322. http://doi.acm.org/10.1145/2700648.2811322
Sahami Shirazi A, Henze N, Schmidt A, Goldberg R, Schmidt B, Schmauder H (2013) Insights into layout patterns of mobile user interfaces by an automatic analysis of android apps. In: Proceedings of the 5th ACM SIGCHI symposium on engineering interactive computing systems, EICS ’13, pp 275–284. ACM, New York, NY, USA. https://doi.org/10.1145/2494603.2480308. http://doi.acm.org/10.1145/2494603.2480308
Sap M, Le Bras R, Allaway E, Bhagavatula C, Lourie N, Rashkin H, Roof B, Smith NA, Choi Y (2019) Atomic: an atlas of machine commonsense for if-then reasoning. Proc AAAI Conf Artif Intell 33:3027–3035
Google Scholar
Sereshkeh AR, Leung G, Perumal K, Phillips C, Zhang M, Fazly A, Mohomed I (2020) Vasta: a vision and language-assisted smartphone task automation system. In: Proceedings of the 25th international conference on intelligent user interfaces, pp 22–32
Google Scholar
She L, Chai J (2017) Interactive learning of grounded verb semantics towards human-robot communication. In: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: long papers), pp 1634–1644. ACL, Vancouver, Canada. https://doi.org/10.18653/v1/P17-1150. https://www.aclweb.org/anthology/P17-1150
Shneiderman B (1983) Direct manipulation: a step beyond programming languages. Computer 16(8):57–69. https://doi.org/10.1109/MC.1983.1654471
Shneiderman B, Plaisant C, Cohen M, Jacobs S, Elmqvist N, Diakopoulos N (2016) Designing the user interface: strategies for effective human-computer interaction, 6, edition. Pearson, Boston
Google Scholar
Srivastava S, Labutov I, Mitchell T (2017) Joint concept learning and semantic parsing from natural language explanations. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 1527–1536
Google Scholar
Su Y, Hassan Awadallah A, Wang M, White RW (2018) Natural language interfaces with fine-grained user interaction: a case study on web apis. In: The 41st international ACM SIGIR conference on research and development in information retrieval, SIGIR ’18, pp 855–864. ACM, New York, NY, USA. https://doi.org/10.1145/3209978.3210013
Suhm B, Myers B, Waibel A (2001) Multimodal error correction for speech user interfaces. ACM Trans Comput-Hum Interact 8(1):60–98. https://doi.org/10.1145/371127.371166. http://doi.acm.org/10.1145/371127.371166
Swearngin A, Dontcheva M, Li W, Brandt J, Dixon M, Ko AJ (2018) Rewire: interface design assistance from examples. In: Proceedings of the 2018 CHI conference on human factors in computing systems, CHI ’18, pp 1–12. ACM, New York, NY, USA. https://doi.org/10.1145/3173574.3174078
Ur B, McManus E, Pak Yong Ho M, Littman ML (2014) Practical trigger-action programming in the smart home. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’14, pp 803–812. ACM, New York, NY, USA. https://doi.org/10.1145/2556288.2557420. http://doi.acm.org/10.1145/2556288.2557420
Vadas D, Curran JR (2005) Programming with unrestricted natural language. In: Proceedings of the Australasian language technology workshop 2005, pp 191–199
Google Scholar
Vrandečić D, Krötzsch M (2014) Wikidata: a free collaborative knowledgebase. Commun ACM 57(10):78–85. http://dl.acm.org/citation.cfm?id=2629489
Xu Q, Erman J, Gerber A, Mao Z, Pang J, Venkataraman S (2011) Identifying diverse usage behaviors of smartphone apps. In: Proceedings of the 2011 ACM SIGCOMM conference on internet measurement conference, IMC ’11, pp 329–344. ACM, New York, NY, USA. https://doi.org/10.1145/2068816.2068847. http://doi.acm.org/10.1145/2068816.2068847
Yang JJ, Lam MS, Landay JA (2020) Dothishere: multimodal interaction to improve cross-application tasks on mobile devices. In: Proceedings of the 33rd annual ACM symposium on user interface software and technology, UIST ’20, pp 35–44. ACM, New York, NY, USA. https://doi.org/10.1145/3379337.3415841
Yao Z, Su Y, Sun H, Yih WT (2019) Model-based interactive semantic parsing: a unified framework and a text-to-SQL case study. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 5447–5458. ACL, Hong Kong, China. https://doi.org/10.18653/v1/D19-1547. https://www.aclweb.org/anthology/D19-1547
Yao Z, Tang Y, Yih WT, Sun H, Su Y (2020) An imitation game for learning semantic parsers from user interaction. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp 6883–6902. ACL, Online. https://doi.org/10.18653/v1/2020.emnlp-main.559. https://www.aclweb.org/anthology/2020.emnlp-main.559
Yeh T, Chang TH, Miller RC (2009) Sikuli: using GUI screenshots for search and automation. In: Proceedings of the 22nd annual ACM symposium on user interface software and technology, UIST ’09, pp 183–192. ACM, New York, NY, USA. https://doi.org/10.1145/1622176.1622213. http://doi.acm.org/10.1145/1622176.1622213
Zhang X, Ross AS, Fogarty J (2018) Robust annotation of mobile application interfaces in methods for accessibility repair and enhancement. In: Proceedings of the 31st annual ACM symposium on user interface software and technology, UIST ’18
Google Scholar
Zhang Z, Zhu Y, Zhu SC (2020) Graph-based hierarchical knowledge representation for robot task transfer from virtual to physical world. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS)
Google Scholar
Zhao S, Ramos J, Tao J, Jiang Z, Li S, Wu Z, Pan G, Dey AK (2016) Discovering different kinds of smartphone users through their application usage behaviors. In: Proceedings of the 2016 ACM international joint conference on pervasive and ubiquitous computing, UbiComp ’16, pp 498–509. ACM, New York, NY, USA. https://doi.org/10.1145/2971648.2971696. http://doi.acm.org/10.1145/2971648.2971696

Download references

Acknowledgements

This research was supported in part by Verizon through the Yahoo! InMind project, a J.P. Morgan Faculty Research Award, NSF grant IIS-1814472, AFOSR grant FA95501710218, and Google Cloud Research Credits. Any opinions, findings or recommendations expressed here are those of the authors and do not necessarily reflect views of the sponsors. We thank Amos Azaria, Yuanchun Li, Fanglin Chen, Igor Labutov, Xiaohan Nancy Li, Xiaoyi Zhang, Wenze Shi, Wanling Ding, Marissa Radensky, Justin Jia, Kirielle Singarajah, Jingya Chen, Brandon Canfield, Haijun Xia, and Lindsay Popowski for their contributions to this project.

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, USA
Toby Jia-Jun Li, Tom M. Mitchell & Brad A. Myers

Authors

Toby Jia-Jun Li
View author publications
You can also search for this author in PubMed Google Scholar
Tom M. Mitchell
View author publications
You can also search for this author in PubMed Google Scholar
Brad A. Myers
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Toby Jia-Jun Li .

Editor information

Editors and Affiliations

Google Research (United States), Mountain View, CA, USA
Yang Li
Advanced Interactive Technologies Lab, ETH Zurich, Zurich, Switzerland
Otmar Hilliges

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Li, T.JJ., Mitchell, T.M., Myers, B.A. (2021). Demonstration + Natural Language: Multimodal Interfaces for GUI-Based Interactive Task Learning Agents. In: Li, Y., Hilliges, O. (eds) Artificial Intelligence for Human Computer Interaction: A Modern Approach. Human–Computer Interaction Series. Springer, Cham. https://doi.org/10.1007/978-3-030-82681-9_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-82681-9_15
Published: 05 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82680-2
Online ISBN: 978-3-030-82681-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics