Keywords

1 Introduction: Generations to Generative

Last academic year, Applied Computer Science program at Woodbury University was invited by The J. Paul Getty Museum, a premier art institution in Los Angeles, to design an installation for the College Night event and exhibition scheduled to take place in April 2019. In conversations with educational specialists at the museum, interest was shown in exploring human emotions as a thematic element in a project that would merge art and technology, which successfully aligned with the undergraduate program’s mission.

To approach the design and development of this project we created a collaboration between two sophomore classes and designed a new syllabus for each course to include dialogue with the museum. The syllabus also included guided visits to the museum to study some of the art works in its collection, learning about the artists’ intentions, and what the works are communicating in their portrayal of emotions and facial expressions.

While studying the museum’s art collection [5] that comprises Greek, Roman, and Etruscan art from the Neolithic Age to Late Antiquity; and European art from the Middle Ages to the early 20th Century; with a contemporary lens, questions arose regarding static and finished pieces of art, in contrast to interactive and responsive artworks [6]. In our attempt to find ways to connect the student project to the museum and its art collection, we decided to explore solutions that would allow for creation of dynamic generative visuals in combination with content somehow derived from the existing collection.

A key observation guiding our concept was that artworks in exhibitions and museums are typically accompanied by a title and a brief description. In our case, unique visuals would be generated based on user interaction, so we proposed the creation of new synthetic titles and descriptions to accompany each user engagement. It seemed fitting that the textual output would be generated by machine learning algorithms trained on existing texts from the museum’s art collection, to create a unique connection bridging carefully curated static content with dynamic generated visuals.

The resulting installation utilizes a Kinect sensor to analyze the users’ movements and a separate camera to read facial expressions [3] via computer vision and Deep Learning algorithms, using their outputs in the next stage as a basis for text generation based on natural language models trained on the text descriptions of the Getty Museum’s art collection.

2 Process: Beyond Codified Interaction

Our goal was to create a new media art [4] piece that has an intimate connection with the users while simultaneously generating new periodic content that is always evolving, changing; never the same.

The installation consists of a vertical video wall composed of three landscapeoriented screens, with two sensors embedded on top: Microsoft Kinect ONE and a USB web camera. We used two software platforms, PyCharm as an integrated development environment for the Python programming language; and Derivative TouchDesigner as a real-time rendering, visual programming platform. The two platforms communicated with each other via TCP/IP sockets over the network.

The title of the project, “WISIWYG”, is a play on the popular acronym “What You See Is What You Get” (WYSIWYG), based on the idea that the installation incorporates computer vision processing and machine learning algorithms, in a sense allowing it to generate outputs according to what it sees from its own perspective.

The user experience was a central component of the project and was carefully crafted. After several sessions of user testing to determine the ideal flow for the installation, we designed a sequence of animated events, to successfully guide the user through the experience lasting a total of about two minutes (Fig. 1).

Fig. 1.
figure 1

Installation diagram

As a result, two types of media proposals were developed in parallel: media to be displayed during users’ interaction, and media generated in-between interactions (Fig. 2).

Fig. 2.
figure 2

User experience diagrams

Once the user experience had been designed, the following strategies were developed in order to connect and interact with the attendees.

3 User Analysis

The primary driver of the project is the camera input, processed through computer vision and machine learning algorithms. During the course of users’ engagement, computer vision is first used to isolate faces and determine the number of people in camera’s field of view. This step is accomplished via a traditional computer vision face tracking method using the popular Open Computer Vision (OpenCV) library. The second step uses a Deep Learning model based on a Convolutional Neural Network (CNN) [7] programmed in Python with the help of Keras and TensorFlow frameworks.

We utilized a 5-layer CNN model made popular by the Kaggle [2] facial expression recognition challenge. The model has been trained on the Feb2013 dataset distributed with the challenge, which consisted of 28,000 training and 3,000 test images of faces stored as 48 by 48-pixel grayscale images. In order to provide the image data in the format that the CNN model expects, sub-windows of the camera feed with faces detected by the OpenCV library were scaled down to 48 × 48 size to be passed on in this step. The Python code to construct the facial expression detection model in Keras, as well as other Deep Learning models and associated weight parameters utilized in the project and discussed further, are available in an open GitHub code repository.

4 Media Creation

When visitors approach the installation, depending on how many people are in front of the screen; their facial expressions and estimated ages; a unique animation is generated on the video wall, comprising the following elements:

  • Silhouettes: A mirror-effect reflection of users’ silhouettes on the screen featuring real-time generated visuals, playing inside and outside the silhouettes.

  • Scenes: Each student designed four real-time scenes with animated content, driven by hard-coded rules based on the parameters obtained through computer vision algorithms.

  • Color palettes: Students generated parametrized color animations using the computer vision inputs (estimated ages, facial expressions, and number of people) (Fig. 3).

    Fig. 3.
    figure 3

    Gallery showcasing several of the animations designed by the students and the different color palettes applied based on the users’ facial expressions and movements.

5 Synthetic Text Descriptions

This portion of the project began by screening the Getty’s art collection online, and selecting all artifacts depicting people. As the next step, a database was created by recording the artworks’ titles and descriptions, as well as subjectively estimating the number and ages [10] of people featured.

In analyzing the Getty’s art collection, students experimented with the Deep Learning natural language model GPT-2 [9], an acronym for Generative PreTrained Transformer, released by the non-profit foundation OpenAI in February 2019. The language model generates convincing responses to textual prompts based on set parameters such as the maximum length of response and ‘temperature’ of indicating the relative degree to which the output conforms to the features resembling training data.

Our project utilized the GPT-2 model with 144 million parameters, the largest made available by the OpenAI Foundation as of April 2019. The model has been “fine-tuned” by the students, a process in which an existing machine learning model was re-trained to fit new data, using descriptions of artworks on display at the Getty Center as the training dataset. To accomplish this task, we utilized the Google Colaboratory [1] notebook environment that allows Python code to be shared and executed online. A Collaboratory notebook designed for fine-tuning the GPT-2 model allowed each student to individually analyze and modify the Python code to read new text data, re-train the language model, and produce new text descriptions based on interactive text prompts.

The text prompts were pre-generated by the students based on their own analysis of artworks and consisted of short singular and plural descriptions like “sad young person” or “two happy people.” The text prompts were interactively input to the GPT-2 model to generate responses that were entered into a database and associated by rows with tagged columns for age, number of people and facial expressions. The database content was then programmatically correlated with the outputs of computer vision processing and Deep Learning classification using the Pandas library in Python, by selecting a random database cell containing a response that matches detected facial expression tags. Finally, the selected response was rendered as the text description that accompanied visual output onscreen.

It was beyond the scope of the project to build students’ expertise in constructing their own Deep Learning models or invest resources into training the models from scratch, especially in view of time and computing power expended in the process. Researchers at University of Massachusetts, Amherst, have shown that training the GPT-2 language model, for example, can consume anywhere between $12,902–$43,008 in cloud computing costs [11]. While training such computationally expensive models was beyond the scope of our project, each student had a chance to install the relevant Python code and libraries, analyze them, fine-tune and run the models on their personal computers.

In addition to facial expression detection, the project incorporated Deep Learning models for age and gender detection, though ultimately we decided not to pursue gender recognition due to non-technical concerns with bias and non-binary gender identification. The age detection was built on a successful CNN architecture based on Oxford Visual Group’s (VGG) model [8] that gained recognition in the ImageNet Large Scale Visual Recognition Competition (ILSVRC) in 2014 [11]. The datasets that formed that basis of the VGG model, known as IMDBWIKI [13], consist of over 500,000 labeled faces.

In order to utilize the model effectively, we again made use of pre-trained weights found online [14], applying transfer learning techniques to work with the real-time video data in our project. As a result, each student was able to directly experiment with deploying machine learning models in Python code to process computers’ built-in camera input and generate predictions for detected facial expressions and age estimates of each other, initially in the classroom environment and eventually in the public setting of the museum exhibition (Fig. 4).

Fig. 4.
figure 4

Example of texts generated by machine learning algorithms overlaid on the real-time generated animations.

6 Thoughts

With the ongoing evolution in computing hardware and software, machine learning is becoming more accessible and widespread. We’re now able expand the levels of interaction to more complex layers, allowing recognition of patterns, movements, facial expressions and more. Along with the rapid evolution of realtime rendering engines, programmable shaders, and new algorithms, it is now possible to effectively create real-time data-driven media at high resolutions and with great rendering quality. Advances in technology and computing exemplified by Deep Learning can contribute to deeper connections and new layers of interactivity.

Different simultaneous levels of interactivity occurred in this project, some with a direct and transparent effect, and others with more elaborate or indirect interaction. For example, we discovered that when the users recognized their own silhouettes, they would engage more actively with the installation: dancing, jumping, moving their hands or even performing backflips. The facial expression and age-based visualization parameters were not as transparent, since it is harder for the user to acknowledge this relationship during a time-limited engagement (Fig. 5).

Fig. 5.
figure 5

Users experiencing the installation and playing with their silhouettes on the screen.

Even though our database was quite small, the generative algorithms provided widely variable outputs. The level of uncertainty in understanding generated textual content created a level of engagement as well, attested by visitors inquiring on many occasions why a certain title or description appeared on the screen. It is of our interest that further explorations should address a more thorough analysis of the users’ feelings or personality, in search of a deeper and more profound connection between the art piece and the subject.

After creating our database, at the beginning of the user testing process, some of the students raised concerns about the use of gender in camera-based analysis and generated texts. As a result, questions came up regarding the role of gender identity in today’s society and opened the door for further discussions involving human-machine interactions. We attempted to address this issue in part by avoiding the use of gender, and programmatically manipulating the generated texts, as well as manually editing our database by screening for male and female pronouns, attempting to “de-gender” it by replacing those with plural “they,” or neutral “person”, wherever appropriate.

The possibility of using gender detection and other forms of identification, afforded by Deep Learning software algorithms, opens the door to a host of significant questions and concerns. Shoshana Zuboff, in her book Surveillance Capitalism [15], raises key elements to reflect on, now that a wide range of software applications and hardware devices the we use daily, monitor, log and process a wide variety of data obtained from the user. Zuboff warns that these technologies are often designed to obtain users data in disguise, in search of individual and society behavior modification. Users’ privacy while being monitored and issues surrounding this topic were briefly addressed in this project, but are undoubtedly significant questions to be aware of, discussed and reflected upon when designing interactive experiences.

7 Conclusion

Features that make this project unique include the combination of real-time generative graphics with exciting new machine learning models, the nature of its development within an academic environment and the opportunity for the students to exhibit at a great art institution. Trying to find ways to connect the existing art collection to our project, while addressing the student learning outcomes, resulted in a project that successfully fulfills the mission of our University program: hybrid art and technology.

By tapping into different areas of study in one experimental project, we feel that we managed to offer the students an opportunity to understand how diverse disciplines can be intertwined and relevant to each other, while the deployment of a live interactive installation at a world-renowned art institution endowed them with valuable professional production experience. The number of constraints, such as addressing multiple curriculum requirements, a collaboration between academic and art institutions, maintaining appropriate workloads and so on, turned out to strengthen and boost the level of creativity, allowing the students to become proficient in several technical skills working with advanced programming frameworks and computational models.

Perhaps our primary contribution as faculty guiding this project has been to help students synthesize and leverage the somewhat disparate technical and creative tools, frameworks and resources as part of the learning process, as well as providing a critical view towards that state of technology and computer science in today’s culture. We hope this project summary provides a useful reference point for others seeking an approach in creative applications of contemporary technologies.