Let us now return to MbM and ask what optical perspective is implied in it. What lens or lenses are encoded in the computational gaze we set upon the BBC Television archive?
We know the Visual Genome uses images originally sourced from Flickr (Krishna et al. 2017, 47) and that the photo platform hosts many of its images along with their EXIF data,Footnote 9 which is an international metadata standard for digital images and sound that includes tags for camera settings and lens information.
EXIF is far from perfect. Its metadata structure is borrowed from TIFF files and is now over 30 years. A notable drawback to working with this type of data is, therefore, its inconsistency, given the quick pace at which digital cameras changed over the last decades and the many differences in how they used the standard over time, even among cameras from the same manufacturer. What is more, some manufacturers like Nikon use custom format fields not common to any other brand and encrypt the metadata contained in them. This makes it very difficult to extract, disaggregate and process EXIF.Footnote 10 Finally, this type of metadata is not usually available for not-born-digital photographs, i.e. taken with analogue cameras or images that were scanned.Footnote 11
These caveats notwithstanding, EXIF is still the most widely used metadata standard for photography and as such a key resource to research the equipment and technical practice that underlies the creation of photographic images in the digital age. And precisely because of its longevity and pervasive use, it is one of the few ways to trace a technical lineage from lenses to computer vision. It is quite possibly the only, where such a lineage can be done at a larger scale, given the size of the collections of images used in deep learning.
We extracted EXIF metadata from all the images whose Flickr IDs matched the ones present in the Visual Genome. The metadata standard is comprised of over twenty thousand tags, but we only selected tags that were general enough so as to be reported by most cameras. Within these, we only focused on the ones containing data about the parameters over which photographers tend to have more choice and control, namely their choice of camera and lens, as well as the aperture, exposure and focal length settings. Table 1 shows a list of the tags that were queried and an example of the values extracted. Table 2 shows an overview of the extraction results.
The extraction process yielded a relatively dense distribution, with over 83% of accessible images returning metadata in at least one of the five of the queried tags. The one exception was < Lens info > , for which only 10% of accessible images returned values. In light of this, we decided to consolidate data for all tags except < Lens info > , which was kept separately for later analysis. We also parsed over apertures and focal lengths to bin them into categories: twelve bins corresponding to full f-stops for apertures—from f1 to f45,Footnote 12 and seven focal distance bins corresponding to a commonly used classificationFootnote 13:
Ultra wide (< 24 mm)
Wide (24–35 mm)
Normal (35–85 mm)
Short telephoto (85–135 mm)
Medium telephoto (135–300 mm)
Super telephoto (+ 300 mm)
We also parsed over exposures to remove faux entries (e.g. a small number of older mobile phones reported infinite or zero values for exposure), and manually matched some camera manufacturers names (e.g. ‘NIKON’ and ‘Nikon Corporation’). The consolidated data frame includes all values in all remaining tags for a total 68,085 entries, which is 66% of all images that comprise the Visual Genome (v1.2). An example of our working data frame is shown in Table 3.
Our analysis of EXIF data shows the clear dominance of DSLR over other types of equipment, with Canon and Nikon being the two major manufacturers combining for over 64% of all cameras, more than eight times the share of the third largest manufacturer, Sony, at 8% (Fig. 4).
From these, the ten most popular camera models all correspond to Canon EOS and Nikon DX systems, with the only exception of the Apple iPhone 4, at number nine. The most common camera in our dataset is Nikon’s D90, an entry-level DSLR released in 2008, and the first model with video-recording capabilities. The second most popular is the semi-professional Canon 5D Mark II, released the same year, closely followed by the 7D also from Canon, released in 2009.
In terms of how these cameras were used, our analysis identifies large apertures f 2.8, 4, and 5.6 as the most popular, accounting together for 74% of photographs (Fig. 5).
For focal length, lenses between 35-85 mm are the most common, accounting for 50.7% of the images, with the least popular being the super telephoto, only used to take 1.6% of the photos in our dataset (Fig. 6).
Exposure was more evenly distributed between the extremes with the notable exception of 1/60, identified as significantly more popular than all other shutter speeds. This is possibly due to the common belief that this is the slowest shutter speed one can expose without needing a tripod.Footnote 14 Figure 7 shows the ten most common combinations of aperture, focal length and exposure parameters in images in the Visual Genome, all of which are under direct control of their photographers.
These findings are consistent with the practices of a “proficient consumer” community of photo enthusiasts working with DSLR equipment.Footnote 15 These are generally non-professional photographers who nevertheless are willing to invest in a bulkier and more expensive camera and take the time to learn how to operate it manually. Users of the Nikon D90 are often recent converts migrating upwards from the common point-and-shoot photography. Or they might also be more established and committed users of a Canon 7D, who probably own a few lenses already and might be close to going professional. This grouping is also supported by our smaller sample of lens data from the < Lens Info > tag, which shows inexpensive lenses that come bundled with cameras to be very popular, e.g. the 18-55 mm f/3.5–5.6 included in both Nikon and Canon starter kits (camera body + lens), but also include a few more expensive lenses (particularly on longer focal lengths, e.g. the 100–400 mm f/4.5–5.6L or the EF70-200 mm f/2.8L, both by Canon).Footnote 16 We believe these lenses overlap with professional practice and were probably acquired as second or third lenses for purpose-specific photography, specifically wildlife or sports, both of which featured heavily in a manual sampling we conducted over images taken with these two models.Footnote 17
By identifying the dominant photographic practices of this community of DSRL enthusiasts in the Visual Genome, we show the implicit optical perspective mobilised in MbM. If one were to ask not about the accuracy in detecting what is depicted, but about the latent camera of this particular computer vision system, we could now reply with some degree of confidence that this perspective falls within the focal range of a 18–55 mm lens on a APS-C or APS-H camera; apertures between f3.5 and f5.6, and a likely exposure 1/60 s. Casting aside some of the other complexities of MbM for a moment, we could say that in general terms this was the lens through which the BBC archive was seen.
Today, DSLR photography of this kind is a somewhat dying practice, as sales of this type of camera have been steadily declining over the past decade (CIPA 2019). Everyday photographs are now taken with mobile phones and circulated through social media (Herrman 2018). However, while the equipment and the communities that supported this visual regime recede into history, lens aesthetics are anything but history. On the contrary, the standard of photography set by DSLR practitioners is now being reimagined under the logic of digital computation and mobile phones,Footnote 18 pursued through software and through AI (See for example: Yang et al. 2016; Ignatov et al. 2017).
With this in mind, we suggest turning computer vision to itself and asking whether it is possible to engineer a machine that tells us about the becoming of images; not only what they depict but how. If we concede that the “aboutness” with which we invest photographic images—including their epistemic advantage—is a function of the depicted no less than of the depiction modality, such an aesthetic machine, we argue, is as justified as one that distinguishes cats from dogs, or hot-dogs from other sandwiches. Could we not train machines to learn about optical perspectives as well as what these perspectives are used for at given times in history?
To close this article, we offer a prototype along these lines as a proof-of-concept, which is purposefully designed to be blind to what photographs are of; a type of vision that cares nothing about recognising objects, people or scenes, and is instead programmed to learn only about how its images were made and the visual perspectives they embody, in this case the focal distance of the lenses with which they were taken.
Using the EXIF dataset we assembled and the images from the Visual Genome, we trained a Neural Network to classify focal lengths and to distinguish between photographs taken with a wide angle lens from those taken with a telephoto. The class boundaries are drawn at under 24 mm for the former and over 135 mm for the latter. Each class was given little over 12,000 training samples (Fig. 8). The model was trained from scratch using a VGG-based convolutional neural network (Fig. 9).
Our results show test accuracy of 83% after fourteen epochs of training. We manually tested the model at this checkpoint by running inference on several photographs not contained in the Visual Genome to confirm it performed as expected in evaluating out of sample images. But we are only at the beginning of our work here. Without probing further into the model and conducting more systematic tests, it is difficult to know what exactly the neural network has learned from these images. One of our working hypotheses is that there are low-level features like the texture of bokeh or warmer green tones which might correlate strongly to longer focal lengths, since both the field view and speed of many of these lenses favours their outdoor use. In any case, our initial results already suggest that, with some exceptions such as irregularly shaped images from elongated panoramas, grainy images or images captured with optical zoom, the predictions of our classifier were reasonably accurate for photographs taken with either very long or very wide lenses. Figure 10 shows a comparison of two successfully classified images using this method. For the casual observer who sees these two images all at once, instead of counting them pixel by pixel, there are many apparent differences: one is the Shard in London, the other a baby orangutan in Borneo; one is a landscape, the other a portrait; one is a night scene, the other was taken in broad daylight. However, when it comes to the type of lens used to render these scenes visible, a posteriori knowledge might in fact be a task for which computer vision is much better suited. In particular deep convolutional networks can help with their progressive and content-agnostic abstraction of pixel relations.
Going back to MbM, we used our focal length classifier on frames from one of the mislabelled sections mentioned at the beginning (Fig. 11). Comparing the predictions outputted by the two systems, our ‘telephoto’ classification seems intuitively more accurate than MbM’s ‘reflection in a mirror’. This might be an extreme example but it points us to a fundamental problem that is sometimes overlooked in machine learning. Which prediction tells us more about the image? What kind of knowledge is implied by each, and when or why would we prefer one kind over the other?