Toward Quantifying Ambiguities in Artistic Images

It has long been hypothesized that perceptual ambiguities play an important role in aesthetic experience: a work with some ambiguity engages a viewer more than one that does not. However, current frameworks for testing this theory are limited by the availability of stimuli and data collection methods. This paper presents an approach to measuring the perceptual ambiguity of a collection of images. Crowdworkers are asked to describe image content, after different viewing durations. Experiments are performed using images created with Generative Adversarial Networks, using the Artbreeder website. We show that text processing of viewer responses can provide a fine-grained way to measure and describe image ambiguities.


INTRODUCTION
"When looking at a picture, one should say that the more associations it can open up the better. " -Pablo Picasso [7] When confronted with a new image, the human visual system automatically tries to make sense of it [26,29,33,34]. Some images are easy to interpret, but others require more effort because they are ambiguous, multivalent, or indeterminate. Art theorists have argued that visual art often exploits ambiguity in order to engage viewers by simultaneously suggesting and concealing the meaning of a work, or by evoking multiple diverse meanings [12], as Picasso suggests in the quote above. Time plays an important role in these theories: some images are confusing at first, but then lead to an "Aha" moment as the subject is recognized [21,25], whereas images that appear simple at first but become more perplexing over time are sometimes called "indeterminate" [27]. Indeterminacy is a major theme in modern visual art [12,28], e.g., Gerhard Richter is an example of a major contemporary artist who deliberately creates and values indeterminate images [31] (Fig. 1).
Can we analyze these image properties quantitatively? Several recent authors have attempted to categorize them [22,23,27,28], and to put them in the context of neuroscience theories [15,35]. However, experimentation remains difficult. Previous studies have tended to use "handmade" artworks [9,18,19,22,24,36,37]. For example, most methods compare artworks made by one artist to works by another artist, or to works by the researchers. Such images have several limitations when used as experimental stimuli. For example, viewers' judgments may be influenced by historical, stylistic, or contextual factors not of direct relevance to the study. Highly-simplified graphics, such as Mooney faces, have also been used as stimuli [21], but generalizing from these results is also challenging. Moreover, previous studies typically ask high-level subjective questions of in-person participants, including whether or not an image is ambiguous or contains an object. As a result, the conclusions from these methods are promising but necessarily limited.
This paper proposes an approach to measuring perceptual ambiguity in artistic images. Study participants are shown an image for a fixed duration, and asked to describe the contents of the image. We hypothesized that the quantity and diversity of the descriptions would provide a measure of perceived ambiguity. We also hypothesized that the distribution of responses would vary for different viewing durations, reflecting how perception of an image evolves over time.
For stimuli, we use images from Generative Adversarial Networks (GAN) [14]. Specifically, we gather popular images from the website Artbreeder.com, since these images span a range of ambiguity and indeterminacy [16]. These stimuli avoid some of the limitations of previous studies: they are presented in their "native" format as digital images, they minimize art historical or contextual confounds, they are visually rich, relatively free of stylistic bias, and can be generated in large numbers.
Understanding ambiguous and indeterminate images is important not only for art theory and history but for our understanding of human image perception more generally. We show that histograms and entropy of viewer response can capture and summarize image ambiguities. Our results suggest how these kinds of studies could help describe, categorize, and measure the space of image ambiguity.

PERCEPTUAL STUDY AND DATA PROCESSING
Image stimuli: To form the image dataset, we manually identified a set of 150 images from Artbreeder.com that appeared to exhibit variations in image ambiguity. All images were taken from Artbreeder's "General" class, which provides images from the BigGAN model [5]. The first 120 images were manually selected from among the most popular images on Artbreeder, and loosely categorized into 4 categories of 30 images each: "Recognizable" (e.g., Fig. 2(top)), "Dichotomous" (depicting two or more distinct objects simultaneously, e.g., Fig. 7(d,e)), "Indeterminate" (open to multiple interpretations, e.g., Fig. 2(bottom)), and "Abstract" (clearly containing no objects, e.g., Fig. 7(a,f)). We manually constructed the remaining thirty, "AbstractFlat", (e.g., Figure 7(g,h)) to be highly abstract, using the site's "gene editing" interface, by increasing the BigGAN truncation parameter via the "chaos gene" control and setting all presented embedding coordinates to -1.
Task design: We crowdsourced descriptions using Amazon's Mechanical Turk. A task consisted of a sequence of 30 images, all from a single category of the stimuli set, to avoid confounding effects. Participants viewed each image for either 0.5 or 3 seconds; the viewing duration was fixed for a single participant, but randomized across participants. After the image disappeared, participants completed an attention vigilance task whereby they reported the last location on the screen where they looked (based on a similar task design [11]). Then, participants entered a freeform text description. They were instructed to describe the scene and any objects in the image, "even if the image looks abstract at first". We recruited 70 participants for each of the two viewing durations and 5 categories from our stimuli set (launching 700 tasks in total). We filtered participants based on the attention vigilance task. After filtering, we have on average 20.4 descriptions per image and viewing duration. Sample images and descriptions are shown in Figures 2(a,b), and 3.
Viewing duration: Human perception studies have shown that retention of visual details plateaus by 3 seconds of perception [3,4], while half a second allows for most visual information to enter conceptual working memory without interference [10,30]. We ran a pilot study with viewing durations of 300 ms, 500 ms, 1000 ms, and 3000 ms. Participants complained that 300 ms was too brief to understand what was depicted, while results at 1000 and 3000 ms were very similar. Thus, for our main experiment, we chose to collect data at two durations: 500 ms and 3000 ms.
Post-processing: Given the raw textual descriptions of a given image, we perform some simple text processing to form a histogram of responses ( Figure 2(c)). We treat the set of responses as a "bag-of-words" [17]. Specifically, for each text description, we run a part-of-speech tagger [2], and keep only the nouns. We also discard any terms in a predefined set of 12 disallowed words, such as "abstract" and "art". Synonyms are grouped using NLTK [1], yielding a set of tokens. We then form a histogram of the tokens for the image, grouped by viewing duration. This process is performed separately for each image's responses.
Given these histograms, we measure ambiguity with two numbers: H 0.5 is the Shannon entropy of the 0.5second viewing duration histogram, and H 3 is the entropy of the 3-second histogram. We report entropy scores in units of bits.

CATEGORIZING AND RANKING AMBIGUITIES
Our key assumption is that the distribution of textual descriptions for a given image and a given time duration reflect the perceptual ambiguity that a single viewer has for that image. This could arise, for example, if, when forced to describe an image, a viewer samples a single possible explanation from their posterior probability distribution. Similar processes have been hypothesized elsewhere in neuroscience [8].
Hence, the histogram of responses acts as an estimate of a typical viewer's probability distribution over image interpretations, for a given viewing duration. We can observe several types of images. In a determinate image, (Figure 2(top row)), most viewers describe the image in the same way at both viewing durations; that is, both H 0.5 and H 3 are low. In an indeterminate image (Figure 2(bottom row)), descriptions are highly varied in both conditions. Figure 3 shows the images with the lowest and highest entropy across descriptions. As can be seen in the figure, this metric reflects the degree of recognizability of the images. Moreover, note that none of the "AbstractFlat" images appear here. Sometimes they have high entropy (e.g., Figure 7(g)), and sometimes they have lower entropy due to a peak of color descriptions like "pink image" (e.g., Figure 7(h)). This suggests that perhaps entropy alone can identify indeterminate images. An image which is too abstract does not conjure many associations, nor does an image which is very realistic. Indeterminacy seems to produce the longest and most varied descriptions. Figure 4 shows high-entropy images sorted by the entropy difference H 3 − H 0.5 under the condition of H 3 > 4. This threshold gives us images which generate more associations after 3s viewing. When sorted by the entropy difference, we see that more complex images tend to have greater decrease in entropy over time. The   complementary examples are shown in Figure 5, where the threshold is set to H 3 < 4. Here the images all seem to depict individual objects. Entropy appears to decrease the most where the object is odd but recognizable. Entropy appears to increase for images that seem to be complex variations on familiar objects. Figure 6 shows a scatterplot of the entropies of images across our dataset. As shown in the plot, the image categorization that we used to build the dataset emerges in some regions of the plot; for example, recognizable images generally have lower entropy scores (H 0.5 , H 3 < 4). Dichotomous images, such as Figure 7(d,e) have simple explanations at first, but diverge as viewers find two or more interpretations. Figure 7 shows a sampling of cases with high entropy. We now discuss some of the phenomena that emerge. The first three images have high entropy (H 3 > 4.5), and entropy decreases over time (H 0.5 > H 3 ). In Figure 7(a), the most-frequent terms in the 0.5s viewing condition are "coat", "knife", and "cloth", but, in the 3s condition, entirely new terms become prevalent, including "building", "ship", and "sails". In Figure 7(b) most descriptions in the 0.5s condition include "face", and even more do in the 3s condition, while several terms, like "bug", "helmet", and "monster", disappear. In Figure 7(c), "flowers" is rare in the initial condition, but becomes much more common after the longer viewing duration.  Observe that entropy appears to reflect image ambiguity/recognizability, and very indeterminate images have the highest entropy.
In Figure 7(d,e,f), entropy increases: viewers first give consistent "first impression" descriptions, such as "buffalo", "dog" and "knife", but the descriptions become considerably more varied and diverging when viewers have spent more time viewing them.
The final images are two examples that could be seen as purely abstract art, yet viewers can occasionally perceive objects within them, though less consistently. Most tokens appear only once, as indicated by the large "[other]" category.

DISCUSSION
Our work is the first to attempt to go beyond the high-level impressions of ambiguity by viewers, to instead uncover and quantify how ambiguous an image is across a population of viewers. It is partly motivated by a desire to investigate whether modern computational methods can be used to address longstanding questions about subjective responses to images, particularly in light of the claims made by artists, such as Picasso, about what makes certain images aesthetically valuable.
As we show, the textual response histograms generated by our method robustly capture image properties like indeterminacy, measuring shades of ambiguity which previous studies were insensitive to.
Our preliminary study is intended to illustrate the potential value of this approach for measuring the subtle and highly subjective phenomenon of visual ambiguity. There are several limitations and directions for future Note that images with low ∆H tend to be recognizable, eventually, while large ∆H images tend to be variations on familiar concepts. development. First, we are yet to validate the indeterminacy hypothesis suggested by the art theoretical literature, that there is a positive correlation between image ambiguity and aesthetic appreciation. The tools and methods developed here provide a useful starting point from which to rigorously investigate this claim in future.
Second, on the basis of previous literature, we expected to find an increase in entropy over time for images with high levels of indeterminacy. We did not see this specific effect in our results (Figures 3-5), but we did find change in entropy to provide some differentiation between types of ambiguity. In fact, our results suggest that entropy alone can measure indeterminacy: if one wishes to follow Picasso's goal of producing the most associations, an image should not be realistic, but it should also not be too abstract.
Third, our text processing procedure is very simple. As a result, it discards some important information in the textual descriptions, e.g., it cannot distinguish between an image having diverse descriptions because the image is confusing ("it's either a horse or a chair"), because it is dichotomous ("it's a chair made to look like a horse"), because it is complex ("it's a horse next to a chair"), or because the description is verbose ("it's a chair sitting on the ground"). Our heuristic list of disallowed words could be replaced with a more nuanced filtering. It is also unclear how best to take advantage of semantic similarity between descriptions that use distinct but related words. Our text processing method ignored phrases that indicated difficulty or inability to respond, such as "I'm not sure but... " or "I have no idea", which occurred frequently for images in our indeterminate categories; the frequency of such phrases could be used to further inform the analysis. Application of more sophisticated text processing can yield more reliable and finer-grained insights.
Another important future step is to relate entropies to aesthetic properties. If semantic diversity and uncertainty are regarded as positive aesthetic attributes in artworks, as the art historical literature suggests, then we might expect to find a correlation between these qualities and entropy. Indeed, previous researchers have found such a positive effect [19]. As a preliminary test, we asked a separate sample of 128 crowdworkers to rate the images according to their level of "interestingness", "powerfulness", and "engagement"; however, we found that crowdworkers gave highest scores to realistic, easily-interpretable images. There may be many reasons for this finding that do not necessarily invalidate the main hypothesis of this study, including viewer expectations for which images constitute art images versus non-art images, the framing and setup of the task, and the expertise of the raters, each of which may need to be controlled for in future experiments. The exposure times used in the present study were relatively short, particularly in the context of art viewing where a museum-goer might typically spend at least several tens of seconds studying an artwork in order to appreciate its nuances, and may return to look at it more than once [6]. Our crowdworkers often commented that they had too little time to complete their descriptions. Exposures in the order of tens of seconds or minutes may yield more complex and nuanced responses.
We thus far have only studied ambiguity of object recognition, whereas other image properties like figureground segmentation may also be ambiguous. However, our methodology suggests a general approach for probing perceptual uncertainty that could be combined with methods for crowdsourcing other image properties [13,20,32], to obtain a rich analysis of image ambiguity.
A suitably rich model of image ambiguity can open up exciting future avenues for image synthesis applications. For instance, such a model may be able to guide GANs and related image synthesis pipelines towards more interesting and unexpected creations. It may be used to automatically curate or filter image streams, or as an alternative metric of diversity to score different image synthesis techniques.
Finally, the methods being developed here could provide a potentially powerful computational framework for studying topics of interest to psychologists and neuroscientists, such as image perception, object recognition, and associative memory.