Most of us would recognise this as the Apple logo with the "bite" replaced with the profile of Steve Jobs.
Let's see how clarifai - a convolutional neural network based image recognition (deep learning) service - does on this task?
Perhaps surprisingly, not very well.
One simple reason this network failed might be that the training set did not include the Apple logo. Maybe a better trained network would have recognised it?
But even if a network did recognise the image as the Apple logo, this would not be the correct answer. The image is NOT the Apple logo.
Could any network correctly recognise the image? The profile is probably not anatomically correct, and almost certainly not distinctive enough to be recognised as Steve Jobs on its own: if it was presented without the surrounding apple, it would probably be very difficult to identify. I doubt if any network could make the correct identification.
So why is it so easy for humans to recognise? Because humans can use their cognitive inference on top of visual features to identify what an object is. We initially recognise the image as the Apple logo because of its visual features, but then notice what looks like a face inside it. It is not a picture of a face but a shadow, a caricature highlighting the salient features of what we know to be the late Steve Jobs. The caricature's salient feature is the little round glasses - a feature which has been used to comic effect elsewhere, as in this picture of Bruce Lee "being" Jobs.
It is a pretty impressive feat to see those little lines in front of the silhouetted face and interpret them as Steve Jobs with his round glasses.
This example illustrates that human perception and classification of images does not rely just on visual features. There is a complex relationship between visual features and cognitive interpretive processes. What something looks like doesn't necessarily tell us what it is. The Steve Jobs Apple shows how little accurate visual input we sometimes need to recognise objects. Visual features are a clue to what an object is, but don't define our concept of it.
The psychologists Susan Gelman, Ellen Markman and John Coley studied adults and young children (all the way down to the age of 2) to see how much importance the perceptual features of objects play in their understanding of what the object is. (We will only talk about the adult version of the study, but the results were similar in children). In one study they used picture triads in which two objects from one category looked dissimilar, but a third object from a different category looked similar to one of the other two objects. For example the image below shows two dissimilar looking birds (Flamingo and Blackbird) and a Bat (top right):
Underneath the flamingo was written: “This bird’s heart has a right aortic arch only.” Underneath the bat was written: “This bat’s heart has a left aortic arch only.” Underneath the blackbird was written: “What does this bird’s heart have?”
Overwhelmingly the students said that the Blackbird's heart had a right aortic arch, just like the heart of the dissimilar looking Flamingo. Perceptual similarity was ignored in the inference.
But it was not the case that people blindly followed the textual description all the time. In another condition the experiment asked for judgements about an irrelevant attribute introduced into the image. For example the experimenter could place a blue dot underneath the Flamingo and a red dot underneath the Bat and ask what colour dot would go below the Blackbird? The students answered this in a random fashion. Finally, the students were asked about a feature that was likely to correlate with perceptual attributes, like how heavy the object would be. In this condition the participants said that the Bat and the Blackbird would be of similar weight, as predicted by visual similarity.
What these studies shows is that people (and even 2 year old children) understand that perceptual similarity is only a clue to what objects are. When the perceptual input is contradicted by explicit and sensible information about what an object is, then people readily abandon their perceptual biases and draw inferences based on conceptual knowledge. When that knowledge tells them to use the visual properties (as in weight estimation) then they readily return to this, but when the perceptual features are irrelevant or misleading (as in deep biological properties) then they simply ignore them.
This complex relationship between visual features and classification can result in vast individual differences when humans are asked to label images. For example a simple picture of a cat can be labeled as animal or as a pet. Neither answer is correct, or incorrect. Being an animal and a pet are not two different things, and they certainly can't be differentiated by visual features. There is a complicated conceptual relationship between pet and animal which a visual feature based neural network does not, and can not be expected to understand. Most researchers know this, and as a result they carefully prepare training sets to avoid overly tricky situations. For example, in a recent paper which presents one of the most successful applications for answering questions about the content of an image, the researchers summarise their training procedure like this: "To make the labeled question-answer pairs diversified, the annotators are free to give any type of questions, as long as these questions are related to the content of the image. The question should be answered by the visual content and common sense (e.g., we are not expecting to get questions such as “What is the name of the person in the image?”). We only select a small number of annotators (195 individuals) whose annotations are satisfactory (i.e. the questions are related to the content of the image and the answers are correct). We pick a set of good and bad examples of the annotated question-answer pairs from the quality monitoring dataset, and give them to the selected annotators as references. We also provide reasons for selecting these examples. After the annotation of all the images is finished, we further refine the dataset and remove a small portion of the images with badly labeled questions and answers." In other words the training set is carefully curated to ensure success. But this leaves out the richness of human interaction with he world, and creates an artificial distinction between "good" and "bad" annotation.
How can the Semantic Symbiosis view help? Consider again the example about deciding if a picture of a cat should be labeled as cat or pet. We have already noted that there are no visual features about the animal itself to distinguish between the two. So a neural network will recognise the visual features as the most highly associated interpretation in the training set. Probably a cat. (Unless its Google Photos in which case it might be a dog!)
But it might be possible to infer additional interpretations by making an educated guess. What kind of visual clues can lead us to conclude that there is a pet in the picture? Is there a human close by? Is the human smiling? Is there contact between the human and the animal? Is the animal in a house? If the answer to some of these questions is "yes", we could infer that the animal in the picture was also a pet. The computer can't ask these questions on its own, because it does not have a rich semantics about concepts. But it does have an excellent ability to locate visual features within pictures, that could answer the questions. Semantic Symbiotics is about getting those questions into the algorithm, either by encoding into some sort of knowledge base, or allowing the machine to ask for help if this is not possible. Either way, human semantics helps the computer and the computer's sheer power to ingest and generalise over volumes of data helps the human. Semantic Symbiosis.
But it might be possible to infer additional interpretations by making an educated guess. What kind of visual clues can lead us to conclude that there is a pet in the picture? Is there a human close by? Is the human smiling? Is there contact between the human and the animal? Is the animal in a house? If the answer to some of these questions is "yes", we could infer that the animal in the picture was also a pet. The computer can't ask these questions on its own, because it does not have a rich semantics about concepts. But it does have an excellent ability to locate visual features within pictures, that could answer the questions. Semantic Symbiotics is about getting those questions into the algorithm, either by encoding into some sort of knowledge base, or allowing the machine to ask for help if this is not possible. Either way, human semantics helps the computer and the computer's sheer power to ingest and generalise over volumes of data helps the human. Semantic Symbiosis.
No comments:
Post a Comment