Wednesday, December 22, 2004

What interior designers (and penguins) can teach us about robotic perception

Before the main topic, I'd like to note that Carlos at Neogentronyx has made significant progress on his outrageous Mecha project, the NMX04-1A. Visit the site for pictures of this one-person project (in the snows of Alaska) that is the most serious effort in the world to build a really large humanoid exoskeleton. Carlos is developing his first Mecha as a demonstration of the technology, with an eye to developing a startup company devoted to creating large exoskeletons. The spirit is the same as the late 1970s personal computer geeks - but this time it's robots.

If Neogentronyx can get its 18-foot mecha working it will join the ranks of the Servo Magazine's Tetsujin Challengewhich competed in the 2004 at Robonexus in San Jose, CA. The chances seem good. While the hardware to create an exoskeleton has existed for at least a century, only recently have low-cost computers plus an understanding of biped movement combined to make them practical. I expect in 2015 that we'll see seniors and paraplegics charging around in these "power suits" - instead of the "special" person in the wheelchair we look down on, we'll be looking up to them in their mechanical might and grandeur. Imagine grandma stomping on board the bus in one of these!

Exoskeletons get along because they handle the motor side of movement, but allow a human to do the sensing. And sensing has been one of the biggest issues in robotics. In the past, many robot developers apparently thought that the best way to handle sensation was to use a single sensor, and compute the hell out of whatever it detected in the environment. This led to machines which can only function in laboratory settings. True mobile robots need more sensors, and one might even define a "Moore's Law" for robots - their capability doubles every time the number of sensors double. Sensation, rather than computational power is key.

But even with good senses, it is difficult for robots to move in human environment, a natural goal for robots that jump. To date, machine vision has been the primary means used to extract features from the environment and determine how the robot will interact with them. I'm going to first tell you a book you'll need to find them:

Home by Design: Transforming Your House into a Home, Sarah Suskana, the Taunton Press, Newtown, CT (2004), ISBN: 1-56158-618-8.

Why am I recommending that robot vision designers read an "artsy" interior design book? Read on...

Just what features to extract from a visual scene is problematic for robots. Some researches, e.g., Hans Moravec's Seegrid company, try to do physics-oriented "first principles." The Seegrid vision system fuses input from cameras into a 3-dimensional "evidence grid" - scoring the likelihood of something solid existing at any particular point in 3D space. The evidence grid can then be used as a starting point for identifying objects. Other systems have used a "2 1/2-dimensional" system - laser rangefinders sweeping across the ground, pulling a series of 2D slices. Yet other systems look for purely 2-dimensional patterns in the visual environments - face recognition systems fall into this category.

The challenge is getting fast recognition of all the objects in a human environment. People instantly recognize things like lamps and chairs, but robots are always slow - it is very difficult to identify physical objects purely from shape in 2D and 2 1/2 D, and the value of 3D for identifying objects is yet to be proven. Making these identifications is going to be necessary for any future robots making critical decisions. Humans can easily see a kid about to dart out into the street, even if only part of his body is showing between two parked cars. Current robotic vision systems have trouble here - if you tune them to pick up the kid, they pick up numerous false positives. The same might be said for, say a napkin lying on the floor beside a table - we see it instantly for what it is, but a robot won't What are people using that the robots aren't?

My suggestion here is that, in our desire to build a "first principles" vision system we have ignored the obvious cues in human-centric visual spaces. That is, humans have evolved to see certain patterns easily in nature. Furthermore, and more important, they pattern their artificial environments with cues designed to set off their "feature detectors" to the max. Rather than trying to reason from an "atomic" model of vision, robots can be designed to use these cues.

The first examples I give are from biology. Biologists have long puzzled over the apparent "beauty" of plants and animals. Take sexual displays, for examples. While it is essential that males and females look different in sexual displays, there is no reason for symmetry, nice colors, elegant lines, etc. Any random pattern of body coloring should do. But without exception, animals show these "beautiful" properties. Penguins have neatly defined regions of black and white, a smooth, elegant series of curves to their head, beak, and body, and lines of bright color, literally outlining parts of their body. What are these for?

I suggest that our perception of "beauty" is actually closer to "minimum energy detection." Penguins have a particular body shape and pattern which is easily recognized by the feature detectors in the brain. Furthermore, evolution has co-evolved these feature detectors with the body shape, so that the bird's brain uses less energy to detect particularly "perfect" or "beautiful" specimens of its kind. The patterns, lines, and body curves are like outlines drawn in chalk around the important features of the animal's form. Minimum energy is important since brain tissue is the most energy-intensive part of an animal, and less energy spent on perception means more spent of reproduction.

When given the option, people do the same thing. While there are many reasons for hairstyles and makeup, one is to augment and accentuate body appearance. Compared to un-modified skin, makeup smooths and increases contrast of major facial features. It literally makes it easier to look at the face - with interfering features smoothed out and lines drawn on/about important ones, there is less "noise" from non-face feature detectors and the face detection is accomplished with a minimum of energy.

Another example is found with the lighting systems used in film and television. Movies and TV both display a 2-dimensional image of a 3-D world. Both media have lower contrast (200-500:1 versus 10,000:1) so there is less information available to reconstruct this 3D world. To help people see the 3D, special standard 3-point lighting is used. A primary light illuminates the subject from the upper left, and a second light fills in the shadows. This helps to scale the light/dark to the contrast of the media. But the real surprise is the back light, whose purpose is simply to put a halo around the subject. Back-lighting literally draws a white line around the subject, making boundaries clear with a completely artificial boundary. This helps to separate people in a film from their background.

What's important to remember is that the lighting in film has no counterpart in the real world - it is completely, utterly, unnatural. But people accept the unnatural lighting methods because it allows them to process the scene quickly, with minimum energy use by their brain. In fact, the enrgy used to process a film scene is probably lower than the real world, which helps to contribute to the "dreamlike" quality that film has relative to say, a security camera.

One of the best places to see deliberate line-drawing to aid perception can be found in design. Good design involves (among 0ther things) creating pointers and lines on the design object which make it easily processed by human visual centers. A good example is "whitewall" tires. Normally, a dark tire sits in a dark wheelwell on a car. Adding a whitewall helps to define the position and orientation of the car. Car-owners may preceive whitewalls as "better looking", which in all probability means that it takes less energy to "take in" their car's visual appearance.

For this reason, roboticists should study interior design. A robot designed to move around in a human environment (meaning an environment created by humans, like a home) doesn't have to invoke first principles to recognize a lamp - instead it needs to respond to the same features that humans use. I don't mean any old features - just those put there by artists to make the object "artistic" or "styled."

Currently, a machine vision programmer would probably think of whitealls on a tire, tassels on a lamp, or lacy ironwork on a cabinet as one more impediment making it harder for a robot to preceive the lamp. Tassels are not something that has to be "removed" by processing before the lamp can be recognized. This is flat-out wrong.

Instead, roboticists, should see the tassels for what they are - cues introduced by the artist/designer deliberately to make it easier for the human vision system to "process" the lamp with a minimum of energy. An advanced robotic system would have an easier time (less cpu time) recognizing the tasseled lamp, as opposed to a "functional" lamp without tassels.

Thinking this way changes one's idea about object recognition radically. It is often thought that scene recognition is difficult because people use so many materials, colors, textures, etc in building artifical environments like homes. This is the opposite of the truth. Instead, the particular colors, patterns, building materials, etc. are used because they increase the ability to move smoothly in a home environment without crashing into walls.

Which brings us to the Suskana interior design book. Her analysis is notable in presenting a theory of design based indirectly on biology, and divides the process of making a house "homey" (meaning I can relax there and my neurons can run at lower average processing) into several categories: Space, Light, Order. These in turn are divided into sub-principles, e.g. for Light, "light to walk toward", "light intensity variation", "reflecting surfaces", "visual weight", and more.

For example, Suskana explains the common practice of placing a window, or lighted object, at the end of a long hall. The reason, according to the author, is that people instinctively shrink away from a corridor which becomes darker as you go in deeper - a cave bear might be waiting in there! So good interior designers create numerous windows or lighted alcoves at the ends of halls. This is an unconscious cue to the person walking through the home that things are well-lit and "open" at the end of each hall, inviting them to walk through them. A well-design house will have halls that "invite" people to walk through them. Put into our terms, the pattern of a tunnel with light at the end is probably coded by a basic human visual feature detector.

How could a roboticist use this information? Imagine a mobile robot is about to enter a hallway. It can, of course, compute the size and orientation of the walls and ceiling. But it could also recognize the "light at the end of the tunnel" interior desigh pattern with a feature detector mimicing the human response, and proceed down the hall.

The presence of reflecting surfaces along the hall (another interior design feature) shouldn't be a source of additional complex processing - in fact, processing should simplify it, if the reflecting surfaces are in the right place.

In a similar fashion, when looking at a tasseled lampshade, a robot could use the presence of tassels to reinforce the "lamplike" pattern - seeing the tassels as the equivalent of serifs on a font style.

Another example is the frequent "dropped ceilings" used in home design. Most machine vision programmers probably wish that home designers would pick one standard ceiling height and keep it. But designers change ceiling height, and frequently drop the ceiling moving from one room to another. This is not done for the sake of complexity. Instead, the dropped ceiling, with its greater complexity, actually makes it easier to see the room.

Such a robot using "interior design" derived feature detectors would act a lot more like humans. A sterile "box world" often used by machine vision programmers would be unpleasant for the robot (meaning that it would actually take more cpu processing at the high-end) than a cluttered Victorian room filled with tables and chairs groaning under the weight of a riot of ceramic figures.

One might object to this scheme by noting that robots would have a harder time moving around poorly designed homes. Exactly. The hallmark of the far superior human vision system is that it has more problems with poorly designed homes.

Face it. Our brains are evolved to compute as little as possible, since this wastes energy that can be used elsewhere. If it was simpler to analyze a "toy world" filled with geometric shapes we would build our homes that way. We do not make our home structure complex because we have brains complex enough to process the scene - this would be preceived as "uncomfortable" or "ugly" since we would be wasting glucose firing neurons. Instead, our home structure is complex in order to make it easier to navigate through it.

To repeat: the reason we find a big, boxy collection of building in an industrial park less beautiful than a Gothic church is because it is actually easier for us to look at the church - we use less neural energy. The boxy building, so close to a block world of classic computer vision, are literally harder for us to look at.

Now, some of you may be saying that it takes way more processing to look at a Victorian parlor than an empty room. This is true, but only at the low level. At the first level of visual processing (contrast enhancement, line segment extraction, color boundary extraction) there is indeed more computation in the parlor. But the higher levels, assembling the low-level patterns don't work as hard. In contrast, a block world has less computation at the low levels of processing, but - I contend - more work necessary at higher levels of visual abstraction. This feature explains why block worlds have been so popular. Standard von Neuman computers have to chug every bit of visual data through the cpu, so processing goes up rapidly if complexity is increased. But recent developments in the EU and the University of Pennsylvania with partly analog "vision chips" will soon remove this constraint on low-level scene processing - it will be near instantaneous. The challenge will be to do high-level processing on the elementary patterns detected in this way. One could suggest that it will involve some aspects of fractals. The reason for this is once again seen in interior design. Check the book mentioned above - you'll see numerous examples of motifs which repeat at different scales - a hallmark of fractal patterning.

This is a radical idea. It supposes that machine vision and applied art are aiming at similar goals - the long-sought fusion of art and science. It could be that anyone trying to create a domestic robot needs to have interior designers on the case - and building software to recognize their artistic conventions will fare better than those working from "physics" first principles.


This page is powered by Blogger. Isn't yours?