Computers are learning to see the world like we do


It is surprisingly difficult to build computers that can recognise the many different objects we see every day, but they are getting better all the time


WHAT animal is in the picture above? Glance at the page, wheels in the brain spin: yeah, that's a bird. The response comes so fast that you barely notice the processing behind it.


If only machines found it that easy. Object recognition is surprisingly tricky for computers. Online web comic xkcd recently poked fun at the problem, bemoaning how arduous it would be to build a system that could determine whether a photo was taken in a national park and contained a bird.


Artist Randall Munroe had thrown down the gauntlet: last week, image host Flickr launched Park or Bird, a website designed to solve the exact problem in the comic. Just drag a photograph into the page, and it will make an educated guess. Gerry Pesavento, senior director of product management at Yahoo in San Francisco, which owns Flickr, says the site doubles as a playful introduction to a genuine problem.


"It's showing that image intelligence is happening very quickly," he says.


Park or Bird wasn't the result of a sudden whim. For the past year, Flickr has been training neural networks to figure out if a given picture has one of 1000 different objects in it, from a cat to a sunset. If Flickr can nail this problem, it will vastly improve the search function for their billions of photos, the firm say – letting people find shots of any item even if the photographer hasn't bothered to tag it.


The promise of object recognition extends far beyond better search through photos. Autonomous cars or people who are blind stand to benefit greatly from a system that can easily and accurately identify people and street signs.


"The sort of algorithms that we're using at Flickr right now are exactly the sort of algorithms that are going to be helping robots see and navigate visually," says Yahoo's Simon Osindero.


Google made waves two years ago with the announcement that it had trained neural networks to spot cats in YouTube videos. Chinese search giant Baidu is also in on the game, offering a translation app that tries to provide the right English word for whatever you have taken a picture of. Amazon's Fire phone comes with a feature that recognises the front covers of books or CDs in the real world and directs you to the relevant shopping website.


Olga Russakovsky at Stanford University in California co-organises the ImageNet Large Scale Visual Recognition Challenge, an annual competition with 1000 categories of objects to identify. In the four years since the challenge began, the quality of the entries has improved remarkably quickly. The winner in 2010 made mistakes 28.2 per cent of the time. Last year's Google-led winning program had an error rate of only 6.7 per cent – only a smidge behind an actual human annotator.


Still, certain objects consistently trip computers up. Competitors in the ImageNet Challenge struggle to differentiate between small and slender hand tools, like spoons and screwdrivers. Things that have a metallic or reflective surface can also be hard to identify. Still, Russakovsky is confident that we are only a few years away from a machine that can parse the world as well as a human.


"I don't think there's anything fundamental stopping us," she says.


This article appeared in print under the headline "It's a bird, right?"


Issue 2993 of New Scientist magazine


  • New Scientist

  • Not just a website!

  • Subscribe to New Scientist and get:

  • New Scientist magazine delivered every week

  • Unlimited online access to articles from over 500 back issues

  • Subscribe Now and Save




If you would like to reuse any content from New Scientist, either in print or online, please contact the syndication department first for permission. New Scientist does not own rights to photos, but there are a variety of licensing options available for use of articles and graphics we own the copyright to.