DeepMind’s ‘Gato’ is mediocre, so why did they build it? | ZDNet


DeepMind’s “Gato” neural network excels in many tasks, including controlling robot arms that stack blocks, playing Atari 2600 games, and subtitling images.


The world is used to seeing headlines about the latest breakthrough of deep learning forms of artificial intelligence. However, the latest achievement for the DeepMind division of Google can be summed up as “An AI program that does such a job with many things.”

Gato, as DeepMind’s program is called, was revealed this week as a so-called multimodal program, one that can play video games, chat, write compositions, captions and control a robot arm that stacks blocks. It is a neural network that can work with several types of data to perform several types of tasks.

“With a single set of weights, Gato can engage in dialogue, captions, stack blocks with a real robot arm, surpass people in playing Atari games, navigate simulated 3D environments, follow instructions and more,” writes lead author Scott Reed and colleagues in their magazine, “A Generalist Agent,” published on Arxiv preprint server.

DeepMinds co-founder Demis Hassabis cheered on the team, exclaims in a tweet“Our most general agent to date !! Great work from the team!”

Also: A new experiment: Does AI really know cats or dogs – or something?

The only catch is that Gato is actually not that good at several tasks.

On the one hand, the program can do better than a dedicated machine learning program to control a robotic Sawyer arm that stacks blocks. On the other hand, it produces captions for images that in many cases are quite bad. Its ability for standard chat dialogue with a human interlocutor is similarly mediocre, sometimes provoking contradictory and meaningless statements.

And its gaming of Atari 2600 video games falls short of that for most dedicated ML programs designed to compete in benchmarking Arcade Learning Environment.

Why would you do a program that does some things pretty well and a lot of other things that are not so good? Precedents and expectations, according to the authors.

There are precedents for more general types of programs becoming state-of-the-art in AI, and there is an expectation that increasing amounts of computing power in the future will compensate for shortcomings.

The general public may tend to triumph in AI. As the authors note, referring to AI researcher Richard Sutton, “historically, generic models that are better at utilizing computations have also tended to go for more specialized domain-specific approaches over time.”

As Sutton wrote in his own blog post“The biggest lesson to be learned from 70 years of AI research is that general methods that utilize computations are ultimately the most effective and by a large margin.”

Translated into a formal dissertation, Reed and team write that “we are testing the hypothesis that it is possible to train an agent who is generally capable of a large number of tasks; and that this general agent can be adapted with a little extra data to succeed with an even greater number of tasks. ”

Also: Meta’s AI light LeCun explores the energy limit for deep learning

The model, in this case, is really very general. It is a version of Transformer, the dominant type of attention-based model that has become the basis of many programs including GPT-3. A transformer models the probability of an element given the elements that surround it, such as words in a sentence.

In the case of Gato, DeepMind researchers can use the same conditional probability search on many data types.

As Reed and colleagues describe the task of training Gato,

During the training phase of Gato, data from different tasks and modalities are serialized into a flat sequence of tokens, batched and processed by a transformer-neural network similar to a large language model. The loss is masked so that Gato only predicts action and text goals.

Gato, in other words, does not treat tokens differently whether they are words in a chat or motion vectors in a block stacking exercise. Everything is the same.


Gato training scenario.

Reed et al. 2022

Buried in Reed and team hypotheses is a consequence, namely that more and more computing power will gain, eventually. Right now, Gato is limited by the response time for a Sawyer robot arm that does the block stacking. With 1.18 billion network parameters, Gato is significantly smaller than very large AI models such as the GPT-3. As models of deep learning become larger, inference leads to latency that can fail in the non-deterministic world of a real robot.

But Reed and colleagues expect that limit to be exceeded when AI hardware speeds up processing.

“We focus our training on the working point on a model scale that allows real-time control of real robots, currently around 1.2B parameters in the case of Gato,” they wrote. “As hardware and model architectures improve, this point of operation will, of course, increase the possible model size, driving generalist models higher up the scaling curve.”

Therefore, Gato is truly a model for how computational scale will continue to be the main vector for the development of machine learning, by making general models larger and larger. Bigger is better, in other words.


Gato gets better as the size of the neural network in parameters increases.

Reed et al. 2022

And the authors have some evidence for this. Gato seems to get better as it gets bigger. They compare averages for all benchmark data for three model sizes according to parameters, 79 million, 364 million, and the main model, 1.18 billion. “We can see that for an equivalent number of tokens, there is a significant performance improvement with an increased scale,” the authors write.

An interesting future question is whether a program that is a generalist is more dangerous than other types of AI programs. The authors spend a lot of time in the magazine and discuss the fact that there are potential dangers that are not yet well understood.

The idea of ​​a program that handles multiple tasks suggests to the layman a kind of human adaptability, but it can be a dangerous misconception. “For example, physical incarnation can lead to users anthropomorphizing the agent, leading to misplaced trust in the case of a faulty system, or can be exploited by bad actors,” Reed and the team write.

“In addition, although knowledge transfer across multiple domains is often a goal in ML research, it can create unexpected and undesirable results if certain behaviors (eg arcade fighting) are transferred to the wrong context.”

Therefore, they write, “Ethical and security considerations for knowledge transfer may require significant new research as generalist systems develop.”

(As an interesting side note, the Gato magazine uses a system to describe risks developed by former Google AI researcher Margaret Michell and colleagues, called Model Cards. Model Cards provides a concise summary of what an AI program is, what it does and what factors affect how it works.Michell wrote last year that she was forced out of Google for supporting her former colleague, Timnit Gebru, whose ethical concerns about AI ran counter to Google’s AI leadership.

Gato is by no means unique in its generalizing tendency. It is part of the broad trend towards generalization, and larger models that use buckets with horsepower. The world got the first taste of Google’s tilt in this direction last summer, with Google’s “Perceiver” neural network combining text transformer tasks with images, sounds and spatial LiDAR coordinates.

Also: Google’s supermodel: DeepMind Perceiver is a step towards an AI machine that can process anything and everything

Its peers include PaLM, Pathways Language Model, was introduced this year by Google researchersa 540 billion parameter model that uses new technology to coordinate thousands of chips, known as Pathways, also invented on Google. A neural network released in January by Meta, called “data2vec,” uses Transformers for image data, speech waveforms, and text language representations all in one.

What’s new about Gato, it seems, is the intention to take AI used for non-robot tasks and drive it into the field of robotics.

Gato’s creators, noting the achievements of Pathways and other generalist approaches, see the ultimate achievement in AI that can work in the real world, with all kinds of tasks.

“Future work should consider how to combine these text skills into a completely generalistic agent who can also act in real time in the real world, in different environments and embodiments.”

You can thus consider Gato as an important step on the way to solving AI’s most difficult problem, robotics.

Leave a Comment