From the VPT video chasing the creation of a diamond picket in Minecraft. The computer program achieved the feat in ten minutes, half the time it would take a competent human player to do so.
How important can it be to master the “diamond tool” in Minecraft?
Important enough to spend $ 160,000, according to OpenAI, the artificial intelligence startup.
That’s the amount of money an OpenAI team spent hiring Minecraft players on the Upwork online job posting platform to send videos of themselves playing the game.
Amazon Prime Day 2022: first offers
In a paper presented this week, “Video PreTraining (VPT): Learning to Act by Watching Untagged Online Videos,” OpenAI researchers Bowen Baker and the team began using large data sets to train a neural network to imitate human beats to solve different tasks in the video game. (OpenAI also posted a blog post.)
A lot of neural networks have conquered various types of games through so-called reinforcing learning in recent years, including DeepMind’s AlphaZero DeepMind, which took over chess, Go and Shogi, and the later MuZero program, which added the ability to handle Atari games.
Baker and the team wanted to develop a neural network for Minecraft’s more complex “open world” game environment, where a series of keystrokes allow players much greater degrees of freedom than in chess games. o Atari.
Also: AI in sixty seconds
The research literature, the authors write, includes a “large amount” of work in Minecraft. But VPT’s work is unique, they write, for its scope and scale: “To the best of our knowledge, there is no published work that operates in the complete, unmodified human space of action that includes inventory management. dragging and dropping and dropping. crafting articles “.
The construction work of the neural network, called VPT, was developed in two stages. The first stage needed human players or contractors, who gathered 4,500 hours of play. Later, researchers discovered that they only needed about 2,000 hours.
Baker and the team describe the process:
We had the applications open for a day and then randomly selected 10 applicants for the first round of contractors. Later in the project, as we needed more data and as some contractors asked to terminate their contracts, we added more applicants from the original group as well as references from the contractors currently working. Contractors were charged $ 20 per hour (minus Upwork platform fees and applicable taxes). All of the results presented in this article are based on about 4,500 hours of data (including data recorded to collect human game statistics that were not used for training), which cost us about $ 90,000. Throughout the project, we collected some data that we did not use due to errors in the recorder and for some ideas that we finally did not follow. In total, we spent about $ 160,000 in compensation from the contractor throughout the project. However, as discussed in Sec. 4.6, we could probably get most of our results with a trained IDM using only $ 2,000 of data, i.e. the foundation’s VPT model, the BC fit to the earlygame_keyword dataset, and the RL fit results. The collection of the contractor_house data set cost about $ 8,000. Because we used IDM trained with about 2,000 hours of contractor data, the actual cost of contractor data for these results was about $ 40,000.
During those 4,500 hours, they attached tags to the video frames of the game for actions such as “inventory”, to check a player’s collection of objects, using the “E” key; and “climb” to move “carefully” in the current direction, using the SHIFT key. These actions are recorded as JSON text strings at each point in the game and stored with the video frames.
Game frames with their actions labeled were used to train a neural network called the inverse dynamics model or IDM, which learns which actions go with which frames. IDM is a combination of several types of neural networks, including a 3D convolutional neural network and a ResNet to analyze video frames and several Transformer attention networks to predict the next video frame.
Also: Feeling? Google LaMDA feels like a typical chatbot
The trained skill of this IDM is used in a much larger set of video footage, a total of 70,000 hours of unlabeled Minecraft footage collected from the web. IDM applies “pseudo-tags” to this much larger collection. In other words, IDM and contractor fees are a way to kick off a great set of video training.
The training regime for VPT.
OpenAI
As expensive as the contractor’s payment may seem, the approach represents a great cost savings, the authors write. If they had to collect contractor data equivalent to 70,000 hours of web video, it would be much more expensive.
“If we could economically collect a set of contractor data labeled of an order of magnitude similar to web_clean, that wouldn’t matter; however, collecting that data scale would have cost millions of dollars.”
Starting at 70,000 hours, the authors train a second neural network, also made up of layers of Transformer, to mimic users ’actions in videos, a common practice known as“ behavioral cloning ”.
The aim of the work is to find a way to train a general purpose computer “agent” who can use the richness of data on the Internet that has no labels to solve tasks that involve causality, meaning and sequences of actions that have a necessary relationship from one to the other.
“The results presented in this article help pave the way for using untagged data richness on the web for sequential decision domains,” they write.
The work can be used for numerous computer tasks that require mouse click sequences and other human operator controls, they suggest.
“Although we only experiment with Minecraft, we believe that VPT provides a general recipe for training previous behaviors in hard but generic action spaces, in any domain that has a large amount of unlabeled data available for free, such as the use of computers “.
Open-AI is best known for the great language program called GPT-3, which also uses a “pre-trained” approach based on tons of untagged web data. In a sense, the game of Minecraft is expanding this approach to imitating behavior in the domain of sequential computer tasks captured by video.
Also: What is GPT-3? Everything your company needs to know about OpenAI’s advanced AI language program
The ultimate achievement is in some cases to exceed the time required for a human to accomplish one of the most difficult tasks, obtaining a diamond stake.
In Minecraft, diamond-based tools only last longer and can do more damage. Diamond spikes are the only ones that are especially important to most players. You need a diamond stake to extract the obsidian and a fictional material called netherite, both of which are important for the final activities of the game, such as charming tables and making netherite equipment.
After training the VPT to learn all sorts of Minecraft tasks, the authors used an “adjustment” approach that developed a neural reinforcement learning network to create a diamond picket in a faster time than normal. .
“To demonstrate the effectiveness of the RL fit, we chose the challenging goal of getting a diamond picket in 10 minutes from a new world of Minecraft survival,” they write.
This is a challenge for humans, who usually take twice as long to do so, if they can:
Doing so involves acquiring a sequence of hard-to-get items that require complex skills such as mining, inventory management, crafting a crafting table and without, the use of tools, the operation of an oven and mining at shallower depths, where there are many dangers such as enemies. and lava exists (Fig. 6). In addition to difficulty, progress can be easily lost by dropping objects, destroying them, or dying. Obtaining a diamond stake most of the time requires a competent human being more than 20 minutes (24,000 shares).
In gathering both contractor data and 70,000 hours of unlabeled web video, the authors considered the offensive content perspective. “Contractors could, in theory, use Minecraft’s open world ownership to generate personally identifiable information and / or offensive content (e.g., using Minecraft blocks to write their name or offensive messages, and then find a place from which the message would be visible) “. write, even though they didn’t see it in the videos of the contractors the authors saw.
“Of course we train our BC [behavioral cloning] models in Internet videos of people playing Minecraft, and if this behavior is found in these videos, our model could also learn it, although we hope that this behavior is rare enough that our model does not reproduce it “, they write.
Where does such a general agent go next? The idea is that having conquered diamond axes, VPT, or their offspring, can do all sorts of things a person could do with a mouse and keyboard, such as booing entries, surfing social media, or browsing maps.