Scaling Imitation Learning In Minecraft

From Hikvision Guides
Jump to: navigation, search

Imitation learning is a powerful family of techniques for learning sensorimotor coordination in immersive environments. We apply imitation learning to attain state-of-the-art performance on hard exploration problems in the Minecraft environment. We report experiments that highlight the influence of network architecture, loss function, and data augmentation. An early version of our approach reached second place in the MineRL competition at NeurIPS 2019. Here we report stronger results that can be used as a starting point for future competition entries and related research. Our code is available at https://github.com/amiranas/minerl_imitation_learning.



Keywords Imitation learning, reinforcement learning, Minecraft, immersive environments, exploration



Reinforcement learning (RL) was used to reach outstanding performance in many challenging domains, such as Atari (Mnih et al.,, 2015), Go (Silver et al.,, 2017), Starcraft II (Vinyals et al.,, 2019), and immersive 3D environments (Jaderberg et al.,, 2019; Wijmans et al.,, 2020; Petrenko et al.,, 2020). However, RL algorithms require billions of interaction steps with the environment in order to achieve these results. This makes application of RL challenging for slow environments that cannot be run for billions of time steps with available computational resources.



In RL training, the agent needs to encounter rewards in the environment before policy optimization can begin. Therefore, environments with very sparse rewards and large action spaces, such as Minecraft, require a huge amount of exploration, exacerbating the sample inefficiency of RL Guss et al., 2019b .



Another approach to obtain a policy for challenging domains is to use imitation learning, which trains directly from expert demonstrations. It does not require any interaction with the environment during training and is unhindered by the sparsity of rewards, making it a desirable alternative to RL. We evaluate the performance of imitation learning on complex Minecraft tasks, for which a large amount of demonstration data is available.



Minecraft is a first-person open world game where the agent interacts with a procedurally-generated 3D environment. We focus on the ObtainIronPickaxe task, which consists of 11 distinct subtasks and features very sparse rewards. A large-scale expert demonstration dataset has been made available for this task by Guss et al., 2019b , making it a suitable and challenging testbed for imitation learning.



We first introduce state- and action-space representations together with demonstration data processing that enables successful imitation learning in Minecraft. We use this setup to investigate how factors such as network architecture, data augmentation, and loss function affect imitation learning performance.



We applied an early form of the presented approach in the MineRL competition at NeurIPS 2019 (Guss et al., 2019a, ), which deliberately constrained computational resources and environment interactions. Our entry reached the second place in the competition without using the environment during training (Milani et al.,, 2020). Here we present a stronger form of our approach that attains higher performance and can be used as a starting point for future competitions and related research.



2 Minecraft Environment



In Minecraft the player interacts with a procedurally generated, 3D, open-world environment. According to the specified task, different goals must be achieved, such as finding and gathering important resources or using the obtained resources to craft better tools required for getting access to more resources. The agent observes the world from a first-person perspective and also has information about the obtained resources, making it a partially observable environment. A simulator Malmo has been created by Johnson et al., (2016) to support research on the Minecraft domain. Previous works have used the environment to tackle navigation problems Matiisen et al., (2019) or block stacking tasks (Shu et al.,, 2017). Malmo has also been used by Guss et al., 2019b to create an array of defined scenarios within the Minecraft environment, such as the ObtainIronPickaxe task. Additionally, Guss et al., 2019b released the MineRL-v0 demonstration dataset, with over 500 hours of human trajectories of the introduced tasks. This is the first time a large-scale demonstration dataset has been provided for an image-based environment, which makes it a suitable testbed for imitation learning approaches.



2.1 Competition



The MineRL competition at NeurIPS 2019 used the Minecraft environment as a testbed (Guss et al., 2019a, ). The goal was to solve the ObtainDiamond task, where an agent has to fulfill many subtasks in order to reach a diamond. To do so the agent was allowed to learn from the MineRL-v0 dataset and was further allowed 8 million interaction steps with the Minecraft environment. In the final round the agent had to be trained remotely on a single machine with a single GPU in 4 days. Custom environment textures that are not public were used during the training and evaluation. The restricted compute power, training time, and interactions with the environment made it an interesting and challenging setting for imitation learning.



2.2 Minecraft Tasks



We focus on two Minecraft tasks: ObtainIronPickaxe and Treechop.



ObtainIronPickaxe.



In this task the agent has to fulfill a sequence of 11 subtasks in order to craft an iron pickaxe. The subtasks are divided into two categories. One is gathering different resources like wood, stone, and iron. The other subtasks are using the collected resources to create tools like pickaxes. With better tools the agent is able to gather better resources and build better pickaxes. Some of the crafting subtasks require a special tool, like the crafting table, which must be placed in front of the agent. The full sequence of subtasks and the associated rewards are shown in Figure 1. The task is identical to the ObtainDiamond task from the MineRL competition at NeurIPS 2019, except that the last subtask of obtaining a diamond with the iron pickaxe has been removed. We removed the last step because so far none of the tested methods or competition entries were able to obtain a diamond. We use the same maximum episode length as in the ObtainDiamond task to have the same episode length constraint as in the competition.



Treechop.



For this task, the agent always starts in the forest biome where enough trees are present and has to navigate through the world, find trees and “attack” them to obtain logs of wood. For each log obtained in this manner the agent receives a reward of 1. The task is considered solved once 64 logs are collected. This task is only used for a comparison with reinforcement learning in Section 4.4.



We work with environments that are discretized in timesteps t𝑡titalic_t. At every step the agent receives an observation of the environment stsubscript𝑠𝑡s_titalic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and has to choose an action 𝐚tsubscript𝐚𝑡\mathbfa_tbold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Thereafter the environment is progressed by one timestep and the agent receives feedback in the form of a reward rtsubscript𝑟𝑡r_titalic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the next observation st+1subscript𝑠𝑡1s_t+1italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, until it reaches a terminal state. The objective is to find a policy π(s)𝜋𝑠\pi(s)italic_π ( italic_s ) that collects the highest return over an episode: R=∑t=0Trt𝑅superscriptsubscript𝑡0𝑇subscript𝑟𝑡R=\sum_t=0^Tr_titalic_R = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.



Imitation learning aims to maximize the return by imitating the behavior of human trajectories. Usually a network is trained to predict an action given an observation on the demonstration data. This makes imitation learning a classification problem, with the different actions as classes.



Imitation learning has been successfully applied to other domains, such as Atari games (Bogdanovic et al.,, 2015), Mario (Chen and Yi,, 2017), or Starcraft 2 (Vinyals et al.,, 2019; Justesen and Risi,, 2017). We evaluate different approaches to train an imitation learning policy. First we consider a classification-based approach with a policy defined through a neural network with a softmax activation function after the last layer:



π(s,a)=pπ(a|s)𝜋𝑠𝑎subscript𝑝𝜋conditional𝑎𝑠\displaystyle\pi(s,a)=p_\pi(a|s)italic_π ( italic_s , italic_a ) = italic_p start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_a | italic_s ) =Softmaxa(f(s,a)),absentsubscriptSoftmax𝑎𝑓𝑠𝑎\displaystyle=\textSoftmax_a(f(s,a)),= Softmax start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_f ( italic_s , italic_a ) ) , (1) where f(s,⋅)𝑓𝑠⋅f(s,\cdot)italic_f ( italic_s , ⋅ ) denotes the features of the last layer and has a length equal to the amount of possible actions. We train the policy π(s,a)𝜋𝑠𝑎\pi(s,a)italic_π ( italic_s , italic_a ) to predict the expert action through a cross-entropy loss. Thereafter the action is either sampled from the distribution (a∼pπ(a|s)similar-to𝑎subscript𝑝𝜋conditional𝑎𝑠a\sim p_\pi(a|s)italic_a ∼ italic_p start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_a | italic_s )) or the action with the highest probability is selected (a=argmaxapπ(a|s)𝑎subscriptargmax𝑎subscript𝑝𝜋conditional𝑎𝑠a=\textargmax_ap_\pi(a|s)italic_a = argmax start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_a | italic_s )). This supervised learning based policy training is often referred to as Behavior Cloning.



We also evaluate the performance of the pre-training process of Deep Q-learning from Demonstrations (DQfD) (Hester et al.,, 2018). There the reward signal is incorporated into the training process together with imitation learning. In their case a greedy policy is used, which selects the action with the highest action-value: π(s)=argmaxaQ(s,a)𝜋𝑠subscriptargmax𝑎𝑄𝑠𝑎\pi(s)=\textargmax_aQ(s,a)italic_π ( italic_s ) = argmax start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a ). The action-value function is trained with the Q-learning loss (Mnih et al.,, 2015):



QTarget(s,a)=Rn+γnmaxaQ(s′,a).superscript𝑄Target𝑠𝑎subscript𝑅𝑛superscript𝛾𝑛subscript𝑎𝑄superscript𝑠′𝑎Q^\textTarget(s,a)=R_n+\gamma^n\max_aQ(s^\prime,a).italic_Q start_POSTSUPERSCRIPT Target end_POSTSUPERSCRIPT ( italic_s , italic_a ) = italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_γ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a ) . (2) The expert action information is incorporated through an additional margin loss. In a state s𝑠sitalic_s, let aEsubscript𝑎𝐸a_Eitalic_a start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT be the action of the expert and m(aE,a)≔b⋅𝟙a≠aE≔𝑚subscript𝑎𝐸𝑎⋅𝑏1𝑎subscript𝑎𝐸m(a_E,a)\coloneqq b\cdot\mathbbm1\a eq a_E\italic_m ( italic_a start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , italic_a ) ≔ italic_b ⋅ blackboard_1 italic_a ≠ italic_a start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT a margin function that is 00 when a=aE𝑎subscript𝑎𝐸a=a_Eitalic_a = italic_a start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT and takes a value b>0𝑏0b>0italic_b >0 otherwise. The large-margin classification loss (Piot et al.,, 2014) is defined as



LS=maxa∈A[Q(s,a)+m(aE,a)]-Q(s,aE).subscript𝐿𝑆subscript𝑎𝐴𝑄𝑠𝑎𝑚subscript𝑎𝐸𝑎𝑄𝑠subscript𝑎𝐸L_S=\max_a\in A\big[Q(s,a)+m(a_E,a)\big]-Q(s,a_E).italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_a ∈ italic_A end_POSTSUBSCRIPT [ italic_Q ( italic_s , italic_a ) + italic_m ( italic_a start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , italic_a ) ] - italic_Q ( italic_s , italic_a start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ) . (3) The margin loss is minimized if aEsubscript𝑎𝐸a_Eitalic_a start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT is the maximizer of Q(s,⋅)𝑄𝑠⋅Q(s,\cdot)italic_Q ( italic_s , ⋅ ) and if its value at aEsubscript𝑎𝐸a_Eitalic_a start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT exceeds that of any other action a𝑎aitalic_a by at least the margin b𝑏bitalic_b. As a result, minimizing this loss pushes the Q-function to assign higher values to the expert actions. In the experiments we compare the empirical performance of the margin loss, with and without the TD loss, to that of the cross-entropy loss. Without the reward incorporation the margin classification loss becomes



LS=maxa∈A[f(s,a)+m(aE,a)]-f(s,aE).subscript𝐿𝑆subscript𝑎𝐴𝑓𝑠𝑎𝑚subscript𝑎𝐸𝑎𝑓𝑠subscript𝑎𝐸L_S=\max_a\in A\big[f(s,a)+m(a_E,a)\big]-f(s,a_E).italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_a ∈ italic_A end_POSTSUBSCRIPT [ italic_f ( italic_s , italic_a ) + italic_m ( italic_a start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , italic_a ) ] - italic_f ( italic_s , italic_a start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ) . (4)



3.2 Training Setup



In this section we describe the training details of applying imitation learning to the Minecraft domain and the considered alterations. We want to investigate how much the training setup, such as the choice of the network architecture or the use of data augmentations, influences the performance of imitation learning.



State and Action Space.



The main part of the state is the observation of the environment that consists of a 64×64646464\times 6464 × 64 RGB image. In the ObtainIronPickaxe task an additional vectorial state is provided that consists of information about the collected resources and crafted tools, and the currently held item. We encode the held item as a one-hot vector and additionally encode the items in the inventory as multi-hot vectors (amount of ones in a sub-vector equals the amount of the according item in the inventory).



The action space consists of three parts. First there are 8 binary actions related to the movement in the environment: forward, backward, left, right, jump, sprint, attack, sneak. Multiple movement actions can be used in the same timestep, resulting in 256 combinations. The second part is the continuous yaw and pitch control of the agent’s camera orientation. The last part are the actions related to crafting, equipping, and placing of items. Some items, like the crafting table, require being placed on the ground before they can be used.



The combination of these basic actions results in a massive action space. We implement continuous control by quantization, which is a common practice for first-person environments (Jaderberg et al.,, 2019; Kempka et al.,, 2016). We choose a single rotation value of 22.5 degrees for each direction.



After quantization of the camera movement, there are 1280 possible movement action combinations. We allow only up to 3 simultaneous movement actions and remove redundant actions like turning left and right at the same time. In the end 112 different movement actions remain. A full description of the movement actions and the state encoding is available in the supplementary material.



Training Setup.



The policy neural network consists of three parts. A convolutional perceptual part for the image input and a fully-connected part for the vectorial part of the state are concatenated after the last layer and followed by a subseqent fully-connected part (Figure 2). The last layer has either a softmax or a linear output for the cross-entropy or the margin loss, respectively. We investigate how the network size correlates with performance by testing three different architectures for the perceptual part of the network: the DQN architecture with 3 convolutional layers (Mnih et al.,, 2015), the Impala architecture with 6 residual blocks (Espeholt et al.,, 2018), and the Deep Impala architecture with 8 residual blocks and Fixup Initialization (Nichol,, 2019; Zhang et al.,, 2019). In the last tested architecture we double the channel size of all fully-connected and convolutional layers (Double Deep Impala).



We train the networks with the Adam optimizer, a learning rate of 6.25×10-56.25superscript1056.25\times 10^-56.25 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, and weight decay of 10-5superscript10510^-510 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for up to 3×1063superscript1063\times 10^63 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT steps.



From the demonstration dataset we use the human trajectories that successfully reach the target of the environment within the timestep limit of the respective task (ObtainIronPickaxe and ObtainDiamond). We also remove all states where the human did not perform any action. For full network architectures see the supplementary material.



We also test multiple augmentations such as horizontal flipping where also left and right actions are flipped, rectangle-removal, brightness, sharpness, contrast, and posterization adjustments.



Beside evaluating the policy performance, we tested two additional measures of performance, the training loss and the test loss on unseen human trajectories, in order to evaluate the correlation of those losses with the actual performance of the policy.



Additional Data Incorporation.



In the default training setup we used the human trajectories from the ObtainIronPickaxe and ObtainDiamond tasks. The amount of available training data could be increased by also incorporating the trajectories from the Treechop task (where the agent has to collect logs, which is also the first step of the ObtainIronPickaxe and ObtainDiamond tasks). However, the observation space of the Treechop trajectories consists only of the RGB images and no inventory information is available. This makes it incompatible with the trajectories from the other tasks. To create realistic observations for the additional data we first sample a random state from the ObtainIronPickaxe and ObtainDiamond trajectories, where the reward of 2222 has not yet been reached. lalalalal Then we use the vectorial observation part of the sampled state to complete a Treechop observation. This process is repeated until all Treechop states have a complete observation.



4 Results



4.1 Evaluation



Figure 3 shows the training and the test losses across three training runs on the ObtainIronPickaxe task. The test loss is the cross-entropy loss of the policy network evaluated on unseen human trajectories. The figure reveals a clear difference between imitation learning and normal supervised training classification: the test loss increases during training, usually a clear indication of heavy overfitting, yet the policy performance in terms of reward keeps improving. Also, even though the training and the test loss were nearly identical between different random seeds, the performance of the snapshots varies over time and is not correlated to either of the losses. Therefore, even when interaction with the environment during training is not required, this interaction cannot be avoided when it comes to evaluation.



We evaluated imitation learning performance as follows. During training 8 snapshots of the network were saved. For each snapshot the average reward was computed from 100 episodes. The performance of the best snapshot was used as the score of the training run. Each training is repeated three times and the average return across the three training runs is considered the overall result of that configuration. The variance was calculated over the three scores of the three training runs.



In the following plots of this section each point represents the score of the best performing snapshot until that time point, always averaged across three training runs. The shaded area always shows the standard deviation across the three training runs.



4.2 Architecture and Augmentations



The performance in terms of architectures and image augmentation options is shown in Table 1 (all trained with horizontal image flipping augmentations). Larger networks yielded better performance. The Deep Impala architecture improved the performance by 78% and the Double Deep Impala architecture yielded another 7% improvement.



Image flipping was the only effective augmentation and it improved the performance by 73%percent7373\%73 %. Applying additional augmentations had no significant impact and sometimes reduced performance.



4.3 Margin Loss and Treechop Data Incorporation



We compare the cross-entropy loss against the margin loss in Figure 4 and investigate whether the reward incorporation by the additional TD loss improves the performance. The cross-entropy loss outperformed the margin loss on both tasks. However, it was very important to sample the actions from the softmax distribution of the cross-entropy based policy. When we applied a deterministic argmax policy, the performance of the cross-entropy based policy became worse than the margin loss based policy. This is relevant in cases where a deterministic policy is required. The combination of the margin and TD loss, as used in the DQfD algorithm pretraining phase, diminished performance.



The additional Treechop data improved the performance of the agent by a large margin. This shows that even with the massive human dataset the amount of available trajectories is still a potential bottleneck for the imitation learning approach.



The best policy (Deep Impala agent trained with the cross-entropy loss and additional Treechop data) was able to reach the stone pickaxe in 82%percent8282\%82 % of the episodes. Sometimes it could obtain an iron ingot. An iron pickaxe was only built in rare cases (ca. 1%percent11\%1 %). Typical failure cases included getting stuck in biomes without trees or being buried underground without sufficient resources to finish the task. A histogram of the attained returns is shown in Figure 4.



4.4 Comparison to Reinforcement Learning



For the action space used in this work, the only task (out of the Treechop, ObtainIronPickaxe, and ObtainDiamond tasks) where Rainbow (Hessel et al.,, 2018), an RL approach without use of demonstrations, was able to outperform a random policy was the Treechop task. We compare the results of Rainbow to imitation learning with the cross-entropy loss and the DQfD algorithm in Figure 5. For all methods we show a variant with the DQN and the Deep Impala network architecture. A larger network always improved results. Imitation learning was able to reach near-optimal performance after just 50000500005000050000 train steps. The RL approach was able to obtain non-zero rewards only on two of the three training runs.



5 MineRL Competition Results



A early form of the presented imitation learning approach placed 2nd in the MineRL competition at NeurIPS 2019, out of 41 participating teams, and received the “Surprising Research Result” award (for attaining this level of performance without using the Minecraft environment at all during training) (Milani et al.,, 2020). The final round was rated by the best-performing policy out of four separate training runs. Our best policy at the time yielded an average return of 42.442.442.442.4, while the top entry in the competition achieved an average return of 61.661.661.661.6. The setup presented in this technical report yields an average return of 74.574.574.574.5 (best-performing policy out of 3 training runs with the cross-entropy loss with Treechop data incorporation). These results are obtained on a different texture set, since the competition textures are not publicly available.



Imitation learning can work very well when sufficient demonstrations are provided. However, imitation learning has to deal with some of the same problems as reinforcement learning: the losses are weakly correlated with the actual performance of the policy at test time and the returns are unstable over time. The performance fluctuates a lot during training (Figure 3).



This distinguishes imitation learning from other supervised learning tasks, such as image classification on a fixed dataset. While single misclassified samples due to stochastic gradient descent have little influence on the overall performance of a standard classifier, on temporally correlated processes fluctuations can cause the neural network to predict a poor action for an essential early state. This can cause a performance collapse of the entire trajectory because future states where the network performs well are never reached.



Minecraft presents a challenging and exciting domain for sensorimotor learning. We hope that the strong imitation learning baseline described in this technical report can support future progress.