Learning from Demonstration in the Wild

Learning from Demonstration in the Wild

Learning from demonstration (LfD), also known as imitation learning, is a machine learning technique that can learn complex behaviours from a dataset of expert demonstrations. LfD is particularly useful in settings where hand-coding behaviour, or engineering a suitable reward function, is too difficult or labour intensive. While LfD has succeeded in a wide range of problems, most methods assume access to a dataset that already contains explicit demonstrations in a convenient form, such as sequences of state-action pairs. This assumption greatly limits the practical applicability of LfD, which to date has largely not been able to leverage the abundant implicit demonstrations available “in the wild” such as videos available on the internet.

Consider the problem of training autonomous vehicles to navigate in the presence of humans. Since physical road tests are expensive and dangerous, simulation is an essential part of the training process. However, such training requires a realistic simulator which, in turn, requires realistic models of the road users, like vehicles, cyclists, and pedestrians, that the autonomous vehicle interacts with. Hand-coded models of road users are labour intensive to create, do not generalise to new contexts or settings, and do not capture the diversity of behaviours produced by humans. LfD is an attractive alternative. In principle, explicit demonstrations could be collected from such road users but setting up scenarios that faithfully capture the range of situations that could arise in the real world isn’t feasible. While there is an abundance of readily available data, such as traffic camera footage of roads, containing implicit demonstrations of road user behaviour in diverse settings, there are currently no LfD methods that can learn from such raw traffic data.

Video to Behaviour (ViBe)

In our recent paper, we introduce ViBe, a new approach to learning models of behaviour that requires as input only unlabelled raw video data of a scene collected from a single, monocular, uncalibrated camera with ordinary resolution. Our approach works by calibrating the camera, detecting the relevant objects, and tracking them through time to form a trajectory. Each trajectory, together with the static and dynamic context of that road user at each moment in time, is then fed as a demonstration to our LfD system, which learns robust behaviour models for each type of road user.

Extracting trajectories

Our approach calibrates the camera, detects relevant objects, tracks them from one frame to the next, and uses the resulting trajectories to perform LfD, yielding models of naturalistic driving behaviour.

We start with a video stream of a traffic scene collected from an ordinary traffic camera available online.

  • Calibration: We obtain the satellite image of this scene using Google Maps and, by identifying corresponding landmarks in both camera and satellite images, we estimate the camera matrix and distortion parameters.
  • Detection: We then use Mask R-CNN to detect bounding boxes of the objects in the scene and project them to 3D.
  • TrackingFinally, detected objects are tracked through time with our extended Deep Sort model using a Kalman filter in 3D.
Building a learning environment

After extracting trajectories, ViBe recreates the scene in a simulator to play back these trajectories and provide an environment for our learning agents to interact with. Given a start frame in the dataset, the simulator plays back tracked trajectories from that frame onwards, produces observations and accepts actions from agents controlled by neural network policies. In other words, it provides exactly the environment needed to both perform LfD on the extracted trajectories, and evaluate the resulting learned policies.

 

 

 

Our simulation generates observations based on both the static and dynamic context: Pseudo-LiDAR readings are used to represent different aspects of the static (e.g. zebra crossings and roads) and dynamic (e.g. distance and velocity of other agents) context of the agent. In addition, we provide information such as the agent’s heading, distance from goal, and velocity.

Learning to imitate behaviour

Given the trajectories extracted by our visual processing module, ViBe uses the simulator to learn a policy that imitates those trajectories. We have developed Horizon GAIL, a novel curriculum-based LfD method based on Generative Adversarial Imitation Learning (GAIL), a state-of-the-art LfD method.

GAIL aims to learn a deep neural network policy  \pi_\theta that cannot be distinguished from the expert policy  \pi_E . To this end, it trains a discriminator  D_{\phi} , also a deep neural network, to distinguish between state-action pairs coming from expert and agent. GAIL optimises  \pi_\theta to make it difficult for the discriminator to make this distinction. Here,  D_{\phi} outputs the probability that (s, a) originated from agent,  \pi_\theta . Formally, the GAIL objective is:

Horizon GAIL bootstraps learning from the expert’s states to ensure a reliable reward signal from the discriminator and helps stabilise learning by using a novel horizon curriculum that slowly increases the number of timesteps for which the agent interacts with the simulator.

Gradually moving from single step state-action pairs to more difficult multi-step trajectories allows the generator and discriminator to jointly learn to generalise to longer sequences of behaviour and match the expert data more closely while ensuring the discriminator does not collapse early in training. We found that Horizon GAIL was critical to successfully reproduce naturalistic behaviour in our complex traffic intersection problem.

Simulating the scene

The following video shows the behaviour produced by our ViBe approach for simulating all cars in the scene. Our method yields stable, plausible trajectories with fewer collisions than any other baseline methods.

 

 

Below you can see a birds-eye view of trajectories taken by agents trained using our Horizon GAIL approach, compared to the results achieved using baseline methods: Behavioural Cloning (BC) and variants of state-of-the-art GAIL method. The right-hand image shows the ‘expert data’, or real-world behaviour extracted using our ViBe vision module. As you can see, only Horizon GAIL succeeds at producing realistic human behaviour. Furthermore, according to several quantitative metrics, our LfD method exhibits better and more stable learning than baseline methods (please refer to the paper for quantitative analysis).

Future directions

We have been extending this work to tackle increasingly complex and varied settings, to learn policies that can generalise to new road layouts and contexts, and to learn different style of behaviours reflecting the diversity of behaviour in real-world. We are convinced that learning to model human behaviour from real and readily-available data is the best way to ensure that simulation environments can be used to safely and scalably test and validate autonomous vehicles. If you share our excitement for research into these challenging problems at the cutting edge of machine learning, please get in touch!

Acknowledgements

Thanks to those who contributed to this paper, as well as our brilliant interns, in particular Rishabh Agarwal and Daniel Marta.   


This post is based on our recent paper:

Learning from Demonstration in the Wild
Feryal Behbahani, Kyriacos Shiarlis, Xi Chen, Vitaly Kurin, Sudhanshu Kasewa, Ciprian Stirbu, João Gomes, Supratik Paul, Frans A. Oliehoek, João Messias, Shimon Whiteson

See the accompanying video for a brief summary: