Module 0: Introduction
- Turing (1950) Computing Machinery and Intelligence
 - Pomerleau (1988) ALVINN: An Autonomous Land Vehicle in a Neural Network
 - [Video] History Channel 1998 : Driverless Car Technology Overview at Carnegie Mellon University
 - Smith & Gasser (2005) The Development of Embodied Cognition: Six Lessons from Babies
 
Module 1: Deep Learning for Structured Outputs
- Suggested readings 
- LeCun (2006) A Tutorial on Energy-Based Learning
 - Girshick et al. (2013) Rich feature hierarchies for accurate object detection and semantic segmentation
 - Long et al. (2014) Fully Convolutional Networks for Semantic Segmentation
 - Zheng et al. (2015) Conditional Random Fields as Recurrent Neural Networks
 - Chen et al. (2016) DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
 - Kingma & Dhariwal (2018) Glow: Generative Flow with Invertible 1x1 Convolutions
 - Ho et al. (2020) Denoising Diffusion Probabilistic Models
 
 - Additional readings 
- Carion et al. (2020) End-to-End Object Detection with Transformers
 - Kamath et al. (2021) MDETR – Modulated Detection for End-to-End Multi-Modal Understanding
 - Cheng et al. (2021) Per-Pixel Classification is Not All You Need for Semantic Segmentation
 - Rombach et al. (2022) High-Resolution Image Synthesis with Latent Diffusion Models
 - Kirillov et al. (2023) Segment Anything
 - Bai et al. (2023) Sequential Modeling Enables Scalable Learning for Large Vision Models
 - Chi et al. (2023) Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
 
 
Module 2: 3D Vision and Mapping
- Suggested readings: 
- Fischer et al. (2015) FlowNet: Learning Optical Flow with Convolutional Networks
 - Godard et al. (2016) Unsupervised Monocular Depth Estimation with Left-Right Consistency
 - Qi et al. (2016) PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
 - Tamar et al. (2016) Value Iteration Networks
 - Parisotto et al. (2017) Neural Map: Structured Memory for Deep Reinforcement Learning
 - Gupta et al. (2017) Cognitive Mapping and Planning for Visual Navigation
 
 - Additional readings: 
- Chaplot et al. (2020) Neural Topological SLAM for Visual Navigation
 - Huang et al. (2022) FlowFormer: A Transformer Architecture for Optical Flow
 - Wu et al. (2023) Policy Pre-training for Autonomous Driving via Self-supervised Geometric Modeling
 - Sun et al. (2023) Dynamo-Depth: Fixing Unsupervised Depth Estimation for Dynamical Scenes
 - Yang et al. (2024) Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
 - Wang et al. (2025) Continuous 3D Perception Model with Persistent State
 
 
Module 3: Self-Supervised Representation Learning and Object Discovery
- Suggested readings: 
- Sermanet et al. (2017) Time-Contrastive Networks: Self-Supervised Learning from Video
 - Van den Oord et al. (2018) Representation Learning with Contrastive Predictive Coding
 - Wu et al. (2018) Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination
 - Chen et al. (2020) A Simple Framework for Contrastive Learning of Visual Representations
 - Grill et al. (2020) Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning
 - He et al. (2021) Masked Autoencoders Are Scalable Vision Learners
 
 - Additional readings: 
- Weinzaepfel et al. (2022) CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow
 - Wang et al. (2022) Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut
 - Seo et al. (2022) Masked World Models for Visual Control
 - Venkataramanan et al. (2023) Is ImageNet Worth 1 Video? Learning Strong Image Encoders from 1 Long Unlabelled Video
 - van Steenkiste et al. (2024) Moving Off-the-Grid: Scene-Grounded Video Representations
 - Cui et al. (2024) DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control
 - Wang et al. (2024) PooDLe: Pooled and Dense Self-Supervised Learning from Naturalistic Videos
 
 
Module 4: World Models and End-to-End Planning
- Suggested readings: 
- Ross et al. (2011) A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
 - Kalchbrenner et al. (2016) Video Pixel Networks
 - Ha and Schmidhuber (2018) World Models
 - Haarnoja et al. (2018) Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
 - Srinivas et al. (2018) Universal Planning Networks
 - Sukhbaatar et al. (2018) Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play
 - Amos et al. (2018) Differentiable MPC for End-to-end Planning and Control
 - Hafner et al. (2019) Dream to Control: Learning Behaviors by Latent Imagination
 - Zeng et al. (2019) End-to-end Interpretable Neural Motion Planner
 
 - Additional readings: 
- Liang et al. (2020) Learning Lane Graph Representations for Motion Forecasting
 - Casas et al. (2021) MP3: A Unified Model to Map, Perceive, Predict and Plan
 - Chaplot et al. (2021) Differentiable Spatial Planning using Transformers
 - Wu et al. (2022) DayDreamer: World Models for Physical Robot Learning
 - Yu et al. (2022) MAGVIT: Masked Generative Video Transformer
 - Hu et al. (2022) Planning-oriented Autonomous Driving
 - Dinev et al. (2022) Differentiable Optimal Control via Differential Dynamic Programming
 - Hafner et al. (2023) Mastering Diverse Domains through World Models
 - Hansen et al. (2023) TD-MPC2: Scalable, Robust World Models for Continuous Control
 - Hu et al. (2023) GAIA-1: A Generative World Model for Autonomous Driving
 - Chi et al. (2023) Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
 - Zhang et al. (2024) Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion
 - Casas et al. (2024) DeTra: A Unified Model for Object Detection and Trajectory Forecasting
 - Bruce et al. (2024) Genie: Generative Interactive Environments
 - Psenka et al. (2024) Learning a Diffusion Model Policy from Rewards via Q-Score Matching
 
 
Module 5: Continual Learning and Meta-Learning
- Suggested readings: 
- Marsland (2002) A Self-Organising Network that Grows when Required
 - Kirkpatrick et al. (2016) Overcoming catastrophic forgetting in neural networks
 - Rebuffi et al. (2016) iCaRL: Incremental Classifier and Representation Learning
 - Yoon et al. (2017) Lifelong Learning with Dynamically Expandable Networks
 - Nguyen et al. (2017) Variational Continual Learning
 - Van de Ven et al. (2020) Brain-Inspired Replay for Continual Learning with Artificial Neural Networks
 - Fei-Fei & Fergus (2006) One-Shot Learning of Object Categories
 - Lake et al. (2011) One-Shot Learning of Simple Visual Concepts
 - Snell et al. (2017) Prototypical Networks for Few-shot Learning
 - Finn et al. (2017) Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
 - James et al. (2018) Task-Embedded Control Networks for Few-Shot Imitation Learning
 - Brown et al. (2020) Language Models are Few-Shot Learners
 - Chen et al. (2021) Exploring Simple Meta-Learning for Few-Shot Learning
 
 - Additional readings: 
- Javed & White (2019) Meta-Learning Representations for Continual Learning
 - Lake (2019) Compositional Generalization through Meta Sequence-to-Sequence Learning
 - Dohare et al. (2021) Continual Backprop: Stochastic Gradient Descent with Persistent Randomness
 - Wang et al. (2021) Learning to Prompt for Continual Learning
 - Ren et al. (2021) Wandering Within a World: Online Contextualized Few-Shot Learning
 - Alayrac et al. (2022) Flamingo: a Visual Language Model for Few-Shot Learning
 - Song et al. (2022) LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models
 - Powers et al. (2023) Evaluating Continual Learning on a Home Robot
 - Zhang et al. (2023) A Novel Visual Question Answering Continual Learning Setting
 - Lee et al. (2023) STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment
 - Majumder et al. (2023) CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization
 
 
Module 6: LLM Agents
- Suggested readings: 
- Langley et al. (2009) Cognitive architectures: Research issues and challenges
 - Misra et al. (2017) Mapping Instructions and Visual Observations to Actions with Reinforcement Learning
 - Andreson et al. (2018) Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments
 - Andreas (2022) Language Models as Agent Models
 - Sridhar et al. (2023) Cognitive Neuroscience Perspective on Memory: Overview and Summary
 
 - Additional readings: 
- Anh et al. (2022) Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
 - Sumers et al. (2023) Cognitive Architectures for Language Agents
 - Schick et al. (2023) Language Models Can Teach Themselves to Use Tools
 - Rana et al. (2023) SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning
 - Kim et al. (2024) ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments
 - Li et al. (2024) Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making