Module 0: Introduction
- Turing (1950) Computing Machinery and Intelligence
- Pomerleau (1988) ALVINN: An Autonomous Land Vehicle in a Neural Network
- [Video] History Channel 1998 : Driverless Car Technology Overview at Carnegie Mellon University
- Smith & Gasser (2005) The Development of Embodied Cognition: Six Lessons from Babies
Module 1: Deep Learning for Structured Outputs
- Suggested readings
- LeCun (2006) A Tutorial on Energy-Based Learning
- Girshick et al. (2013) Rich feature hierarchies for accurate object detection and semantic segmentation
- Long et al. (2014) Fully Convolutional Networks for Semantic Segmentation
- Zheng et al. (2015) Conditional Random Fields as Recurrent Neural Networks
- Chen et al. (2016) DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
- Kingma & Dhariwal (2018) Glow: Generative Flow with Invertible 1x1 Convolutions
- Ho et al. (2020) Denoising Diffusion Probabilistic Models
- Additional readings
- Carion et al. (2020) End-to-End Object Detection with Transformers
- Kamath et al. (2021) MDETR – Modulated Detection for End-to-End Multi-Modal Understanding
- Cheng et al. (2021) Per-Pixel Classification is Not All You Need for Semantic Segmentation
- Rombach et al. (2022) High-Resolution Image Synthesis with Latent Diffusion Models
- Kirillov et al. (2023) Segment Anything
- Bai et al. (2023) Sequential Modeling Enables Scalable Learning for Large Vision Models
- Chi et al. (2023) Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
Module 2: 3D Vision and Mapping
- Suggested readings:
- Fischer et al. (2015) FlowNet: Learning Optical Flow with Convolutional Networks
- Godard et al. (2016) Unsupervised Monocular Depth Estimation with Left-Right Consistency
- Qi et al. (2016) PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
- Tamar et al. (2016) Value Iteration Networks
- Parisotto et al. (2017) Neural Map: Structured Memory for Deep Reinforcement Learning
- Gupta et al. (2017) Cognitive Mapping and Planning for Visual Navigation
- Additional readings:
- Chaplot et al. (2020) Neural Topological SLAM for Visual Navigation
- Huang et al. (2022) FlowFormer: A Transformer Architecture for Optical Flow
- Wu et al. (2023) Policy Pre-training for Autonomous Driving via Self-supervised Geometric Modeling
- Sun et al. (2023) Dynamo-Depth: Fixing Unsupervised Depth Estimation for Dynamical Scenes
- Yang et al. (2024) Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
- Wang et al. (2025) Continuous 3D Perception Model with Persistent State
Module 3: Self-Supervised Representation Learning and Object Discovery
- Suggested readings:
- Sermanet et al. (2017) Time-Contrastive Networks: Self-Supervised Learning from Video
- Van den Oord et al. (2018) Representation Learning with Contrastive Predictive Coding
- Wu et al. (2018) Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination
- Chen et al. (2020) A Simple Framework for Contrastive Learning of Visual Representations
- Grill et al. (2020) Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning
- He et al. (2021) Masked Autoencoders Are Scalable Vision Learners
- Additional readings:
- Weinzaepfel et al. (2022) CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow
- Wang et al. (2022) Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut
- Seo et al. (2022) Masked World Models for Visual Control
- Venkataramanan et al. (2023) Is ImageNet Worth 1 Video? Learning Strong Image Encoders from 1 Long Unlabelled Video
- van Steenkiste et al. (2024) Moving Off-the-Grid: Scene-Grounded Video Representations
- Cui et al. (2024) DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control
- Wang et al. (2024) PooDLe: Pooled and Dense Self-Supervised Learning from Naturalistic Videos
Module 4: World Models and End-to-End Planning
- Suggested readings:
- Ross et al. (2011) A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
- Kalchbrenner et al. (2016) Video Pixel Networks
- Ha and Schmidhuber (2018) World Models
- Haarnoja et al. (2018) Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
- Srinivas et al. (2018) Universal Planning Networks
- Sukhbaatar et al. (2018) Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play
- Amos et al. (2018) Differentiable MPC for End-to-end Planning and Control
- Hafner et al. (2019) Dream to Control: Learning Behaviors by Latent Imagination
- Zeng et al. (2019) End-to-end Interpretable Neural Motion Planner
- Additional readings:
- Liang et al. (2020) Learning Lane Graph Representations for Motion Forecasting
- Casas et al. (2021) MP3: A Unified Model to Map, Perceive, Predict and Plan
- Chaplot et al. (2021) Differentiable Spatial Planning using Transformers
- Wu et al. (2022) DayDreamer: World Models for Physical Robot Learning
- Yu et al. (2022) MAGVIT: Masked Generative Video Transformer
- Hu et al. (2022) Planning-oriented Autonomous Driving
- Dinev et al. (2022) Differentiable Optimal Control via Differential Dynamic Programming
- Hafner et al. (2023) Mastering Diverse Domains through World Models
- Hansen et al. (2023) TD-MPC2: Scalable, Robust World Models for Continuous Control
- Hu et al. (2023) GAIA-1: A Generative World Model for Autonomous Driving
- Chi et al. (2023) Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
- Zhang et al. (2024) Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion
- Casas et al. (2024) DeTra: A Unified Model for Object Detection and Trajectory Forecasting
- Bruce et al. (2024) Genie: Generative Interactive Environments
- Psenka et al. (2024) Learning a Diffusion Model Policy from Rewards via Q-Score Matching
Module 5: Continual Learning and Meta-Learning
- Suggested readings:
- Marsland (2002) A Self-Organising Network that Grows when Required
- Kirkpatrick et al. (2016) Overcoming catastrophic forgetting in neural networks
- Rebuffi et al. (2016) iCaRL: Incremental Classifier and Representation Learning
- Yoon et al. (2017) Lifelong Learning with Dynamically Expandable Networks
- Nguyen et al. (2017) Variational Continual Learning
- Van de Ven et al. (2020) Brain-Inspired Replay for Continual Learning with Artificial Neural Networks
- Fei-Fei & Fergus (2006) One-Shot Learning of Object Categories
- Lake et al. (2011) One-Shot Learning of Simple Visual Concepts
- Snell et al. (2017) Prototypical Networks for Few-shot Learning
- Finn et al. (2017) Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
- James et al. (2018) Task-Embedded Control Networks for Few-Shot Imitation Learning
- Brown et al. (2020) Language Models are Few-Shot Learners
- Chen et al. (2021) Exploring Simple Meta-Learning for Few-Shot Learning
- Additional readings:
- Javed & White (2019) Meta-Learning Representations for Continual Learning
- Lake (2019) Compositional Generalization through Meta Sequence-to-Sequence Learning
- Dohare et al. (2021) Continual Backprop: Stochastic Gradient Descent with Persistent Randomness
- Wang et al. (2021) Learning to Prompt for Continual Learning
- Ren et al. (2021) Wandering Within a World: Online Contextualized Few-Shot Learning
- Alayrac et al. (2022) Flamingo: a Visual Language Model for Few-Shot Learning
- Song et al. (2022) LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models
- Powers et al. (2023) Evaluating Continual Learning on a Home Robot
- Zhang et al. (2023) A Novel Visual Question Answering Continual Learning Setting
- Lee et al. (2023) STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment
- Majumder et al. (2023) CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization
Module 6: LLM Agents
- Suggested readings:
- Langley et al. (2009) Cognitive architectures: Research issues and challenges
- Misra et al. (2017) Mapping Instructions and Visual Observations to Actions with Reinforcement Learning
- Andreson et al. (2018) Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments
- Andreas (2022) Language Models as Agent Models
- Sridhar et al. (2023) Cognitive Neuroscience Perspective on Memory: Overview and Summary
- Additional readings:
- Anh et al. (2022) Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
- Sumers et al. (2023) Cognitive Architectures for Language Agents
- Schick et al. (2023) Language Models Can Teach Themselves to Use Tools
- Rana et al. (2023) SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning
- Kim et al. (2024) ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments
- Li et al. (2024) Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making