Skip to content
/ VLAM-24 Public

Material for the course "Vision-Language-Action Models for Robotics and Autonomous Vehicles"

Notifications You must be signed in to change notification settings

3lis/VLAM-24

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VLAM-24

Material for the course Vision-Language-Action Models for Robotics and Autonomous Vehicles.

Alice Plebe, March 2024.

Slides, Recordings, and Exam

The lesson slides are available here. Video recording of the lessons are available as well:

The exam is an oral presentation of a recent paper. The list of papers to choose from is available here.

Software

  • attention.ipynb contains:
    • an example of word embedding using GloVe,
    • an implementation of multi-head attention layer from scratch.
  • transformer.ipynb contains:
    • an example of Tokenizer using SimpleBooks,
    • an implementation of Transformer block with visualization of attention matrix,
    • an implementation of a simple GPT model producing completions with different Samplers.
  • pendulum.ipynb contains an implementation of Transformer for time-series prediction usign Time2Vec embedding.
  • gpt4v.ipynb contains an example of using OpenAI API to make inference with GPT4v.

Downloads

Files used in the software:

Papers

List of papers cited during the lessons, and more.

Foundations of Language models

  • T. Mikolov et al., "Distributed Representations of Words and Phrases and their Compositionality", NeurIPS (2013).
  • J. Pennington et al., "GloVe: Global Vectors for Word Representation", EMNLP (2014).
  • D. Bahdanau et al., "Neural machine translation by jointly learning to align and translate", arXiv (2014).
  • P. Christiano et al., "Deep Reinforcement Learning from Human Preferences", NeurIPS (2017).
  • A. Vaswani et al., "Attention Is All You Need", NeurIPS (2017).
  • A. Radford et al., "Improving Language Understanding by Generative Pre-Training" (2018).
  • A. Radford et al., "Language Models are Unsupervised Multitask Learners" (2019).
  • S. Kazemi et al., "Time2Vec: Learning a Vector Representation of Time", arXiv (2019).
  • T. Brown et al., "Language Models are Few-Shot Learners", NeurIPS (2020).
  • D. Ziegler et al., "Fine-Tuning Language Models from Human Preferences", arXiv (2020).
  • L. Ouyang et al., "Training language models to follow instructions with human feedback", arXiv (2022).
  • G. Franceschelli & M. Musolesi, "Reinforcement Learning for Generative AI: State of the Art, Opportunities and Open Research Challenges", Journal of Artificial Intelligence Research (2024).

Foundations of Multimodal Language models

  • A. Ramesh et al., "Zero-Shot Text-to-Image Generation", PMLR (2020).
  • A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale", ICLR (2021).
  • C. Jia et al., "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision", PMLR (2021).
  • A. Radford et al., "Learning Transferable Visual Models From Natural Language Supervision", PMLR (2021).
  • M. Tsimpoukelli et al., "Multimodal Few-Shot Learning with Frozen Language Models", NeurIPS (2021).
  • Z. Yang et al., "The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)", arXiv (2023).

Language models for Autonomous driving

  • X. Li et al., "Towards Knowledge-driven Autonomous Driving", arXiv (2023).
  • S. Wang et al., "ChatGPT as Your Vehicle Co-Pilot: An Initial Attempt", IEEE T-IV (2023).
  • L. Wen et al., "On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving", arXiv (2023).
  • C. Cui et al., "A Survey on Multimodal Large Language Models for Autonomous Driving", WACV (2024).
  • D. Fu et al., "Drive like a human: Rethinking autonomous driving with large language models", WACV (2024).
  • S. Luo et al., "Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives", arXiv (2024).

Language models for Robotics

  • M. Ahn et al., "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances", arXiv (2022).
  • W. Huang et al., "Inner Monologue: Embodied Reasoning through Planning with Language Models", arXiv (2022).
  • A. Brohan et al., "RT-1: Robotics Transformer for Real-world Control at Scale", arXiv (2023).
  • A. Brohan et al., "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", arXiv (2023).
  • R. Firoozi et al., "Foundation Models in Robotics: Applications, Challenges, and the Future", arXiv (2023).
  • S. Vemprala et al., "ChatGPT for Robotics: Design Principles and Model Abilities", arXiv (2023).
  • M. Ahn et al., "AutoRT: Embodied Foundation Models For Large Scale Orchestration of Robotic Agents", arXiv (2024).

About

Material for the course "Vision-Language-Action Models for Robotics and Autonomous Vehicles"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published