VLAM-24

Material for the course Vision-Language-Action Models for Robotics and Autonomous Vehicles.

Alice Plebe, March 2024.

Slides, Recordings, and Exam

The lesson slides are available here. Video recording of the lessons are available as well:

Lesson 1 (26/03/2024)
Lesson 2 (27/03/2024)
Lesson 3 (28/03/2024)
Lesson 4 (29/03/2024)

The exam is an oral presentation of a recent paper. The list of papers to choose from is available here.

Software

attention.ipynb contains:
- an example of word embedding using GloVe,
- an implementation of multi-head attention layer from scratch.
transformer.ipynb contains:
- an example of Tokenizer using SimpleBooks,
- an implementation of Transformer block with visualization of attention matrix,
- an implementation of a simple GPT model producing completions with different Samplers.
pendulum.ipynb contains an implementation of Transformer for time-series prediction usign Time2Vec embedding.
gpt4v.ipynb contains an example of using OpenAI API to make inference with GPT4v.

Downloads

Files used in the software:

Papers

List of papers cited during the lessons, and more.

Foundations of Language models

T. Mikolov et al., "Distributed Representations of Words and Phrases and their Compositionality", NeurIPS (2013).
J. Pennington et al., "GloVe: Global Vectors for Word Representation", EMNLP (2014).
D. Bahdanau et al., "Neural machine translation by jointly learning to align and translate", arXiv (2014).
P. Christiano et al., "Deep Reinforcement Learning from Human Preferences", NeurIPS (2017).
A. Vaswani et al., "Attention Is All You Need", NeurIPS (2017).
A. Radford et al., "Improving Language Understanding by Generative Pre-Training" (2018).
A. Radford et al., "Language Models are Unsupervised Multitask Learners" (2019).
S. Kazemi et al., "Time2Vec: Learning a Vector Representation of Time", arXiv (2019).
T. Brown et al., "Language Models are Few-Shot Learners", NeurIPS (2020).
D. Ziegler et al., "Fine-Tuning Language Models from Human Preferences", arXiv (2020).
L. Ouyang et al., "Training language models to follow instructions with human feedback", arXiv (2022).
G. Franceschelli & M. Musolesi, "Reinforcement Learning for Generative AI: State of the Art, Opportunities and Open Research Challenges", Journal of Artificial Intelligence Research (2024).

Foundations of Multimodal Language models

A. Ramesh et al., "Zero-Shot Text-to-Image Generation", PMLR (2020).
A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale", ICLR (2021).
C. Jia et al., "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision", PMLR (2021).
A. Radford et al., "Learning Transferable Visual Models From Natural Language Supervision", PMLR (2021).
M. Tsimpoukelli et al., "Multimodal Few-Shot Learning with Frozen Language Models", NeurIPS (2021).
Z. Yang et al., "The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)", arXiv (2023).

Language models for Autonomous driving

X. Li et al., "Towards Knowledge-driven Autonomous Driving", arXiv (2023).
S. Wang et al., "ChatGPT as Your Vehicle Co-Pilot: An Initial Attempt", IEEE T-IV (2023).
L. Wen et al., "On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving", arXiv (2023).
C. Cui et al., "A Survey on Multimodal Large Language Models for Autonomous Driving", WACV (2024).
D. Fu et al., "Drive like a human: Rethinking autonomous driving with large language models", WACV (2024).
S. Luo et al., "Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives", arXiv (2024).

Language models for Robotics

M. Ahn et al., "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances", arXiv (2022).
W. Huang et al., "Inner Monologue: Embodied Reasoning through Planning with Language Models", arXiv (2022).
A. Brohan et al., "RT-1: Robotics Transformer for Real-world Control at Scale", arXiv (2023).
A. Brohan et al., "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", arXiv (2023).
R. Firoozi et al., "Foundation Models in Robotics: Applications, Challenges, and the Future", arXiv (2023).
S. Vemprala et al., "ChatGPT for Robotics: Design Principles and Model Abilities", arXiv (2023).
M. Ahn et al., "AutoRT: Embodied Foundation Models For Large Scale Orchestration of Robotic Agents", arXiv (2024).

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitignore		.gitignore
README.md		README.md
attention.ipynb		attention.ipynb
gpt4v.ipynb		gpt4v.ipynb
pendulum.ipynb		pendulum.ipynb
transformer.ipynb		transformer.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLAM-24

Slides, Recordings, and Exam

Software

Downloads

Papers

Foundations of Language models

Foundations of Multimodal Language models

Language models for Autonomous driving

Language models for Robotics

About

Releases

Packages

Languages

3lis/VLAM-24

Folders and files

Latest commit

History

Repository files navigation

VLAM-24

Slides, Recordings, and Exam

Software

Downloads

Papers

Foundations of Language models

Foundations of Multimodal Language models

Language models for Autonomous driving

Language models for Robotics

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages