Welcome to CS 294-43: Large Scale Vision, Language and Action Models (Fall, 2025)
Course Logistics
- Lectures: Monday 3:00PM - 5:00PM PDT, Location Berkeley Way West, Room 8019
- Lecture Videos / Remote Participation: Lecture/seminar information will be posted on the course website, but to encourage discussion, will not be recorded. We will provide a way for remote students to participate in the live lectures.
- Contact: Please use email for all course-related questions. If you need to contact the course staff privately, please use the email addresses listed below.
Course Description
Perception, language, and action are becoming increasingly integrated in modern AI systems, enabling machines to interpret the world, reason visually, and interact through complex behaviors. The interplay of vision, language, and actions has long tantalized AI researchers, and has recently begun to bear considerable fruit. As time and the interest of participants permit, this course will delve into advancements in vision, language, and action models, focusing on the development of large-scale systems such as vision understanding and generation models, vision-language models (VLMs), and vision-language-action models (VLAs). This course will explore cutting-edge techniques and topics including vision pre-training and its integration with language and robotics (such as VLMs and VLAs), long-context modeling in these models, image and video generation, world models and its synergy with robotics, agents, and visual reasoning. Additionally, the course will cover data collection and analysis for large-scale models, and address critical ethical, interpretability, and security issues of AI systems. The overarching goal of the course is to equip students with the background to contribute to research in the rapidly evolving field of foundational vision, language, and action models.
Previous Offerings
Prerequisites
Instructor permission required. Must have completed graduate-level computer vision/NLP/robotics courses and be engaged in active research. Limited to 30 participants. This course may be taken for variable units (2-4), may NOT be audited, and may be retaken for credit in different semesters as the material will change from term to term. Priority for registration will be given to those taking the course for credit for at least two units if they are eligible to do so. Postdocs and others are not able to register, but still may be considered priority participants. No priority distinction will be made for those requesting two vs more than two units. Please fill out this request form to summarize your background and express your interest in joining the course either for credit or as an auditor. Permission codes will be sent to selected students to register for the course.
Course Format
Course Title: Large Scale Vision, Language, and Action Models
Each week's session is structured around a central topic and question that guides both the readings and class discussion. The goal is to develop a deep, critical understanding of the latest research in large-scale vision, language, and action models, with a strong emphasis on student engagement and evolving areas.
Weekly Structure
-
Lead Presenters: Each week, 1-2 students are responsible for leading the session. Leads will:
- Introduce the week's central topic and framing question.
- Prepare a 10-minute background overview.
- Help select 3 key papers for the week.
- Coordinate with volunteer presenters for each paper.
-
Paper Presentations: For each of the 3 selected papers, a student volunteer presents a 15-minute deep dive (summary + analysis), followed by 5 minutes of discussion/questions.
-
Class Discussion: After paper presentations, the class will have a 20-minute open discussion focused on the week's framing question and connections between the papers.
- Expect to be randomly called on if you haven't spoken.
-
Preview & Wrap-Up: The last 10 minutes preview the next week's topic and central question, including a quick check-in about the following week's presenters and paper choices.
Class Schedule (110 minutes total)
| Time | Activity |
|---|---|
| 5 min | Introduction: Weekly Question / Topic |
| 10 min | Background Overview (Leads) |
| 3 x 20 min | Paper Presentations + Discussion (15+5 min each) |
| 20 min | Group Discussion (Framing Question) |
| 10 min | Next Week's Topic & Question Preview |
| 5 min | Logistics & Announcements |
Coursework
The following are the requirements for students taking the course for credit.
Points
Every student will be required to earn a total of at least two points throughout the semester. Points can be earned in the following ways:
- Leading a discussion on a week's topic (2 points)
- Presenting a paper in detail (1 point)
- Additional points may be granted by the course instructor based on class participation, helping with organization etc. The option for additional points to be granted by the instructor also allows for flexibility in recognizing and rewarding students for their efforts and contributions beyond leading and presenting.
For all students
In addition to earning at least two points, all students are required to complete the following:
- Active participation in class discussions
- Completion of short response form before each class summarizing the key idea in one or two assigned key papers each week and asking one critical question or making a suggested extension to the work. Additional optional papers will also be covered each week but no response form will be required.
For students taking the course for more than two units
In addition to the above requirements, students taking the course for more than two units will be required to complete a course project. The project can be completed individually or in groups of up to three students. The project will be graded based on the following criteria:
-
[For 3 units] A course project which is one of the following types: new research results and report judged suitable for submission to a CV, NLP, or NeurIPS workshop, a solid replication or reimplementation of existing work, evaluation of existing work on a new dataset, or a literature survey. (Or other format with permission of instructor.)
-
[For 4 units] A course project with new research results and report judged suitable for acceptance at a top CV or NLP conference or journal venue, or a major new open source repository or dataset with high impact for the community.
Auditing
Unfotunately, we do not have the capacity to accommodate auditors in the course, however course materials will be made available online after each meeting. You are welcome to take the couse for two credits as a S/U course, which has a limited workload (see above for coursework requirements).
All students are welcome
We are committed to doing what we can to work for equity and to create an inclusive learning environment that actively values the diversity of backgrounds, identities, and experiences of everyone in the course. It is our expectation that all interactions with course staff and other students will demonstrate appropriate respect, consideration, and compassion for others. Please remember to be friendly and thoughtful; our community draws from a wide spectrum of valuable experiences. For further reading, please reference the Berkeley Principles of Community and Berkeley Campus Code of Student Conduct.
Special Accommodations
We will provide appropriate accommodations to all students enrolled in Berkeley's Disabled Students Program (DSP). To ensure that you receive the appropriate accommodations, have your DSP specialist submit a letter confirming your status and accommodations. If you're not enrolled in DSP, or are in the process of being onboarded by DSP, you may still be eligible for accommodations (such as extended time on exams or extended deadlines). You may also be eligible for accommodations if serious extenuating circumstances should come up during the semester. If you believe you may require accommodations, please contact us. All DSP and accommodations-related materials for this course are kept in a repository separate from the rest of the course materials that is visible only to the instructors, selected staff, and staff course managers. For any DSP and accommodations-related communications, please reach out to an instructor directly.
Well-Being and Mental Health
If you are experiencing personal, academic, or relationship problems and would like to talk to someone with training and experience, reach out to the Counseling and Psychological Services (CAPS) on campus. CAPS is the university's counseling center dedicated to student mental health and wellbeing. Phone appointments can be made at CAPS by calling (510) 642-9494, or for more information, please visit the wepage at https://uhs.berkeley.edu/counseling. If you are in crisis, please call the 24/7 crisis line at (855) 817-5667.
AI Tools and Ethics
We expect that all material generated in this class, including code, reports, and presentations will adhere to the ACL policy on publication ethics. In particular, authors are responsible for all content submitted, and any use of generative AI tools and technologies to create content should be fully disclosed in the Acknowledgements section - for instance, "Section 3 was written with inputs from ChatGPT."
Schedule
| Week | Date | Topic | Leads | Papers |
|---|---|---|---|---|
| 1 | 09/01 | Labor Day | — | — |
| 2 | 09/08 | Vision Pre-training | Tony Lian, Baifeng Shi | DINOv3 TULIP (2503.15485) Perception Encoder (2504.13181) |
| 3 | 09/15 | Vision-Language / Unified Multimodal Models | Jiaxin Ge | Emerging Properties (2505.14683) BAGEL Mogao (2505.05472) Omni-Video (2507.06119) |
| 4 | 09/22 | Visual Reasoning | Pranav Atreya | DeepEyes (2505.14362) Mini-o3 (2509.07969) VLM-R1 (2504.07615) Learning Only with Images (2507.20766) |
| 5 | 09/29 | Visual Agents | Nathan McNaughton | VisualWebArena (2401.13649) PPTAgent (2501.03936) AutoPresent (2501.00912) ScreenCoder (2507.22827) |
| 6 | 10/06 | Image/Video Generation | Ayaan Haque | Rectified Flow Transformers (2403.03206) Seedance 1.0 (2506.09113) Qwen-Image (2508.02324) |
| 7 | 10/13 | Vision for Robotics (Humanoid/Manipulation) | Hiya Shah | EgoVLA Visual Imitation (2505.03729) HRP (2407.18911) |
| 8 | 10/20 | Vision-Language-Action (Humanoid/Manip.) | Ritvik Singh, Haritheja Etukuru | π0 Gemini Robotics 1.5 MolmoAct (2508.07917) |
| 9 | 10/27 | World Models for Robotics | William Liang, Jagdeep Bhatia | Scalable World Models (2509.24527) DINO-WM (2411.04983) GR-2 (2410.06158) V-JEPA 2 |
| 10 | 11/03 | RL Post-Training for VLM/VLAs | Kumar Krishna Agrawal | GRAPE (2411.19309) ConRFT (2502.05450) Self-Improving Embodied FMs (2509.15155) |
| 11 | 11/10 | 3D/4D Vision + Language, Efficient Models | Bhawna Paliwal, Anne Harrington | Brick Structures (2505.05469) Thinking in Space (2412.14171) SpatialVLM (2401.12168) Looped Language Models |
| 12 | 11/17 | Ethics & Bias | Marianna Elia, Genevieve Smith | Gender Shades Linguistic Bias in ChatGPT Gender Biases in T2I Models Women Also Snowboard |
| 13 | 11/24 | Human-VLM/VLA Interaction & Dialog | Téa Wright | VLM Common Ground Talk Less, Interact Better Shared Autonomy Agents M+ (MemoryLLM) |
| 14 | 12/01 | Project Presentations | — | — |
Contact
To contact us, please do so by email:
Instructor
- Prof. Trevor Darrell (trevordarrell@berkeley.edu)
Discussant Provocateur
- Dr. David Chan (davidchan@berkeley.edu)
Seminar Coordinator
- Baifeng Shi (baifeng_shi@berkeley.edu), Long (Tony) Lian (longlian@berkeley.edu)