The Multimodal Web Agent Challenge
Explore the fundamental challenges of web automation, including the limitations of text-only agents and the critical “grounding problem” that arises with multimodal models.
Problem space: Multimodal web agents
So far in this course, we’ve seen agents designed for specific, relatively contained environments. Now, we’re going to explore a new and incredibly complex environment for agents to operate in: the live web.
What is a web agent?
A web agent is an autonomous system designed to navigate and interact with real-world websites to complete tasks on behalf of a user. Unlike a simple web scraper that just extracts data, a true web agent can perform actions like a human: clicking buttons, typing in forms, and making decisions based on what it sees.
Our goal in this chapter is to design an agent that, given a high-level instruction like “Find the cheapest flight from New York to London next Tuesday,” can autonomously browse a website and return a final answer.
The foundational tool: HTML
To perform any action, an agent must first understand the structure of a webpage. The primary source for this understanding is the website’s HTML (Hypertext Markup Language).
HTML is the skeleton of a webpage. For an agent, it’s an incredibly valuable source of signals because it explicitly defines the interactive elements on a page. By parsing the HTML, an agent can perform a number of functions outlined below.
Identify elements: It can see every button, input field, link, and image on the page.
Understand element types: It knows the difference between a clickable
<button>and a fillable<input type="text">. ...