AI Features

The Multimodal Web Agent Challenge

Explore the fundamental challenges of web automation, including the limitations of text-only agents and the critical “grounding problem” that arises with multimodal models.

Problem space: Multimodal web agents

So far in this course, we’ve seen agents designed for specific, relatively contained environments. Now, we’re going to explore a new and incredibly complex environment for agents to operate in: the live web.

What is a web agent?

A web agent is an autonomous system designed to navigate and interact with real-world websites to complete tasks on behalf of a user. Unlike a simple web scraper that just extracts data, a true web agent can perform actions like a human: clicking buttons, typing in forms, and making decisions based on what it sees.

A web agent must be able to perform actions like a human, such as clicking buttons, to complete its tasks
A web agent must be able to perform actions like a human, such as clicking buttons, to complete its tasks

Our goal in this chapter is to design an agent that, given a high-level instruction like “Find the cheapest flight from New York to London next Tuesday,” can autonomously browse a website and return a final answer.

The foundational tool: HTML

To perform any action, an agent must first understand the structure of a webpage. The primary source for this understanding is the website’s HTML (Hypertext Markup Language).

HTML is the skeleton of a webpage. For an agent, it’s an incredibly valuable source of signals because it explicitly defines the interactive elements on a page. By parsing the HTML, an agent can perform a number of functions outlined below.

  • Identify elements: It can see every button, input field, link, and image on the page.

  • Understand element types: It knows the difference between a clickable <button> and a fillable <input type="text">. ...