From Content to Behaviors: Indexing the Web through Web Browser Agents

Our goal is to shift how we understand and organize web content fundamentally. We plan to understand the correlation between web actions and their real-world outcomes.

Written by

Silverstream

Published on

20 November 2024

Introduction

Search engines have long been the entry point for vast amounts of information on the Internet. They represent content in an index to answer the question, "What information is on this page?" Yet, webpages are not just static repositories of information. They are dynamic interfaces designed for interaction.

At Silverstream, we are deploying agents at the internet scale to understand the modern web through action-based knowledge rather than content-oriented indexing. We are building a digital world model that our agents can leverage to understand the consequences of their actions. This is a critical element to develop safe, reliable web agents.

The Evolution of web indexing

Search engines have indexed all kinds of web content - text, photos, and multimedia - and ranked them using metrics like authority and novelty. Initially, search engines relied on simple keyword matches and content analysis. Google's algorithm (PageRank) introduced link analysis to rank pages based on their authority.

In the 2000s, Google added new indexing features like "freshness"—to prefer recent content—and personalized results based on user history. The "Caffeine" update (2010) allowed for faster, incremental indexing and "Universal Search" blended results from multiple formats (videos, images, etc.). The "Knowledge Graph" (2012) and "Hummingbird" (2013) updates helped to better understand entities, relationships, and user intent.

Recent updates include:

RankBrain (2015): Introduced neural embeddings for semantic matching, which helps better understand ambiguous queries.
Mobile-First Indexing (2018): Focused on mobile versions of websites for indexing and ranking due to increased mobile users.
BERT (2019): Employed transformer models to improve natural language understanding.
Passage Indexing (2020): Allowed indexing of specific page sections to enhance the accuracy of long-form content retrieval.
MUM (2021): Managed multimodal queries, by combining text and images to provide richer search results.

Illustration - Timeline of Google’s SEO Indexing evolution: Caffeine, Universal Search, Knowledge Graph, Hummingbird, RankBrain, Mobile-First Indexing, and BERT, with icons for each update — *Note: Evolution of Google's Indexing Algorithms (not comprehensive)*

Despite these advancements, these indexing algorithms focus only on content. They answer, "What information is on this page?" but not, "What can I do with this page?"

Indexing actions to build behaviors

The web's time as a static knowledge repository is long gone. Web pages are now designed for interaction. Software as a Service has strengthened the web as an interface for doing things, where actions and workflows are critical to the user experience. Traditional indexing methods overlook this dynamic nature.

We wondered:

"Why not systematically organize behaviors, workflows, and traces in a new indexing system - one was designed to answer the question, "What can I do with this page?"

Achieving this at scale poses several questions:

How can this be done at scale?
How do you prioritize actions and behaviors?
What new horizons could this open?

1) How to do this at scale (and why)?

The Bitter Lesson

Rich Sutton's "The Bitter Lesson" reveals a fundamental truth: large-scale learning always outperforms manually engineered solutions.

Applying this to our agent architecture, we avoid local optimizations like:

Crafting perfect prompts for specific cases
Maintaining libraries of specialized instructions
Constant prompt engineering and shuffling

Deploying Web Agents across the Internet

Web-browsing agents will be our "Google Bots" for understanding web interfaces. These Agents actively interact with web pages, mirroring actual user behaviors. By interacting with web pages, web agents enable the creation of a dynamic knowledge graph that shows the cause-and-effect links between what users do and how websites respond.

The distinction between web browsing agents and traditional scrapers is straightforward. These agents have both intelligence and the ability to take action. This capability enables them to build world models that understand what can be done on a webpage and the consequences of different interactions.

The supermarket test: a real-world analogy

Tracking behaviors has been important for years, as shown by tools like Pixels, Google Analytics, Clearbit, and Clarity. This information has typically been used for advertising and has been collected separately by different owners of various touchpoints.

To illustrate the difference, consider the analogy of shopping in a new supermarket. A traditional crawler is like someone who, before buying anything, makes a complete inventory of the store. It walks through every aisle, notes where each item is, and creates a detailed map of the store's layout.

Illustration comparing a traditional web crawler to navigating a supermarket, highlighting the process of visiting and parsing webpages — *Illustration comparing a traditional web crawler to navigating a supermarket.*

In contrast, agents are like experienced shoppers who know how supermarkets work. When given a task—like buying ingredients for a recipe—they look at the signs and navigate straight to the right items using their prior knowledge of typical store layouts.

*An analogy for AI-driven task execution with prior knowledge and goal-oriented behavior.*

They don't need to explore every aisle; they choose where to look based on how supermarkets usually organize their products.

This difference in approach extends beyond simple navigation. Agents adapt to environmental changes, understand the context of their actions, and change their behavior based on feedback from the environment around them.

Why web agents and why now?

Web agents can complete tasks by interacting directly with web interfaces without relying on APIs. They have a deep understanding of web page structure through DOM Distillation and Vision models, and they rely on language models for reasoning and handling multiple types of inputs.

The timing of this technology's advancement is particularly significant. Thanks to substantial improvements in the available models, these agents have only reached production-level readiness in recent months.

Agents now have a broader context, make smarter and better decisions, and learn from their interactions using memory mechanisms in ways that were not possible before.

2) How to prioritize page interactions? A parallel with advertising

Attribution algorithms used in advertising assign value to specific actions of a user in a workflow that leads to a conversion. They don't necessarily refer to a single action on a homepage; instead, they focus on the touchpoints involved in leading to the conversion. Typically, all credit for the critical action, called a key event, is given to the last ad customers clicked. But did that ad alone make them decide to interact with a crucial event on the path to conversion? What about the other ads they clicked on before it?

An attribution algorithm distributes credit for a conversion (such as a purchase or sign-up) across the marketing touchpoints a customer interacts with. Marketers can improve campaign effectiveness by identifying which channels played the most significant role in the conversion.

Prioritized Experience Replay in web-workflows

The same approach can be used to prioritize actionable touchpoints and elements in a web trace. For example, knowledge workers typically perform many end-to-end workflows depending on a few key page elements.

As a guiding example, consider an IT suppot who is tasked with onboarding new hires.

Each day, he logs into the organization's ERP platform. On the landing page, he sees a dashboard showing critical statistics of the work assigned to him. Using the workspace's menu, he navigates to a list of requests to be fulfilled. He then filters the list to extract all assigned requests and sorts them by priority.

Finally, he processes each request by filling out forms to create new user profiles and using the service catalog to order laptops. The dashboard, the list, and the forms will be the key, high-priority elements in this flow of interactions.

Our paper, "WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?", which we co-authored with ServiceNow, extensively explores this ERP application.

By taking inspiration from Prioritized Experience Replay (PER), we can use these key events as an underlying weighting method to create the index. PER is a technique used in reinforcement learning that prioritizes more valuable experiences, like those with higher prediction errors, for training. In standard RL, experiences are sampled uniformly from recent interactions, giving all of them equal weight. PER, in contrast, focuses on experiences that offer greater learning value, making the learning process faster and more efficient.

3) New horizon: building dynamic world models

World models are concepts that bridge the gap between perception and action. A "world model" is an internal representation that enables an agent to understand, predict, and plan within its environment. Just as humans build mental models of how the world works, AI agents develop these models to navigate complex environments.

AI Web Browsing Agents & A World Model, from Scott McCloud’s Understanding Comics — A World Model, from Scott McCloud’s *Understanding Comics*. [1,2,3]

Understanding world models

At their core, world models serve three critical functions: prediction, planning, and generalization. Prediction allows agents to anticipate the consequences of their actions. Planning enables agents to simulate different scenarios before acting. Generalization helps them use what they've learned in new situations.

World models are powerful because they can summarize experience in a way that makes prediction and planning easier. When an agent faces a new situation, it doesn't have to learn from scratch; it can make intelligent decisions based on similar past experiences.

“Traditional” world models vs. web models

World models are often used in "in-game" environments and operate in controlled spaces with clear physical rules. Our approach extends this concept to the vast and dynamic web environment, where the rules are less explicit but equally important.

In the web context, a world model must capture:

the affordance of the websites,
common patterns in user interactions,
the relationship between actions and their outcomes,
and the temporal dynamics of web interactions.

The result is a dynamic, growing understanding of the web as an interactive space, not just a set of static pages. This web model allows our agents to interact meaningfully with the web, making decisions and taking actions that mimic human-like understanding of web interfaces.

Questions to the future

What are the scaling laws for agents? And what's the trade-off between natural and synthetic data for training effective web-browsing agents? These questions are essential as we move from proof-of-concept to large-scale deployment.

Just as Google changed from an "Information Engine" to a "Knowledge Engine" with its Knowledge Graph, we are on the brink of another significant transformation.

This approach will advance multimodal agents from basic models of actions and outputs to strong models of behaviors and outcomes. It will create web knowledge that is not about searchability and knowability but about actionability and predictability.

Enjoyed this? Let’s chat!

Whether you’re curious to learn more, explore opportunities, or simply have a chat, leave your email, and we’ll reach out!

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.