Deep dive on Pasta-1T: an open dataset of web trajectories to train browser agents

We're releasing Pasta-1T, a live, open dataset of real-world web interactions that will enable better fine-tuning for web agents.

Written by

Silverstream AI Team

Published on

April 4, 2025

Pasta-1T captures billions of tokens generated from agentic interactions across diverse websites in several languages and categories. This comprehensive dataset provides a rich source of machine-validated web trajectories, reflecting heterogeneous, multicultural browsing behavior. This aims to be the largest public dataset, specifically of web trajectories.

Behaviors, not tokens

The evolution of autonomous agents represents a fundamental shift in how we think about automation. While current pre-training datasets treat the web as a sequence of tokens, the agentic web is driven by behavior emerging from the interaction between the agent and its environment. This information captures the consequences of agentic actions rather than providing a static world description. Pasta-1T addresses this gap by providing a comprehensive dataset of validated web interaction trajectories, capturing the full context and behavioral patterns needed for training reliable agents. By making this resource available to the community, we want to help the emergence of more trustworthy and reliable open web agents.

Benchmarks vs Real-World Performance

Today's agents typically boast near-80% accuracy in academic benchmarks, which are not representative of real-world applications. Just as self-driving vehicles must meet a much higher standard than human-driven cars, autonomous agents must achieve a similarly high level of trustworthiness—99.9% or better—to be truly hands-off.

By releasing and supporting Pasta-1T, we aim to help the community bridge the usefulness gap so that agents can graduate from simple benchmarks to solving reliably complex tasks autonomously.

Specifically, this release includes:

An ever-increasing dataset of billions of tokens and unique traces across the whole live internet
Multi-modal data captures complete web interactions, including DOM structure, visual elements, and agent reasoning
LLM, as a judge validation process, ensures high-quality, reliable training data

All data processing scripts are open source and available on our repository. The dataset will be available through the open standard Minari. This is just the first of what we hope will be many contributions to the open community focused on improving AI agents.

Request Access vs Open Access

For now, users must complete a form, as we need to track usage and potential ethical and copyright issues. After peer review, the dataset will be open-accessed through the proper academic channels.

Dataset Composition and Methodology

The dataset consists of web trajectories generated by web agents interacting with websites. These interactions are collected through Silverstream AI's autonomous web agents in a setup that uses a language model-driven exploration policy to perform a self-defined curriculum of tasks within a browser environment. Each data point includes a self-defined task, its expected consequences, and the set of states and actions to achieve it. We ensure that the task is feasible and valuable and that the agent successfully completes it. This approach helps maintain consistency across tasks while capturing diverse interaction patterns.

Validation and Quality Control

We apply some filtering to ensure the dataset consists of high-quality and meaningful interactions. We discard trajectories that terminate too quickly without reaching a meaningful number of steps or state changes. This ensures that only trajectories demonstrating substantial interaction and complexity are retained to inform learning signals such as rewards.

Quality Threshold Filter

Each trajectory is evaluated using an LLM as a judge, inspired by the Generalized Value Functions. The LLM judge assesses the trajectory against:

Task adherence
Acceptance criteria
Error analysis

Trajectories are scored on a 1–5 scale:

A score of 4 if there are minor errors or if the task was more than 70% completed
A score of 3 or below if the agent made very little progress toward the given instruction or if there are significant errors

Only trajectories scoring 4 or 5 are included in the dataset. To ensure non-triviality, tasks must require at least three meaningful and independent steps. The differential world model analyzes changes in the browser state at each step to confirm that interactions involve distinct and significant actions, avoiding superficial or repetitive tasks.

Dataset Structure

The dataset is a collection of episodes:

Each episode in the dataset includes the following components:

1. Title of the task: (e.g., "Buying socks")

2. Task Goal: The goal driving the interaction in a user story format (As a..., I..., so that...)

3. Success definition: The definition of a successful episode in the acceptance test criteria format (Given..., When...., Then...)

4. Trajectory: A sequence of transitions detailing the agent's interaction process.

Trajectory Structure

A trajectory represents a step-by-step interaction between the agent and the webpage. A trajectory is a sequence of transitions, with each transition consisting of:

1. Observation:

URL Transitions (Captures before-and-after URLs)
Detailed DOM structure and accessibility information
Network Requests
Screenshots
Recorded Videos (for Video access, contact us)

2. Action:

Reasoning (e.g., the rationale behind why a button was clicked)
Action Syntax (Code-level representation, e.g., page.click("#home"))

Access the Dataset

Want to explore the Pasta-1T dataset and join the Silverstream AI community? Fill out this form to request access. For collaboration inquiries or more information, feel free to contact us.

Enjoyed this? Let’s chat!

Whether you’re curious to learn more, explore opportunities, or simply have a chat, leave your email, and we’ll reach out!

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.