You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: paper/position/paper.tex
+7-7Lines changed: 7 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -200,7 +200,7 @@ \subsection{A Taxonomy of Browser Agent Benchmarks}
200
200
201
201
\paragraph{Pattern 4: Domain concentration.} Existing benchmarks heavily favor a small set of domains: e-commerce, content management, developer tools, and travel booking appear repeatedly, while vast categories of economically important web work (financial services, healthcare portals, government services, enterprise SaaS, professional services) remain largely uncovered.
202
202
203
-
\paragraph{Pattern 5: The live-web evaluation problem.} Benchmarks that evaluate on live websites (Mind2Web \citep{zhang2023mind2web}, BrowseComp \citep{wei2025browsecomp}, BEARCUBS \citep{song2025bearcubs}, WebVoyager \citep{he2024webvoyager}) face continuous validity challenges as the web changes. Those that avoid this through replicas (WebArena \citep{zhou2023webarena}, REAL \citep{garg2025real}) gain reproducibility but lose coverage and realism.
203
+
\paragraph{Pattern 5: The live-web evaluation problem.} Benchmarks that evaluate on live websites (Mind2Web \citep{zhang2023mind2web}, BrowseComp \citep{wei2025browsecomp}, BEARCUBS \citep{song2025bearcubs}, WebVoyager \citep{he2024webvoyager}) face continuous validity challenges as the web changes. Those that avoid this through replicas (WebArena \citep{zhou2023webarena}, REAL \citep{garg2025real}) gain reproducibility but lose coverage and realism. Moreover, sensitive operations (payments, authentication, account creation) cannot be retried freely without real consequences.
204
204
205
205
\subsection{The Cost Structure of Environment Construction}
\paragraph{Stage 1: Tool-Call Parsing.} Raw browser events are converted to a standardized Domain-Specific Language (DSL):
512
512
513
-
\begin{table*}[h]
513
+
\begin{table*}[t]
514
514
\caption{Tool-call DSL mapping from browser events.}
515
515
\small
516
516
\centering
@@ -586,9 +586,9 @@ \subsection{Replay Engine}
586
586
\item\textbf{Exact Match:} The HAR file is searched for entries with identical method and URL base (scheme + host + path). If a single match exists, it is used directly.
587
587
\item\textbf{Character-Based Matching:} For URLs with dynamic query parameters, a character-frequency similarity score is computed:
URLs are normalized before matching (removing timestamp parameters, sorting query strings). Candidates with >90\% character overlap and matching all target characters are treated as perfect matches.
591
+
where $\text{tgt}[c]$ and $\text{cand}[c]$ are the character counts for character $c$ in the target and candidate URLs respectively. URLs are normalized before matching (removing timestamp parameters, sorting query strings). Candidates with greater than 90\% character overlap and matching all target characters are treated as perfect matches.
592
592
\item\textbf{LLM Disambiguation:} When multiple candidates remain after character-based filtering, the top-5 candidates (ranked by match score) are sent to an LLM for selection. The prompt provides: target request details (method, normalized URL, headers, POST data), candidate request details with response MIME types, and character match scores as additional context.
593
593
\end{enumerate}
594
594
@@ -616,7 +616,7 @@ \section{Comparison with HTTP Record-Replay Tools}
616
616
617
617
Table~\ref{tab:replay-comparison} compares TRACE's replay approach with existing HTTP record-replay tools. While these tools share the fundamental capture-replay paradigm, they differ significantly in request matching sophistication and intended use case.
618
618
619
-
\begin{table*}[h]
619
+
\begin{table*}[t]
620
620
\caption{Comparison of HTTP record-replay tools. TRACE adds semantic matching capabilities designed for the non-determinism inherent in web agent evaluation.}
To validate replay correctness, we manually traversed each captured environment following the original trajectory. Table~\ref{tab:replay-validation} reports matching outcomes across the multi-tier system.
725
725
726
-
\begin{table*}[h]
726
+
\begin{table*}[t]
727
727
\caption{Replay matching breakdown for human traversal of captured environments. Deterministic matches require no LLM disambiguation; LLM-required matches were resolved correctly; failed matches had no suitable HAR entry.}
0 commit comments