Skip to content

Commit a443b1d

Browse files
committed
rendering fixes
1 parent 48ebd4a commit a443b1d

2 files changed

Lines changed: 7 additions & 7 deletions

File tree

paper/position/paper.pdf

49 Bytes
Binary file not shown.

paper/position/paper.tex

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -200,7 +200,7 @@ \subsection{A Taxonomy of Browser Agent Benchmarks}
200200

201201
\paragraph{Pattern 4: Domain concentration.} Existing benchmarks heavily favor a small set of domains: e-commerce, content management, developer tools, and travel booking appear repeatedly, while vast categories of economically important web work (financial services, healthcare portals, government services, enterprise SaaS, professional services) remain largely uncovered.
202202

203-
\paragraph{Pattern 5: The live-web evaluation problem.} Benchmarks that evaluate on live websites (Mind2Web \citep{zhang2023mind2web}, BrowseComp \citep{wei2025browsecomp}, BEARCUBS \citep{song2025bearcubs}, WebVoyager \citep{he2024webvoyager}) face continuous validity challenges as the web changes. Those that avoid this through replicas (WebArena \citep{zhou2023webarena}, REAL \citep{garg2025real}) gain reproducibility but lose coverage and realism.
203+
\paragraph{Pattern 5: The live-web evaluation problem.} Benchmarks that evaluate on live websites (Mind2Web \citep{zhang2023mind2web}, BrowseComp \citep{wei2025browsecomp}, BEARCUBS \citep{song2025bearcubs}, WebVoyager \citep{he2024webvoyager}) face continuous validity challenges as the web changes. Those that avoid this through replicas (WebArena \citep{zhou2023webarena}, REAL \citep{garg2025real}) gain reproducibility but lose coverage and realism. Moreover, sensitive operations (payments, authentication, account creation) cannot be retried freely without real consequences.
204204

205205
\subsection{The Cost Structure of Environment Construction}
206206

@@ -465,7 +465,7 @@ \subsection{Collection Architecture}
465465

466466
\paragraph{Event Recording.} The \texttt{Recorder} class captures events across multiple categories:
467467

468-
\begin{table*}[h]
468+
\begin{table*}[t]
469469
\caption{Event categories captured by TRACE.}
470470
\small
471471
\centering
@@ -510,7 +510,7 @@ \subsection{Post-Processing Pipeline}
510510

511511
\paragraph{Stage 1: Tool-Call Parsing.} Raw browser events are converted to a standardized Domain-Specific Language (DSL):
512512

513-
\begin{table*}[h]
513+
\begin{table*}[t]
514514
\caption{Tool-call DSL mapping from browser events.}
515515
\small
516516
\centering
@@ -586,9 +586,9 @@ \subsection{Replay Engine}
586586
\item \textbf{Exact Match:} The HAR file is searched for entries with identical method and URL base (scheme + host + path). If a single match exists, it is used directly.
587587
\item \textbf{Character-Based Matching:} For URLs with dynamic query parameters, a character-frequency similarity score is computed:
588588
\begin{equation*}
589-
\text{score} = \sum_c \min(\text{target\_char\_count}[c], \text{candidate\_char\_count}[c])
589+
\text{score} = \sum_c \min(\text{tgt}[c], \text{cand}[c])
590590
\end{equation*}
591-
URLs are normalized before matching (removing timestamp parameters, sorting query strings). Candidates with >90\% character overlap and matching all target characters are treated as perfect matches.
591+
where $\text{tgt}[c]$ and $\text{cand}[c]$ are the character counts for character $c$ in the target and candidate URLs respectively. URLs are normalized before matching (removing timestamp parameters, sorting query strings). Candidates with greater than 90\% character overlap and matching all target characters are treated as perfect matches.
592592
\item \textbf{LLM Disambiguation:} When multiple candidates remain after character-based filtering, the top-5 candidates (ranked by match score) are sent to an LLM for selection. The prompt provides: target request details (method, normalized URL, headers, POST data), candidate request details with response MIME types, and character match scores as additional context.
593593
\end{enumerate}
594594

@@ -616,7 +616,7 @@ \section{Comparison with HTTP Record-Replay Tools}
616616

617617
Table~\ref{tab:replay-comparison} compares TRACE's replay approach with existing HTTP record-replay tools. While these tools share the fundamental capture-replay paradigm, they differ significantly in request matching sophistication and intended use case.
618618

619-
\begin{table*}[h]
619+
\begin{table*}[t]
620620
\caption{Comparison of HTTP record-replay tools. TRACE adds semantic matching capabilities designed for the non-determinism inherent in web agent evaluation.}
621621
\label{tab:replay-comparison}
622622
\vskip 0.1in
@@ -723,7 +723,7 @@ \subsection{Replay Validation Results}
723723

724724
To validate replay correctness, we manually traversed each captured environment following the original trajectory. Table~\ref{tab:replay-validation} reports matching outcomes across the multi-tier system.
725725

726-
\begin{table*}[h]
726+
\begin{table*}[t]
727727
\caption{Replay matching breakdown for human traversal of captured environments. Deterministic matches require no LLM disambiguation; LLM-required matches were resolved correctly; failed matches had no suitable HAR entry.}
728728
\label{tab:replay-validation}
729729
\vskip 0.1in

0 commit comments

Comments
 (0)