Fix XML scraping data loss, auth failure handling, and add tests#577
Conversation
…tests - Fix _extract_tag_value silently dropping content for XML tags named template/script/style by respecting the extract parameter instead of short-circuiting with tag.string - Fix ContentRequestManager swallowing 401/403 auth failures that would cause scraping unreliable data (login pages) without warning - Clear cookie jar on reauth invalidation to prevent stale cookies from interfering with fresh authentication - Log warning when stale form_variables persist after failed reauth - Fix misleading parser name "html (lxml-xml)" -> "xml (lxml-xml)" - Add 110 new tests for XML scraping, cookie persistence, and auth failure/re-auth flows Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Warning Rate limit exceeded
To continue reviewing without waiting, purchase usage credits in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR enhances the multiscrape component with XML parser support and improved authentication error handling. It adds XML labeling to the parser, refines text extraction for HTML-like elements, implements defensive error handling in form submission and HTTP requests with a new authentication invalidation method, and provides comprehensive test coverage for authentication failures, cookie persistence, and XML scraping scenarios. ChangesXML Scraping Support
Authentication Error Handling & Invalidation
Sequence DiagramsequenceDiagram
participant Client as Client/Coordinator
participant FA as FormAuthenticator
participant CRM as ContentRequestManager
participant Server as Server
participant HS as HttpSession
Client->>HS: ensure_authenticated()
HS->>FA: ensure_authenticated()
FA->>Server: GET /login (fetch form)
Server-->>FA: form page + cookies
FA->>FA: extract form inputs
FA->>Server: POST /login (submit form)<br/>(with credentials)
alt Submission Success
Server-->>FA: 200 OK + session cookies
FA->>Server: GET /data (fetch content)
Server-->>FA: 200 OK + data
FA-->>HS: authenticated
else Submission Fails (any error)
Server-->>FA: error (4xx/5xx)
FA->>FA: preserve form_variables<br/>log warning
FA-->>HS: error (re-raise)
HS-->>Client: error propagates
end
alt HTTPStatusError (401/403)
Client->>CRM: get_content()
CRM->>Server: POST /submit (form data)
Server-->>CRM: 401/403 Unauthorized
CRM->>CRM: log + re-raise
CRM-->>Client: HTTPStatusError
else HTTPStatusError (other)
Client->>CRM: get_content()
CRM->>Server: POST /submit
Server-->>CRM: 4xx/5xx error
CRM->>CRM: log HTTP error
CRM-->>Client: continue (no raise)
end
Client->>HS: invalidate_auth()
HS->>FA: invalidate()
HS->>HS: clear cookie jar
HS->>HS: log cleared
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…guard - Fix _extract_tag_value to handle extract=None (non-schema path) the same as extract="text", preventing regression for internally-constructed selectors that bypass schema validation - Guard cookie jar clearing in invalidate_auth() behind _should_submit check so cookies are preserved when resubmit_on_error=False (clearing cookies without ability to re-auth would break the session) - Add test verifying cookies are preserved when resubmit is disabled Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
_extract_tag_valuesilently returned empty strings for XML tags namedtemplate,script, orstylebecause an HTML-specific short-circuit bypassed theextractparameter. Now theextractmode is always respected.ContentRequestManager.get_content()caught all exceptions from form auth (including 401/403) and fell through to scraping the data page without authentication — potentially scraping login pages as sensor data. Now 401/403 propagate as errors.invalidate_auth()now clears the httpx cookie jar to prevent old session cookies from interfering with fresh authentication."html (lxml-xml)"→"xml (lxml-xml)"in debug logs.Test plan
lxml-xmlparser on a real RSS/Atom feed🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Tests