Skip to content

Fix XML scraping data loss, auth failure handling, and add tests#577

Merged
danieldotnl merged 2 commits into
masterfrom
fix/xml-scraping-auth-bugs-test-coverage
May 6, 2026
Merged

Fix XML scraping data loss, auth failure handling, and add tests#577
danieldotnl merged 2 commits into
masterfrom
fix/xml-scraping-auth-bugs-test-coverage

Conversation

@danieldotnl
Copy link
Copy Markdown
Owner

@danieldotnl danieldotnl commented May 6, 2026

Summary

  • Fix data loss in XML scraping: _extract_tag_value silently returned empty strings for XML tags named template, script, or style because an HTML-specific short-circuit bypassed the extract parameter. Now the extract mode is always respected.
  • Fix silent auth failure swallowing: ContentRequestManager.get_content() caught all exceptions from form auth (including 401/403) and fell through to scraping the data page without authentication — potentially scraping login pages as sensor data. Now 401/403 propagate as errors.
  • Fix stale cookies on reauth: invalidate_auth() now clears the httpx cookie jar to prevent old session cookies from interfering with fresh authentication.
  • Add stale form_variables warning: When reauth fails and stale variables from a prior successful auth remain, a warning is logged.
  • Fix misleading parser name: "html (lxml-xml)""xml (lxml-xml)" in debug logs.
  • Add 110 new tests covering XML scraping (RSS, weather, SOAP, CDATA, namespaces, HTML-colliding tag names), cookie persistence (accumulation, form auth flow, reauth clearing, edge cases), and auth failure/re-auth flows (401/403 rejection, fallthrough behavior, state management).

Test plan

  • All 457 tests pass (347 existing + 110 new)
  • Pre-commit hooks (ruff, isort, codespell, prettier, pytest-check) pass
  • Verify XML scraping with lxml-xml parser on a real RSS/Atom feed
  • Verify form auth with a site that returns 401 on bad credentials
  • Verify cookie-based auth flow where server rotates cookie names

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added ability to explicitly reset authentication state.
    • Enhanced parser to distinguish XML from HTML markup.
  • Bug Fixes

    • Improved error handling for HTTP authentication failures (401/403).
    • Enhanced form submission error handling with state preservation.
    • Refined text extraction for special elements.
  • Tests

    • Added comprehensive test coverage for authentication failures and recovery.
    • Added cookie persistence test suite.
    • Added XML scraping test suite with fixtures.

…tests

- Fix _extract_tag_value silently dropping content for XML tags named
  template/script/style by respecting the extract parameter instead of
  short-circuiting with tag.string
- Fix ContentRequestManager swallowing 401/403 auth failures that would
  cause scraping unreliable data (login pages) without warning
- Clear cookie jar on reauth invalidation to prevent stale cookies from
  interfering with fresh authentication
- Log warning when stale form_variables persist after failed reauth
- Fix misleading parser name "html (lxml-xml)" -> "xml (lxml-xml)"
- Add 110 new tests for XML scraping, cookie persistence, and auth
  failure/re-auth flows

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

Warning

Rate limit exceeded

@danieldotnl has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 37 minutes and 22 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c229ec0a-9002-452f-b3d9-a1add49e350e

📥 Commits

Reviewing files that changed from the base of the PR and between 8fed0ee and a6d1657.

📒 Files selected for processing (3)
  • custom_components/multiscrape/extractors.py
  • custom_components/multiscrape/http_session.py
  • tests/test_auth_failure.py
📝 Walkthrough

Walkthrough

This PR enhances the multiscrape component with XML parser support and improved authentication error handling. It adds XML labeling to the parser, refines text extraction for HTML-like elements, implements defensive error handling in form submission and HTTP requests with a new authentication invalidation method, and provides comprehensive test coverage for authentication failures, cookie persistence, and XML scraping scenarios.

Changes

XML Scraping Support

Layer / File(s) Summary
Parser Behavior
custom_components/multiscrape/parsers.py
HtmlParser now labels lxml-xml parser as "xml (lxml-xml)" to distinguish XML parsing mode from HTML parsing.
Text Extraction Refinement
custom_components/multiscrape/extractors.py
_extract_tag_value now conditionally returns tag.string for style/script/template elements when extract mode is "text", preserving empty-string cases while falling back to standard text extraction otherwise.
Test Fixtures
tests/fixtures/xml_samples.py
Nine XML sample constants added: SAMPLE_XML_RSS, SAMPLE_XML_WEATHER, SAMPLE_XML_SOAP, SAMPLE_XML_ATTRIBUTES, SAMPLE_XML_CDATA, SAMPLE_XML_EMPTY, SAMPLE_XML_NAMESPACES, SAMPLE_XML_HTML_TAG_NAMES, SAMPLE_XML_MALFORMED for various XML scenarios.
XML Scraping Tests
tests/test_xml_scraping.py
Comprehensive test suite covering basic XML parsing, RSS feeds, weather data, attribute extraction, CDATA handling, extract modes, templating, namespaces, SOAP, HTML-colliding tags, error cases, and parser factory integration with lxml-xml.

Authentication Error Handling & Invalidation

Layer / File(s) Summary
Error Handling in Content Retrieval
custom_components/multiscrape/coordinator.py
Added httpx import and wrapped form-submit/content retrieval in try/except to catch httpx.HTTPStatusError; 401/403 errors are logged and re-raised; other HTTP errors are logged and execution continues; general exceptions are also caught and logged.
Defensive Form Submission
custom_components/multiscrape/form_auth.py
ensure_authenticated now wraps form submission in try/except to preserve previously scraped form variables and log a warning before re-raising errors. New private method _submit_form encapsulates the form submission flow.
Authentication Invalidation API
custom_components/multiscrape/http_session.py
New public method invalidate_auth clears authentication state by invoking FormAuthenticator invalidate, clearing the httpx client cookie jar, and logging the action to ensure fresh authentication on next use.
Authentication Failure Tests
tests/test_auth_failure.py
Extensive test suite covering Basic Auth failures, Form Auth failures, re-authentication flows, ContentRequestManager error recovery, 401/403 rejection semantics, cookie jar handling on reauth, and stale form variable preservation using respx mocking and httpx simulation.
Cookie Persistence Tests
tests/test_cookie_persistence.py
Test module validating single and multiple cookie persistence across requests, server-driven updates, persistence across scraping cycles, cookie flow during form-auth workflows, and edge cases (empty values, special characters, cross-session isolation, long tokens).

Sequence Diagram

sequenceDiagram
    participant Client as Client/Coordinator
    participant FA as FormAuthenticator
    participant CRM as ContentRequestManager
    participant Server as Server
    participant HS as HttpSession

    Client->>HS: ensure_authenticated()
    HS->>FA: ensure_authenticated()
    FA->>Server: GET /login (fetch form)
    Server-->>FA: form page + cookies
    FA->>FA: extract form inputs
    FA->>Server: POST /login (submit form)<br/>(with credentials)
    alt Submission Success
        Server-->>FA: 200 OK + session cookies
        FA->>Server: GET /data (fetch content)
        Server-->>FA: 200 OK + data
        FA-->>HS: authenticated
    else Submission Fails (any error)
        Server-->>FA: error (4xx/5xx)
        FA->>FA: preserve form_variables<br/>log warning
        FA-->>HS: error (re-raise)
        HS-->>Client: error propagates
    end

    alt HTTPStatusError (401/403)
        Client->>CRM: get_content()
        CRM->>Server: POST /submit (form data)
        Server-->>CRM: 401/403 Unauthorized
        CRM->>CRM: log + re-raise
        CRM-->>Client: HTTPStatusError
    else HTTPStatusError (other)
        Client->>CRM: get_content()
        CRM->>Server: POST /submit
        Server-->>CRM: 4xx/5xx error
        CRM->>CRM: log HTTP error
        CRM-->>Client: continue (no raise)
    end

    Client->>HS: invalidate_auth()
    HS->>FA: invalidate()
    HS->>HS: clear cookie jar
    HS->>HS: log cleared
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • danieldotnl/ha-multiscrape#568: Modifies _extract_tag_value handling for script/style/template elements and parser naming behavior, directly overlapping with extractors.py and parsers.py changes in this PR.
  • danieldotnl/ha-multiscrape#561: Updates ContentRequestManager.get_content flow in coordinator.py with HTTP error handling and form-auth request processing, matching the error-handling enhancements in this PR.
  • danieldotnl/ha-multiscrape#567: Modifies form submission flow in form_auth.py and introduces HttpSession.invalidate_auth, directly related to the defensive error handling and auth invalidation features in this PR.

Poem

🐰 XML flows through the parser, clean and bright,
Auth errors caught and logged just right,
Cookies persist through every request's flight,
With tests aplenty, the future burns white! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: fixing XML scraping data loss, fixing auth failure handling, and adding comprehensive tests.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/xml-scraping-auth-bugs-test-coverage

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…guard

- Fix _extract_tag_value to handle extract=None (non-schema path) the
  same as extract="text", preventing regression for internally-constructed
  selectors that bypass schema validation
- Guard cookie jar clearing in invalidate_auth() behind _should_submit
  check so cookies are preserved when resubmit_on_error=False (clearing
  cookies without ability to re-auth would break the session)
- Add test verifying cookies are preserved when resubmit is disabled

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@danieldotnl danieldotnl merged commit 480ba31 into master May 6, 2026
7 checks passed
@danieldotnl danieldotnl deleted the fix/xml-scraping-auth-bugs-test-coverage branch May 6, 2026 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant