Fix XML scraping data loss, auth failure handling, and add tests by danieldotnl · Pull Request #577 · danieldotnl/ha-multiscrape

danieldotnl · 2026-05-06T20:35:52Z

Summary

Fix data loss in XML scraping: _extract_tag_value silently returned empty strings for XML tags named template, script, or style because an HTML-specific short-circuit bypassed the extract parameter. Now the extract mode is always respected.
Fix silent auth failure swallowing: ContentRequestManager.get_content() caught all exceptions from form auth (including 401/403) and fell through to scraping the data page without authentication — potentially scraping login pages as sensor data. Now 401/403 propagate as errors.
Fix stale cookies on reauth: invalidate_auth() now clears the httpx cookie jar to prevent old session cookies from interfering with fresh authentication.
Add stale form_variables warning: When reauth fails and stale variables from a prior successful auth remain, a warning is logged.
Fix misleading parser name: "html (lxml-xml)" → "xml (lxml-xml)" in debug logs.
Add 110 new tests covering XML scraping (RSS, weather, SOAP, CDATA, namespaces, HTML-colliding tag names), cookie persistence (accumulation, form auth flow, reauth clearing, edge cases), and auth failure/re-auth flows (401/403 rejection, fallthrough behavior, state management).

Test plan

All 457 tests pass (347 existing + 110 new)
Pre-commit hooks (ruff, isort, codespell, prettier, pytest-check) pass
Verify XML scraping with lxml-xml parser on a real RSS/Atom feed
Verify form auth with a site that returns 401 on bad credentials
Verify cookie-based auth flow where server rotates cookie names

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added ability to explicitly reset authentication state.
- Enhanced parser to distinguish XML from HTML markup.
Bug Fixes
- Improved error handling for HTTP authentication failures (401/403).
- Enhanced form submission error handling with state preservation.
- Refined text extraction for special elements.
Tests
- Added comprehensive test coverage for authentication failures and recovery.
- Added cookie persistence test suite.
- Added XML scraping test suite with fixtures.

…tests - Fix _extract_tag_value silently dropping content for XML tags named template/script/style by respecting the extract parameter instead of short-circuiting with tag.string - Fix ContentRequestManager swallowing 401/403 auth failures that would cause scraping unreliable data (login pages) without warning - Clear cookie jar on reauth invalidation to prevent stale cookies from interfering with fresh authentication - Log warning when stale form_variables persist after failed reauth - Fix misleading parser name "html (lxml-xml)" -> "xml (lxml-xml)" - Add 110 new tests for XML scraping, cookie persistence, and auth failure/re-auth flows Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-06T20:36:04Z

Warning

Rate limit exceeded

@danieldotnl has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 37 minutes and 22 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c229ec0a-9002-452f-b3d9-a1add49e350e

📥 Commits

Reviewing files that changed from the base of the PR and between 8fed0ee and a6d1657.

📒 Files selected for processing (3)

custom_components/multiscrape/extractors.py
custom_components/multiscrape/http_session.py
tests/test_auth_failure.py

📝 Walkthrough

Walkthrough

This PR enhances the multiscrape component with XML parser support and improved authentication error handling. It adds XML labeling to the parser, refines text extraction for HTML-like elements, implements defensive error handling in form submission and HTTP requests with a new authentication invalidation method, and provides comprehensive test coverage for authentication failures, cookie persistence, and XML scraping scenarios.

Changes

XML Scraping Support

Layer / File(s)	Summary
Parser Behavior `custom_components/multiscrape/parsers.py`	HtmlParser now labels lxml-xml parser as "xml (lxml-xml)" to distinguish XML parsing mode from HTML parsing.
Text Extraction Refinement `custom_components/multiscrape/extractors.py`	`_extract_tag_value` now conditionally returns `tag.string` for style/script/template elements when extract mode is "text", preserving empty-string cases while falling back to standard text extraction otherwise.
Test Fixtures `tests/fixtures/xml_samples.py`	Nine XML sample constants added: SAMPLE_XML_RSS, SAMPLE_XML_WEATHER, SAMPLE_XML_SOAP, SAMPLE_XML_ATTRIBUTES, SAMPLE_XML_CDATA, SAMPLE_XML_EMPTY, SAMPLE_XML_NAMESPACES, SAMPLE_XML_HTML_TAG_NAMES, SAMPLE_XML_MALFORMED for various XML scenarios.
XML Scraping Tests `tests/test_xml_scraping.py`	Comprehensive test suite covering basic XML parsing, RSS feeds, weather data, attribute extraction, CDATA handling, extract modes, templating, namespaces, SOAP, HTML-colliding tags, error cases, and parser factory integration with lxml-xml.

Authentication Error Handling & Invalidation

Layer / File(s)	Summary
Error Handling in Content Retrieval `custom_components/multiscrape/coordinator.py`	Added httpx import and wrapped form-submit/content retrieval in try/except to catch `httpx.HTTPStatusError`; 401/403 errors are logged and re-raised; other HTTP errors are logged and execution continues; general exceptions are also caught and logged.
Defensive Form Submission `custom_components/multiscrape/form_auth.py`	`ensure_authenticated` now wraps form submission in try/except to preserve previously scraped form variables and log a warning before re-raising errors. New private method `_submit_form` encapsulates the form submission flow.
Authentication Invalidation API `custom_components/multiscrape/http_session.py`	New public method `invalidate_auth` clears authentication state by invoking FormAuthenticator invalidate, clearing the httpx client cookie jar, and logging the action to ensure fresh authentication on next use.
Authentication Failure Tests `tests/test_auth_failure.py`	Extensive test suite covering Basic Auth failures, Form Auth failures, re-authentication flows, ContentRequestManager error recovery, 401/403 rejection semantics, cookie jar handling on reauth, and stale form variable preservation using respx mocking and httpx simulation.
Cookie Persistence Tests `tests/test_cookie_persistence.py`	Test module validating single and multiple cookie persistence across requests, server-driven updates, persistence across scraping cycles, cookie flow during form-auth workflows, and edge cases (empty values, special characters, cross-session isolation, long tokens).

Sequence Diagram

sequenceDiagram
    participant Client as Client/Coordinator
    participant FA as FormAuthenticator
    participant CRM as ContentRequestManager
    participant Server as Server
    participant HS as HttpSession

    Client->>HS: ensure_authenticated()
    HS->>FA: ensure_authenticated()
    FA->>Server: GET /login (fetch form)
    Server-->>FA: form page + cookies
    FA->>FA: extract form inputs
    FA->>Server: POST /login (submit form)<br/>(with credentials)
    alt Submission Success
        Server-->>FA: 200 OK + session cookies
        FA->>Server: GET /data (fetch content)
        Server-->>FA: 200 OK + data
        FA-->>HS: authenticated
    else Submission Fails (any error)
        Server-->>FA: error (4xx/5xx)
        FA->>FA: preserve form_variables<br/>log warning
        FA-->>HS: error (re-raise)
        HS-->>Client: error propagates
    end

    alt HTTPStatusError (401/403)
        Client->>CRM: get_content()
        CRM->>Server: POST /submit (form data)
        Server-->>CRM: 401/403 Unauthorized
        CRM->>CRM: log + re-raise
        CRM-->>Client: HTTPStatusError
    else HTTPStatusError (other)
        Client->>CRM: get_content()
        CRM->>Server: POST /submit
        Server-->>CRM: 4xx/5xx error
        CRM->>CRM: log HTTP error
        CRM-->>Client: continue (no raise)
    end

    Client->>HS: invalidate_auth()
    HS->>FA: invalidate()
    HS->>HS: clear cookie jar
    HS->>HS: log cleared

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

danieldotnl/ha-multiscrape#568: Modifies _extract_tag_value handling for script/style/template elements and parser naming behavior, directly overlapping with extractors.py and parsers.py changes in this PR.
danieldotnl/ha-multiscrape#561: Updates ContentRequestManager.get_content flow in coordinator.py with HTTP error handling and form-auth request processing, matching the error-handling enhancements in this PR.
danieldotnl/ha-multiscrape#567: Modifies form submission flow in form_auth.py and introduces HttpSession.invalidate_auth, directly related to the defensive error handling and auth invalidation features in this PR.

Poem

🐰 XML flows through the parser, clean and bright,
Auth errors caught and logged just right,
Cookies persist through every request's flight,
With tests aplenty, the future burns white! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: fixing XML scraping data loss, fixing auth failure handling, and adding comprehensive tests.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/xml-scraping-auth-bugs-test-coverage

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…guard - Fix _extract_tag_value to handle extract=None (non-schema path) the same as extract="text", preventing regression for internally-constructed selectors that bypass schema validation - Guard cookie jar clearing in invalidate_auth() behind _should_submit check so cookies are preserved when resubmit_on_error=False (clearing cookies without ability to re-auth would break the session) - Add test verifying cookies are preserved when resubmit is disabled Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

danieldotnl merged commit 480ba31 into master May 6, 2026
7 checks passed

danieldotnl deleted the fix/xml-scraping-auth-bugs-test-coverage branch May 6, 2026 20:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix XML scraping data loss, auth failure handling, and add tests#577

Fix XML scraping data loss, auth failure handling, and add tests#577
danieldotnl merged 2 commits into
masterfrom
fix/xml-scraping-auth-bugs-test-coverage

danieldotnl commented May 6, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 6, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danieldotnl commented May 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danieldotnl commented May 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 6, 2026 •

edited

Loading