Skip to content

Refactor/remove bs4 dependency#161

Merged
thalissonvs merged 4 commits intoautoscrape-labs:mainfrom
Marcusfam-RB:Refactor/remove-bs4-dependency
Jun 18, 2025
Merged

Refactor/remove bs4 dependency#161
thalissonvs merged 4 commits intoautoscrape-labs:mainfrom
Marcusfam-RB:Refactor/remove-bs4-dependency

Conversation

@Marcusfam-RB
Copy link
Copy Markdown
Contributor

Refactoring Pull Request

Refactoring Scope

Refactored internal HTML text extraction logic by removing BeautifulSoup dependancy and using custom implementation.

Related Issue(s)

Resolves #154

Description

Removed BeautifulSoup dependancy, previously used get_text() method replaced with custom extract_text_from_html method in utils.py, that uses new TextExtractor class to extract text from HTML. New class is based on HTMLParser from built-in html library. This class replicates the behavior of BeautifulSoup.get_text(), including:

  • Skipping tags: <script>, <style>, and <template>
  • Supporting strip to trim whitespace from each text node
  • Supporting separator to join text fragments

Motivation

Reduce external dependencies, BeautifulSoup was only used for get_text() method.

Performance Impact

  • Performance improved
  • Performance potentially decreased
  • No significant performance change
  • Performance impact unknown

Technical Debt

API Changes

  • No changes to public API
  • Public API changed, but backward compatible
  • Breaking changes to public API

Testing Strategy

Testing Checklist

  • Existing tests updated
  • New tests added for previously uncovered cases
  • All tests pass
  • Code coverage maintained or improved

Checklist before requesting a review

  • My code follows the style guidelines of this project
  • I have performed a thorough self-review of the refactored code
  • I have commented my code, particularly in complex areas
  • I have updated documentation if needed
  • I have run poetry run task lint and fixed any issues
  • I have run poetry run task test and all tests pass
  • My commits follow the conventional commits style

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

📢 Thoughts on this report? Let us know!

@thalissonvs
Copy link
Copy Markdown
Member

Thanks @Marcusfam-RB, it's perfect.

@thalissonvs thalissonvs merged commit 6511d0d into autoscrape-labs:main Jun 18, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Refactor]: Remove bs4 dependency

2 participants