Refactor/remove bs4 dependency by Marcusfam-RB · Pull Request #161 · autoscrape-labs/pydoll

Marcusfam-RB · 2025-06-17T18:49:18Z

Refactoring Pull Request

Refactoring Scope

Refactored internal HTML text extraction logic by removing BeautifulSoup dependancy and using custom implementation.

Related Issue(s)

Resolves #154

Description

Removed BeautifulSoup dependancy, previously used get_text() method replaced with custom extract_text_from_html method in utils.py, that uses new TextExtractor class to extract text from HTML. New class is based on HTMLParser from built-in html library. This class replicates the behavior of BeautifulSoup.get_text(), including:

Skipping tags: <script>, <style>, and <template>
Supporting strip to trim whitespace from each text node
Supporting separator to join text fragments

Motivation

Reduce external dependencies, BeautifulSoup was only used for get_text() method.

Performance Impact

Performance improved
Performance potentially decreased
No significant performance change
Performance impact unknown

Technical Debt

API Changes

No changes to public API
Public API changed, but backward compatible
Breaking changes to public API

Testing Strategy

Testing Checklist

Existing tests updated
New tests added for previously uncovered cases
All tests pass
Code coverage maintained or improved

Checklist before requesting a review

My code follows the style guidelines of this project
I have performed a thorough self-review of the refactored code
I have commented my code, particularly in complex areas
I have updated documentation if needed
I have run poetry run task lint and fixed any issues
I have run poetry run task test and all tests pass
My commits follow the conventional commits style

codecov · 2025-06-17T18:50:27Z

Codecov Report

All modified and coverable lines are covered by tests ✅

📢 Thoughts on this report? Let us know!

thalissonvs · 2025-06-18T03:06:24Z

Thanks @Marcusfam-RB, it's perfect.

Marcusfam-RB added 4 commits June 17, 2025 12:38

refactor: replace BeautifulSoup with custom HTML text extractor

dbac69f

chore: remove beautifulsoup4 from dependencies

a42ad8f

test: add tests for extract_text_from_html function

84e9bee

style: fix docstring line length

a21a405

thalissonvs approved these changes Jun 18, 2025

View reviewed changes

thalissonvs assigned Marcusfam-RB Jun 18, 2025

thalissonvs merged commit 6511d0d into autoscrape-labs:main Jun 18, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor/remove bs4 dependency#161

Refactor/remove bs4 dependency#161
thalissonvs merged 4 commits intoautoscrape-labs:mainfrom
Marcusfam-RB:Refactor/remove-bs4-dependency

Marcusfam-RB commented Jun 17, 2025

Uh oh!

codecov Bot commented Jun 17, 2025

Uh oh!

thalissonvs commented Jun 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Marcusfam-RB commented Jun 17, 2025

Refactoring Pull Request

Refactoring Scope

Related Issue(s)

Description

Motivation

Performance Impact

Technical Debt

API Changes

Testing Strategy

Testing Checklist

Checklist before requesting a review

Uh oh!

codecov Bot commented Jun 17, 2025

Codecov Report

Uh oh!

thalissonvs commented Jun 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants