Skip to content

Commit 75beb85

Browse files
authored
Merge pull request #387 from autoscrape-labs/feat/extractor
Implement declarative extractors
2 parents f6c202a + 597a914 commit 75beb85

26 files changed

Lines changed: 2997 additions & 314 deletions

File tree

README.md

Lines changed: 120 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,9 @@
2222
<a href="#support">Support</a>
2323
</p>
2424

25-
Pydoll automates Chromium-based browsers (Chrome, Edge) by connecting directly to the Chrome DevTools Protocol over WebSocket. No WebDriver binary, no `navigator.webdriver` flag, no compatibility issues.
25+
Pydoll automates Chromium-based browsers (Chrome, Edge) by connecting directly to the Chrome DevTools Protocol over WebSocket. **No WebDriver binary, no `navigator.webdriver` flag, no compatibility issues.**
2626

27-
It combines a high-level API for common tasks with low-level CDP access for fine-grained control over network, fingerprinting, and browser behavior. The entire codebase is async-native and fully type-checked with mypy.
27+
It combines a high-level API for stealthy automation with low-level CDP access for fine-grained control over network, fingerprinting, and browser behavior. And with its new **Pydantic-powered extraction engine**, it maps the DOM directly to structured Python objects, delivering an unmatched Developer Experience (DX).
2828

2929
### Top Sponsors
3030

@@ -48,11 +48,11 @@ It combines a high-level API for common tasks with low-level CDP access for fine
4848

4949
### Why Pydoll
5050

51-
- **Stealth-first**: Human-like mouse movement, realistic typing, and granular [browser preference](https://pydoll.tech/docs/features/configuration/browser-preferences/) control for fingerprint management.
51+
- **Structured extraction**: Define a [Pydantic](https://docs.pydantic.dev/) model, call `tab.extract()`, get typed and validated data back. No manual element-by-element querying.
5252
- **Async and typed**: Built on `asyncio` from the ground up, 100% type-checked with `mypy`. Full IDE autocompletion and static error checking.
53+
- **Stealth built in**: Human-like mouse movement, realistic typing, and granular [browser preference](https://pydoll.tech/docs/features/configuration/browser-preferences/) control for fingerprint management.
5354
- **Network control**: [Intercept](https://pydoll.tech/docs/features/network/interception/) requests to block ads/trackers, [monitor](https://pydoll.tech/docs/features/network/monitoring/) traffic for API discovery, and make [authenticated HTTP requests](https://pydoll.tech/docs/features/network/http-requests/) that inherit the browser session.
5455
- **Shadow DOM and iframes**: Full support for [shadow roots](https://pydoll.tech/docs/deep-dive/architecture/shadow-dom/) (including closed) and cross-origin iframes. Discover, query, and interact with elements inside them using the same API.
55-
- **Ergonomic API**: `tab.find()` for most cases, `tab.query()` for complex [CSS/XPath selectors](https://pydoll.tech/docs/deep-dive/guides/selectors-guide/).
5656

5757
## Installation
5858

@@ -62,55 +62,124 @@ pip install pydoll-python
6262

6363
No WebDriver binaries or external dependencies required.
6464

65-
## What's New
65+
## Getting Started
6666

67-
<details>
68-
<summary><b>HAR Network Recording</b></summary>
69-
<br>
67+
### 1. Stateful Automation & Evasion
7068

71-
Record network activity during a browser session and export as HAR 1.2. Replay recorded requests to reproduce exact API sequences.
69+
When you need to navigate, bypass challenges, or interact with dynamic UI, Pydoll's imperative API handles it with humanized timing by default.
7270

7371
```python
74-
from pydoll.browser.chromium import Chrome
72+
import asyncio
73+
from pydoll.browser import Chrome
74+
from pydoll.constants import Key
7575

76-
async with Chrome() as browser:
77-
tab = await browser.start()
76+
async def google_search(query: str):
77+
async with Chrome() as browser:
78+
tab = await browser.start()
79+
await tab.go_to('https://www.google.com')
7880

79-
async with tab.request.record() as capture:
80-
await tab.go_to('https://example.com')
81+
# Find elements and interact with human-like timing
82+
search_box = await tab.find(tag_name='textarea', name='q')
83+
await search_box.insert_text(query)
84+
await tab.keyboard.press(Key.ENTER)
8185

82-
capture.save('flow.har')
83-
print(f'Captured {len(capture.entries)} requests')
86+
first_result = await tab.find(
87+
tag_name='h3',
88+
text='autoscrape-labs/pydoll',
89+
timeout=10,
90+
)
91+
await first_result.click()
92+
print(f"Page loaded: {await tab.title}")
8493

85-
responses = await tab.request.replay('flow.har')
94+
asyncio.run(google_search('pydoll site:github.com'))
8695
```
8796

88-
Filter by resource type:
97+
### 2. Structured Data Extraction
98+
99+
Once you reach the target page, switch to the declarative engine. Define what you want with a model, and Pydoll extracts it — typed, validated, and ready to use.
89100

90101
```python
91-
from pydoll.protocol.network.types import ResourceType
102+
from pydoll.browser.chromium import Chrome
103+
from pydoll.extractor import ExtractionModel, Field
104+
105+
class Quote(ExtractionModel):
106+
text: str = Field(selector='.text', description='The quote text')
107+
author: str = Field(selector='.author', description='Who said it')
108+
tags: list[str] = Field(selector='.tag', description='Tags')
109+
year: int | None = Field(selector='.year', description='Year', default=None)
92110

93-
async with tab.request.record(
94-
resource_types=[ResourceType.FETCH, ResourceType.XHR]
95-
) as capture:
96-
await tab.go_to('https://example.com')
111+
async def extract_quotes():
112+
async with Chrome() as browser:
113+
tab = await browser.start()
114+
await tab.go_to('https://quotes.toscrape.com')
115+
116+
quotes = await tab.extract_all(Quote, scope='.quote', timeout=5)
117+
118+
for q in quotes:
119+
print(f'{q.author}: {q.text}') # fully typed, IDE autocomplete works
120+
print(q.tags) # list[str], not a raw element
121+
print(q.model_dump_json()) # pydantic serialization built-in
122+
123+
asyncio.run(extract_quotes())
97124
```
98125

99-
[HAR Recording Docs](https://pydoll.tech/docs/features/network/network-recording/)
126+
Models support CSS/XPath auto-detection, HTML attribute targeting, custom transforms, and nested models.
127+
128+
<details>
129+
<summary><b>Nested models, transforms, and attribute extraction</b></summary>
130+
<br>
131+
132+
```python
133+
from datetime import datetime
134+
from pydoll.extractor import ExtractionModel, Field
135+
136+
def parse_date(raw: str) -> datetime:
137+
return datetime.strptime(raw.strip(), '%B %d, %Y')
138+
139+
class Author(ExtractionModel):
140+
name: str = Field(selector='.author-title')
141+
born: datetime = Field(
142+
selector='.author-born-date',
143+
transform=parse_date,
144+
)
145+
146+
class Article(ExtractionModel):
147+
title: str = Field(selector='h1')
148+
url: str = Field(selector='.source-link', attribute='href')
149+
author: Author = Field(selector='.author-card', description='Nested model')
150+
151+
article = await tab.extract(Article, timeout=5)
152+
article.author.born.year # int — types are preserved all the way down
153+
```
100154
</details>
101155

156+
## Features
157+
102158
<details>
103-
<summary><b>Page Bundles</b></summary>
159+
<summary><b>Humanized Mouse Movement</b></summary>
104160
<br>
105161

106-
Save the current page and all its assets (CSS, JS, images, fonts) as a `.zip` bundle for offline viewing. Optionally inline everything into a single HTML file.
162+
Mouse operations produce human-like cursor movement by default:
163+
164+
- **Bezier curve paths** with asymmetric control points
165+
- **Fitts's Law timing**: duration scales with distance
166+
- **Minimum-jerk velocity**: bell-shaped speed profile
167+
- **Physiological tremor**: Gaussian noise scaled with velocity
168+
- **Overshoot correction**: ~70% chance on fast movements, then corrects back
107169

108170
```python
109-
await tab.save_bundle('page.zip')
110-
await tab.save_bundle('page-inline.zip', inline_assets=True)
171+
await tab.mouse.move(500, 300)
172+
await tab.mouse.click(500, 300)
173+
await tab.mouse.drag(100, 200, 500, 400)
174+
175+
button = await tab.find(id='submit')
176+
await button.click()
177+
178+
# Opt out when speed matters
179+
await tab.mouse.click(500, 300, humanize=False)
111180
```
112181

113-
[Screenshots, PDFs & Bundles Docs](https://pydoll.tech/docs/features/automation/screenshots-and-pdfs/)
182+
[Mouse Control Docs](https://pydoll.tech/docs/features/automation/mouse-control/)
114183
</details>
115184

116185
<details>
@@ -139,75 +208,46 @@ Highlights:
139208
- `deep=True` traverses cross-origin iframes (OOPIFs)
140209
- Standard `find()`, `query()`, `click()` API inside shadow roots
141210

142-
```python
143-
# Cloudflare Turnstile inside a cross-origin iframe
144-
shadow_roots = await tab.find_shadow_roots(deep=True, timeout=10)
145-
for sr in shadow_roots:
146-
checkbox = await sr.query('input[type="checkbox"]', raise_exc=False)
147-
if checkbox:
148-
await checkbox.click()
149-
```
150-
151211
[Shadow DOM Docs](https://pydoll.tech/docs/deep-dive/architecture/shadow-dom/)
152212
</details>
153213

154214
<details>
155-
<summary><b>Humanized Mouse Movement</b></summary>
215+
<summary><b>HAR Network Recording</b></summary>
156216
<br>
157217

158-
Mouse operations produce human-like cursor movement by default:
159-
160-
- **Bezier curve paths** with asymmetric control points
161-
- **Fitts's Law timing**: duration scales with distance
162-
- **Minimum-jerk velocity**: bell-shaped speed profile
163-
- **Physiological tremor**: Gaussian noise scaled with velocity
164-
- **Overshoot correction**: ~70% chance on fast movements, then corrects back
218+
Record network activity during a browser session and export as HAR 1.2. Replay recorded requests to reproduce exact API sequences.
165219

166220
```python
167-
await tab.mouse.move(500, 300)
168-
await tab.mouse.click(500, 300)
169-
await tab.mouse.drag(100, 200, 500, 400)
170-
171-
button = await tab.find(id='submit')
172-
await button.click()
173-
174-
# Opt out when speed matters
175-
await tab.mouse.click(500, 300, humanize=False)
176-
```
221+
from pydoll.browser.chromium import Chrome
177222

178-
[Mouse Control Docs](https://pydoll.tech/docs/features/automation/mouse-control/)
179-
</details>
223+
async with Chrome() as browser:
224+
tab = await browser.start()
180225

181-
## Getting Started
226+
async with tab.request.record() as capture:
227+
await tab.go_to('https://example.com')
182228

183-
```python
184-
import asyncio
185-
from pydoll.browser import Chrome
186-
from pydoll.constants import Key
229+
capture.save('flow.har')
230+
print(f'Captured {len(capture.entries)} requests')
187231

188-
async def google_search(query: str):
189-
async with Chrome() as browser:
190-
tab = await browser.start()
191-
await tab.go_to('https://www.google.com')
232+
responses = await tab.request.replay('flow.har')
233+
```
192234

193-
search_box = await tab.find(tag_name='textarea', name='q')
194-
await search_box.insert_text(query)
195-
await tab.keyboard.press(Key.ENTER)
235+
[HAR Recording Docs](https://pydoll.tech/docs/features/network/network-recording/)
236+
</details>
196237

197-
first_result = await tab.find(
198-
tag_name='h3',
199-
text='autoscrape-labs/pydoll',
200-
timeout=10,
201-
)
202-
await first_result.click()
238+
<details>
239+
<summary><b>Page Bundles</b></summary>
240+
<br>
203241

204-
await tab.find(id='repository-container-header', timeout=10)
205-
print(f"Page loaded: {await tab.title}")
242+
Save the current page and all its assets (CSS, JS, images, fonts) as a `.zip` bundle for offline viewing. Optionally inline everything into a single HTML file.
206243

207-
asyncio.run(google_search('pydoll site:github.com'))
244+
```python
245+
await tab.save_bundle('page.zip')
246+
await tab.save_bundle('page-inline.zip', inline_assets=True)
208247
```
209248

210-
## Features
249+
[Screenshots, PDFs & Bundles Docs](https://pydoll.tech/docs/features/automation/screenshots-and-pdfs/)
250+
</details>
211251

212252
<details>
213253
<summary><b>Hybrid Automation (UI + API)</b></summary>

0 commit comments

Comments
 (0)