Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 11 additions & 2 deletions docs/01_introduction/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,16 @@ import CodeBlock from '@theme/CodeBlock';

import IntroductionExample from '!!raw-loader!./code/01_introduction.py';

The Apify SDK for Python is the official library for creating [Apify Actors](https://docs.apify.com/platform/actors) in Python. It provides useful features like Actor lifecycle management, local storage emulation, and Actor event handling.
The Apify SDK for Python is the official library for creating [Apify Actors](https://docs.apify.com/platform/actors) in Python. It gives you everything you need to build an Actor and run it both locally and on the [Apify platform](https://docs.apify.com/platform), including:

- **Actor lifecycle management** — initialization, graceful shutdown, status messages, rebooting, and metamorphing.
- **Storage access** — datasets, key-value stores, and request queues, with automatic local emulation when running outside the platform.
- **Actor input** — convenient access to the Actor input, including automatic decryption of secret fields.
- **Events & state persistence** — react to platform events (system info, migration, abort) and persist state across migrations and restarts.
- **Proxy management** — Apify Proxy and custom proxies, with session and tiered-proxy support.
- **Platform interaction** — start, call, and abort other Actors and tasks, create webhooks, and reach the full Apify API client.
- **Monetization** — charge users with the pay-per-event pricing model.
- **Framework integrations** — first-class support for [Crawlee](../guides/crawlee) and [Scrapy](../guides/scrapy).

<CodeBlock className="language-python">
{IntroductionExample}
Expand All @@ -29,7 +38,7 @@ Explore the Guides section in the sidebar for a deeper understanding of the SDK'

## Installation

The Apify SDK for Python requires Python version 3.10 or above. It is typically installed when you create a new Actor project using the [Apify CLI](https://docs.apify.com/cli). To install it manually in an existing project, use:
The Apify SDK for Python requires Python version 3.11 or above. It is typically installed when you create a new Actor project using the [Apify CLI](https://docs.apify.com/cli). To install it manually in an existing project, use:

```bash
pip install apify
Expand Down
1 change: 1 addition & 0 deletions docs/01_introduction/quick-start.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ To learn more about the features of the Apify SDK and how to use them, check out
- [Actor lifecycle](../concepts/actor-lifecycle)
- [Actor input](../concepts/actor-input)
- [Working with storages](../concepts/storages)
- [Storage clients](../concepts/storage-clients)
- [Actor events & state persistence](../concepts/actor-events)
- [Proxy management](../concepts/proxy-management)
- [Interacting with other Actors](../concepts/interacting-with-other-actors)
Expand Down
2 changes: 1 addition & 1 deletion docs/02_concepts/01_actor_lifecycle.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -106,4 +106,4 @@ Update the status only when the user's understanding of progress changes - avoid

## Conclusion

This page has presented the full Actor lifecycle: initialization, execution, error handling, rebooting, shutdown and status messages. You've seen how the SDK supports both context-based and manual control patterns. For deeper dives, explore the <ApiLink to="">reference docs</ApiLink>, [guides](https://docs.apify.com/sdk/python/docs/guides/beautifulsoup-httpx), and [platform documentation](https://docs.apify.com/platform).
This page has presented the full Actor lifecycle: initialization, execution, error handling, rebooting, shutdown and status messages. You've seen how the SDK supports both context-based and manual control patterns. For deeper dives, explore the <ApiLink to="class/Actor">`Actor` API reference</ApiLink>, [guides](https://docs.apify.com/sdk/python/docs/guides/beautifulsoup-httpx), and [platform documentation](https://docs.apify.com/platform).
2 changes: 1 addition & 1 deletion docs/02_concepts/03_storages.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,6 @@ To check if all the requests in the queue are handled, you can use the <ApiLink

## Storage clients

Behind the scenes, the SDK uses storage clients to communicate with the storage backend. The appropriate client is selected automatically based on the runtime environment — on the Apify platform, data is persisted via the Apify API, while local runs use the filesystem. For most use cases, you don't need to think about storage clients at all. If you want to learn more about how storage clients work, the available implementations, or how to configure them, see the [Crawlee storage clients guide](https://crawlee.dev/python/docs/guides/storage-clients). The Apify-specific clients are available in the `apify.storage_clients` module.
Behind the scenes, the SDK uses storage clients to communicate with the storage backend. The appropriate client is selected automatically based on the runtime environment — on the Apify platform, data is persisted via the Apify API, while local runs use the filesystem. For most use cases, you don't need to think about storage clients at all. To learn about the available implementations, how to switch between a single and shared request queue, or how to configure a custom client, see the [Storage clients](./storage-clients) page. For a deeper look at how storage clients work internally, see the [Crawlee storage clients guide](https://crawlee.dev/python/docs/guides/storage-clients).

For comprehensive information about storage on the Apify platform, see the [storage documentation](https://docs.apify.com/platform/storage), including the pages on [datasets](https://docs.apify.com/platform/storage/dataset), [key-value stores](https://docs.apify.com/platform/storage/key-value-store), and [request queues](https://docs.apify.com/platform/storage/request-queue).
28 changes: 14 additions & 14 deletions docs/02_concepts/04_actor_events.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ During its runtime, the Actor receives Actor events sent by the Apify platform o

## Event types

A listener can optionally receive a single argument — a Pydantic model with the event's data. The table below lists the events, the type of that data object, and when each event is emitted.

<table>
<thead>
<tr>
Expand All @@ -25,25 +27,23 @@ During its runtime, the Actor receives Actor events sent by the Apify platform o
<tbody>
<tr>
<td><code>SYSTEM_INFO</code></td>
<td><pre>{`{
"created_at": datetime,
"cpu_current_usage": float,
"mem_current_bytes": int,
"is_cpu_overloaded": bool
}`}
</pre></td>
<td><ApiLink to="class/EventSystemInfoData"><code>EventSystemInfoData</code></ApiLink></td>
<td>
<p>This event is emitted regularly and it indicates the current resource usage of the Actor.</p>
The <code>is_cpu_overloaded</code> argument indicates whether the current CPU usage is higher than <code>Config.max_used_cpu_ratio</code>
<p>Emitted regularly to report the Actor's current resource usage. The
<code>cpu_info.used_ratio</code> field reports the fraction of CPU currently in use
(a float between <code>0.0</code> and <code>1.0</code>), and <code>memory_info.current_size</code>
reports the current memory usage. Compare <code>cpu_info.used_ratio</code> against
<code>Configuration.max_used_cpu_ratio</code> to detect CPU overload.</p>
</td>
</tr>
<tr>
<td><code>MIGRATING</code></td>
<td><code>None</code></td>
<td><ApiLink to="class/EventMigratingData"><code>EventMigratingData</code></ApiLink></td>
<td>
<p>Emitted when the Actor running on the Apify platform
is going to be <a href="https://docs.apify.com/platform/actors/development/state-persistence#what-is-a-migration">migrated</a>
{' '}to another worker server soon.</p>
{' '}to another worker server soon. The <code>time_remaining</code> field reports how much time
the Actor has left before it is force-migrated.</p>
You can use it to persist the state of the Actor so that once it is executed again on the new server,
it doesn't have to start over from the beginning.
Once you have persisted the state of your Actor, you can call <ApiLink to="class/Actor#reboot">`Actor.reboot`</ApiLink>
Expand All @@ -52,7 +52,7 @@ During its runtime, the Actor receives Actor events sent by the Apify platform o
</tr>
<tr>
<td><code>ABORTING</code></td>
<td><code>None</code></td>
<td><ApiLink to="class/EventAbortingData"><code>EventAbortingData</code></ApiLink></td>
<td>
When a user aborts an Actor run on the Apify platform,
they can choose to abort gracefully to allow the Actor some time before getting killed.
Expand All @@ -61,7 +61,7 @@ During its runtime, the Actor receives Actor events sent by the Apify platform o
</tr>
<tr>
<td><code>PERSIST_STATE</code></td>
<td><pre>{`{ "is_migrating": bool }`}</pre></td>
<td><ApiLink to="class/EventPersistStateData"><code>EventPersistStateData</code></ApiLink></td>
<td>
<p>Emitted in regular intervals (by default 60 seconds) to notify the Actor that it should persist its state,
in order to avoid repeating all work when the Actor restarts.</p>
Expand All @@ -73,7 +73,7 @@ During its runtime, the Actor receives Actor events sent by the Apify platform o
</tr>
<tr>
<td><code>EXIT</code></td>
<td><code>None</code></td>
<td><ApiLink to="class/EventExitData"><code>EventExitData</code></ApiLink></td>
<td>
Emitted by the SDK (not the platform) when the Actor is about to exit. You can use this event to perform final cleanup tasks,
such as closing external connections or sending notifications, before the Actor shuts down.
Expand Down
21 changes: 18 additions & 3 deletions docs/02_concepts/10_configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -27,14 +27,29 @@ This will cause the Actor to persist its state every 10 seconds:

## Configuring via environment variables

All the configuration options can be set via environment variables. The environment variables are prefixed with `APIFY_`, and the configuration options are in uppercase, with underscores as separators. See the <ApiLink to="class/Configuration">`Configuration`</ApiLink> API reference for the full list of configuration options.
All configuration options can also be set via environment variables. Most options are read from an environment variable named after the option in uppercase; many options accept several aliases — commonly with an `APIFY_`, `ACTOR_`, or `CRAWLEE_` prefix. See the <ApiLink to="class/Configuration">`Configuration`</ApiLink> API reference for the full list of configuration options.

This Actor run will not persist its local storages to the filesystem:
For example, this Actor run will keep the contents of its local storages instead of purging them on start:

```bash
APIFY_PERSIST_STORAGE=0 apify run
APIFY_PURGE_ON_START=0 apify run
```

### Commonly used options

The table below lists a few options you are most likely to set yourself. When running on the Apify platform or via the Apify CLI, the platform-related options are populated automatically.

| Option | Environment variable | Default | Description |
| --- | --- | --- | --- |
| `token` | `APIFY_TOKEN` | `None` | API token used to authenticate calls to the Apify API. |
| `proxy_password` | `APIFY_PROXY_PASSWORD` | `None` | Password for [Apify Proxy](https://docs.apify.com/proxy). |
| `purge_on_start` | `APIFY_PURGE_ON_START` | `True` | Whether to purge local storages when the Actor starts. |
| `persist_state_interval` | `APIFY_PERSIST_STATE_INTERVAL_MILLIS` | `1 min` | How often the `PERSIST_STATE` event is emitted (the variable is in milliseconds). |
| `log_level` | `APIFY_LOG_LEVEL` | `'INFO'` | Minimum severity of log messages that are printed. |
| `headless` | `APIFY_HEADLESS` | `True` | Whether to run browsers in headless mode. |
| `storage_dir` | `APIFY_LOCAL_STORAGE_DIR` | `'./storage'` | Directory holding local storages when running outside the platform. |
| `is_at_home` | `APIFY_IS_AT_HOME` | `False` | Set by the platform — `True` when the Actor runs on Apify. |

## Reading the runtime environment

The <ApiLink to="class/Actor#get_env">`Actor.get_env`</ApiLink> method returns a dictionary with all `APIFY_*` environment variables parsed into their typed values. This is useful for inspecting the Actor's runtime context, such as the Actor ID, run ID, or default storage IDs. Variables that are not set or are invalid will have a value of `None`.
Expand Down
64 changes: 64 additions & 0 deletions docs/02_concepts/12_storage_clients.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
---
id: storage-clients
title: Storage clients
description: Choose and configure the backend the Actor uses for datasets, key-value stores, and request queues.
---

import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
import ApiLink from '@theme/ApiLink';

import SharedRequestQueueExample from '!!raw-loader!roa-loader!./code/12_shared_request_queue.py';
import CustomStorageClientExample from '!!raw-loader!roa-loader!./code/12_custom_storage_client.py';

Storage clients are the components that actually read and write your [storages](./storages) — datasets, key-value stores, and request queues. The Apify SDK selects an appropriate client automatically based on where the Actor runs, so for most Actors you never need to think about them. This page explains the available clients and how to customize them when you do.

## How the Actor selects a storage client

By default, the Actor uses a <ApiLink to="class/SmartApifyStorageClient">`SmartApifyStorageClient`</ApiLink> — a hybrid client that delegates to one of two underlying clients depending on the environment:

- When running **on the Apify platform** (detected automatically), or when you pass `force_cloud=True`, it uses the **cloud** client — <ApiLink to="class/ApifyStorageClient">`ApifyStorageClient`</ApiLink>, which persists data through the Apify API.
- When running **locally**, it uses the **local** client — <ApiLink to="class/FileSystemStorageClient">`FileSystemStorageClient`</ApiLink>, which emulates platform storages on your filesystem under the `storage` folder.

This is what lets the same Actor code run unchanged both locally and on the platform.

## Available storage clients

The `apify.storage_clients` module provides the following clients:

- <ApiLink to="class/SmartApifyStorageClient">`SmartApifyStorageClient`</ApiLink> — the default hybrid client described above. It wraps a `cloud_storage_client` and a `local_storage_client` and routes each call to the right one.
- <ApiLink to="class/ApifyStorageClient">`ApifyStorageClient`</ApiLink> — talks to the Apify API. Used as the cloud client.
- <ApiLink to="class/FileSystemStorageClient">`FileSystemStorageClient`</ApiLink> — persists data to the local filesystem. Used as the default local client.
- <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> — keeps everything in memory only; nothing is persisted. Useful for tests and short-lived runs.

## Single vs. shared request queue

`ApifyStorageClient` supports two ways of accessing the Apify request queue, selected via its `request_queue_access` argument:

- **`'single'`** (default) — optimized for a single consumer. It makes far fewer API calls, so it is cheaper and faster, but it does not support multiple clients consuming the same queue concurrently. This is the right choice for the majority of Actors.
- **`'shared'`** — supports multiple consumers working on the same queue at the same time, at the cost of more API calls.

To opt into the shared client, set it as the cloud client of the `SmartApifyStorageClient` in the [service locator](https://crawlee.dev/python/docs/guides/service-locator) before entering the Actor context:

<RunnableCodeBlock className="language-python" language="python">
{SharedRequestQueueExample}
</RunnableCodeBlock>

## Using cloud storage while running locally

When developing locally, storages are read from and written to the local filesystem by default. To work with a storage on the Apify platform instead — for example, to read the output of a remote Actor run — pass `force_cloud=True` to <ApiLink to="class/Actor#open_dataset">`Actor.open_dataset`</ApiLink>, <ApiLink to="class/Actor#open_key_value_store">`Actor.open_key_value_store`</ApiLink>, or <ApiLink to="class/Actor#open_request_queue">`Actor.open_request_queue`</ApiLink>. This requires an Apify token, provided via the `APIFY_TOKEN` environment variable.

## Customizing the storage client

You can replace either of the underlying clients — for example, to keep all local data in memory instead of on disk. To do this, set a `SmartApifyStorageClient` with your chosen sub-clients in the service locator **before** entering the Actor context (or awaiting <ApiLink to="class/Actor#init">`Actor.init`</ApiLink>):

<RunnableCodeBlock className="language-python" language="python">
{CustomStorageClientExample}
</RunnableCodeBlock>

:::note

The Actor's storage client must be a `SmartApifyStorageClient`. Setting a bare `ApifyStorageClient` or `MemoryStorageClient` directly in the service locator raises an error — wrap it in a `SmartApifyStorageClient` as shown above.

:::

For a deeper look at how storage clients work and how to write your own, see the [Crawlee storage clients guide](https://crawlee.dev/python/docs/guides/storage-clients).
9 changes: 5 additions & 4 deletions docs/02_concepts/code/04_actor_events.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
import asyncio
from typing import Any

from apify import Actor, Event
from apify import Actor, Event, EventPersistStateData


async def main() -> None:
Expand All @@ -15,9 +14,11 @@ async def main() -> None:
processed_items = actor_state

# Save the state when the `PERSIST_STATE` event happens
async def save_state(event_data: Any) -> None:
async def save_state(event_data: EventPersistStateData) -> None:
nonlocal processed_items
Actor.log.info('Saving Actor state', extra=event_data)
Actor.log.info(
'Persisting Actor state (migrating=%s)', event_data.is_migrating
)
await Actor.set_value('STATE', processed_items)

Actor.on(Event.PERSIST_STATE, save_state)
Expand Down
22 changes: 22 additions & 0 deletions docs/02_concepts/code/12_custom_storage_client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import asyncio

from crawlee import service_locator

from apify import Actor
from apify.storage_clients import MemoryStorageClient, SmartApifyStorageClient


async def main() -> None:
# Keep all local data in memory instead of writing it to the filesystem
# when running outside the Apify platform.
service_locator.set_storage_client(
SmartApifyStorageClient(local_storage_client=MemoryStorageClient()),
)

async with Actor:
store = await Actor.open_key_value_store()
await store.set_value('example', {'hello': 'world'})


if __name__ == '__main__':
asyncio.run(main())
24 changes: 24 additions & 0 deletions docs/02_concepts/code/12_shared_request_queue.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import asyncio

from crawlee import service_locator

from apify import Actor
from apify.storage_clients import ApifyStorageClient, SmartApifyStorageClient


async def main() -> None:
# Use the shared Apify request queue client, which supports multiple
# consumers working on the same queue at the cost of more API calls.
service_locator.set_storage_client(
SmartApifyStorageClient(
cloud_storage_client=ApifyStorageClient(request_queue_access='shared'),
)
)

async with Actor:
request_queue = await Actor.open_request_queue()
await request_queue.add_request('https://crawlee.dev')


if __name__ == '__main__':
asyncio.run(main())
Loading
Loading