Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 33 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ into NumPy arrays and PyTorch tensors with minimal overhead.
## Known limitations

- Data must not contain null values
- Destination NumPy arrays and PyTorch tensors must be column-major (Fortran-style)
- Destination NumPy arrays and PyTorch tensors must be column-major (Fortran-style)

## Selecting a reader backend

Expand Down Expand Up @@ -46,7 +46,7 @@ your workload is I/O-bound or memory-/CPU-bound.
- Reusing NumPy arrays or PyTorch tensors avoids repeated memory allocation.
- While allocation itself is fast, it can trigger kernel contention and degrade performance.

### Large datasets (exceed filesystem cache)
### Large datasets (exceeding filesystem cache)

For datasets larger than the available page cache, performance is typically
I/O-bound. Enabling either `pre_buffer=True` or `prefetch_page_cache=True`
Expand All @@ -59,7 +59,7 @@ Recommended configuration:
- `use_threads = True`, `prefetch_page_cache = True`, `pre_buffer = False`,
with the default reader backend.

### Small datasets (fit in filesystem cache)
### Small datasets (fitting in filesystem cache)

For datasets that comfortably fit in RAM, performance is typically CPU- or
memory-bound. Using `pre_buffer` is not recommended because it leads to an
Expand All @@ -85,6 +85,28 @@ threads on different cores later consume cold data.
This is only useful for local or network-mounted file systems that have a
page cache. Remote file systems such as S3 will not benefit from this.

The Linux kernel's [`force_page_cache_ra`](https://github.com/torvalds/linux/blob/57b8e2d666a31fa201432d58f5fe3469a0dd83ba/mm/readahead.c#L353-L358)
caps the number of pages read per `posix_fadvise` call to the block device's
[readahead window](https://www.kernel.org/doc/html/v5.3/block/queue-sysfs.html#read-ahead-kb-rw).
Any bytes beyond this cap are silently ignored. The readahead window is
typically 128 KB or higher. Check the value for your device:

```bash
cat /sys/block/<device>/queue/read_ahead_kb
```

If `range_size_limit` exceeds this value, most of each coalesced range will
not be prefetched. Set `range_size_limit` to match or stay below the device's
`read_ahead_kb`:

```python
cache_options = pa.CacheOptions(
hole_size_limit=8192,
range_size_limit=128*1024, # must not exceed read_ahead_kb
lazy=False,
)
```

There are two ways to enable page cache prefetching:

**As a parameter on `read_into_numpy`:**
Expand All @@ -97,6 +119,7 @@ jj.read_into_numpy(
row_group_indices=range(pr.metadata.num_row_groups),
column_indices=range(pr.metadata.num_columns),
prefetch_page_cache=True,
cache_options=cache_options,
)
```

Expand All @@ -108,11 +131,12 @@ jj.prefetch_page_cache(
metadata=pr.metadata,
row_group_indices=range(pr.metadata.num_row_groups),
column_indices=range(pr.metadata.num_columns),
cache_options=cache_options,
)
```

Useful for sliding-window prefetching, where you prefetch the next files
while processing the current one:
The standalone call is useful for sliding-window prefetching, where you
prefetch the next files while processing the current one:

```python
# Prime the pump
Expand Down Expand Up @@ -330,10 +354,12 @@ np_array = np.zeros((n_rows, n_columns), dtype="f", order="F")
pr = pq.ParquetReader()
pr.open(path)

# cache_options controls which byte ranges are prefetched into the page cache
# cache_options controls which byte ranges are prefetched into the page cache.
# range_size_limit should not exceed the device's read_ahead_kb,
# because the kernel silently ignores readahead beyond that cap.
cache_options = pa.CacheOptions(
hole_size_limit=8192,
range_size_limit=16*1024*1024,
range_size_limit=128*1024, # must not exceed read_ahead_kb
lazy=False,
)

Expand Down