Skip to content

Commit b713461

Browse files
PR remarks
1 parent 3398281 commit b713461

3 files changed

Lines changed: 78 additions & 66 deletions

File tree

README.md

Lines changed: 75 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,9 @@ into NumPy arrays and PyTorch tensors with minimal overhead.
2020
By default, the reader uses the regular file API via
2121
`parquet::ParquetFileReader`. In most cases, this is the recommended choice.
2222

23-
An alternative reader backend based on **io_uring** is also available. It can
24-
provide better performance, especially for very large datasets and when used
25-
together with **O_DIRECT**.
23+
An alternative reader backend based on **io_uring** is also available. It may
24+
provide better performance for some workloads, particularly when used together
25+
with **O_DIRECT**.
2626

2727
To enable the alternative backend, set the `JJ_READER_BACKEND` environment
2828
variable to one of the following values:
@@ -50,29 +50,93 @@ your workload is I/O-bound or memory-/CPU-bound.
5050

5151
For datasets larger than the available page cache, performance is typically
5252
I/O-bound. Enabling either `pre_buffer=True` or `prefetch_page_cache=True`
53-
brings throughput close to the raw I/O ceiling.
53+
brings throughput close to the raw I/O ceiling, but `prefetch_page_cache`
54+
avoids the increased LLC miss rate caused by `pre_buffer`
55+
(see [Page cache prefetching](#page-cache-prefetching-with-prefetch_page_cache) below).
5456

5557
Recommended configuration:
5658

5759
- `use_threads = True`, `prefetch_page_cache = True`, `pre_buffer = False`,
5860
with the default reader backend.
5961

60-
Both options reach near-identical throughput. `prefetch_page_cache` avoids the
61-
temporary buffer copies that `pre_buffer` uses (see section below) and the
62-
increased LLC miss rate.
63-
6462
### Small datasets (fit in filesystem cache)
6563

6664
For datasets that comfortably fit in RAM, performance is typically CPU- or
67-
memory-bound.
65+
memory-bound. Using `pre_buffer` is not recommended because it leads to an
66+
increased LLC miss rate and suboptimal performance
67+
(see [Page cache prefetching](#page-cache-prefetching-with-prefetch_page_cache) below).
6868

6969
Recommended configuration:
7070

7171
- `use_threads = True`, `prefetch_page_cache = True`, `pre_buffer = False`,
7272
with the default reader backend.
7373

74+
### Page cache prefetching with `prefetch_page_cache`
75+
76+
The `prefetch_page_cache` option calls `posix_fadvise(POSIX_FADV_WILLNEED)` to tell
77+
the kernel to start loading the relevant byte ranges into the page cache.
78+
Each worker thread then reads directly via `pread` into its own
79+
locally-allocated buffer, keeping data hot in its local CPU caches.
80+
81+
This avoids the LLC (Last Level Cache) miss problem with `pre_buffer=True`,
82+
where Arrow's IO thread pool fills temporary buffers on one core and worker
83+
threads on different cores later consume cold data.
84+
85+
This is only useful for local or network-mounted file systems that have a
86+
page cache. Remote file systems such as S3 will not benefit from this.
87+
88+
There are two ways to enable page cache prefetching:
89+
90+
**As a parameter on `read_into_numpy`:**
91+
92+
```python
93+
jj.read_into_numpy(
94+
source=path,
95+
metadata=pr.metadata,
96+
np_array=np_array,
97+
row_group_indices=range(pr.metadata.num_row_groups),
98+
column_indices=range(pr.metadata.num_columns),
99+
prefetch_page_cache=True,
100+
)
101+
```
102+
103+
**As a standalone call:**
104+
105+
```python
106+
jj.prefetch_page_cache(
107+
source=path,
108+
metadata=pr.metadata,
109+
row_group_indices=range(pr.metadata.num_row_groups),
110+
column_indices=range(pr.metadata.num_columns),
111+
)
112+
```
113+
114+
Useful for sliding-window prefetching, where you prefetch the next files
115+
while processing the current one:
116+
117+
```python
118+
# Prime the pump
119+
for path in file_paths[:PREFETCH_DEPTH]:
120+
jj.prefetch_page_cache(source=path, ...)
121+
122+
# Main loop
123+
for i, path in enumerate(file_paths):
124+
# Slide the window
125+
ahead_index = i + PREFETCH_DEPTH
126+
if ahead_index < len(file_paths):
127+
jj.prefetch_page_cache(source=file_paths[ahead_index], ...)
128+
129+
# Page cache should already be warm
130+
jj.read_into_numpy(source=path, np_array=np_array, ...)
131+
132+
process(np_array)
133+
```
134+
74135
### Pre-buffering and `cache_options`
75136

137+
If you use `pre_buffer=True` instead of `prefetch_page_cache`, the following
138+
tuning applies.
139+
76140
When `pre_buffer=True`, Arrow merges nearby column ranges and reads them into
77141
temporary buffers. The default maximum merged range is 32 MB
78142
(`range_size_limit`).
@@ -110,61 +174,9 @@ To debug allocator issues with mimalloc, run with `MIMALLOC_SHOW_STATS=1` and
110174
### Pre-buffering and `ARROW_IO_THREADS`
111175

112176
When `pre_buffer=True`, Arrow dispatches reads to its IO thread pool,
113-
configured via the `ARROW_IO_THREADS` environment variable (default: 8).
177+
configured via the `ARROW_IO_THREADS` environment variable (default: 8).
114178
Tuning this value may improve performance.
115179

116-
### Page cache prefetching with `prefetch_page_cache`
117-
118-
With `pre_buffer=True`, Arrow's IO thread pool allocates temporary buffers
119-
and fills them on the IO thread's core. When worker threads on different
120-
cores later consume those buffers, the data is cold in their caches,
121-
causing LLC misses.
122-
123-
`prefetch_page_cache` provides an alternative: it calls
124-
`posix_fadvise(POSIX_FADV_WILLNEED)` to tell the kernel to start loading
125-
the relevant byte ranges into the page cache. Each worker thread then
126-
reads directly via `pread` into its own locally-allocated buffer, keeping
127-
data hot in its local CPU caches.
128-
129-
Two ways to use it:
130-
131-
**As a parameter on `read_into_numpy`:**
132-
133-
```python
134-
jj.read_into_numpy(
135-
source=path,
136-
metadata=pr.metadata,
137-
np_array=np_array,
138-
row_group_indices=range(pr.metadata.num_row_groups),
139-
column_indices=range(pr.metadata.num_columns),
140-
prefetch_page_cache=True,
141-
)
142-
```
143-
144-
This is only useful for local or network-mounted file systems that have a
145-
page cache. Remote file systems such as S3 will not benefit from this.
146-
147-
**As a standalone call** (when you want to prefetch ahead of time, e.g.
148-
from a different thread):
149-
150-
```python
151-
jj.prefetch_page_cache(
152-
source=path,
153-
metadata=pr.metadata,
154-
row_group_indices=range(pr.metadata.num_row_groups),
155-
column_indices=range(pr.metadata.num_columns),
156-
)
157-
158-
jj.read_into_numpy(
159-
source=path,
160-
metadata=pr.metadata,
161-
np_array=np_array,
162-
row_group_indices=range(pr.metadata.num_row_groups),
163-
column_indices=range(pr.metadata.num_columns),
164-
pre_buffer=False,
165-
)
166-
```
167-
168180
## Requirements
169181

170182
- pyarrow ~= 24.0.0
@@ -373,7 +385,7 @@ jj.read_into_torch(
373385
tensor=tensor,
374386
row_group_indices=range(pr.metadata.num_row_groups),
375387
column_indices=range(pr.metadata.num_columns),
376-
pre_buffer=True,
388+
prefetch_page_cache=True,
377389
use_threads=True,
378390
)
379391

jollyjack/jollyjack.cc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -338,7 +338,7 @@ void ReadIntoMemory (std::shared_ptr<arrow::io::RandomAccessFile> source
338338
).ValueOrDie();
339339
auto status = source->WillNeed(read_ranges);
340340
if (!status.ok()) {
341-
throw std::logic_error(status.message());
341+
throw std::runtime_error(status.message());
342342
}
343343
}
344344

@@ -694,7 +694,7 @@ void PrefetchPageCache(
694694

695695
auto status = source->WillNeed(read_ranges);
696696
if (!status.ok()) {
697-
throw std::logic_error(status.message());
697+
throw std::runtime_error(status.message());
698698
}
699699
}
700700

test/test_readme.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,7 @@
203203
# ```
204204
import torch
205205

206-
# Create a tesnsor and transpose it to get Fortran-style order
206+
# Create a tensor and transpose it to get Fortran-style order
207207
tensor = torch.zeros(n_columns, n_rows, dtype=torch.float32).transpose(0, 1)
208208
# ```
209209

0 commit comments

Comments
 (0)