Skip to content

Commit 98b8562

Browse files
PR remarks
1 parent 3398281 commit 98b8562

2 files changed

Lines changed: 47 additions & 35 deletions

File tree

README.md

Lines changed: 45 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,9 @@ into NumPy arrays and PyTorch tensors with minimal overhead.
2020
By default, the reader uses the regular file API via
2121
`parquet::ParquetFileReader`. In most cases, this is the recommended choice.
2222

23-
An alternative reader backend based on **io_uring** is also available. It can
24-
provide better performance, especially for very large datasets and when used
25-
together with **O_DIRECT**.
23+
An alternative reader backend based on **io_uring** is also available. It may
24+
provide better performance for some workloads, particularly when used together
25+
with **O_DIRECT**.
2626

2727
To enable the alternative backend, set the `JJ_READER_BACKEND` environment
2828
variable to one of the following values:
@@ -50,21 +50,21 @@ your workload is I/O-bound or memory-/CPU-bound.
5050

5151
For datasets larger than the available page cache, performance is typically
5252
I/O-bound. Enabling either `pre_buffer=True` or `prefetch_page_cache=True`
53-
brings throughput close to the raw I/O ceiling.
53+
brings throughput close to the raw I/O ceiling, but `prefetch_page_cache`
54+
avoids the increased LLC miss rate caused by `pre_buffer`
55+
(see [Page cache prefetching](#page-cache-prefetching-with-prefetch_page_cache) below).
5456

5557
Recommended configuration:
5658

5759
- `use_threads = True`, `prefetch_page_cache = True`, `pre_buffer = False`,
5860
with the default reader backend.
5961

60-
Both options reach near-identical throughput. `prefetch_page_cache` avoids the
61-
temporary buffer copies that `pre_buffer` uses (see section below) and the
62-
increased LLC miss rate.
63-
6462
### Small datasets (fit in filesystem cache)
6563

6664
For datasets that comfortably fit in RAM, performance is typically CPU- or
67-
memory-bound.
65+
memory-bound. Using `pre_buffer` is not recommended because it leads to an
66+
increased LLC miss rate and suboptimal performance
67+
(see [Page cache prefetching](#page-cache-prefetching-with-prefetch_page_cache) below).
6868

6969
Recommended configuration:
7070

@@ -73,6 +73,9 @@ Recommended configuration:
7373

7474
### Pre-buffering and `cache_options`
7575

76+
If you use `pre_buffer=True` instead of `prefetch_page_cache`, the following
77+
tuning applies.
78+
7679
When `pre_buffer=True`, Arrow merges nearby column ranges and reads them into
7780
temporary buffers. The default maximum merged range is 32 MB
7881
(`range_size_limit`).
@@ -110,21 +113,22 @@ To debug allocator issues with mimalloc, run with `MIMALLOC_SHOW_STATS=1` and
110113
### Pre-buffering and `ARROW_IO_THREADS`
111114

112115
When `pre_buffer=True`, Arrow dispatches reads to its IO thread pool,
113-
configured via the `ARROW_IO_THREADS` environment variable (default: 8).
116+
configured via the `ARROW_IO_THREADS` environment variable (default: 8).
114117
Tuning this value may improve performance.
115118

116119
### Page cache prefetching with `prefetch_page_cache`
117120

118-
With `pre_buffer=True`, Arrow's IO thread pool allocates temporary buffers
119-
and fills them on the IO thread's core. When worker threads on different
120-
cores later consume those buffers, the data is cold in their caches,
121-
causing LLC misses.
121+
The `prefetch_page_cache` option calls `posix_fadvise(POSIX_FADV_WILLNEED)` to tell
122+
the kernel to start loading the relevant byte ranges into the page cache.
123+
Each worker thread then reads directly via `pread` into its own
124+
locally-allocated buffer, keeping data hot in its local CPU caches.
125+
126+
This avoids the LLC (Last Level Cache) miss problem with `pre_buffer=True`,
127+
where Arrow's IO thread pool fills temporary buffers on one core and worker
128+
threads on different cores later consume cold data.
122129

123-
`prefetch_page_cache` provides an alternative: it calls
124-
`posix_fadvise(POSIX_FADV_WILLNEED)` to tell the kernel to start loading
125-
the relevant byte ranges into the page cache. Each worker thread then
126-
reads directly via `pread` into its own locally-allocated buffer, keeping
127-
data hot in its local CPU caches.
130+
This is only useful for local or network-mounted file systems that have a
131+
page cache. Remote file systems such as S3 will not benefit from this.
128132

129133
Two ways to use it:
130134

@@ -141,11 +145,7 @@ jj.read_into_numpy(
141145
)
142146
```
143147

144-
This is only useful for local or network-mounted file systems that have a
145-
page cache. Remote file systems such as S3 will not benefit from this.
146-
147-
**As a standalone call** (when you want to prefetch ahead of time, e.g.
148-
from a different thread):
148+
**As a standalone call:**
149149

150150
```python
151151
jj.prefetch_page_cache(
@@ -154,15 +154,27 @@ jj.prefetch_page_cache(
154154
row_group_indices=range(pr.metadata.num_row_groups),
155155
column_indices=range(pr.metadata.num_columns),
156156
)
157+
```
157158

158-
jj.read_into_numpy(
159-
source=path,
160-
metadata=pr.metadata,
161-
np_array=np_array,
162-
row_group_indices=range(pr.metadata.num_row_groups),
163-
column_indices=range(pr.metadata.num_columns),
164-
pre_buffer=False,
165-
)
159+
Useful for sliding-window prefetching, where you prefetch the next files
160+
while processing the current one:
161+
162+
```python
163+
# Prime the pump
164+
for path in file_paths[:PREFETCH_DEPTH]:
165+
jj.prefetch_page_cache(source=path, ...)
166+
167+
# Main loop
168+
for i, path in enumerate(file_paths):
169+
# Slide the window
170+
ahead_index = i + PREFETCH_DEPTH
171+
if ahead_index < len(file_paths):
172+
jj.prefetch_page_cache(source=file_paths[ahead_index], ...)
173+
174+
# Page cache should already be warm
175+
jj.read_into_numpy(source=path, np_array=np_array, ...)
176+
177+
process(np_array)
166178
```
167179

168180
## Requirements
@@ -373,7 +385,7 @@ jj.read_into_torch(
373385
tensor=tensor,
374386
row_group_indices=range(pr.metadata.num_row_groups),
375387
column_indices=range(pr.metadata.num_columns),
376-
pre_buffer=True,
388+
prefetch_page_cache=True,
377389
use_threads=True,
378390
)
379391

jollyjack/jollyjack.cc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -338,7 +338,7 @@ void ReadIntoMemory (std::shared_ptr<arrow::io::RandomAccessFile> source
338338
).ValueOrDie();
339339
auto status = source->WillNeed(read_ranges);
340340
if (!status.ok()) {
341-
throw std::logic_error(status.message());
341+
throw std::runtime_error(status.message());
342342
}
343343
}
344344

@@ -694,7 +694,7 @@ void PrefetchPageCache(
694694

695695
auto status = source->WillNeed(read_ranges);
696696
if (!status.ok()) {
697-
throw std::logic_error(status.message());
697+
throw std::runtime_error(status.message());
698698
}
699699
}
700700

0 commit comments

Comments
 (0)