@@ -48,21 +48,28 @@ your workload is I/O-bound or memory-/CPU-bound.
4848
4949### Large datasets (exceed filesystem cache)
5050
51- For datasets larger than the available page cache, performance is typically I/O-bound.
51+ For datasets larger than the available page cache, performance is typically
52+ I/O-bound. Enabling either ` pre_buffer=True ` or ` prefetch_page_cache=True `
53+ brings throughput close to the raw I/O ceiling.
5254
5355Recommended configuration:
5456
55- - ` use_threads = True ` , ` pre_buffer = True ` , ` JJ_READER_BACKEND = io_uring_odirect `
57+ - ` use_threads = True ` , ` prefetch_page_cache = True ` , ` pre_buffer = False ` ,
58+ with the default reader backend.
5659
57- This combination bypasses the page cache, reduces double-buffering, and allows deeper I/O queues via io_uring.
60+ Both options reach near-identical throughput. ` prefetch_page_cache ` avoids the
61+ temporary buffer copies that ` pre_buffer ` uses (see section below) and the
62+ increased LLC miss rate.
5863
5964### Small datasets (fit in filesystem cache)
6065
61- For datasets that comfortably fit in RAM, performance is typically CPU- or memory-bound.
66+ For datasets that comfortably fit in RAM, performance is typically CPU- or
67+ memory-bound.
6268
6369Recommended configuration:
6470
65- - ` use_threads = False ` , ` pre_buffer = False ` , and the default reader backend (no io_uring).
71+ - ` use_threads = True ` , ` prefetch_page_cache = True ` , ` pre_buffer = False ` ,
72+ with the default reader backend.
6673
6774### Pre-buffering and ` cache_options `
6875
@@ -102,10 +109,62 @@ To debug allocator issues with mimalloc, run with `MIMALLOC_SHOW_STATS=1` and
102109
103110### Pre-buffering and ` ARROW_IO_THREADS `
104111
105- When ` pre_buffer=True ` , Arrow dispatches reads to its IO thread pooll ,
112+ When ` pre_buffer=True ` , Arrow dispatches reads to its IO thread pool ,
106113configured via the ` ARROW_IO_THREADS ` environment variable (default: 8).
107114Tuning this value may improve performance.
108115
116+ ### Page cache prefetching with ` prefetch_page_cache `
117+
118+ With ` pre_buffer=True ` , Arrow's IO thread pool allocates temporary buffers
119+ and fills them on the IO thread's core. When worker threads on different
120+ cores later consume those buffers, the data is cold in their caches,
121+ causing LLC misses.
122+
123+ ` prefetch_page_cache ` provides an alternative: it calls
124+ ` posix_fadvise(POSIX_FADV_WILLNEED) ` to tell the kernel to start loading
125+ the relevant byte ranges into the page cache. Each worker thread then
126+ reads directly via ` pread ` into its own locally-allocated buffer, keeping
127+ data hot in its local CPU caches.
128+
129+ Two ways to use it:
130+
131+ ** As a parameter on ` read_into_numpy ` :**
132+
133+ ``` python
134+ jj.read_into_numpy(
135+ source = path,
136+ metadata = pr.metadata,
137+ np_array = np_array,
138+ row_group_indices = range (pr.metadata.num_row_groups),
139+ column_indices = range (pr.metadata.num_columns),
140+ prefetch_page_cache = True ,
141+ )
142+ ```
143+
144+ This is only useful for local or network-mounted file systems that have a
145+ page cache. Remote file systems such as S3 will not benefit from this.
146+
147+ ** As a standalone call** (when you want to prefetch ahead of time, e.g.
148+ from a different thread):
149+
150+ ``` python
151+ jj.prefetch_page_cache(
152+ source = path,
153+ metadata = pr.metadata,
154+ row_group_indices = range (pr.metadata.num_row_groups),
155+ column_indices = range (pr.metadata.num_columns),
156+ )
157+
158+ jj.read_into_numpy(
159+ source = path,
160+ metadata = pr.metadata,
161+ np_array = np_array,
162+ row_group_indices = range (pr.metadata.num_row_groups),
163+ column_indices = range (pr.metadata.num_columns),
164+ pre_buffer = False ,
165+ )
166+ ```
167+
109168## Requirements
110169
111170- pyarrow ~ = 24.0.0
@@ -253,6 +312,48 @@ with fs.LocalFileSystem().open_input_file(path) as f:
253312print (np_array)
254313```
255314
315+ ### Using page cache prefetching
316+ ``` python
317+ np_array = np.zeros((n_rows, n_columns), dtype = " f" , order = " F" )
318+ pr = pq.ParquetReader()
319+ pr.open(path)
320+
321+ # cache_options controls which byte ranges are prefetched into the page cache
322+ cache_options = pa.CacheOptions(
323+ hole_size_limit = 8192 ,
324+ range_size_limit = 16 * 1024 * 1024 ,
325+ lazy = False ,
326+ )
327+
328+ # Prefetch and read in one call
329+ jj.read_into_numpy(
330+ source = path,
331+ metadata = pr.metadata,
332+ np_array = np_array,
333+ row_group_indices = range (pr.metadata.num_row_groups),
334+ column_indices = range (pr.metadata.num_columns),
335+ cache_options = cache_options,
336+ prefetch_page_cache = True ,
337+ )
338+
339+ # Or prefetch separately, then read
340+ jj.prefetch_page_cache(
341+ source = path,
342+ metadata = pr.metadata,
343+ row_group_indices = range (pr.metadata.num_row_groups),
344+ column_indices = range (pr.metadata.num_columns),
345+ cache_options = cache_options,
346+ )
347+ jj.read_into_numpy(
348+ source = path,
349+ metadata = pr.metadata,
350+ np_array = np_array,
351+ row_group_indices = range (pr.metadata.num_row_groups),
352+ column_indices = range (pr.metadata.num_columns),
353+ pre_buffer = False ,
354+ )
355+ ```
356+
256357### Generating a PyTorch tensor to read into
257358``` python
258359import torch
0 commit comments