@@ -20,9 +20,9 @@ into NumPy arrays and PyTorch tensors with minimal overhead.
2020By default, the reader uses the regular file API via
2121` parquet::ParquetFileReader ` . In most cases, this is the recommended choice.
2222
23- An alternative reader backend based on ** io_uring** is also available. It can
24- provide better performance, especially for very large datasets and when used
25- together with ** O_DIRECT** .
23+ An alternative reader backend based on ** io_uring** is also available. It may
24+ provide better performance for some workloads, particularly when used together
25+ with ** O_DIRECT** .
2626
2727To enable the alternative backend, set the ` JJ_READER_BACKEND ` environment
2828variable to one of the following values:
@@ -50,29 +50,93 @@ your workload is I/O-bound or memory-/CPU-bound.
5050
5151For datasets larger than the available page cache, performance is typically
5252I/O-bound. Enabling either ` pre_buffer=True ` or ` prefetch_page_cache=True `
53- brings throughput close to the raw I/O ceiling.
53+ brings throughput close to the raw I/O ceiling, but ` prefetch_page_cache `
54+ avoids the increased LLC miss rate caused by ` pre_buffer `
55+ (see [ Page cache prefetching] ( #page-cache-prefetching-with-prefetch_page_cache ) below).
5456
5557Recommended configuration:
5658
5759- ` use_threads = True ` , ` prefetch_page_cache = True ` , ` pre_buffer = False ` ,
5860 with the default reader backend.
5961
60- Both options reach near-identical throughput. ` prefetch_page_cache ` avoids the
61- temporary buffer copies that ` pre_buffer ` uses (see section below) and the
62- increased LLC miss rate.
63-
6462### Small datasets (fit in filesystem cache)
6563
6664For datasets that comfortably fit in RAM, performance is typically CPU- or
67- memory-bound.
65+ memory-bound. Using ` pre_buffer ` is not recommended because it leads to an
66+ increased LLC miss rate and suboptimal performance
67+ (see [ Page cache prefetching] ( #page-cache-prefetching-with-prefetch_page_cache ) below).
6868
6969Recommended configuration:
7070
7171- ` use_threads = True ` , ` prefetch_page_cache = True ` , ` pre_buffer = False ` ,
7272 with the default reader backend.
7373
74+ ### Page cache prefetching with ` prefetch_page_cache `
75+
76+ The ` prefetch_page_cache ` option calls ` posix_fadvise(POSIX_FADV_WILLNEED) ` to tell
77+ the kernel to start loading the relevant byte ranges into the page cache.
78+ Each worker thread then reads directly via ` pread ` into its own
79+ locally-allocated buffer, keeping data hot in its local CPU caches.
80+
81+ This avoids the LLC (Last Level Cache) miss problem with ` pre_buffer=True ` ,
82+ where Arrow's IO thread pool fills temporary buffers on one core and worker
83+ threads on different cores later consume cold data.
84+
85+ This is only useful for local or network-mounted file systems that have a
86+ page cache. Remote file systems such as S3 will not benefit from this.
87+
88+ There are two ways to enable page cache prefetching:
89+
90+ ** As a parameter on ` read_into_numpy ` :**
91+
92+ ``` python
93+ jj.read_into_numpy(
94+ source = path,
95+ metadata = pr.metadata,
96+ np_array = np_array,
97+ row_group_indices = range (pr.metadata.num_row_groups),
98+ column_indices = range (pr.metadata.num_columns),
99+ prefetch_page_cache = True ,
100+ )
101+ ```
102+
103+ ** As a standalone call:**
104+
105+ ``` python
106+ jj.prefetch_page_cache(
107+ source = path,
108+ metadata = pr.metadata,
109+ row_group_indices = range (pr.metadata.num_row_groups),
110+ column_indices = range (pr.metadata.num_columns),
111+ )
112+ ```
113+
114+ Useful for sliding-window prefetching, where you prefetch the next files
115+ while processing the current one:
116+
117+ ``` python
118+ # Prime the pump
119+ for path in file_paths[:PREFETCH_DEPTH ]:
120+ jj.prefetch_page_cache(source = path, ... )
121+
122+ # Main loop
123+ for i, path in enumerate (file_paths):
124+ # Slide the window
125+ ahead_index = i + PREFETCH_DEPTH
126+ if ahead_index < len (file_paths):
127+ jj.prefetch_page_cache(source = file_paths[ahead_index], ... )
128+
129+ # Page cache should already be warm
130+ jj.read_into_numpy(source = path, np_array = np_array, ... )
131+
132+ process(np_array)
133+ ```
134+
74135### Pre-buffering and ` cache_options `
75136
137+ If you use ` pre_buffer=True ` instead of ` prefetch_page_cache ` , the following
138+ tuning applies.
139+
76140When ` pre_buffer=True ` , Arrow merges nearby column ranges and reads them into
77141temporary buffers. The default maximum merged range is 32 MB
78142(` range_size_limit ` ).
@@ -110,61 +174,9 @@ To debug allocator issues with mimalloc, run with `MIMALLOC_SHOW_STATS=1` and
110174### Pre-buffering and ` ARROW_IO_THREADS `
111175
112176When ` pre_buffer=True ` , Arrow dispatches reads to its IO thread pool,
113- configured via the ` ARROW_IO_THREADS ` environment variable (default: 8).
177+ configured via the ` ARROW_IO_THREADS ` environment variable (default: 8).
114178Tuning this value may improve performance.
115179
116- ### Page cache prefetching with ` prefetch_page_cache `
117-
118- With ` pre_buffer=True ` , Arrow's IO thread pool allocates temporary buffers
119- and fills them on the IO thread's core. When worker threads on different
120- cores later consume those buffers, the data is cold in their caches,
121- causing LLC misses.
122-
123- ` prefetch_page_cache ` provides an alternative: it calls
124- ` posix_fadvise(POSIX_FADV_WILLNEED) ` to tell the kernel to start loading
125- the relevant byte ranges into the page cache. Each worker thread then
126- reads directly via ` pread ` into its own locally-allocated buffer, keeping
127- data hot in its local CPU caches.
128-
129- Two ways to use it:
130-
131- ** As a parameter on ` read_into_numpy ` :**
132-
133- ``` python
134- jj.read_into_numpy(
135- source = path,
136- metadata = pr.metadata,
137- np_array = np_array,
138- row_group_indices = range (pr.metadata.num_row_groups),
139- column_indices = range (pr.metadata.num_columns),
140- prefetch_page_cache = True ,
141- )
142- ```
143-
144- This is only useful for local or network-mounted file systems that have a
145- page cache. Remote file systems such as S3 will not benefit from this.
146-
147- ** As a standalone call** (when you want to prefetch ahead of time, e.g.
148- from a different thread):
149-
150- ``` python
151- jj.prefetch_page_cache(
152- source = path,
153- metadata = pr.metadata,
154- row_group_indices = range (pr.metadata.num_row_groups),
155- column_indices = range (pr.metadata.num_columns),
156- )
157-
158- jj.read_into_numpy(
159- source = path,
160- metadata = pr.metadata,
161- np_array = np_array,
162- row_group_indices = range (pr.metadata.num_row_groups),
163- column_indices = range (pr.metadata.num_columns),
164- pre_buffer = False ,
165- )
166- ```
167-
168180## Requirements
169181
170182- pyarrow ~ = 24.0.0
@@ -373,7 +385,7 @@ jj.read_into_torch(
373385 tensor = tensor,
374386 row_group_indices = range (pr.metadata.num_row_groups),
375387 column_indices = range (pr.metadata.num_columns),
376- pre_buffer = True ,
388+ prefetch_page_cache = True ,
377389 use_threads = True ,
378390)
379391
0 commit comments