PR remarks

marcin-krystianc · marcin-krystianc · commit b713461c80d3 · 2026-04-27T13:58:22.000+02:00
diff --git a/README.md b/README.md
@@ -20,9 +20,9 @@ into NumPy arrays and PyTorch tensors with minimal overhead.
 By default, the reader uses the regular file API via
 `parquet::ParquetFileReader`. In most cases, this is the recommended choice.
 
-An alternative reader backend based on **io_uring** is also available. It can
-provide better performance, especially for very large datasets and when used
-together with **O_DIRECT**.
+An alternative reader backend based on **io_uring** is also available. It may
+provide better performance for some workloads, particularly when used together
+with **O_DIRECT**.
 
 To enable the alternative backend, set the `JJ_READER_BACKEND` environment
 variable to one of the following values:
@@ -50,29 +50,93 @@ your workload is I/O-bound or memory-/CPU-bound.
 
 For datasets larger than the available page cache, performance is typically
 I/O-bound. Enabling either `pre_buffer=True` or `prefetch_page_cache=True`
-brings throughput close to the raw I/O ceiling.
+brings throughput close to the raw I/O ceiling, but `prefetch_page_cache`
+avoids the increased LLC miss rate caused by `pre_buffer`
+(see [Page cache prefetching](#page-cache-prefetching-with-prefetch_page_cache) below).
 
 Recommended configuration:
 
 - `use_threads = True`, `prefetch_page_cache = True`, `pre_buffer = False`,
   with the default reader backend.
 
-Both options reach near-identical throughput. `prefetch_page_cache` avoids the
-temporary buffer copies that `pre_buffer` uses (see section below) and the
-increased LLC miss rate.
-
 ### Small datasets (fit in filesystem cache)
 
 For datasets that comfortably fit in RAM, performance is typically CPU- or
-memory-bound.
+memory-bound. Using `pre_buffer` is not recommended because it leads to an
+increased LLC miss rate and suboptimal performance
+(see [Page cache prefetching](#page-cache-prefetching-with-prefetch_page_cache) below).
 
 Recommended configuration:
 
 - `use_threads = True`, `prefetch_page_cache = True`, `pre_buffer = False`,
   with the default reader backend.
 
+### Page cache prefetching with `prefetch_page_cache`
+
+The `prefetch_page_cache` option calls `posix_fadvise(POSIX_FADV_WILLNEED)` to tell
+the kernel to start loading the relevant byte ranges into the page cache.
+Each worker thread then reads directly via `pread` into its own
+locally-allocated buffer, keeping data hot in its local CPU caches.
+
+This avoids the LLC (Last Level Cache) miss problem with `pre_buffer=True`,
+where Arrow's IO thread pool fills temporary buffers on one core and worker
+threads on different cores later consume cold data.
+
+This is only useful for local or network-mounted file systems that have a
+page cache. Remote file systems such as S3 will not benefit from this.
+
+There are two ways to enable page cache prefetching:
+
+**As a parameter on `read_into_numpy`:**
+
+```python
+jj.read_into_numpy(
+    source=path,
+    metadata=pr.metadata,
+    np_array=np_array,
+    row_group_indices=range(pr.metadata.num_row_groups),
+    column_indices=range(pr.metadata.num_columns),
+    prefetch_page_cache=True,
+)
+```
+
+**As a standalone call:**
+
+```python
+jj.prefetch_page_cache(
+    source=path,
+    metadata=pr.metadata,
+    row_group_indices=range(pr.metadata.num_row_groups),
+    column_indices=range(pr.metadata.num_columns),
+)
+```
+
+Useful for sliding-window prefetching, where you prefetch the next files
+while processing the current one:
+
+```python
+# Prime the pump
+for path in file_paths[:PREFETCH_DEPTH]:
+    jj.prefetch_page_cache(source=path, ...)
+
+# Main loop
+for i, path in enumerate(file_paths):
+    # Slide the window
+    ahead_index = i + PREFETCH_DEPTH
+    if ahead_index < len(file_paths):
+        jj.prefetch_page_cache(source=file_paths[ahead_index], ...)
+
+    # Page cache should already be warm
+    jj.read_into_numpy(source=path, np_array=np_array, ...)
+
+    process(np_array)
+```
+
 ### Pre-buffering and `cache_options`
 
+If you use `pre_buffer=True` instead of `prefetch_page_cache`, the following
+tuning applies.
+
 When `pre_buffer=True`, Arrow merges nearby column ranges and reads them into
 temporary buffers. The default maximum merged range is 32 MB
 (`range_size_limit`).
@@ -110,61 +174,9 @@ To debug allocator issues with mimalloc, run with `MIMALLOC_SHOW_STATS=1` and
 ### Pre-buffering and `ARROW_IO_THREADS`
 
 When `pre_buffer=True`, Arrow dispatches reads to its IO thread pool,
-configured via the `ARROW_IO_THREADS` environment variable (default: 8). 
+configured via the `ARROW_IO_THREADS` environment variable (default: 8).
 Tuning this value may improve performance.
 
-### Page cache prefetching with `prefetch_page_cache`
-
-With `pre_buffer=True`, Arrow's IO thread pool allocates temporary buffers
-and fills them on the IO thread's core. When worker threads on different
-cores later consume those buffers, the data is cold in their caches,
-causing LLC misses.
-
-`prefetch_page_cache` provides an alternative: it calls
-`posix_fadvise(POSIX_FADV_WILLNEED)` to tell the kernel to start loading
-the relevant byte ranges into the page cache. Each worker thread then
-reads directly via `pread` into its own locally-allocated buffer, keeping
-data hot in its local CPU caches.
-
-Two ways to use it:
-
-**As a parameter on `read_into_numpy`:**
-
-```python
-jj.read_into_numpy(
-    source=path,
-    metadata=pr.metadata,
-    np_array=np_array,
-    row_group_indices=range(pr.metadata.num_row_groups),
-    column_indices=range(pr.metadata.num_columns),
-    prefetch_page_cache=True,
-)
-```
-
-This is only useful for local or network-mounted file systems that have a
-page cache. Remote file systems such as S3 will not benefit from this.
-
-**As a standalone call** (when you want to prefetch ahead of time, e.g.
-from a different thread):
-
-```python
-jj.prefetch_page_cache(
-    source=path,
-    metadata=pr.metadata,
-    row_group_indices=range(pr.metadata.num_row_groups),
-    column_indices=range(pr.metadata.num_columns),
-)
-
-jj.read_into_numpy(
-    source=path,
-    metadata=pr.metadata,
-    np_array=np_array,
-    row_group_indices=range(pr.metadata.num_row_groups),
-    column_indices=range(pr.metadata.num_columns),
-    pre_buffer=False,
-)
-```
-
 ## Requirements
 
 - pyarrow ~= 24.0.0
@@ -373,7 +385,7 @@ jj.read_into_torch(
     tensor=tensor,
     row_group_indices=range(pr.metadata.num_row_groups),
     column_indices=range(pr.metadata.num_columns),
-    pre_buffer=True,
+    prefetch_page_cache=True,
     use_threads=True,
 )
 
diff --git a/jollyjack/jollyjack.cc b/jollyjack/jollyjack.cc
@@ -338,7 +338,7 @@ void ReadIntoMemory (std::shared_ptr<arrow::io::RandomAccessFile> source
                             ).ValueOrDie();
     auto status = source->WillNeed(read_ranges);
     if (!status.ok()) {
-      throw std::logic_error(status.message());
+      throw std::runtime_error(status.message());
     }
   }
 
@@ -694,7 +694,7 @@ void PrefetchPageCache(
 
   auto status = source->WillNeed(read_ranges);
   if (!status.ok()) {
-    throw std::logic_error(status.message());
+    throw std::runtime_error(status.message());
   }
 }
 
diff --git a/test/test_readme.py b/test/test_readme.py
@@ -203,7 +203,7 @@
 # ```
 import torch
 
-# Create a tesnsor and transpose it to get Fortran-style order
+# Create a tensor and transpose it to get Fortran-style order
 tensor = torch.zeros(n_columns, n_rows, dtype=torch.float32).transpose(0, 1)
 # ```
 

Original file line number	Diff line number	Diff line change
`@@ -338,7 +338,7 @@ void ReadIntoMemory (std::shared_ptr<arrow::io::RandomAccessFile> source`
`338`	`338`	`).ValueOrDie();`
`339`	`339`	`auto status = source->WillNeed(read_ranges);`
`340`	`340`	`if (!status.ok()) {`
`341`		`- throw std::logic_error(status.message());`
	`341`	`+ throw std::runtime_error(status.message());`
`342`	`342`	`}`
`343`	`343`	`}`
`344`	`344`
`@@ -694,7 +694,7 @@ void PrefetchPageCache(`
`694`	`694`
`695`	`695`	`auto status = source->WillNeed(read_ranges);`
`696`	`696`	`if (!status.ok()) {`
`697`		`- throw std::logic_error(status.message());`
	`697`	`+ throw std::runtime_error(status.message());`
`698`	`698`	`}`
`699`	`699`	`}`
`700`	`700`