Randomly draw rows (with optional seed) from a CSV using seven different sampling methods - reservoir (default), indexed, bernoulli, systematic, stratified, weighted & cluster sampling. Supports sampling from CSVs on remote URLs. Uses the stats cache to skip unnecessary scanning and inform its sampling strategies.
Table of Contents | Source: src/cmd/sample.rs | 📇🏎️🌐🎲🪄
Description | Examples | Usage | Arguments | Sample Options | Sampling Methods Options | Time-Series Sampling Options | Sketch Options | Remote File Options | Common Options
Description ↩
Randomly samples CSV data.
It supports ten sampling methods:
-
RESERVOIR: the default sampling method when NO INDEX is present and no sampling method is specified. Visits every CSV record exactly once, using MEMORY PROPORTIONAL to the sample size (k) - O(k). https://en.wikipedia.org/wiki/Reservoir_sampling
-
INDEXED: the default sampling method when an INDEX is present and no sampling method is specified. Uses random I/O to sample efficiently, as it only visits records selected by random indexing, using MEMORY PROPORTIONAL to the sample size (k) - O(k). https://en.wikipedia.org/wiki/Random_access
-
BERNOULLI: the sampling method when the --bernoulli option is specified. Each record has an independent probability p of being selected, where p is specified by the argument. For example, if p=0.1, then each record has a 10% chance of being selected, regardless of the other records. The final sample size is random and follows a binomial distribution. Uses CONSTANT MEMORY - O(1). When sampling from a remote URL, processes the file in chunks without downloading it entirely, making it especially efficient for sampling large remote files. https://en.wikipedia.org/wiki/Bernoulli_sampling
-
SYSTEMATIC: the sampling method when the --systematic option is specified. Selects every nth record from the input, where n is the integer part of and the fraction part is the percentage of the population to sample. For example, if is 10.5, it will select every 10th record and 50% of the population. If is a whole number (no fractional part), it will select every nth record for the whole population. Uses CONSTANT memory - O(1). The starting point can be specified as "random" or "first". Useful for time series data or when you want evenly spaced samples. https://en.wikipedia.org/wiki/Systematic_sampling
-
STRATIFIED: the sampling method when the --stratified option is specified. Stratifies the population by the specified column and then samples from each stratum. Particularly useful when a population has distinct subgroups (strata) that are heterogeneous within but homogeneous between in terms of the variable of interest. For example, if you want to sample 1,000 records from a population of 100,000 across the US, you can stratify the population by US state and then sample 20 records from each stratum. This will ensure that you have a representative sample from each of the 50 states. The sample size must be a whole number. Uses MEMORY PROPORTIONAL to the number of strata (s) and samples per stratum (k) as specified by - O(s*k). https://en.wikipedia.org/wiki/Stratified_sampling
-
WEIGHTED: the sampling method when the --weighted option is specified. Samples records with probabilities proportional to values in a specified weight column. Records with higher weights are more likely to be selected. For example, if you have sales data and want to sample transactions weighted by revenue, high-value transactions will have a higher chance of being included. Non-numeric weights are treated as zero. The weights are automatically normalized using the maximum weight in the dataset. Specify the desired sample size with . Uses MEMORY PROPORTIONAL to the sample size (k) - O(k). "Weighted random sampling with a reservoir" https://doi.org/10.1016/j.ipl.2005.11.003
-
VAROPT: the sampling method when the --varopt option is specified. Variance-bounded weighted reservoir sampling using the A-ExpJ keying scheme of Efraimidis and Spirakis (2006). For each record, computes a key u^(1/w) and retains the items with the largest keys. Unlike the --weighted method, it does NOT require a stats cache, runs in a single pass, and supports merge across partitions through the --sketch-out and --sketch-in options. Suitable for heavy-tailed weight distributions where bounded-variance estimators are needed. Uses MEMORY PROPORTIONAL to the sample size (k) - O(k). This is a native Rust implementation written from the original paper; the analogous VarOpt sketches in the Apache DataSketches library use the same family of algorithms but are NOT used here. Algorithm: "Weighted random sampling with a reservoir" doi 10.1016/j.ipl.2005.11.003
-
MERGEABLE-RESERVOIR: the sampling method when the --mergeable-reservoir flag is set. Uniform reservoir sample using Vitter's Algorithm R. Same statistical distribution as the default RESERVOIR method, but the sampler state is mergeable: a sketch written by one run can be combined with sketches from other runs via the --sketch-out and --sketch-in options, producing a uniform sample of the combined stream WITHOUT re-reading the input files. Useful for sharded or incremental sampling pipelines. Uses MEMORY PROPORTIONAL to the sample size (k) - O(k). Native Rust implementation; the analogous ReservoirItemsSketch in the Apache DataSketches library implements the same algorithm but is NOT used here. See en.wikipedia.org/wiki/Reservoir_sampling
-
CLUSTER: the sampling method when the --cluster option is specified. Samples entire groups of records together based on a cluster identifier column. The number of clusters is specified by the argument. Useful when records are naturally grouped (e.g., by household, neighborhood, etc.). For example, if you have records grouped by neighborhood and specify a sample size of 10, it will randomly select 10 neighborhoods and include ALL records from those neighborhoods in the output. This ensures that natural groupings in the data are preserved. Uses MEMORY PROPORTIONAL to the number of clusters (c) - O(c). https://en.wikipedia.org/wiki/Cluster_sampling
-
TIMESERIES: the sampling method when the --timeseries option is specified. Samples records based on time intervals from a time-series dataset. Groups records by time windows (e.g., hourly, daily, weekly) and selects one record per interval. Supports adaptive sampling (e.g., prefer business hours or weekends) and aggregation (e.g., mean, sum, min, max) within each interval. The starting point can be "first" (earliest), "last" (most recent), or "random". Particularly useful for time-series data where simple row-based sampling would always return the same records due to sorting. Uses MEMORY PROPORTIONAL to the number of records - O(n).
Supports sampling from CSVs on remote URLs. Note that the entire file is downloaded first to a temporary file before sampling begins for all sampling methods except Bernoulli, which streams the file as it samples it, stopping when the desired sample size is reached or the end of the file is reached.
Sampling from stdin is also supported for all sampling methods, copying stdin to a in-memory buffer first before sampling begins.
If a stats cache is available, it will be used to do extra checks on systematic, weighted and cluster sampling, and to speed up sampling in general.
This command is intended to provide a means to sample from a CSV data set that is too big to fit into memory (for example, for use with commands like 'qsv stats' with the '--everything' option).
Examples ↩
Take a sample of 1000 records from data.csv using RESERVOIR or INDEXED sampling depending on whether an INDEX is present.
qsv sample 1000 data.csvTake a sample of approximately 10% of the records from data.csv using RESERVOIR or INDEXED sampling depending on whether an INDEX is present.
qsv sample 0.1 data.csvTake a sample using BERNOULLI sampling where each record has a 5% chance of being selected
qsv sample --bernoulli 0.05 data.csvTake a sample using SYSTEMATIC sampling where every 10th record is selected and approximately 50% of the population is sampled, starting from a random point.
qsv sample --systematic random 10.5 data.csvTake a sample using STRATIFIED sampling where 20 records are sampled from each stratum defined by the 'State' column.
qsv sample --stratified State 20 data.csvTake a sample using WEIGHTED sampling where records are sampled with probabilities proportional to the 'Revenue' column, for a total sample size of 1000 records.
qsv sample --weighted Revenue 1000 data.csvTake a sample using CLUSTER sampling where 10 clusters defined by the 'Neighborhood' column are randomly selected and all records from those clusters are included in the sample.
qsv sample --cluster Neighborhood 10 data.csvTake a sample using VAROPT (A-ExpJ weighted reservoir) sampling, weighted by the 'Revenue' column, for a sample size of 1000 records. Unlike --weighted, this does NOT require a stats cache.
qsv sample --varopt Revenue 1000 data.csvSample two shards and merge their sketches into a single uniform sample without re-reading the inputs.
qsv sample --mergeable-reservoir --sketch-out a.sk 1000 shard_a.csvqsv sample --mergeable-reservoir --sketch-out b.sk 1000 shard_b.csvqsv sample --sketch-in a.sk,b.sk 1000 -o merged.csvFor more examples, see tests.
Usage ↩
qsv sample [options] <sample-size> [<input>]
qsv sample --helpArguments ↩
| Argument | Description |
|---|---|
<input> |
The CSV file to sample. This can be a local file, stdin, or a URL (http and https schemes supported). |
<sample-size> |
When using INDEXED, RESERVOIR or WEIGHTED sampling, the sample size. Can either be a whole number or a value between value between 0 and 1. If a fraction, specifies the sample size as a percentage of the population. (e.g. 0.15 - 15 percent of the CSV) When using BERNOULLI sampling, the probability of selecting each record (between 0 and 1). When using SYSTEMATIC sampling, the integer part is the interval between records to sample & the fractional part is the percentage of the population to sample. When there is no fractional part, it will select every nth record for the entire population. When using STRATIFIED sampling, the stratum sample size. When using CLUSTER sampling, the number of clusters. When using TIMESERIES sampling, the interval number (treated as hours by default, e.g., 1 = 1 hour). Use --ts-interval for custom intervals like "1d" (daily), "1w" (weekly), "1m" (monthly), "1y" (yearly), etc. |
Sample Options ↩
| Option | Type | Description | Default |
|---|---|---|---|
‑‑seed |
integer | Random Number Generator (RNG) seed. | |
‑‑rng |
string | The Random Number Generator (RNG) algorithm to use. | standard |
Sampling Methods Options ↩
| Option | Type | Description | Default |
|---|---|---|---|
‑‑bernoulli |
flag | Use Bernoulli sampling instead of indexed or reservoir sampling. When this flag is set, must be between 0 and 1 and represents the probability of selecting each record. | |
‑‑systematic |
string | Use systematic sampling (every nth record as specified by ). If is "random", the starting point is randomly chosen between 0 & n. If is "first", the starting point is the first record. The sample size must be a whole number. Uses CONSTANT memory - O(1). | |
‑‑stratified |
string | Use stratified sampling. The strata column is specified by | |
‑‑weighted |
string | Use weighted sampling. The weight column is specified by | |
‑‑varopt |
string | Use VAROPT weighted reservoir sampling (A-ExpJ keying). The weight column is specified by | |
‑‑mergeable‑reservoir |
flag | Use a mergeable Algorithm-R reservoir sampler. Distribution is identical to the default RESERVOIR method, but the resulting sketch is mergeable via --sketch-out / --sketch-in. Cannot be combined with another sampling-method flag. | |
‑‑cluster |
string | Use cluster sampling. The cluster column is specified by | |
‑‑timeseries |
string | Use time-series sampling. The time column is specified by | |
Time-Series Sampling Options ↩
| Option | Type | Description | Default |
|---|---|---|---|
‑‑ts‑interval |
string | Time interval for grouping records. Format: where unit is h (hour), d (day), w (week), m (month), y (year). Examples: "1h", "1d", "1w", "2d" (every 2 days). If not specified, is treated as hours. | |
‑‑ts‑start |
string | Starting point for time-series sampling. Options: "first" (earliest timestamp, default), "last" (most recent timestamp), "random" (random starting point). | first |
‑‑ts‑adaptive |
string | Adaptive sampling mode for time-series data. Options: "business-hours" (prefer 9am-5pm Mon-Fri), "weekends" (prefer weekends), "business-days" (prefer weekdays), "both" (combine business-hours and weekends). | |
‑‑ts‑aggregate |
string | Aggregation function to apply within each time interval. Options: "first", "last", "mean", "sum", "count", "min", "max", "median". When specified, aggregates all records in each interval instead of selecting a single record. | |
‑‑ts‑input‑tz |
string | Timezone for parsing input timestamps. Can be an IANA timezone name or "local" for the local timezone. | UTC |
‑‑ts‑prefer‑dmy |
flag | Prefer to parse dates in dmy format. Otherwise, use mdy format. |
Sketch Options ↩
| Option | Type | Description | Default |
|---|---|---|---|
‑‑sketch‑out |
string | After sampling, also write a binary sketch describing the internal sampler state to . The blob can later be merged into another run via --sketch-in. Only valid with --varopt or --mergeable-reservoir. The format is qsv-specific and is not interoperable with serialized sketches from other tools. | |
‑‑sketch‑in |
string | Comma-separated list of sketch files produced by --sketch-out. CSV input is NOT read; the listed sketches (which must all be of the same sampler kind) are merged and the resulting sample is emitted as CSV. may be used to cap the merged sample below the sketches' own k. |
Remote File Options ↩
| Option | Type | Description | Default |
|---|---|---|---|
‑‑user‑agent |
string | Specify custom user agent to use when the input is a URL. It supports the following variables - $QSV_VERSION, $QSV_TARGET, $QSV_BIN_NAME, $QSV_KIND and $QSV_COMMAND. Try to follow the syntax here - https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent | |
‑‑timeout |
integer | Timeout for downloading URLs in seconds. If 0, no timeout is used. | 30 |
‑‑max‑size |
integer | Maximum size of the file to download in MB before sampling. Will download the entire file if not specified. If the CSV is partially downloaded, the sample will be taken only from the downloaded portion. | |
‑‑force |
flag | Do not use stats cache, even if its available. |
Common Options ↩
| Option | Type | Description | Default |
|---|---|---|---|
‑h,‑‑help |
flag | Display this message | |
‑o,‑‑output |
string | Write output to instead of stdout. | |
‑n,‑‑no‑headers |
flag | When set, the first row will be considered as part of the population to sample from. (When not set, the first row is the header row and will always appear in the output.) | |
‑d,‑‑delimiter |
string | The field delimiter for reading/writing CSV data. Must be a single character. (default: ,) |
Source: src/cmd/sample.rs
| Table of Contents | README