sample

Randomly draw rows (with optional seed) from a CSV using seven different sampling methods - reservoir (default), indexed, bernoulli, systematic, stratified, weighted & cluster sampling. Supports sampling from CSVs on remote URLs. Uses the stats cache to skip unnecessary scanning and inform its sampling strategies.

Table of Contents | Source: src/cmd/sample.rs | 📇🏎️🌐🎲🪄

Description ↩

Randomly samples CSV data.

It supports ten sampling methods:

RESERVOIR: the default sampling method when NO INDEX is present and no sampling method is specified. Visits every CSV record exactly once, using MEMORY PROPORTIONAL to the sample size (k) - O(k). https://en.wikipedia.org/wiki/Reservoir_sampling
INDEXED: the default sampling method when an INDEX is present and no sampling method is specified. Uses random I/O to sample efficiently, as it only visits records selected by random indexing, using MEMORY PROPORTIONAL to the sample size (k) - O(k). https://en.wikipedia.org/wiki/Random_access
BERNOULLI: the sampling method when the --bernoulli option is specified. Each record has an independent probability p of being selected, where p is specified by the argument. For example, if p=0.1, then each record has a 10% chance of being selected, regardless of the other records. The final sample size is random and follows a binomial distribution. Uses CONSTANT MEMORY - O(1). When sampling from a remote URL, processes the file in chunks without downloading it entirely, making it especially efficient for sampling large remote files. https://en.wikipedia.org/wiki/Bernoulli_sampling
SYSTEMATIC: the sampling method when the --systematic option is specified. Selects every nth record from the input, where n is the integer part of and the fraction part is the percentage of the population to sample. For example, if is 10.5, it will select every 10th record and 50% of the population. If is a whole number (no fractional part), it will select every nth record for the whole population. Uses CONSTANT memory - O(1). The starting point can be specified as "random" or "first". Useful for time series data or when you want evenly spaced samples. https://en.wikipedia.org/wiki/Systematic_sampling
STRATIFIED: the sampling method when the --stratified option is specified. Stratifies the population by the specified column and then samples from each stratum. Particularly useful when a population has distinct subgroups (strata) that are heterogeneous within but homogeneous between in terms of the variable of interest. For example, if you want to sample 1,000 records from a population of 100,000 across the US, you can stratify the population by US state and then sample 20 records from each stratum. This will ensure that you have a representative sample from each of the 50 states. The sample size must be a whole number. Uses MEMORY PROPORTIONAL to the number of strata (s) and samples per stratum (k) as specified by - O(s*k). https://en.wikipedia.org/wiki/Stratified_sampling
WEIGHTED: the sampling method when the --weighted option is specified. Samples records with probabilities proportional to values in a specified weight column. Records with higher weights are more likely to be selected. For example, if you have sales data and want to sample transactions weighted by revenue, high-value transactions will have a higher chance of being included. Non-numeric weights are treated as zero. The weights are automatically normalized using the maximum weight in the dataset. Specify the desired sample size with . Uses MEMORY PROPORTIONAL to the sample size (k) - O(k). "Weighted random sampling with a reservoir" https://doi.org/10.1016/j.ipl.2005.11.003
VAROPT: the sampling method when the --varopt option is specified. Variance-bounded weighted reservoir sampling using the A-ExpJ keying scheme of Efraimidis and Spirakis (2006). For each record, computes a key u^(1/w) and retains the items with the largest keys. Unlike the --weighted method, it does NOT require a stats cache, runs in a single pass, and supports merge across partitions through the --sketch-out and --sketch-in options. Suitable for heavy-tailed weight distributions where bounded-variance estimators are needed. Uses MEMORY PROPORTIONAL to the sample size (k) - O(k). This is a native Rust implementation written from the original paper; the analogous VarOpt sketches in the Apache DataSketches library use the same family of algorithms but are NOT used here. Algorithm: "Weighted random sampling with a reservoir" doi 10.1016/j.ipl.2005.11.003
MERGEABLE-RESERVOIR: the sampling method when the --mergeable-reservoir flag is set. Uniform reservoir sample using Vitter's Algorithm R. Same statistical distribution as the default RESERVOIR method, but the sampler state is mergeable: a sketch written by one run can be combined with sketches from other runs via the --sketch-out and --sketch-in options, producing a uniform sample of the combined stream WITHOUT re-reading the input files. Useful for sharded or incremental sampling pipelines. Uses MEMORY PROPORTIONAL to the sample size (k) - O(k). Native Rust implementation; the analogous ReservoirItemsSketch in the Apache DataSketches library implements the same algorithm but is NOT used here. See en.wikipedia.org/wiki/Reservoir_sampling
CLUSTER: the sampling method when the --cluster option is specified. Samples entire groups of records together based on a cluster identifier column. The number of clusters is specified by the argument. Useful when records are naturally grouped (e.g., by household, neighborhood, etc.). For example, if you have records grouped by neighborhood and specify a sample size of 10, it will randomly select 10 neighborhoods and include ALL records from those neighborhoods in the output. This ensures that natural groupings in the data are preserved. Uses MEMORY PROPORTIONAL to the number of clusters (c) - O(c). https://en.wikipedia.org/wiki/Cluster_sampling
TIMESERIES: the sampling method when the --timeseries option is specified. Samples records based on time intervals from a time-series dataset. Groups records by time windows (e.g., hourly, daily, weekly) and selects one record per interval. Supports adaptive sampling (e.g., prefer business hours or weekends) and aggregation (e.g., mean, sum, min, max) within each interval. The starting point can be "first" (earliest), "last" (most recent), or "random". Particularly useful for time-series data where simple row-based sampling would always return the same records due to sorting. Uses MEMORY PROPORTIONAL to the number of records - O(n).

Supports sampling from CSVs on remote URLs. Note that the entire file is downloaded first to a temporary file before sampling begins for all sampling methods except Bernoulli, which streams the file as it samples it, stopping when the desired sample size is reached or the end of the file is reached.

Sampling from stdin is also supported for all sampling methods, copying stdin to a in-memory buffer first before sampling begins.

If a stats cache is available, it will be used to do extra checks on systematic, weighted and cluster sampling, and to speed up sampling in general.

This command is intended to provide a means to sample from a CSV data set that is too big to fit into memory (for example, for use with commands like 'qsv stats' with the '--everything' option).

Examples ↩

Take a sample of 1000 records from data.csv using RESERVOIR or INDEXED sampling depending on whether an INDEX is present.

qsv sample 1000 data.csv

Take a sample of approximately 10% of the records from data.csv using RESERVOIR or INDEXED sampling depending on whether an INDEX is present.

qsv sample 0.1 data.csv

Take a sample using BERNOULLI sampling where each record has a 5% chance of being selected

qsv sample --bernoulli 0.05 data.csv

Take a sample using SYSTEMATIC sampling where every 10th record is selected and approximately 50% of the population is sampled, starting from a random point.

qsv sample --systematic random 10.5 data.csv

Take a sample using STRATIFIED sampling where 20 records are sampled from each stratum defined by the 'State' column.

qsv sample --stratified State 20 data.csv

Take a sample using WEIGHTED sampling where records are sampled with probabilities proportional to the 'Revenue' column, for a total sample size of 1000 records.

qsv sample --weighted Revenue 1000 data.csv

Take a sample using CLUSTER sampling where 10 clusters defined by the 'Neighborhood' column are randomly selected and all records from those clusters are included in the sample.

qsv sample --cluster Neighborhood 10 data.csv

Take a sample using VAROPT (A-ExpJ weighted reservoir) sampling, weighted by the 'Revenue' column, for a sample size of 1000 records. Unlike --weighted, this does NOT require a stats cache.

qsv sample --varopt Revenue 1000 data.csv

Sample two shards and merge their sketches into a single uniform sample without re-reading the inputs.

qsv sample --mergeable-reservoir --sketch-out a.sk 1000 shard_a.csv

qsv sample --mergeable-reservoir --sketch-out b.sk 1000 shard_b.csv

qsv sample --sketch-in a.sk,b.sk 1000 -o merged.csv

For more examples, see tests.

Usage ↩

qsv sample [options] <sample-size> [<input>]
qsv sample --help

Arguments ↩

Argument Description

<input> The CSV file to sample. This can be a local file, stdin, or a URL (http and https schemes supported).

<sample-size> When using INDEXED, RESERVOIR or WEIGHTED sampling, the sample size. Can either be a whole number or a value between value between 0 and 1. If a fraction, specifies the sample size as a percentage of the population. (e.g. 0.15 - 15 percent of the CSV) When using BERNOULLI sampling, the probability of selecting each record (between 0 and 1). When using SYSTEMATIC sampling, the integer part is the interval between records to sample & the fractional part is the percentage of the population to sample. When there is no fractional part, it will select every nth record for the entire population. When using STRATIFIED sampling, the stratum sample size. When using CLUSTER sampling, the number of clusters. When using TIMESERIES sampling, the interval number (treated as hours by default, e.g., 1 = 1 hour). Use --ts-interval for custom intervals like "1d" (daily), "1w" (weekly), "1m" (monthly), "1y" (yearly), etc.

Argument	Description
`<input>`	The CSV file to sample. This can be a local file, stdin, or a URL (http and https schemes supported).
`<sample-size>`	When using INDEXED, RESERVOIR or WEIGHTED sampling, the sample size. Can either be a whole number or a value between value between 0 and 1. If a fraction, specifies the sample size as a percentage of the population. (e.g. 0.15 - 15 percent of the CSV) When using BERNOULLI sampling, the probability of selecting each record (between 0 and 1). When using SYSTEMATIC sampling, the integer part is the interval between records to sample & the fractional part is the percentage of the population to sample. When there is no fractional part, it will select every nth record for the entire population. When using STRATIFIED sampling, the stratum sample size. When using CLUSTER sampling, the number of clusters. When using TIMESERIES sampling, the interval number (treated as hours by default, e.g., 1 = 1 hour). Use --ts-interval for custom intervals like "1d" (daily), "1w" (weekly), "1m" (monthly), "1y" (yearly), etc.

Sample Options ↩

Option	Type	Description	Default
`‑‑seed`	integer	Random Number Generator (RNG) seed.
`‑‑rng`	string	The Random Number Generator (RNG) algorithm to use.	`standard`

Sampling Methods Options ↩

. Can be either a column name or 0-based column index. The sample size must be a whole number. Uses MEMORY PROPORTIONAL to the number of strata (s) and samples per stratum (k) - O(s*k).. Can be either a column name or 0-based column index. The column will be parsed as a number. Records with non-number weights will be skipped. Uses MEMORY PROPORTIONAL to the sample size (k) - O(k).(column name or 0-based index). Variance-bounded, single-pass, no stats-cache required, mergeable via --sketch-out / --sketch-in. Records with non-positive or non-numeric weights are silently skipped. Uses MEMORY PROPORTIONAL to the sample size (k) - O(k).. Can be either a column name or 0-based column index. Uses MEMORY PROPORTIONAL to the number of clusters (c) - O(c).. Can be either a column name or 0-based column index. Sorts records by the specified time column and then groups by time intervals and selects one record per interval. Supports various date formats (19 formats recognized by qsv-dateparser). Uses MEMORY PROPORTIONAL to the number of records - O(n).

Option	Type	Description
`‑‑bernoulli`	flag	Use Bernoulli sampling instead of indexed or reservoir sampling. When this flag is set, must be between 0 and 1 and represents the probability of selecting each record.
`‑‑systematic`	string	Use systematic sampling (every nth record as specified by ). If is "random", the starting point is randomly chosen between 0 & n. If is "first", the starting point is the first record. The sample size must be a whole number. Uses CONSTANT memory - O(1).
`‑‑stratified`	string	Use stratified sampling. The strata column is specified by

`‑‑weighted`	string	Use weighted sampling. The weight column is specified by

`‑‑varopt`	string	Use VAROPT weighted reservoir sampling (A-ExpJ keying). The weight column is specified by

`‑‑mergeable‑reservoir`	flag	Use a mergeable Algorithm-R reservoir sampler. Distribution is identical to the default RESERVOIR method, but the resulting sketch is mergeable via --sketch-out / --sketch-in. Cannot be combined with another sampling-method flag.
`‑‑cluster`	string	Use cluster sampling. The cluster column is specified by

`‑‑timeseries`	string	Use time-series sampling. The time column is specified by

Time-Series Sampling Options ↩

Option	Type	Description	Default
`‑‑ts‑interval`	string	Time interval for grouping records. Format: where unit is h (hour), d (day), w (week), m (month), y (year). Examples: "1h", "1d", "1w", "2d" (every 2 days). If not specified, is treated as hours.
`‑‑ts‑start`	string	Starting point for time-series sampling. Options: "first" (earliest timestamp, default), "last" (most recent timestamp), "random" (random starting point).	`first`
`‑‑ts‑adaptive`	string	Adaptive sampling mode for time-series data. Options: "business-hours" (prefer 9am-5pm Mon-Fri), "weekends" (prefer weekends), "business-days" (prefer weekdays), "both" (combine business-hours and weekends).
`‑‑ts‑aggregate`	string	Aggregation function to apply within each time interval. Options: "first", "last", "mean", "sum", "count", "min", "max", "median". When specified, aggregates all records in each interval instead of selecting a single record.
`‑‑ts‑input‑tz`	string	Timezone for parsing input timestamps. Can be an IANA timezone name or "local" for the local timezone.	`UTC`
`‑‑ts‑prefer‑dmy`	flag	Prefer to parse dates in dmy format. Otherwise, use mdy format.

Sketch Options ↩

Option	Type	Description	Default
`‑‑sketch‑out`	string	After sampling, also write a binary sketch describing the internal sampler state to . The blob can later be merged into another run via --sketch-in. Only valid with --varopt or --mergeable-reservoir. The format is qsv-specific and is not interoperable with serialized sketches from other tools.
`‑‑sketch‑in`	string	Comma-separated list of sketch files produced by --sketch-out. CSV input is NOT read; the listed sketches (which must all be of the same sampler kind) are merged and the resulting sample is emitted as CSV. may be used to cap the merged sample below the sketches' own k.

Remote File Options ↩

Option	Type	Description	Default
`‑‑user‑agent`	string	Specify custom user agent to use when the input is a URL. It supports the following variables - $QSV_VERSION, $QSV_TARGET, $QSV_BIN_NAME, $QSV_KIND and $QSV_COMMAND. Try to follow the syntax here - https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
`‑‑timeout`	integer	Timeout for downloading URLs in seconds. If 0, no timeout is used.	`30`
`‑‑max‑size`	integer	Maximum size of the file to download in MB before sampling. Will download the entire file if not specified. If the CSV is partially downloaded, the sample will be taken only from the downloaded portion.
`‑‑force`	flag	Do not use stats cache, even if its available.

Common Options ↩

Option	Type	Description
`‑h,` `‑‑help`	flag	Display this message
`‑o,` `‑‑output`	string	Write output to instead of stdout.
`‑n,` `‑‑no‑headers`	flag	When set, the first row will be considered as part of the population to sample from. (When not set, the first row is the header row and will always appear in the output.)
`‑d,` `‑‑delimiter`	string	The field delimiter for reading/writing CSV data. Must be a single character. (default: ,)

Source: src/cmd/sample.rs | Table of Contents | README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sample

Description ↩

Examples ↩

Usage ↩

Arguments ↩

Sample Options ↩

Sampling Methods Options ↩

Time-Series Sampling Options ↩

Sketch Options ↩

Remote File Options ↩

Common Options ↩

FilesExpand file tree

sample.md

Latest commit

History

sample.md

File metadata and controls

sample

Description ↩

Examples ↩

Usage ↩

Arguments ↩

Sample Options ↩

Sampling Methods Options ↩

Time-Series Sampling Options ↩

Sketch Options ↩

Remote File Options ↩

Common Options ↩