Skip to content

dandi/s3-log-extraction

Repository files navigation

s3-log-extraction logo

S3 Log Extraction

Supported Python versions Ubuntu codecov Daily tests Daily tests (remote)

PyPI latest release version License: MIT DOI

Python code style: Black Python code style: Ruff

Fast extraction of access summary data from S3 logs.

Originally developed for the DANDI Archive.

Read more about S3 logging on AWS.

⚠️ This package currently only supports processing of access data (GET-type requests); if you wish to use this package for other types of requests (PUT/DELETE/HEAD, etc.) please reach out by raising an issue. ⚠️

Installation

pip install s3-log-extraction

Note for Windows users: This package requires GAWK and is not natively supported on Windows. Windows users should use Windows Subsystem for Linux (WSL) to run this package.

Workflow

flowchart TD
    A[Configure cache<br/><br/>Initialize home and cache directories]
    B[Extract logs<br/><br/>Process raw S3 logs and store minimal extracted data]
    C[Update IP indexes<br/><br/>Generate anonymized indexes for each IP address]
    D[Update region codes<br/><br/>Map IPs to ISO 3166 region codes using external API]
    E[Update coordinates<br/><br/>Convert region codes to latitude/longitude for mapping]
    F[Generate summaries<br/><br/>Create per-dataset summaries for reporting]
    G[Generate totals<br/><br/>Aggregate statistics across datasets or archive]
    H[Share!<br/><br/>Post the summaries and totals in a public data repository]

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    G --> H
Loading

Generic Usage

[Optional] Configure a non-default cache directory on a mounted disk that has sufficient space (the default is placed under ~/.cache). This will be the main location where extracted logs and other useful information will be stored.

s3logextraction config cache set <new cache directory>

To extract the logs:

s3logextraction extract <log directory>

To override the cache directory for a single extraction run (without changing global config):

s3logextraction extract <log directory> --cache <cache directory>

NOTE: If you feel like this command is taking a long time on your system, DO NOT interrupt it via ctrl+C or pkill. Instead, you can safely interrupt it by running:

s3logextraction stop

This will allow it to finish processing the current batch of logs and then exit gracefully.

After your logs are extracted, generate anonymized indexes for each IP address:

s3logextraction update ip indexes

Next, ensure some required environment variables related to external services are set:

  1. IPINFO_API_KEY
    • Access token for the ipinfo.io service.
    • Extracts geographic region information in ISO 3166 format (e.g. "US/California") for anonymized statistics.
  2. OPENCAGE_API_KEY
    • Access token for the opencagedata.com service.
    • Maps the ISO 3166 codes from the first step to latitude and longitude coordinates for the geographic heat maps used in visualizations.
export IPINFO_API_KEY="your_token_here"
export OPENCAGE_API_KEY="your_token_here"

To update the region codes and their coordinates:

s3logextraction update ip regions
s3logextraction update ip coordinates

To generate top-level summaries and totals (that is, per dataset):

s3logextraction update summaries
s3logextraction update totals

Finally, to generate archive-wide summaries and totals:

s3logextraction update summaries --mode archive
s3logextraction update totals --mode archive

Remote S3 Bucket Extraction

To extract logs from a remote S3 bucket, use the --mode remote flag. For large buckets, we strongly recommend setting up AWS S3 Inventory and downloading the inventory locally before running the extraction. Scanning the bucket directly via live network calls (the default when no --inventory path is given) can be extremely slow for buckets with millions of objects.

Using S3 Inventory (recommended)

AWS S3 Inventory generates periodic snapshots of all objects in your bucket as gzip-compressed CSV files. Once downloaded locally, the inventory lets s3logextraction enumerate all log files without making any live S3 listing calls, providing a significant performance improvement over direct bucket scanning.

Expected inventory directory layout:

<inventory_directory>/
├── <timestamp>/               # e.g. 2026-05-03T01-00Z/
│   ├── manifest.json
│   └── manifest.checksum
├── data/
│   └── <uuid>.csv.gz          # gzip-compressed CSV inventory files
└── hive/
    └── dt=<YYYY-MM-DD-HH-MM>/ # e.g. dt=2026-05-03-01-00/
        └── symlink.txt        # references to data/*.csv.gz

Pass the path to the downloaded inventory directory via the --inventory option:

s3logextraction extract s3://my-logs-bucket --mode remote --inventory /path/to/inventory

To check how many log files are in the inventory and the total size:

s3logextraction stats --inventory /path/to/inventory

To report what percentage of log files have already been processed:

s3logextraction completion --inventory /path/to/inventory

Without an S3 Inventory (not recommended)

If you do not have an S3 Inventory available, do not use --mode remote without --inventory — the live bucket scan will be extremely slow for large buckets. Instead, use s5cmd to manually download the unprocessed log files to a local directory and then run the local extraction on those files:

s5cmd cp "s3://my-logs-bucket/*" /path/to/local/logs/
s3logextraction extract /path/to/local/logs/

If you're new to using AWS S3 buckets and haven't yet enabled the logging this project utilizes, you can follow these simple instructions to get started.

  1. Log into your AWS console.
  2. Create a new PRIVATE S3 bucket - typically the name of the new bucket is the name of the one you wish to enable logging on with -logs added to the end. For example, dandiarchive-logs.
    • NEVER share this bucket publicly as it contains sensitive information.
  3. Navigate back to the S3 bucket you wish to enable logging on.
  4. Under the Properties tab, scroll down to the section called Server access logging and select Edit.
  5. Toggle the selection to Enable, then specify the destination where logs will be stored as the new S3 bucket you created in step (2).
  6. Recommended:
    • Specify the Log object key format as the nested pattern shown below.
    • Ensure the Source of date used in log object key format is the S3 event time. image

Developer Notes

Throughout the codebase, various processes are referred to in the following ways:

  • parallelized: The process can be run in parallel across multiple workers, which increases throughput.
  • interruptible: The process can be safely interrupted (ctrl+C or pkill) with only a very low chance of causing corruption. For parallelized interruption you may have to either pkill the main dispatch process or spam ctrl+C multiple times.
  • updatable: The process can be resumed from the last checkpoint without losing any progress. It can also be run fresh at different times, such as on a CRON cycle, and it will only interact with unprocessed data.

Performance

By leveraging GAWK, this version of the S3 log handling is considerably more efficient than the previous attempts.

The previous attempt used a multistep process which took several days to run (even on multiple workers). It also required an additional ~200 GB cache to allow lazy updates of the per-object bins.

This version requires no intermediate cache, stores only the minimal amount of data to be shared, and takes less than a day to do a fresh run (and is also lazy regarding daily CRON updates).

Validation

In lieu of attempting fully validated parsing of each and every line from the log files (which is a hard, unsolved problem - see s3-log-parser), we instead validate the heuristics in a targeted manner through specific validation scripts.

These can also be used to verify the current state of the extraction process, such as warning about corrupt records or incomplete cache files.

Excluded IP regex configuration

The extraction heuristic pre-validator uses an excluded-IP regex. By default, no IPs are excluded.

Example custom regex:

export S3_LOG_EXTRACTION_EXCLUDED_IP_REGEX='^(192\.0\.2\.1|198\.51\.100\.2)$'

Submission of line decoding errors

Should you discover any lines in your S3 log files that cause failures in the codebase, please email them to the core maintainer (cody.c.baker.phd@gmail.com) before raising issues or submitting PRs contributing them as examples, to more easily correct any aspects that might require anonymization.

About

Fast extraction of access summary data from S3 logs.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

 
 
 

Contributors