Skip to content

MilkMp/CIA-World-Factbooks-Archive-1990-2025

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CIA World Factbooks Archive 1990-2025

A complete, structured archive of the CIA World Factbook spanning 36 years (1990-2025), covering 281 entities with 1,071,603 data fields in a normalized SQLite database.

The CIA World Factbook was discontinued on February 4, 2026. This archive preserves every edition published since 1990 and creates a structured, queryable dataset.

Search the interactive archive | Project documentation | Executive Summary

Data Integrity: No Factbook content is added or altered. The parsing process structures the CIA's raw text into queryable fields — removing formatting artifacts, sectioning headers, and deduplicating noise lines — but the actual data values are exactly as the CIA published them. The only additions to the source data are reference lookup tables (FIPS-to-ISO code mappings, entity classifications, COCOM regional assignments) that sit alongside the original data, not inside it. In FieldValues, a small number of rows are derived by computation from neighboring sub-values (e.g. total life expectancy averaged from male/female in pre-1995 data); these are clearly flagged with IsComputed = 1.

Database Statistics

Metric Value
Years covered 1990-2025 (36 editions)
Entities 281 (192 sovereign states, 65 territories, 6 disputed, and more)
Country-year records 9,536
Category records 83,682
Data fields 1,071,603
Content size ~662 MB (with FieldValues + FTS)
Field name variants 1,132 mapped to 416 canonical names
Structured sub-values 1,775,588 parsed from raw text (2,599 sub-fields)

Data Sources

Years Source Method
1990-1995, 1997-1999 Project Gutenberg Plain text parsing (4 format variants across the decade)
1996 CIA original + Gutenberg CIA's own text file from Wayback Machine (replaced truncated Gutenberg data for 7 countries)
2000 Wayback Machine HTML zip download + classic format parser
2001 Project Gutenberg Text fallback (HTML zip was corrupted)
2002-2020 Wayback Machine HTML zip archives from cia.gov, 4 parser generations
2021-2025 factbook/cache.factbook.json Git history with year-end commit snapshots

Year-by-Year Breakdown

Year Source Countries Fields
1990 Text 249 15,750
1991 Text 247 14,903
1992 Text 264 17,372
1993 Text 266 18,509
1994 Text 266 28,633
1995 Text 266 19,599
1996 Text 266 21,118
1997 Text 266 23,405
1998 Text 266 23,524
1999 Text 266 25,178
2000 HTML 267 25,724
2001 Text 265 27,281
2002 HTML 268 27,430
2003 HTML 268 28,676
2004 HTML 271 28,958
2005 HTML 271 28,728
2006 HTML 274 28,962
2007 HTML 266 29,103
2008 HTML 263 30,755
2009 HTML 260 30,818
2010 HTML 262 30,805
2011 HTML 263 33,635
2012 HTML 271 35,192
2013 HTML 269 36,731
2014 HTML 268 36,680
2015 HTML 268 36,870
2016 HTML 268 36,804
2017 HTML 268 37,046
2018 HTML 268 37,285
2019 HTML 268 37,394
2020 HTML 268 36,687
2021 JSON 260 39,714
2022 JSON 260 37,344
2023 JSON 260 37,558
2024 JSON 260 34,838
2025 JSON 260 32,594

Repository Structure

data/
  master_countries.sql       # 281 canonical entities
  countries.sql              # 9,536 country-year records
  categories.sql             # 83,682 category records
  field_name_mappings.sql    # 1,132 field name standardization rules
  fields/
    country_fields_1990.sql.gz  # Split by year (36 gzipped files)
    ...
    country_fields_2025.sql.gz
schema/
  create_tables.sql          # DDL for all 5 tables
  create_field_values.sql    # DDL for FieldValues (structured sub-values)
etl/
  build_archive.py           # HTML parser (2000-2020)
  load_gutenberg_years.py    # Text parser (1990-2001)
  reload_json_years.py       # JSON loader (2021-2025)
  build_field_mappings.py    # Field name standardization
  classify_entities.py       # Entity type classification
  repair_1996_truncated.py   # CIA original text parser for 7 truncated 1996 countries
  validate_integrity.py      # Data quality checks
  export_to_sqlite.py        # SQL Server -> SQLite export (with FTS5)
  structured_parsing/
    parse_field_values.py    # Decompose text blobs into typed sub-values (55 parsers)
    validate_field_values.py # Validation: spot checks, coverage, numeric stats
    export_field_values_to_sqlite.py  # Export FieldValues + refs to SQLite
    dashboard_preview.py     # Local preview dashboard (8 chart panels)
    DESIGN.md                # Architecture and design document
  stardict/
    build_stardict.py        # Generate StarDict dictionaries for KOReader/GoldenDict
scripts/
  factbook_search.py         # Command-line search utility
  validate_cocom.py          # COCOM region validation
  validate_integrity.py      # Data integrity checks
queries/
  sample_queries.sql         # 18 analytical queries for Power BI / analysis
  search_cli.py              # Command-line search tool
docs/
  DATABASE_SCHEMA.md         # Table definitions and relationships
  ETL_PIPELINE.md            # How the archive was built
  FIELD_EVOLUTION.md         # How CIA field names changed over time
  METHODOLOGY.md             # Complete methodology: parsing, standardization, validation
  screenshots/               # 49 PNGs + 9 animated GIFs of the web application
  index.html                 # GitHub Pages static landing page

ETL Pipeline & Python Scripts

There are two ways to get the data:

  1. Import the pre-built SQL dumps (in data/) — no Python required. See How to Restore below.
  2. Re-run the ETL pipeline from scratch — the Python scripts in etl/ download raw Factbook data directly from their original sources (Wayback Machine, Project Gutenberg, GitHub), parse every format variant, and load the structured results into SQL Server via pyodbc. This is how the archive was originally built.

The raw CIA World Factbook changed format at least 10 times between 1990 and 2025. Every script in etl/ exists because a previous version of the parser broke on a new year's data. The pipeline handles 4 plain-text format variants, 5 HTML layout generations, and a final JSON era — each requiring its own parser.

Scripts

Script Lines Years What it does
build_archive.py 986 2000-2020 Downloads HTML zips from the Wayback Machine, detects which of 5 HTML layouts each year uses, and parses fields. Handles CDX API fallback when downloads fail, filters template/print pages, and deduplicates entries.
load_gutenberg_years.py 1,043 1990-2001 Parses plain-text Gutenberg editions with 4 distinct format variants (tagged, asterisk, at-sign, colon). Includes fuzzy country name matching to handle aliases like "Burma" and "Ivory Coast."
reload_json_years.py 413 2021-2025 Checks out year-end git commits from the factbook/cache.factbook.json repo and loads structured JSON. Strips embedded HTML from content fields.
build_field_mappings.py 783 All Maps 1,132 raw field name variants to 416 canonical names using a 7-rule system: identity, dash normalization, CIA renames, consolidation, country-specific entries, noise detection, and manual review. Uses case-sensitive collation (COLLATE Latin1_General_CS_AS) to ensure case variants get separate mapping rows. Pipe (|) delimiters in Content for sub-field boundaries across all eras.
classify_entities.py 283 All Auto-classifies 281 entities into 9 types (sovereign, territory, disputed, etc.) based on Dependency Status and Government Type fields, with hardcoded overrides for edge cases.
repair_1996_truncated.py 189 1996 Parses CIA's original wfb-96.txt.gz (page-header format with centered country/section names) and replaces truncated Gutenberg entries for 7 countries: Venezuela, Armenia, Greece, Luxembourg, Malta, Monaco, Tuvalu.
validate_integrity.py 296 All Read-only validation suite with 10 checks: field count benchmarks, US population/GDP ground truth, year-over-year consistency, source provenance, NULL detection, and unmapped field name gap analysis.
structured_parsing/parse_field_values.py 1,400+ All Decomposes 1,071,603 raw text blobs into 1,775,588 typed sub-values using 55 field-specific parsers + generic fallback. Each row includes a SourceFragment column showing the exact text slice that produced the value, and an IsComputed flag distinguishing source-extracted values (0) from parser-derived values (1). Extracts land/water area, male/female life expectancy, age brackets, sex ratios, literacy, budget, elevation, dependency ratios, GDP composition, CO2 emissions, water/sanitation, and more.
structured_parsing/validate_field_values.py 231 All Validates FieldValues against the source database: row counts, coverage, numeric extraction rate, spot checks against known ground truth values (US population, Russia area, Japan military).
structured_parsing/export_field_values_to_sqlite.py 269 All Exports FieldValues + all reference tables to a self-contained SQLite database (factbook.db, ~662 MB with FTS5 + ISOCountryCodes when using --webapp). Includes post-export integrity check that detects and backfills any unmapped field names.

Why parsing was so difficult

The CIA never maintained a stable schema. Every few years the HTML layout changed completely, field names were renamed without notice, and entire categories were restructured. Examples:

  • 1990-1999 (plain text): Four different formatting conventions across the decade. 1990-1993 used Country: Name headers with indented fields. 1994 introduced tagged markers (_@_, _*_, _#_). 1996 switched to bare section headers with no markers at all. 1999 changed the delimiter scheme again. Each variant required its own regex-based parser.

  • 2000-2020 (HTML): The CIA redesigned the Factbook website at least 5 times. The 2000 edition used <b>FieldName:</b> inline formatting. By 2004 it switched to <td class="FieldLabel"> table layouts. 2008 introduced CollapsiblePanel JavaScript widgets. 2014 changed to expand/collapse <h2> sections. 2017 moved to field-anchor <div> structures. A parser that worked on 2006 data would produce garbage on 2010 data.

  • 2001: The Wayback Machine's HTML zip for 2001 was corrupted, so parsing fell back to the Project Gutenberg plain-text edition — the only year where the HTML and text pipelines had to swap.

  • Field name drift: The CIA renamed fields silently over the decades. "GDP - real growth rate" became "Real GDP growth rate." "Telephones" split into "Telephones - fixed lines" and "Telephones - mobile cellular." Oil sub-fields were consolidated into "Petroleum." The build_field_mappings.py script maps all 1,132 variants through 7 rule layers to maintain time-series continuity.

  • Country-specific noise: 1990s editions embedded government body names (parliaments, assemblies, courts) as top-level fields. Dissolved countries (Serbia and Montenegro, Netherlands Antilles) appear and disappear. Turkish Cyprus, Malaysian states, and Netherlands Antilles islands show up as sub-entries that need special handling.

See docs/METHODOLOGY.md and docs/ETL_PIPELINE.md for the full technical details.

How to Restore

Prerequisites

Steps

  1. Create the database:

    CREATE DATABASE CIA_WorldFactbook;
  2. Run the schema script:

    sqlcmd -S localhost -d CIA_WorldFactbook -i schema/create_tables.sql
  3. Import data (in order):

    sqlcmd -S localhost -d CIA_WorldFactbook -i data/master_countries.sql
    sqlcmd -S localhost -d CIA_WorldFactbook -i data/countries.sql
    sqlcmd -S localhost -d CIA_WorldFactbook -i data/categories.sql
    sqlcmd -S localhost -d CIA_WorldFactbook -i data/field_name_mappings.sql
  4. Import field data (decompress and import each year):

    cd data/fields
    # On Linux/macOS:
    for f in *.sql.gz; do gunzip -k "$f" && sqlcmd -S localhost -d CIA_WorldFactbook -i "${f%.gz}"; done
    # On Windows (PowerShell):
    Get-ChildItem *.sql.gz | ForEach-Object {
        $sql = $_.FullName -replace '\.gz$',''
        [System.IO.Compression.GZipStream]::new(
            [System.IO.File]::OpenRead($_.FullName),
            [System.IO.Compression.CompressionMode]::Decompress
        ).CopyTo([System.IO.File]::Create($sql))
        sqlcmd -S localhost -d CIA_WorldFactbook -i $sql
    }
  5. Verify:

    SELECT COUNT(*) FROM CountryFields;  -- Should return 1,071,603

Structured Field Values Database (NEW)

The raw text in CountryFields.Content has been decomposed into 1,775,588 typed sub-values across 2,599 distinct sub-fields using 55 dedicated parsers. Each row includes a SourceFragment showing the exact text slice that produced the value, and an IsComputed flag distinguishing values extracted directly from source text (0) from values derived by computation (1, e.g. averaging male/female life expectancy for pre-1995 data). Sub-field boundaries in Content use pipe (|) delimiters for unambiguous parsing. This enables SQL queries that were previously impossible without per-query regex — for example, ranking countries by land-vs-water ratio, comparing male vs female life expectancy, or charting budget deficit trends.

Download: factbook.db (~662 MB) from Release v3.5 — single self-contained database with all tables, FTS5 search index, and ISO country codes.

Live dashboard: worldfactbookarchive.org/analysis/structured-data — interactive charts showing new queries with SQL and source data tabs.

The database is self-contained with all reference tables included. Example query:

-- Top 10 countries by land area (2025) with land/water split
SELECT c.Name,
  MAX(CASE WHEN fv.SubField = 'land' THEN fv.NumericVal END) AS land_sqkm,
  MAX(CASE WHEN fv.SubField = 'water' THEN fv.NumericVal END) AS water_sqkm
FROM FieldValues fv
JOIN CountryFields cf ON fv.FieldID = cf.FieldID
JOIN Countries c ON cf.CountryID = c.CountryID
JOIN FieldNameMappings fnm ON cf.FieldName = fnm.OriginalName
WHERE c.Year = 2025 AND fnm.CanonicalName = 'Area'
GROUP BY c.Name
ORDER BY land_sqkm DESC LIMIT 10;

See etl/structured_parsing/DESIGN.md for the full architecture, parser registry, and sub-field catalog.

StarDict Dictionaries (for KOReader / GoldenDict)

Offline dictionaries in StarDict format for use with KOReader, GoldenDict-ng, and other StarDict-compatible apps. Each (country, field) pair is its own dictionary entry (e.g., "Afghanistan - Population"), matching the per-field format used by KOReader's existing factbook dictionary. Entries are searchable by country name, ISO Alpha-2 code, or FIPS 10-4 code combined with the field name.

36 dictionaries (one per year, 1990-2025), ~2.1 million entries total, ~130 MB compressed.

Download: StarDict tarballs from Release v3.5-stardict — grab individual years or all 36.

Generate locally:

pip install pyglossary python-idzip
python etl/stardict/build_stardict.py                    # all 36 years
python etl/stardict/build_stardict.py --years 2025       # just 2025

Output goes to data/stardict/. Each dictionary is a directory with .ifo, .idx, .dict.dz, and .syn files. Copy the directory into your dictionary app's data folder (e.g., koreader/data/dict/ for KOReader).

See etl/stardict/README.md for format details, synonym resolution, validation (16/16 tests), and installation guides.

Alternative: SQLite (No SQL Server Required)

A pre-built SQLite database (factbook.db, ~662 MB) is available as a release download for users who don't need SQL Server. The file exceeds GitHub's 100 MB file-size limit, so it is not included in the repository itself.

Download: factbook.db from the latest release

Place the downloaded file at data/factbook.db in the project root. SQLite requires no installation — Python's built-in sqlite3 module can query it directly.

SQL Server SQLite
Setup Install SQL Server + ODBC driver, run schema + import scripts Download one .db file
Size ~263 MB across 36 gzipped SQL files ~662 MB single file
Query tool SSMS, sqlcmd, pyodbc Python sqlite3, DB Browser, any SQLite client
Best for Power BI, enterprise analytics, large-scale joins Quick exploration, scripting, lightweight apps
Schema Identical 5-table structure Identical 5-table structure

The SQLite database contains the same tables (including FieldValues with IsComputed), same indexes, and same 1,071,603 fields as the SQL Server version, plus an FTS5 full-text search index for fast keyword and boolean search. This is what the live webapp runs on.

Note on numeric values: The CountryFields table stores raw text content exactly as published by the CIA. There is no pre-computed Value column. The CSV and Excel exports on the webapp parse numeric values at download time using regex extraction. To extract numeric values yourself, join through FieldNameMappings and apply a pattern for the field you need:

import sqlite3, re

db = sqlite3.connect("factbook.db")
db.row_factory = sqlite3.Row

# Example: extract Population as an integer
rows = db.execute("""
    SELECT c.Year, mc.CanonicalName, cf.Content
    FROM CountryFields cf
    JOIN Countries c ON cf.CountryID = c.CountryID
    JOIN MasterCountries mc ON c.MasterCountryID = mc.MasterCountryID
    JOIN FieldNameMappings fm ON cf.FieldName = fm.OriginalName
    WHERE fm.CanonicalName = 'Population'
    ORDER BY mc.CanonicalName, c.Year
""").fetchall()

for r in rows:
    m = re.search(r'[\d,]{5,}', r["Content"])
    value = int(m.group().replace(",", "")) if m else None
    if value:
        print(f"{r['CanonicalName']} ({r['Year']}): {value:,}")

Common extraction patterns for other fields:

Field Pattern Example
Population re.search(r'[\d,]{5,}', text) → remove commas, cast to int 331,449,281
Real GDP per capita re.search(r'\$([\d,]+)', text) → remove commas, cast to int $63,500
Life expectancy at birth re.search(r'total population:\s*([\d.]+)', text) → float 78.5 years
Population growth rate re.search(r'(-?[\d.]+)%', text) → float 0.7%
Military expenditures re.search(r'([\d.]+)%\s*of\s*G[DN]P', text) → float 3.45% of GDP
Area re.search(r'total\s*:?\s*([\d,]+)\s*sq\s*km', text) → remove commas, cast to int 9,833,517 sq km
Birth/Death rate re.search(r'([\d.]+)\s*(?:births|deaths)\s*/\s*1,000', text) → float 11.0 births/1,000

Live Web Application

The archive is served as a FastAPI + Jinja2 web application at worldfactbookarchive.org. Key features:

  • Full-text search with Library of Congress boolean syntax (AND, OR, NOT, "phrase", truncation)
  • Library — card-based overview of the entire archive collection
  • Browse archive by year (1990-2025) and country (281 entities)
  • Country profiles with category drill-down, field time series, and data export (CSV, Excel, PDF)
  • Text diff — side-by-side year comparison with highlighted changes for any country
  • Country dictionary with ISO codes, entity types, and COCOM region assignments
  • Interactive maps gallery — browsable country maps with search and filtering
  • Data export — CSV, Excel, print-ready PDF reports, and a complete analysis-ready dataset download
  • Intelligence analysis dashboards powered by Mapbox GL JS and Apache ECharts:
    • Regional Dashboard — global choropleth with 6 COCOM regions, hover popups, capital city markers, and click-to-zoom
    • COCOM Region Detail — per-region map with ranked bar charts for GDP, population, military spending
    • Timeline Map — animated choropleth across 36 years with multi-country time series and playback speed controls
    • Map Compare — two synced maps for side-by-side year comparison with shared color scale
    • Communications Analysis — internet, mobile, broadband penetration with digital divide indicators
    • Global Trends — multi-indicator time series for any country
    • Global Rankings — sortable country rankings by any indicator with year selection
    • Change Detection — year-over-year field changes with trend charts and region filtering
    • Field Explorer — browse all 414 canonical data fields with coverage statistics across 36 years
    • Advanced Analytics Explorer — correlation scatter, heatmap matrix, ranking race, and treemap visualizations
    • Scatter Plot Analysis — multi-indicator scatter plots with regression lines and outlier detection
    • Query Builder — custom analytical queries across all indicators with flexible filters
    • Dashboard Builder — custom multi-panel dashboards with drag-and-drop indicator selection
    • Trade Networks — geographic and force-directed graph of global import/export relationships
    • Organization Networks — international organization memberships and alliance visualization
    • Geopolitical Atlas — territorial disputes, infrastructure mapping, and missile site tracking
    • Political Stability — government type choropleth, regime change tracking, and regional peer comparison
    • Natural Resources & Economy — resource production maps, commodity scatter plots, and country profiles
    • Demographics — population pyramids with country comparison and year-over-year animation
    • Dissolved States — historical entities no longer in the Factbook with archived indicator data
    • Data Coverage — field availability matrix across all 36 years and 281 entities
    • Structured Field Data — interactive dashboard of 1,775,588 parsed sub-values with Chart/SQL/Source tabs showing exactly where each number was extracted from
  • CIA Studies in Intelligence — searchable archive of declassified CIA analytical journal articles with dashboard and browse views
  • World Leaders Database — comprehensive leadership data with governance analysis, power concentration metrics, security apparatus tracking, and interactive choropleth maps
  • Intelligence dossiers following ICD 203 analytic standards
  • Regional threat briefs with instability and security indicators
  • Factbook Quiz — 4 modes: country identification, capital cities, higher-or-lower, and flag recognition
  • Bug reporting via GitHub Issues integration

Stack: Python 3.12, FastAPI, Jinja2, SQLite (FTS5), Mapbox GL JS v3, Apache ECharts 5, deployed on Fly.io.

The web application source code is maintained in a separate private repository.

Entity Types

Type Count Description
sovereign 192 Independent states
territory 65 Dependencies, overseas territories
misc 7 Oceans, World, EU
disputed 6 Kosovo, Gaza Strip, West Bank, etc.
crown_dependency 3 Jersey, Guernsey, Isle of Man
freely_associated 3 Marshall Islands, Micronesia, Palau
special_admin 2 Hong Kong, Macau
dissolved 2 Netherlands Antilles, Serbia and Montenegro
antarctic 1 Antarctica

Field Name Standardization

The CIA renamed many fields over the 36-year span. The FieldNameMappings table maps 1,132 raw field name variants to 416 canonical names (case-sensitive, so Natural hazards and natural hazards get separate rows):

Mapping Type Count Description
identity 184 Modern field names (unchanged)
rename 159 CIA renamed the field (e.g. "GDP - real growth rate" -> "Real GDP growth rate")
dash_format 64 Formatting differences (single vs double dashes)
consolidation 48 Sub-fields merged into parents (e.g. Oil production/consumption -> Petroleum)
country_specific 354 Regional sub-entries, government body names
noise 281 Parser artifacts, fragments (flagged IsNoise=1)

Sample Queries

See queries/sample_queries.sql for 18 ready-to-use analytical queries, including:

  • Population time series (single country)
  • GDP comparison across G7 nations
  • Military expenditure trends
  • Internet users growth
  • Countries that appear/disappear over time
  • Consolidated petroleum sub-field view

Screenshots

Homepage Full-Text Search
Homepage — Database statistics, navigation, and live search Full-Text Search — Keyword search across 1,071,603 fields
Country Profile Library
Country Profile — Complete factbook data by category Library — Card-based overview of the entire archive collection
Regional Dashboard Geopolitical Atlas
Regional Dashboard — Mapbox GL JS choropleth with COCOM drill-down Geopolitical Atlas — Territorial disputes, infrastructure, and missile sites
Timeline Map Communications
Timeline Map — Animated Mapbox choropleth 1990-2025 with time series Communications Analysis — Internet, mobile, broadband indicators
Trade Networks Global Rankings
Trade Networks — Geographic and force-directed graph of global trade Global Rankings — Sortable country rankings by any indicator
Political Stability Demographics
Political Stability — Government type choropleth and regime change tracking Demographics — Population pyramids with country comparison and animation
CSI Search World Leaders Map
CIA Studies in Intelligence — Searchable archive of declassified CIA journal articles World Leaders Map — Interactive choropleth of governance metrics

49 screenshots and 9 animated GIF demos are available in docs/screenshots/. See the full gallery on the project page.

License

The CIA World Factbook is a work of the United States Government and is in the public domain (17 U.S.C. § 105). This archive is released under CC0 1.0 Universal (Public Domain Dedication).

About

The CIA World Factbook was first published in 1962 as a classified document and made publicly available starting in 1971. The online edition launched in the 1990s and was updated weekly until its discontinuation on February 4, 2026.

This archive was built as a preservation project to ensure this unique longitudinal dataset remains accessible for researchers, analysts, and the public.

About

Complete structured archive of every CIA World Factbook edition from 1990-2025. 281 countries, 36 years, 1M+ parsed data fields in SQL Server. Python ETL scripts included. Website created to allow users to explore the data without downloading. www.worldfactbookarchive.org

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors