A complete, structured archive of the CIA World Factbook spanning 36 years (1990-2025), covering 281 entities with 1,071,603 data fields in a normalized SQLite database.
The CIA World Factbook was discontinued on February 4, 2026. This archive preserves every edition published since 1990 and creates a structured, queryable dataset.
Search the interactive archive | Project documentation | Executive Summary
Data Integrity: No Factbook content is added or altered. The parsing process structures the CIA's raw text into queryable fields — removing formatting artifacts, sectioning headers, and deduplicating noise lines — but the actual data values are exactly as the CIA published them. The only additions to the source data are reference lookup tables (FIPS-to-ISO code mappings, entity classifications, COCOM regional assignments) that sit alongside the original data, not inside it. In FieldValues, a small number of rows are derived by computation from neighboring sub-values (e.g. total life expectancy averaged from male/female in pre-1995 data); these are clearly flagged with
IsComputed = 1.
| Metric | Value |
|---|---|
| Years covered | 1990-2025 (36 editions) |
| Entities | 281 (192 sovereign states, 65 territories, 6 disputed, and more) |
| Country-year records | 9,536 |
| Category records | 83,682 |
| Data fields | 1,071,603 |
| Content size | ~662 MB (with FieldValues + FTS) |
| Field name variants | 1,132 mapped to 416 canonical names |
| Structured sub-values | 1,775,588 parsed from raw text (2,599 sub-fields) |
| Years | Source | Method |
|---|---|---|
| 1990-1995, 1997-1999 | Project Gutenberg | Plain text parsing (4 format variants across the decade) |
| 1996 | CIA original + Gutenberg | CIA's own text file from Wayback Machine (replaced truncated Gutenberg data for 7 countries) |
| 2000 | Wayback Machine | HTML zip download + classic format parser |
| 2001 | Project Gutenberg | Text fallback (HTML zip was corrupted) |
| 2002-2020 | Wayback Machine | HTML zip archives from cia.gov, 4 parser generations |
| 2021-2025 | factbook/cache.factbook.json | Git history with year-end commit snapshots |
| Year | Source | Countries | Fields |
|---|---|---|---|
| 1990 | Text | 249 | 15,750 |
| 1991 | Text | 247 | 14,903 |
| 1992 | Text | 264 | 17,372 |
| 1993 | Text | 266 | 18,509 |
| 1994 | Text | 266 | 28,633 |
| 1995 | Text | 266 | 19,599 |
| 1996 | Text | 266 | 21,118 |
| 1997 | Text | 266 | 23,405 |
| 1998 | Text | 266 | 23,524 |
| 1999 | Text | 266 | 25,178 |
| 2000 | HTML | 267 | 25,724 |
| 2001 | Text | 265 | 27,281 |
| 2002 | HTML | 268 | 27,430 |
| 2003 | HTML | 268 | 28,676 |
| 2004 | HTML | 271 | 28,958 |
| 2005 | HTML | 271 | 28,728 |
| 2006 | HTML | 274 | 28,962 |
| 2007 | HTML | 266 | 29,103 |
| 2008 | HTML | 263 | 30,755 |
| 2009 | HTML | 260 | 30,818 |
| 2010 | HTML | 262 | 30,805 |
| 2011 | HTML | 263 | 33,635 |
| 2012 | HTML | 271 | 35,192 |
| 2013 | HTML | 269 | 36,731 |
| 2014 | HTML | 268 | 36,680 |
| 2015 | HTML | 268 | 36,870 |
| 2016 | HTML | 268 | 36,804 |
| 2017 | HTML | 268 | 37,046 |
| 2018 | HTML | 268 | 37,285 |
| 2019 | HTML | 268 | 37,394 |
| 2020 | HTML | 268 | 36,687 |
| 2021 | JSON | 260 | 39,714 |
| 2022 | JSON | 260 | 37,344 |
| 2023 | JSON | 260 | 37,558 |
| 2024 | JSON | 260 | 34,838 |
| 2025 | JSON | 260 | 32,594 |
data/
master_countries.sql # 281 canonical entities
countries.sql # 9,536 country-year records
categories.sql # 83,682 category records
field_name_mappings.sql # 1,132 field name standardization rules
fields/
country_fields_1990.sql.gz # Split by year (36 gzipped files)
...
country_fields_2025.sql.gz
schema/
create_tables.sql # DDL for all 5 tables
create_field_values.sql # DDL for FieldValues (structured sub-values)
etl/
build_archive.py # HTML parser (2000-2020)
load_gutenberg_years.py # Text parser (1990-2001)
reload_json_years.py # JSON loader (2021-2025)
build_field_mappings.py # Field name standardization
classify_entities.py # Entity type classification
repair_1996_truncated.py # CIA original text parser for 7 truncated 1996 countries
validate_integrity.py # Data quality checks
export_to_sqlite.py # SQL Server -> SQLite export (with FTS5)
structured_parsing/
parse_field_values.py # Decompose text blobs into typed sub-values (55 parsers)
validate_field_values.py # Validation: spot checks, coverage, numeric stats
export_field_values_to_sqlite.py # Export FieldValues + refs to SQLite
dashboard_preview.py # Local preview dashboard (8 chart panels)
DESIGN.md # Architecture and design document
stardict/
build_stardict.py # Generate StarDict dictionaries for KOReader/GoldenDict
scripts/
factbook_search.py # Command-line search utility
validate_cocom.py # COCOM region validation
validate_integrity.py # Data integrity checks
queries/
sample_queries.sql # 18 analytical queries for Power BI / analysis
search_cli.py # Command-line search tool
docs/
DATABASE_SCHEMA.md # Table definitions and relationships
ETL_PIPELINE.md # How the archive was built
FIELD_EVOLUTION.md # How CIA field names changed over time
METHODOLOGY.md # Complete methodology: parsing, standardization, validation
screenshots/ # 49 PNGs + 9 animated GIFs of the web application
index.html # GitHub Pages static landing page
There are two ways to get the data:
- Import the pre-built SQL dumps (in
data/) — no Python required. See How to Restore below. - Re-run the ETL pipeline from scratch — the Python scripts in
etl/download raw Factbook data directly from their original sources (Wayback Machine, Project Gutenberg, GitHub), parse every format variant, and load the structured results into SQL Server via pyodbc. This is how the archive was originally built.
The raw CIA World Factbook changed format at least 10 times between 1990 and 2025. Every script in etl/ exists because a previous version of the parser broke on a new year's data. The pipeline handles 4 plain-text format variants, 5 HTML layout generations, and a final JSON era — each requiring its own parser.
| Script | Lines | Years | What it does |
|---|---|---|---|
build_archive.py |
986 | 2000-2020 | Downloads HTML zips from the Wayback Machine, detects which of 5 HTML layouts each year uses, and parses fields. Handles CDX API fallback when downloads fail, filters template/print pages, and deduplicates entries. |
load_gutenberg_years.py |
1,043 | 1990-2001 | Parses plain-text Gutenberg editions with 4 distinct format variants (tagged, asterisk, at-sign, colon). Includes fuzzy country name matching to handle aliases like "Burma" and "Ivory Coast." |
reload_json_years.py |
413 | 2021-2025 | Checks out year-end git commits from the factbook/cache.factbook.json repo and loads structured JSON. Strips embedded HTML from content fields. |
build_field_mappings.py |
783 | All | Maps 1,132 raw field name variants to 416 canonical names using a 7-rule system: identity, dash normalization, CIA renames, consolidation, country-specific entries, noise detection, and manual review. Uses case-sensitive collation (COLLATE Latin1_General_CS_AS) to ensure case variants get separate mapping rows. Pipe (|) delimiters in Content for sub-field boundaries across all eras. |
classify_entities.py |
283 | All | Auto-classifies 281 entities into 9 types (sovereign, territory, disputed, etc.) based on Dependency Status and Government Type fields, with hardcoded overrides for edge cases. |
repair_1996_truncated.py |
189 | 1996 | Parses CIA's original wfb-96.txt.gz (page-header format with centered country/section names) and replaces truncated Gutenberg entries for 7 countries: Venezuela, Armenia, Greece, Luxembourg, Malta, Monaco, Tuvalu. |
validate_integrity.py |
296 | All | Read-only validation suite with 10 checks: field count benchmarks, US population/GDP ground truth, year-over-year consistency, source provenance, NULL detection, and unmapped field name gap analysis. |
structured_parsing/parse_field_values.py |
1,400+ | All | Decomposes 1,071,603 raw text blobs into 1,775,588 typed sub-values using 55 field-specific parsers + generic fallback. Each row includes a SourceFragment column showing the exact text slice that produced the value, and an IsComputed flag distinguishing source-extracted values (0) from parser-derived values (1). Extracts land/water area, male/female life expectancy, age brackets, sex ratios, literacy, budget, elevation, dependency ratios, GDP composition, CO2 emissions, water/sanitation, and more. |
structured_parsing/validate_field_values.py |
231 | All | Validates FieldValues against the source database: row counts, coverage, numeric extraction rate, spot checks against known ground truth values (US population, Russia area, Japan military). |
structured_parsing/export_field_values_to_sqlite.py |
269 | All | Exports FieldValues + all reference tables to a self-contained SQLite database (factbook.db, ~662 MB with FTS5 + ISOCountryCodes when using --webapp). Includes post-export integrity check that detects and backfills any unmapped field names. |
The CIA never maintained a stable schema. Every few years the HTML layout changed completely, field names were renamed without notice, and entire categories were restructured. Examples:
-
1990-1999 (plain text): Four different formatting conventions across the decade. 1990-1993 used
Country: Nameheaders with indented fields. 1994 introduced tagged markers (_@_,_*_,_#_). 1996 switched to bare section headers with no markers at all. 1999 changed the delimiter scheme again. Each variant required its own regex-based parser. -
2000-2020 (HTML): The CIA redesigned the Factbook website at least 5 times. The 2000 edition used
<b>FieldName:</b>inline formatting. By 2004 it switched to<td class="FieldLabel">table layouts. 2008 introduced CollapsiblePanel JavaScript widgets. 2014 changed to expand/collapse<h2>sections. 2017 moved to field-anchor<div>structures. A parser that worked on 2006 data would produce garbage on 2010 data. -
2001: The Wayback Machine's HTML zip for 2001 was corrupted, so parsing fell back to the Project Gutenberg plain-text edition — the only year where the HTML and text pipelines had to swap.
-
Field name drift: The CIA renamed fields silently over the decades. "GDP - real growth rate" became "Real GDP growth rate." "Telephones" split into "Telephones - fixed lines" and "Telephones - mobile cellular." Oil sub-fields were consolidated into "Petroleum." The
build_field_mappings.pyscript maps all 1,132 variants through 7 rule layers to maintain time-series continuity. -
Country-specific noise: 1990s editions embedded government body names (parliaments, assemblies, courts) as top-level fields. Dissolved countries (Serbia and Montenegro, Netherlands Antilles) appear and disappear. Turkish Cyprus, Malaysian states, and Netherlands Antilles islands show up as sub-entries that need special handling.
See docs/METHODOLOGY.md and docs/ETL_PIPELINE.md for the full technical details.
- SQL Server 2017+ (or Azure SQL)
- ODBC Driver 18 for SQL Server
- Python 3.8+ with
pyodbc(for ETL scripts and CLI search tool)
-
Create the database:
CREATE DATABASE CIA_WorldFactbook;
-
Run the schema script:
sqlcmd -S localhost -d CIA_WorldFactbook -i schema/create_tables.sql
-
Import data (in order):
sqlcmd -S localhost -d CIA_WorldFactbook -i data/master_countries.sql sqlcmd -S localhost -d CIA_WorldFactbook -i data/countries.sql sqlcmd -S localhost -d CIA_WorldFactbook -i data/categories.sql sqlcmd -S localhost -d CIA_WorldFactbook -i data/field_name_mappings.sql
-
Import field data (decompress and import each year):
cd data/fields # On Linux/macOS: for f in *.sql.gz; do gunzip -k "$f" && sqlcmd -S localhost -d CIA_WorldFactbook -i "${f%.gz}"; done # On Windows (PowerShell): Get-ChildItem *.sql.gz | ForEach-Object { $sql = $_.FullName -replace '\.gz$','' [System.IO.Compression.GZipStream]::new( [System.IO.File]::OpenRead($_.FullName), [System.IO.Compression.CompressionMode]::Decompress ).CopyTo([System.IO.File]::Create($sql)) sqlcmd -S localhost -d CIA_WorldFactbook -i $sql }
-
Verify:
SELECT COUNT(*) FROM CountryFields; -- Should return 1,071,603
The raw text in CountryFields.Content has been decomposed into 1,775,588 typed sub-values across 2,599 distinct sub-fields using 55 dedicated parsers. Each row includes a SourceFragment showing the exact text slice that produced the value, and an IsComputed flag distinguishing values extracted directly from source text (0) from values derived by computation (1, e.g. averaging male/female life expectancy for pre-1995 data). Sub-field boundaries in Content use pipe (|) delimiters for unambiguous parsing. This enables SQL queries that were previously impossible without per-query regex — for example, ranking countries by land-vs-water ratio, comparing male vs female life expectancy, or charting budget deficit trends.
Download: factbook.db (~662 MB) from Release v3.5 — single self-contained database with all tables, FTS5 search index, and ISO country codes.
Live dashboard: worldfactbookarchive.org/analysis/structured-data — interactive charts showing new queries with SQL and source data tabs.
The database is self-contained with all reference tables included. Example query:
-- Top 10 countries by land area (2025) with land/water split
SELECT c.Name,
MAX(CASE WHEN fv.SubField = 'land' THEN fv.NumericVal END) AS land_sqkm,
MAX(CASE WHEN fv.SubField = 'water' THEN fv.NumericVal END) AS water_sqkm
FROM FieldValues fv
JOIN CountryFields cf ON fv.FieldID = cf.FieldID
JOIN Countries c ON cf.CountryID = c.CountryID
JOIN FieldNameMappings fnm ON cf.FieldName = fnm.OriginalName
WHERE c.Year = 2025 AND fnm.CanonicalName = 'Area'
GROUP BY c.Name
ORDER BY land_sqkm DESC LIMIT 10;See etl/structured_parsing/DESIGN.md for the full architecture, parser registry, and sub-field catalog.
Offline dictionaries in StarDict format for use with KOReader, GoldenDict-ng, and other StarDict-compatible apps. Each (country, field) pair is its own dictionary entry (e.g., "Afghanistan - Population"), matching the per-field format used by KOReader's existing factbook dictionary. Entries are searchable by country name, ISO Alpha-2 code, or FIPS 10-4 code combined with the field name.
36 dictionaries (one per year, 1990-2025), ~2.1 million entries total, ~130 MB compressed.
Download: StarDict tarballs from Release v3.5-stardict — grab individual years or all 36.
Generate locally:
pip install pyglossary python-idzip
python etl/stardict/build_stardict.py # all 36 years
python etl/stardict/build_stardict.py --years 2025 # just 2025Output goes to data/stardict/. Each dictionary is a directory with .ifo, .idx, .dict.dz, and .syn files. Copy the directory into your dictionary app's data folder (e.g., koreader/data/dict/ for KOReader).
See etl/stardict/README.md for format details, synonym resolution, validation (16/16 tests), and installation guides.
A pre-built SQLite database (factbook.db, ~662 MB) is available as a release download for users who don't need SQL Server. The file exceeds GitHub's 100 MB file-size limit, so it is not included in the repository itself.
Download: factbook.db from the latest release
Place the downloaded file at data/factbook.db in the project root. SQLite requires no installation — Python's built-in sqlite3 module can query it directly.
| SQL Server | SQLite | |
|---|---|---|
| Setup | Install SQL Server + ODBC driver, run schema + import scripts | Download one .db file |
| Size | ~263 MB across 36 gzipped SQL files | ~662 MB single file |
| Query tool | SSMS, sqlcmd, pyodbc | Python sqlite3, DB Browser, any SQLite client |
| Best for | Power BI, enterprise analytics, large-scale joins | Quick exploration, scripting, lightweight apps |
| Schema | Identical 5-table structure | Identical 5-table structure |
The SQLite database contains the same tables (including FieldValues with IsComputed), same indexes, and same 1,071,603 fields as the SQL Server version, plus an FTS5 full-text search index for fast keyword and boolean search. This is what the live webapp runs on.
Note on numeric values: The CountryFields table stores raw text content exactly as published by the CIA. There is no pre-computed Value column. The CSV and Excel exports on the webapp parse numeric values at download time using regex extraction. To extract numeric values yourself, join through FieldNameMappings and apply a pattern for the field you need:
import sqlite3, re
db = sqlite3.connect("factbook.db")
db.row_factory = sqlite3.Row
# Example: extract Population as an integer
rows = db.execute("""
SELECT c.Year, mc.CanonicalName, cf.Content
FROM CountryFields cf
JOIN Countries c ON cf.CountryID = c.CountryID
JOIN MasterCountries mc ON c.MasterCountryID = mc.MasterCountryID
JOIN FieldNameMappings fm ON cf.FieldName = fm.OriginalName
WHERE fm.CanonicalName = 'Population'
ORDER BY mc.CanonicalName, c.Year
""").fetchall()
for r in rows:
m = re.search(r'[\d,]{5,}', r["Content"])
value = int(m.group().replace(",", "")) if m else None
if value:
print(f"{r['CanonicalName']} ({r['Year']}): {value:,}")Common extraction patterns for other fields:
| Field | Pattern | Example |
|---|---|---|
| Population | re.search(r'[\d,]{5,}', text) → remove commas, cast to int |
331,449,281 |
| Real GDP per capita | re.search(r'\$([\d,]+)', text) → remove commas, cast to int |
$63,500 |
| Life expectancy at birth | re.search(r'total population:\s*([\d.]+)', text) → float |
78.5 years |
| Population growth rate | re.search(r'(-?[\d.]+)%', text) → float |
0.7% |
| Military expenditures | re.search(r'([\d.]+)%\s*of\s*G[DN]P', text) → float |
3.45% of GDP |
| Area | re.search(r'total\s*:?\s*([\d,]+)\s*sq\s*km', text) → remove commas, cast to int |
9,833,517 sq km |
| Birth/Death rate | re.search(r'([\d.]+)\s*(?:births|deaths)\s*/\s*1,000', text) → float |
11.0 births/1,000 |
The archive is served as a FastAPI + Jinja2 web application at worldfactbookarchive.org. Key features:
- Full-text search with Library of Congress boolean syntax (AND, OR, NOT, "phrase", truncation)
- Library — card-based overview of the entire archive collection
- Browse archive by year (1990-2025) and country (281 entities)
- Country profiles with category drill-down, field time series, and data export (CSV, Excel, PDF)
- Text diff — side-by-side year comparison with highlighted changes for any country
- Country dictionary with ISO codes, entity types, and COCOM region assignments
- Interactive maps gallery — browsable country maps with search and filtering
- Data export — CSV, Excel, print-ready PDF reports, and a complete analysis-ready dataset download
- Intelligence analysis dashboards powered by Mapbox GL JS and Apache ECharts:
- Regional Dashboard — global choropleth with 6 COCOM regions, hover popups, capital city markers, and click-to-zoom
- COCOM Region Detail — per-region map with ranked bar charts for GDP, population, military spending
- Timeline Map — animated choropleth across 36 years with multi-country time series and playback speed controls
- Map Compare — two synced maps for side-by-side year comparison with shared color scale
- Communications Analysis — internet, mobile, broadband penetration with digital divide indicators
- Global Trends — multi-indicator time series for any country
- Global Rankings — sortable country rankings by any indicator with year selection
- Change Detection — year-over-year field changes with trend charts and region filtering
- Field Explorer — browse all 414 canonical data fields with coverage statistics across 36 years
- Advanced Analytics Explorer — correlation scatter, heatmap matrix, ranking race, and treemap visualizations
- Scatter Plot Analysis — multi-indicator scatter plots with regression lines and outlier detection
- Query Builder — custom analytical queries across all indicators with flexible filters
- Dashboard Builder — custom multi-panel dashboards with drag-and-drop indicator selection
- Trade Networks — geographic and force-directed graph of global import/export relationships
- Organization Networks — international organization memberships and alliance visualization
- Geopolitical Atlas — territorial disputes, infrastructure mapping, and missile site tracking
- Political Stability — government type choropleth, regime change tracking, and regional peer comparison
- Natural Resources & Economy — resource production maps, commodity scatter plots, and country profiles
- Demographics — population pyramids with country comparison and year-over-year animation
- Dissolved States — historical entities no longer in the Factbook with archived indicator data
- Data Coverage — field availability matrix across all 36 years and 281 entities
- Structured Field Data — interactive dashboard of 1,775,588 parsed sub-values with Chart/SQL/Source tabs showing exactly where each number was extracted from
- CIA Studies in Intelligence — searchable archive of declassified CIA analytical journal articles with dashboard and browse views
- World Leaders Database — comprehensive leadership data with governance analysis, power concentration metrics, security apparatus tracking, and interactive choropleth maps
- Intelligence dossiers following ICD 203 analytic standards
- Regional threat briefs with instability and security indicators
- Factbook Quiz — 4 modes: country identification, capital cities, higher-or-lower, and flag recognition
- Bug reporting via GitHub Issues integration
Stack: Python 3.12, FastAPI, Jinja2, SQLite (FTS5), Mapbox GL JS v3, Apache ECharts 5, deployed on Fly.io.
The web application source code is maintained in a separate private repository.
| Type | Count | Description |
|---|---|---|
| sovereign | 192 | Independent states |
| territory | 65 | Dependencies, overseas territories |
| misc | 7 | Oceans, World, EU |
| disputed | 6 | Kosovo, Gaza Strip, West Bank, etc. |
| crown_dependency | 3 | Jersey, Guernsey, Isle of Man |
| freely_associated | 3 | Marshall Islands, Micronesia, Palau |
| special_admin | 2 | Hong Kong, Macau |
| dissolved | 2 | Netherlands Antilles, Serbia and Montenegro |
| antarctic | 1 | Antarctica |
The CIA renamed many fields over the 36-year span. The FieldNameMappings table maps 1,132 raw field name variants to 416 canonical names (case-sensitive, so Natural hazards and natural hazards get separate rows):
| Mapping Type | Count | Description |
|---|---|---|
| identity | 184 | Modern field names (unchanged) |
| rename | 159 | CIA renamed the field (e.g. "GDP - real growth rate" -> "Real GDP growth rate") |
| dash_format | 64 | Formatting differences (single vs double dashes) |
| consolidation | 48 | Sub-fields merged into parents (e.g. Oil production/consumption -> Petroleum) |
| country_specific | 354 | Regional sub-entries, government body names |
| noise | 281 | Parser artifacts, fragments (flagged IsNoise=1) |
See queries/sample_queries.sql for 18 ready-to-use analytical queries, including:
- Population time series (single country)
- GDP comparison across G7 nations
- Military expenditure trends
- Internet users growth
- Countries that appear/disappear over time
- Consolidated petroleum sub-field view
49 screenshots and 9 animated GIF demos are available in docs/screenshots/. See the full gallery on the project page.
The CIA World Factbook is a work of the United States Government and is in the public domain (17 U.S.C. § 105). This archive is released under CC0 1.0 Universal (Public Domain Dedication).
The CIA World Factbook was first published in 1962 as a classified document and made publicly available starting in 1971. The online edition launched in the 1990s and was updated weekly until its discontinuation on February 4, 2026.
This archive was built as a preservation project to ensure this unique longitudinal dataset remains accessible for researchers, analysts, and the public.













