Exploratory Bibliographic and Text Network Analysis of British Comedy Dramas Across 17th-19th Centuries Using OpenRefine, Python and Gephi
Course: Introduction to Digital Humanities CLass (2025-26) Dr. Margherita Fantoli at KU Leuven (Msc Digital Humanities)
This project explores the British Library’s bibliographic dataset of digitized British dramas from the 17th–19th centuries, using an exploratory distant reading approach.
The workflow of this notebook proceeds in three stages:
-
Descriptive Statistics and Visualization of the Dataset
Provides a structured overview of the dataset for both the researcher and the reader, helping understand dataset size, structure and suitability for further analysis. -
Exploratory Data Analysis on Comedy Subdataset
Investigateso examine temporal trends in publications, spatial networks of publishing locations, authorship distribution, and lexical patterns of comedy titles. -
Text Mining and Network Analysis
Constructs a syntactic dependency network of frequent lemmas in comedy titles to address the central research question:How do 17th–19th century British comedy titles encode gendered roles and archetypes, and what do they reveal about the social expectations of the period?
-
data/
-
data_analysis/
-
network_analysis/
-
data_visualization/ #charts and data visualizations from python
-
project_management/ #project management plant + gantt chart
-
visual_assets/ #visual aesthetic images for python html
- Original Source: British Library digitized drama metadata, cleaned as part of a group assignment for Introduction to Digital Humanities CLass, taught by Dr. Margherita Fantoli, in academic year 2025-26 at KU Leuven (Msc Digital Humanities).
- Data Wrangling Process:
- OpenRefine used to wrangle data while preserving original information.
- Facet filtering to isolate subsets and identify inconsistencies.
- GREL scripting to standardize and transform columns.
- WikiData reconciliation to enrich author information with contextual metadata.
- Cleaned Datasets: Main dataset: dh_group7_drama.csv. Three subdatasets: Comedy, Tragedy, Plays cleaned in OpenRefine by Chahna Ahuja, Xinran Liu and Liangyu Gan, respectively as a group project component for this assignment. (check this Github repository to know more!)
- Data Analysis Environment: Jupyter Notebook combining code, output, and narrative. Interactive charts via Plotly, word clouds via WordCloud. Use of Wikidata API.
- Text Mining Using NLP: spaCy (tokenization, lemmatization, dependency parsing), NLTK (preprocessing & exploratory analysis)
- Network Visualization: Gephi used to construct the comedy title syntactic dependency network to explore the hypothesis.
- For reproducing the python notebook, first pip install the packages in requirements.txt
- Go to https://gephi.org/users/download/ and download Gephi. Open Gephi, and import nodes and edges CSV files from here
- To view HTML notebook without downloading it, go here