Skip to content

chahna-ahuja/dh_project

Repository files navigation

ALL THE METADATA'S A STAGE

Exploratory Bibliographic and Text Network Analysis of British Comedy Dramas Across 17th-19th Centuries Using OpenRefine, Python and Gephi

Author: Chahna Ahuja

Course: Introduction to Digital Humanities CLass (2025-26) Dr. Margherita Fantoli at KU Leuven (Msc Digital Humanities)

---

Introduction to the Project

This project explores the British Library’s bibliographic dataset of digitized British dramas from the 17th–19th centuries, using an exploratory distant reading approach.

The workflow of this notebook proceeds in three stages:

  1. Descriptive Statistics and Visualization of the Dataset
    Provides a structured overview of the dataset for both the researcher and the reader, helping understand dataset size, structure and suitability for further analysis.

  2. Exploratory Data Analysis on Comedy Subdataset
    Investigateso examine temporal trends in publications, spatial networks of publishing locations, authorship distribution, and lexical patterns of comedy titles.

  3. Text Mining and Network Analysis
    Constructs a syntactic dependency network of frequent lemmas in comedy titles to address the central research question:

    How do 17th–19th century British comedy titles encode gendered roles and archetypes, and what do they reveal about the social expectations of the period?


Github Repository Structure


Dataset

  • Original Source: British Library digitized drama metadata, cleaned as part of a group assignment for Introduction to Digital Humanities CLass, taught by Dr. Margherita Fantoli, in academic year 2025-26 at KU Leuven (Msc Digital Humanities).
  • Data Wrangling Process:
    • OpenRefine used to wrangle data while preserving original information.
    • Facet filtering to isolate subsets and identify inconsistencies.
    • GREL scripting to standardize and transform columns.
    • WikiData reconciliation to enrich author information with contextual metadata.
  • Cleaned Datasets: Main dataset: dh_group7_drama.csv. Three subdatasets: Comedy, Tragedy, Plays cleaned in OpenRefine by Chahna Ahuja, Xinran Liu and Liangyu Gan, respectively as a group project component for this assignment. (check this Github repository to know more!)

Methods

  • Data Analysis Environment: Jupyter Notebook combining code, output, and narrative. Interactive charts via Plotly, word clouds via WordCloud. Use of Wikidata API.
  • Text Mining Using NLP: spaCy (tokenization, lemmatization, dependency parsing), NLTK (preprocessing & exploratory analysis)
  • Network Visualization: Gephi used to construct the comedy title syntactic dependency network to explore the hypothesis.

Tools for Python and Gephi

About

All the Metadata's A Stage: Exploratory Bibliographic and Text Network Analysis of British Comedy Dramas Across 17th-19th Centuries Using OpenRefine, Python & Gephi

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors