Skip to content

Latest commit

 

History

History
90 lines (72 loc) · 5.3 KB

File metadata and controls

90 lines (72 loc) · 5.3 KB

ALL THE METADATA'S A STAGE

Exploratory Bibliographic and Text Network Analysis of British Comedy Dramas Across 17th-19th Centuries Using OpenRefine, Python and Gephi

Author: Chahna Ahuja

Course: Introduction to Digital Humanities CLass (2025-26) Dr. Margherita Fantoli at KU Leuven (Msc Digital Humanities)

---

Introduction to the Project

This project explores the British Library’s bibliographic dataset of digitized British dramas from the 17th–19th centuries, using an exploratory distant reading approach.

The workflow of this notebook proceeds in three stages:

  1. Descriptive Statistics and Visualization of the Dataset
    Provides a structured overview of the dataset for both the researcher and the reader, helping understand dataset size, structure and suitability for further analysis.

  2. Exploratory Data Analysis on Comedy Subdataset
    Investigateso examine temporal trends in publications, spatial networks of publishing locations, authorship distribution, and lexical patterns of comedy titles.

  3. Text Mining and Network Analysis
    Constructs a syntactic dependency network of frequent lemmas in comedy titles to address the central research question:

    How do 17th–19th century British comedy titles encode gendered roles and archetypes, and what do they reveal about the social expectations of the period?


Github Repository Structure


Dataset

  • Original Source: British Library digitized drama metadata, cleaned as part of a group assignment for Introduction to Digital Humanities CLass, taught by Dr. Margherita Fantoli, in academic year 2025-26 at KU Leuven (Msc Digital Humanities).
  • Data Wrangling Process:
    • OpenRefine used to wrangle data while preserving original information.
    • Facet filtering to isolate subsets and identify inconsistencies.
    • GREL scripting to standardize and transform columns.
    • WikiData reconciliation to enrich author information with contextual metadata.
  • Cleaned Datasets: Main dataset: dh_group7_drama.csv. Three subdatasets: Comedy, Tragedy, Plays cleaned in OpenRefine by Chahna Ahuja, Xinran Liu and Liangyu Gan, respectively as a group project component for this assignment. (check this Github repository to know more!)

Methods

  • Data Analysis Environment: Jupyter Notebook combining code, output, and narrative. Interactive charts via Plotly, word clouds via WordCloud. Use of Wikidata API.
  • Text Mining Using NLP: spaCy (tokenization, lemmatization, dependency parsing), NLTK (preprocessing & exploratory analysis)
  • Network Visualization: Gephi used to construct the comedy title syntactic dependency network to explore the hypothesis.

Tools for Python and Gephi