DHlab-Historical-LLM

Basic Information:

Student: Nino Gerber
Supervisor: Emanuela Boros
Semester Project in Data Science 2024 EPFL

About:

Large Language Models (LLMs) often struggle to provide accurate responses to historical facts, a challenge compounded by the poor quality of historical documents due to OCR errors. This project aims to address this issue by creating an instruction-based dataset tailored for historical Swiss newspapers. To achieve this, we generate Question-Answer (QA) pairs using two small LLMs with 7B and 8B parameters, and evaluate their validity through a semi-automatic process. Our approach leverages established metrics such as ROUGE, BLEU, and Semantic Similarity, and introduces a novel veridicity metric to quantify the quality of the generated dataset. We assess the performance of these metrics through manual annotation and compare various methods to identify the best-performing strategies. Our results show that the proposed metric reliably indicates whether a dataset is of higher or lower quality and exhibits a measurable correlation with manual annotations. However, the metric is insufficient for classification tasks due to the already high baseline accuracy of the generated QA pairs. Using the best-performing methods, we generate a high-quality dataset and fine-tune an LLM, achieving a significant improvement on a benchmark for historical data. Specifically, accuracy increases from 4.14% before fine-tuning to 40.98% after fine-tuning. These results are particularly impressive given that our final dataset only includes 8000 QA pairs spanning a short time period and focuses exclusively on Swiss newspapers, whereas the benchmark covers a broader time span and includes diverse, non-European information. This highlights the potential of our approach to generalize beyond the scope of the dataset and significantly improve LLM performance on challenging historical datasets

Research summary:

We created a custom dataset as a subset based on two datasets from Impresso NZZ-1945 (Neue Zürcher Zeitung) and GDL-1945 (Gazette De Lausanne). We combined this custom data subset with different prompt strategies based on the entity "Person" or on Open Question Answering (Open QA). We used Mistral-7B-Instruct-v0.3 or LLaMA_3.1_8B_instruct as data generators to generate Question-Answer pairs (QA pairs).

These QA pairs were evaluated using a veridicity metric based on ROUGE, BLEU, and Semantic similarity scores. This metric enabled us to classify the prompt strategies, and we showed that our method worked through manual annotation. However, the veridicity scores failed to efficiently classify the QA pairs into valid and invalid pairs because the best prompt strategies already produced a high proportion of valid QA pairs.

Further, we fine-tuned a LLaMA3.2-8B model on the best prompt strategy after our evaluation, which was Prompt 4 (zero-shot) with a 10-sentence context as input to the model. This drastically improved the model on the ArchivalQA testbench, which we translated into French. The data for the final dataset came from the Impresso dataset with data from GDL-1916 and JDG-1916 (Journal de Genève).

Installation and Usage:

Dependencies: python3.11, libraries found in requirements.txt
Usage: 
To generate a new dataset run this command with one or more GPU's: 

CUDA_VISIBLE_DEVICES=0 python -m scripts.new_data.no_entities.use_text_generator_no_entity

Licensed under the GNU Affero General Public License v3.0.

Name		Name	Last commit message	Last commit date
Latest commit History 273 Commits
config		config
data		data
notebooks		notebooks
report		report
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DHlab-Historical-LLM

Basic Information:

About:

Research summary:

Installation and Usage:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DHlab-Historical-LLM

Basic Information:

About:

Research summary:

Installation and Usage:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages