- Student: Nino Gerber
- Supervisor: Emanuela Boros
- Semester Project in Data Science 2024 EPFL
Large Language Models (LLMs) often struggle to provide accurate responses to historical facts, a challenge compounded by the poor quality of historical documents due to OCR errors. This project aims to address this issue by creating an instruction-based dataset tailored for historical Swiss newspapers. To achieve this, we generate Question-Answer (QA) pairs using two small LLMs with 7B and 8B parameters, and evaluate their validity through a semi-automatic process. Our approach leverages established metrics such as ROUGE, BLEU, and Semantic Similarity, and introduces a novel veridicity metric to quantify the quality of the generated dataset. We assess the performance of these metrics through manual annotation and compare various methods to identify the best-performing strategies. Our results show that the proposed metric reliably indicates whether a dataset is of higher or lower quality and exhibits a measurable correlation with manual annotations. However, the metric is insufficient for classification tasks due to the already high baseline accuracy of the generated QA pairs. Using the best-performing methods, we generate a high-quality dataset and fine-tune an LLM, achieving a significant improvement on a benchmark for historical data. Specifically, accuracy increases from 4.14% before fine-tuning to 40.98% after fine-tuning. These results are particularly impressive given that our final dataset only includes 8000 QA pairs spanning a short time period and focuses exclusively on Swiss newspapers, whereas the benchmark covers a broader time span and includes diverse, non-European information. This highlights the potential of our approach to generalize beyond the scope of the dataset and significantly improve LLM performance on challenging historical datasets
We created a custom dataset as a subset based on two datasets from Impresso NZZ-1945 (Neue Zürcher Zeitung)
and GDL-1945 (Gazette De Lausanne). We combined this custom data subset with different prompt strategies
based on the entity "Person" or on Open Question Answering (Open QA). We used Mistral-7B-Instruct-v0.3
or LLaMA_3.1_8B_instruct as data generators to generate Question-Answer pairs (QA pairs).
These QA pairs were evaluated using a veridicity metric based on ROUGE, BLEU, and Semantic similarity scores. This metric enabled us to classify the prompt strategies, and we showed that our method worked through manual annotation. However, the veridicity scores failed to efficiently classify the QA pairs into valid and invalid pairs because the best prompt strategies already produced a high proportion of valid QA pairs.
Further, we fine-tuned a LLaMA3.2-8B model on the best prompt strategy after our evaluation, which was Prompt 4 (zero-shot) with a 10-sentence context as input to the model. This drastically improved the model on the ArchivalQA testbench, which we translated into French. The data for the final dataset came from the Impresso dataset with data from GDL-1916 and JDG-1916 (Journal de Genève).
Dependencies: python3.11, libraries found in requirements.txt
Usage:
To generate a new dataset run this command with one or more GPU's:
CUDA_VISIBLE_DEVICES=0 python -m scripts.new_data.no_entities.use_text_generator_no_entity
historical-llm Copyright (c) 2024 EPFL
Licensed under the GNU Affero General Public License v3.0.
