A comprehensive Python package for extracting composition-property data from scientific articles for building databases
ComProScanner is a multi-agent framework designed to extract composition-property relationships from scientific articles in materials science. It automates the entire workflow from metadata collection to data extraction, evaluation, and visualization.
Key Features:
- 📚 Multi-publisher support (Elsevier, Springer, Wiley, IOP, local PDFs)
- 🤖 Agentic extraction using CrewAI framework
- 🔍 RAG-powered context retrieval for cost effective automation with accuracy
- 📊 Comprehensive evaluation and visualization tools
- 🎯 Customizable extraction workflows
- 🌐 Knowledge graph generation
Install from PyPI:
pip install comproscannerOr install from source:
git clone https://github.com/slimeslab/ComProScanner.git
cd comproscanner
pip install -e .Here's a complete example extracting piezoelectric coefficient (
from comproscanner import ComProScanner
# Initialize scanner
scanner = ComProScanner(main_property_keyword="piezoelectric")
# Collect metadata
scanner.collect_metadata(
base_queries=["piezoelectric", "piezoelectricity"],
extra_queries=["ceramics", "applications"]
)
# Process articles
property_keywords = {
"exact_keywords": ["d33"],
"substring_keywords": [" d 33 "]
}
scanner.process_articles(
property_keywords=property_keywords,
source_list=["elsevier", "springer"]
)
# Extract composition-property data
scanner.extract_composition_property_data(
main_extraction_keyword="d33"
)The ComProScanner workflow consists of four main stages:
- Metadata Retrieval - Find relevant scientific articles
- Article Collection - Extract full-text from various publishers
- Information Extraction - Use LLM agents to extract structured data
- Post Processing & Dataset Creation - Evaluate, clean, and visualize results
📖 Full documentation is available at slimeslab.github.io/ComProScanner
- Elsevier (via TDM API)
- Springer Nature (via TDM API)
- Wiley (via TDM API)
- IOP Publishing (via SFTP bulk access)
- Local PDFs (any publication)
- Composition-property relationships
- Material families
- Synthesis methods and precursors
- Characterization techniques
- Synthesis steps
- Semantic Evaluation - Using semantic similarity measures
- Agentic Evaluation - LLM-powered contextual analysis
- Data Visualization
- Evaluation Visualization
scanner.process_articles(
property_keywords=property_keywords,
source_list=["elsevier", "springer", "wiley"]
)scanner.extract_composition_property_data(
main_extraction_keyword="d33",
rag_chat_model="gemini-2.5-pro",
rag_max_tokens=2048,
rag_top_k=5
)from comproscanner import data_visualizer, eval_visualizer
# Create knowledge graph
data_visualizer.create_knowledge_graph(result_file="results.json")
# Plot evaluation metrics
eval_visualizer.plot_multiple_radar_charts(
result_sources=["model1.json", "model2.json"],
model_names=["GPT-4o", "Claude-3.5"]
)- Python 3.12 or 3.13
- TDM API keys for desired publishers (Elsevier, Springer, Wiley)
- LLM API keys (OpenAI, Anthropic, Google, etc.)
- Optional: Neo4j for knowledge graph visualization
If you use ComProScanner in your research, please cite:
@Article{roy2026comproscannermultiagentbasedframework,
author ="Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara",
title ="ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature",
journal ="Digital Discovery",
year ="2026",
pages ="Accepted",
publisher ="RSC",
doi ="10.1039/D5DD00521C",
url ="https://doi.org/10.1039/D5DD00521C"
}See the CHANGELOG for details on what has changed in each version.
We welcome contributions! Please see our Contributing Guidelines for details.
This project is licensed under the MIT License - see the LICENSE file for details.
Copyright © 2025 SLIMES Lab
Author: Aritra Roy
- 🌐 Website: aritraroy.live
- 📧 Email: [email protected]
- 🐙 GitHub: @aritraroy24
- 𝕏 Twitter: @aritraroy24
Project Links:
- 📦 PyPI: pypi.org/project/comproscanner
- 📖 Documentation: slimeslab.github.io/ComProScanner
- 🐛 Issues: github.com/slimeslab/ComProScanner/issues
Made with ❤️ by SLIMES Lab

