- Updated README.md, CITATION.cff and docs with the published version (advance article) of the ComProScanner paper in Digital Discovery as fully open access:
- Guide for API key creation for various LLM providers and publisher APIs added to the documentation at
docs/getting-started/api-key-guide.mdwith detailed instructions for each provider.
-
Data related to comparison with other agentic data extraction frameworks added for the ComProScanner paper in the
examples/piezo_test/comparing_existing_frameworksfolder. -
New parameter
apply_advanced_cleaningadded to data cleaning methods indata_cleaner.py. When set toTrue, it triggers the advanced cleaning pipeline. -
Advanced composition cleaning methods in
data_cleaner.py:_remove_miller_indices()- Removes crystal plane notations from chemical formulas_remove_zero_coefficient_elements()- Removes elements with zero coefficients_normalize_coefficients()- Removes trailing zeros from coefficients_expand_leading_and_trailing_coefficients()- Expands leading/trailing coefficient patterns_expand_parenthetical_coefficients()- Expands nested bracket coefficients
-
Enhanced documentation in
docs/usage/data-cleaning.md:- Added
apply_advanced_cleaningparameter documentation - Added Mermaid process flow diagram showing cleaning stages
- Added advanced cleaning examples with tables for each transformation type
- Added
-
Template for GitHub issues added to .github/ISSUE_TEMPLATE for the following topics:
- bug reports
- feature requests
- documentation improvements
- support questions
-
Changelog page added in the documentation. Also, CHANGELOG.md linked in README.md.
-
DeepWiki integration badge added to README.md for community Q&A support:
-
arXiv preprint badge added to README.md:
-
CITATION.cff added for standardized citation information based on the latest release and arXiv preprint.
-
OAWorks API is replaced with OpenAlex API as OAWorks is no longer available.
-
Empty/corrupted PDF handled in
pdf_processor.pyandwiley_processor.pyto avoid having GLYPH errors during text extraction. -
Data extraction failures fixed if composition-property text data is empty.
-
CSV progress tracking in
elsevier_processor.py:- DtypeWarning resolved by adding
dtype=str, low_memory=Falsetopd.read_csv() - Data loss issue fixed with immediate CSV persistence for processed articles
- Sleep delays optimized for batch writes
- DtypeWarning resolved by adding
-
Type annotation warnings in documentation build (griffe/mkdocstrings):
- Added return type annotations to function signatures in
comproscanner.py - Added return type annotations to all visualization functions in
data_visualizer.pyandeval_visualizer.py - Fixed parameter type format in docstrings from colon to comma notation
- Added
TYPE_CHECKINGconditional imports for matplotlib Figure type - Fixed
**kwargstype annotations across multiple modules
- Added return type annotations to function signatures in
-
Numbered list formatting in
docs/about/contribution.md:- Fixed list continuation by using 4-space indentation for code blocks and nested lists
- Disabled format on save for Markdown files in
.vscode/settings.json
-
GitHub Actions CI disk space issue:
- Added
--no-cache-dirflag to pip install to reduce disk usage
- Added
- README badges section converted from HTML to markdown format for better compatibility across platforms.
-
New function
clean_data()added for improved data cleaning and preprocessing instead of integrating it into data extraction function. -
New documentation page for Data Cleaning added:
- docs/usage/data-cleaning.md
- Added to mkdocs.yml navigation.
-
New API overview documentation page added:
- docs/api.md
- Added to mkdocs.yml navigation.
- New mkdocstrings configuration added to mkdocs.yml for automatic API documentation generation.
-
New tests added for remaining utils functions.
-
Added pytest coverage tracking (50%) using
pytest-covand coverage report generation using codecov.
- Tests updated to reflect changes in data cleaning process.
- Arguments related to data cleaning removed from data extraction function.
- README images updated with raw GitHub links for better reliability:
- RecursiveCharacterTextSplitter importing updated for latest langchain version to avoid import errors:
- Changed from
from langchain.text_splitter import RecursiveCharacterTextSplitter - To
from langchain.text_splitter.recursive_character import RecursiveCharacterTextSplitter
- Changed from
- Link to ComProScanner preprint on arXiv in the documentation index page and README.md:
- README images updated with external image link to fix PyPI rendering issue.
- Initial release of ComProScanner.