Open
Conversation
AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Fix broken legacy build backend so package installs via uv - Impact: build-backend changed to setuptools.build_meta; setuptools-scm added to build-system.requires - Verified via: uv pip install -e .
…ints AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Remove dynamic class creation at runtime; modernise with type hints and docstrings - Impact: _SimpleNamespace replaces anonymous Entity subclasses for nested data dicts; no behavioural change for flat data keys; full type annotations on Entity, SimpleNER, find_all - Verified via: uv run pytest test/test_core.py -v (35 passed)
…ERWrapper AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Python 3.10+ type annotations and docstrings for all public classes in rules/annotators - Impact: RegexNER._create_regex now returns None on re.error and callers skip gracefully; no behavioural changes otherwise - Verified via: uv run pytest test/test_core.py -v (35 passed)
…R, NERWrapper AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Provide core test coverage with zero optional dependencies - Impact: 35 tests cover construction, span lookup, rule extraction, regex extraction, wrapper aggregation, as_json output, and edge cases (invalid regex, partial-word boundary) - Verified via: uv run pytest test/test_core.py -v (35 passed)
AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Document architecture, annotator table, install/usage, and log all Phase 1 changes - Impact: docs/index.md created with source citations; MAINTENANCE_REPORT.md records all AI actions and test results - Verified via: file review
- HashtagAnnotator: re.UNICODE flag; \w pattern matches any script (Arabic, Japanese, Chinese, Cyrillic, etc.) - BaseAnnotator.__init__: added lang="en-us" param; subclasses propagate via super().__init__(lang=lang) instead of self.lang - SimpleNERIntentTransformer: resolves lang from OVOS session (intent.updated_session → SessionManager → config fallback); _get_pipeline(lang) rebuilds only on language change - AUDIT.md: documented TECH-009 CurrencyAnnotator char-class bug - docs/FAQ.md: expanded language support to per-annotator table - tests: 20 new multilingual/Unicode tests; 203 passing Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- numbers_ner/temporal_ner: add compat shim for ovos-number-parser rename convert_words_to_numbers → numbers_to_digits; detect at import time via inspect.signature; map short_scale to Scale enum - temporal_ner: update all ovos-date-parser calls to new positional lang signature (extract_datetime/duration/nice_date/nice_duration) - currency_ner: fix TECH-009 — R$/A$/C$ multi-char symbols now use regex alternation instead of character class; _parse_currency sorts symbols longest-first; pattern built by _build_pattern() classmethod - 3 previously skipped temporal tests now pass (206 total, 0 skipped) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- url_ner: extend URL_PATTERN to Unicode label chars (Latin Extended, Cyrillic, CJK, Hiragana, Katakana) + re.UNICODE flag; IDN domains like https://münchen.de now detected - names_ner: add _STOPWORDS frozenset (~50 entries) filtering common capitalised English non-names (The, Store, Monday, January etc.) to cut false positives at sentence boundaries Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Replace hardcoded ~50-word set with _load_stopwords_iso() reading stopwords-iso.json directly (bypasses pkg_resources bug on Py3.13). Loads 2590 EN stopwords (lower + Title case) at class definition; graceful fallback to minimal hardcoded set if package unavailable. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
… guards
- lookup_ner: use ahocorasick-ner as O(N) backend; regex fallback if absent;
rebuild automaton on add/remove_wordlist; start/end positions in entity data
- temporal_ner: load temporal keywords from res/<lang>/temporal_keywords.txt
instead of hardcoded set; False-positive guard skips diff spans with no
temporal keyword or ordinal (e.g. currency amounts parsed as clock times)
- numbers_ner: skip diffs where replacement is not a pure number (fixes
spurious written_number matches on emails/phone after number normalisation)
- res/en-us/temporal_keywords.txt: 42 English temporal keywords
- res/{de-de,es-es,fr-fr}/temporal_keywords.txt: German, Spanish, French
- pyproject.toml: add ahocorasick-ner>=0.1.1 to dependencies
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…t, __all__ - locations_ner: replace O(N_words×N_cities) word-loop with AhocorasickNER automaton; multi-word names (New York, United States, Los Angeles) now detected; _legacy_extract removed - lookup_ner: drop try/except import and regex fallback; ahocorasick-ner is now a hard dependency; annotate() simplified - phone_ner: add _EXT suffix pattern for x123 / ext. 456 extensions - __init__.py: add __all__ = ["Entity", "SimpleNER"] - SUGGESTIONS.md: 10 tracked improvement proposals Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Replaces exact-span grouping in _deduplicate with a greedy longest-span-wins
algorithm that handles overlaps across different annotators. Ties resolved by
confidence then annotator order. Entities without span info are passed through
unchanged. Adds _resolve_span() helper that prefers data["start"]/data["end"]
over entity.spans for speed.
Also adds phone.rx patterns for space-separated international numbers (S-003):
- +44 20 7946 0958 and +33 1 23 45 67 89 now matched
- Pattern 3: \+\d{1,3}(?:[\s-]\d{1,4}){2,6}
- Pattern 4: \+\d{1,3}\s\d{2,5}\s\d{3,4}\s\d{4}
17 new tests in test/test_pipeline_overlap_dedup.py; 323/323 passing.
AI-Generated Change:
- Model: Claude Sonnet 4.6
- Intent: Fix cross-annotator span collision (S-006) and missing intl phone formats (S-003)
- Impact: _deduplicate now resolves overlaps correctly; PhoneAnnotator matches EU/UK space formats
- Verified via: uv run pytest test/ -q (323 passed)
- Add simple_NER/utils/locale.py: load_rx(), load_intents(), load_wordlist() - Add locale/en-us/ and locale/de-de/ with .rx, .intent, .txt files - Wire PhoneAnnotator, CurrencyAnnotator, OrganizationAnnotator, DateAnnotator to locale - S-002: longest-match-wins dedup in LocationNER (York vs New York) - S-003: space-separated international phone formats (+44 20 7946 0958) - S-004: temporal-keyword guard applied to duration extraction - S-005: per-label confidence in LookUpNER and LocationNER - S-006: cross-annotator span-overlap dedup in NERPipeline - S-010: EU decimal notation in CurrencyAnnotator (1.000,50) - 135 new tests: 206 → 341 passing; numbers_ner 28→84%, lookup 72→91% Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Use difflib.SequenceMatcher to map converted digit spans back to character positions in the original text. NumberNER entities now participate in pipeline cross-annotator overlap dedup (TECH-011). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- date_ner.py: 73% → 92% — all format branches, full _is_valid_date validation (leap years, month/day bounds, century rules) - hashtag_ner.py: 79% → 96% — edge cases and classification branches - pipeline.py: 79% → 92% — Span helpers, select_entity, async dedup - utils/locale.py: 76% → 99% — malformed regex skip, missing file paths 52 new tests; total 350 → 402 passing. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- README.md rewritten — install, quick start, annotator table, dedup strategies, locale/i18n, async, OVOS plugin - docs/index.md — full API reference with constructor params, data fields per annotator, locale system, BaseAnnotator extension guide - docs/TUTORIALS.md — 8 end-to-end tutorials with sample output - examples/01-12: every annotator, dedup strategies, custom keywords, LocationNER label_confidence, TemporalNER anchor_date, multilang currency, async batch, custom annotator subclass, OVOS plugin, LookUpNER runtime wordlists, locale utilities direct usage - examples/README.md — index with run commands Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
temporal_keywords.txt: - Add it-it (43 kw), nl-nl (42 kw), pt-pt (43 kw) locale files per language (es-es, fr-fr, it-it, nl-nl, pt-pt): - currency.intent — native written currency templates - currency.rx — EU decimal format (dot-thousands, comma-decimal) - organization.rx — country-specific legal suffixes + university patterns - phone.rx — country-specific phone formats + country code variants locale/de-de: - phone.rx — +49 and 0xxx German formats Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- Rename locale/res dirs to bare language codes (de-de → de, en-us → en)
- locale loader normalises lang via .split('-')[0].lower() — 'de-DE',
'de-de', 'de' all resolve to locale/de/
- Add 16 new languages: da, el, eu, fa, gl, hu, lt, pl, ro, ru, sv,
tr, uk, an, ast, mwl
- Each language gets: date_months.txt, currency.intent, currency.rx,
organization.rx, phone.rx, temporal_keywords.txt
- Custom currency.rx for fa (ریال/تومان), ru (₽), uk (₴), tr (₺)
- Merge regional variants (es-419, nl-be, pt-br, pt-ao, sv-fi) into
primary bare-code dirs
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- Add simple_NER/version.py with OVOS version block (v0.9.0 stable) - Wire pyproject.toml dynamic version from simple_NER.version.__version__ - Add standard workflows: release_workflow, publish_stable, build-tests, lint, coverage, release-preview, repo-health, license_check, pip_audit, opm-check, conventional-label - Remove legacy build_tests.yml and license_tests.yml Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…ists for 24 languages AI-Generated Change: - Model: Claude Sonnet 4.6 - Intent: Close resolved AUDIT.md issue and expand LookUpNER coverage across all supported locales - Impact: - AUDIT.md: TECH-011 moved from Open Issues to Resolved (commit a5bac24, 2026-03-31) - Added color/emotion/weather/animal.entity for de, es, fr, it, nl, pt, ca, cs, da, el, eu, fa, gl, hu, lt, mwl, pl, ro, ru, sv, tr, uk, an, ast (96 new files; total entity count: 107) - Non-Latin scripts (el, fa, ru, uk) use native script throughout - Minority languages (an, ast, mwl, eu, gl, ca) use accurate regional vocabulary - Verified via: uv run pytest test/ -q → 402 passed Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
* feat(integration): ahocorasick-ner dataset wrapper + examples + tests - Add AhocorasickAnnotatorWrapper: adapts any AhocorasickNER instance as a BaseAnnotator for use in NERPipeline - Add test_ahocorasick_wrapper.py and test_integration_hf.py (10 tests) - Add examples: huggingface_datasets, wikidata_subclasses, comprehensive_datasets - Add docs/DATASET_INTEGRATION.md with full HF dataset guide - Fix duplicate dependency in pyproject.toml Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(names_ner): sentence-boundary heuristic reduces false positives (S-001) Single capitalised words at position 0 or after sentence-ending punctuation (.!?) now score 0.55 — below the default threshold of 0.65 — so common false positives like "Send" or "Meeting" at sentence starts are suppressed. Multi-word compound names (e.g. "John Doe") retain confidence 0.85 regardless of sentence position. Mid-sentence single words score 0.80 as before. Adds sentence_initial flag to entity data. 10 new tests covering all cases. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * test: cover cli, batch, diff, version; fix TECH-004 typing - utils/diff.py: replace typing.Tuple with built-in tuple (closes TECH-004) - test_diff.py: 7 tests, diff.py now 100% covered - test_batch.py: 10 tests, batch.py 0% → 83% - test_cli.py: 22 tests, cli.py 0% → 93%, version.py 0% → 100% - Overall coverage: 79% → 89% Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * docs: update FAQ and add AI_TRANSPARENCY_LOG Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * refactor(lookup_ner): use public AhocorasickNER name, drop private alias AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Remove misleading _AhocorasickNER private alias; AhocorasickNER is a public API - Impact: Import renamed; type annotation on _ac field updated from string literal to direct type - Verified via: uv run pytest test/ (462 passed) * refactor(locations_ner): use public AhocorasickNER name, drop private alias AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Consistent with lookup_ner.py rename; AhocorasickNER is a public API - Impact: Import alias removed; _build_automaton instantiation updated - Verified via: uv run pytest test/ (462 passed) * feat(ahocorasick_wrapper): expose min_word_len param, forward to tag() AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Allow callers to tune minimum match length (default 5 mirrors AhocorasickNER.tag default) - Impact: New keyword-only param min_word_len on __init__; mock updated to accept the param - Verified via: uv run pytest test/ (462 passed) * feat(lookup_ner): add add_word() for single-word runtime registration AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Allow per-word additions without replacing the whole wordlist (gap vs add_wordlist) - Impact: New public method add_word(label, word); rebuilds automaton immediately - Verified via: uv run pytest test/ (462 passed) * docs(ahocorasick_wrapper): document min_word_len param and dataset loader examples AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Make dataset loader compatibility discoverable; document new min_word_len param - Impact: Class docstring rewritten with two usage examples (custom vocab + dataset loader) - Verified via: uv run pytest test/ (462 passed) * docs(index): add AhocorasickAnnotatorWrapper section with min_word_len and dataset loaders AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Make wrapper discoverable in docs; document min_word_len param and dataset loader compatibility - Impact: New section in docs/index.md with constructor param table and two usage examples - Verified via: uv run pytest test/ (462 passed) * test: cover min_word_len forwarding, LookUpNER.add_word, no private alias AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Verify all acceptance criteria from spec.md with dedicated tests - Impact: +6 tests (2 min_word_len, 1 symbol check, 3 add_word); 468 total - Verified via: uv run pytest test/ (468 passed) * chore: mark all status items complete; update AI_TRANSPARENCY_LOG AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Close sprint record; all 8 checklist items done - Impact: status.md fully checked; AI_TRANSPARENCY_LOG updated with sprint 2 summary - Verified via: uv run pytest test/ (468 passed, coverage 89%) * fix: remove broken alias test; fix two stale docstrings AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Address audit findings — false-pass test deleted, stale docstrings corrected - Impact: test_no_private_alias_in_source removed (python -m grep is not real); lookup_ner.py fallback prose removed (ahocorasick-ner is a hard dep); ahocorasick_wrapper.py Args example class names corrected to actual dataset classes - Verified via: uv run pytest test/ (467 passed) * refactor(locations_ner): align _ac type annotation with lookup_ner (AhocorasickNER | None) AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Consistency — lookup_ner uses AhocorasickNER | None, locations_ner used Any - Impact: _ac field annotation tightened; Any import retained for other fields - Verified via: uv run pytest test/ (467 passed) * docs(lookup_ner): warn about O(n²) rebuilds when using add_word in a loop AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Guide callers to add_wordlist() for bulk additions - Impact: add_word() docstring gains a one-line rebuild-cost note - Verified via: uv run pytest test/ (467 passed) * test(factory): parametrized smoke tests for all 27 factory keys + edge cases AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Close 81% coverage gap on factory.py instantiation branches - Impact: 34 new tests covering every registered key, unknown-key error, custom registration, create_pipeline happy/skip/all-unknown paths - Verified via: uv run pytest test/unittests/test_factory.py (34 passed) * test(temporal_ner): cover import-error fallback (lines 21-44) AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Exercise the _OVOS_AVAILABLE=False path that was previously untested - Impact: 3 new tests — deps present, deps absent (sentinel check), annotate no-op when unavailable - Verified via: uv run pytest test/unittests/test_temporal_import_fallback.py (3 passed) * test: cover locations error paths, opm session branch, temporal fallback, factory keys AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Close remaining coverage gaps from audit — locations_ner 131-145, opm 127-132, temporal_ner 30-44, factory instantiation - Impact: +47 tests; coverage 89%→90%; test/test_factory.py and test/test_opm.py removed (duplicated by test/unittests/) - Verified via: uv run pytest test/ (481 passed, coverage 90%) * docs: comprehensive onboarding and documentation reorganization - Add docs/GETTING_STARTED.md: 10-section getting started guide for new users with quick install, first pipeline, entity types, async processing, multi-language support, and troubleshooting - Add docs/README.md: navigation hub with decision tree routing (new user vs developer vs advanced) + annotator selection table - Update docs/index.md: add quick-navigation section at top linking to getting started, API, FAQ, tutorials, examples - Update readme.md: replace sparse links with organized learning paths (Getting Started → FAQ → API Reference → Tutorials → Examples) - Update AUDIT.md: consolidate 5 open issues from recent sprint (false-pass test, O(n²) caveat, type annotation inconsistency, stale docstrings) - Update SUGGESTIONS.md: add 4 pending fixes (S-011 to S-014) tied to AUDIT issues with effort estimates - All 481 tests pass, coverage at 90% Addresses user request to "get the whole repo in shape, ready for a total noob to use out of the box." Co-Authored-By: Claude Haiku 4.5 <[email protected]> * audit: consolidate all fixes into AUDIT.md, delete SUGGESTIONS.md - Remove TECH-012 (invalid per user — S-011 superseded) - Mark TECH-013 through TECH-016 as resolved (all fixes implemented): - TECH-013: O(n²) caveat documented in LookUpNER.add_word() docstring - TECH-014: LocationNER._ac type aligned to AhocorasickNER | None - TECH-015: LookUpNER stale fallback reference removed - TECH-016: AhocorasickAnnotatorWrapper docstring corrected with actual dataset classes - Delete SUGGESTIONS.md (all pending fixes completed; AUDIT.md is authoritative) All 481 tests pass. Code is production-ready. Co-Authored-By: Claude Haiku 4.5 <[email protected]> * test(opm): skip OPM tests when ovos_plugin_manager not installed The OPM tests require ovos_plugin_manager, which is not part of the 'test' extras (since it's only needed for the OVOS plugin integration). Skip the tests gracefully in CI where the module is unavailable, matching the pattern used for optional deps. Fixes CI failures in build-tests, coverage, and opm-check workflows. Co-Authored-By: Claude Haiku 4.5 <[email protected]> * build: add ovos-plugin-manager to test extras; remove skip logic Add ovos-plugin-manager (>=1.0.0) to both 'test' and 'dev' extras so it's always available during testing. This ensures proper coverage of the OVOS plugin integration (SimpleNERIntentTransformer) instead of skipping tests. Remove the conditional import check and @pytest.mark.skipif from test_opm.py since the module is now guaranteed to be present. All 481 tests pass. Co-Authored-By: Claude Haiku 4.5 <[email protected]> --------- Co-authored-by: Claude Sonnet 4.6 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Human review requested!