Skip to content

Release 0.9.1a2#10

Open
github-actions[bot] wants to merge 48 commits intomasterfrom
release-0.9.1a2
Open

Release 0.9.1a2#10
github-actions[bot] wants to merge 48 commits intomasterfrom
release-0.9.1a2

Conversation

@github-actions
Copy link
Copy Markdown

@github-actions github-actions bot commented Apr 1, 2026

Human review requested!

JarbasAl and others added 30 commits April 12, 2020 19:46
AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Fix broken legacy build backend so package installs via uv
- Impact: build-backend changed to setuptools.build_meta; setuptools-scm added to build-system.requires
- Verified via: uv pip install -e .
…ints

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Remove dynamic class creation at runtime; modernise with type hints and docstrings
- Impact: _SimpleNamespace replaces anonymous Entity subclasses for nested data dicts; no behavioural change for flat data keys; full type annotations on Entity, SimpleNER, find_all
- Verified via: uv run pytest test/test_core.py -v (35 passed)
…ERWrapper

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Python 3.10+ type annotations and docstrings for all public classes in rules/annotators
- Impact: RegexNER._create_regex now returns None on re.error and callers skip gracefully; no behavioural changes otherwise
- Verified via: uv run pytest test/test_core.py -v (35 passed)
…R, NERWrapper

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Provide core test coverage with zero optional dependencies
- Impact: 35 tests cover construction, span lookup, rule extraction, regex extraction, wrapper aggregation, as_json output, and edge cases (invalid regex, partial-word boundary)
- Verified via: uv run pytest test/test_core.py -v (35 passed)
AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Document architecture, annotator table, install/usage, and log all Phase 1 changes
- Impact: docs/index.md created with source citations; MAINTENANCE_REPORT.md records all AI actions and test results
- Verified via: file review
- HashtagAnnotator: re.UNICODE flag; \w pattern matches any script
  (Arabic, Japanese, Chinese, Cyrillic, etc.)
- BaseAnnotator.__init__: added lang="en-us" param; subclasses
  propagate via super().__init__(lang=lang) instead of self.lang
- SimpleNERIntentTransformer: resolves lang from OVOS session
  (intent.updated_session → SessionManager → config fallback);
  _get_pipeline(lang) rebuilds only on language change
- AUDIT.md: documented TECH-009 CurrencyAnnotator char-class bug
- docs/FAQ.md: expanded language support to per-annotator table
- tests: 20 new multilingual/Unicode tests; 203 passing

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- numbers_ner/temporal_ner: add compat shim for ovos-number-parser
  rename convert_words_to_numbers → numbers_to_digits; detect at
  import time via inspect.signature; map short_scale to Scale enum
- temporal_ner: update all ovos-date-parser calls to new positional
  lang signature (extract_datetime/duration/nice_date/nice_duration)
- currency_ner: fix TECH-009 — R$/A$/C$ multi-char symbols now use
  regex alternation instead of character class; _parse_currency sorts
  symbols longest-first; pattern built by _build_pattern() classmethod
- 3 previously skipped temporal tests now pass (206 total, 0 skipped)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- url_ner: extend URL_PATTERN to Unicode label chars (Latin Extended,
  Cyrillic, CJK, Hiragana, Katakana) + re.UNICODE flag; IDN domains
  like https://münchen.de now detected
- names_ner: add _STOPWORDS frozenset (~50 entries) filtering common
  capitalised English non-names (The, Store, Monday, January etc.)
  to cut false positives at sentence boundaries

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Replace hardcoded ~50-word set with _load_stopwords_iso() reading
stopwords-iso.json directly (bypasses pkg_resources bug on Py3.13).
Loads 2590 EN stopwords (lower + Title case) at class definition;
graceful fallback to minimal hardcoded set if package unavailable.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
… guards

- lookup_ner: use ahocorasick-ner as O(N) backend; regex fallback if absent;
  rebuild automaton on add/remove_wordlist; start/end positions in entity data
- temporal_ner: load temporal keywords from res/<lang>/temporal_keywords.txt
  instead of hardcoded set; False-positive guard skips diff spans with no
  temporal keyword or ordinal (e.g. currency amounts parsed as clock times)
- numbers_ner: skip diffs where replacement is not a pure number (fixes
  spurious written_number matches on emails/phone after number normalisation)
- res/en-us/temporal_keywords.txt: 42 English temporal keywords
- res/{de-de,es-es,fr-fr}/temporal_keywords.txt: German, Spanish, French
- pyproject.toml: add ahocorasick-ner>=0.1.1 to dependencies

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
JarbasAl and others added 18 commits March 31, 2026 01:17
…t, __all__

- locations_ner: replace O(N_words×N_cities) word-loop with AhocorasickNER
  automaton; multi-word names (New York, United States, Los Angeles) now
  detected; _legacy_extract removed
- lookup_ner: drop try/except import and regex fallback; ahocorasick-ner
  is now a hard dependency; annotate() simplified
- phone_ner: add _EXT suffix pattern for x123 / ext. 456 extensions
- __init__.py: add __all__ = ["Entity", "SimpleNER"]
- SUGGESTIONS.md: 10 tracked improvement proposals

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Replaces exact-span grouping in _deduplicate with a greedy longest-span-wins
algorithm that handles overlaps across different annotators. Ties resolved by
confidence then annotator order. Entities without span info are passed through
unchanged. Adds _resolve_span() helper that prefers data["start"]/data["end"]
over entity.spans for speed.

Also adds phone.rx patterns for space-separated international numbers (S-003):
- +44 20 7946 0958 and +33 1 23 45 67 89 now matched
- Pattern 3: \+\d{1,3}(?:[\s-]\d{1,4}){2,6}
- Pattern 4: \+\d{1,3}\s\d{2,5}\s\d{3,4}\s\d{4}

17 new tests in test/test_pipeline_overlap_dedup.py; 323/323 passing.

AI-Generated Change:
- Model: Claude Sonnet 4.6
- Intent: Fix cross-annotator span collision (S-006) and missing intl phone formats (S-003)
- Impact: _deduplicate now resolves overlaps correctly; PhoneAnnotator matches EU/UK space formats
- Verified via: uv run pytest test/ -q (323 passed)
- Add simple_NER/utils/locale.py: load_rx(), load_intents(), load_wordlist()
- Add locale/en-us/ and locale/de-de/ with .rx, .intent, .txt files
- Wire PhoneAnnotator, CurrencyAnnotator, OrganizationAnnotator, DateAnnotator to locale
- S-002: longest-match-wins dedup in LocationNER (York vs New York)
- S-003: space-separated international phone formats (+44 20 7946 0958)
- S-004: temporal-keyword guard applied to duration extraction
- S-005: per-label confidence in LookUpNER and LocationNER
- S-006: cross-annotator span-overlap dedup in NERPipeline
- S-010: EU decimal notation in CurrencyAnnotator (1.000,50)
- 135 new tests: 206 → 341 passing; numbers_ner 28→84%, lookup 72→91%

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Use difflib.SequenceMatcher to map converted digit spans back to
character positions in the original text. NumberNER entities now
participate in pipeline cross-annotator overlap dedup (TECH-011).

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- date_ner.py: 73% → 92% — all format branches, full _is_valid_date
  validation (leap years, month/day bounds, century rules)
- hashtag_ner.py: 79% → 96% — edge cases and classification branches
- pipeline.py: 79% → 92% — Span helpers, select_entity, async dedup
- utils/locale.py: 76% → 99% — malformed regex skip, missing file paths

52 new tests; total 350 → 402 passing.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- README.md rewritten — install, quick start, annotator table, dedup
  strategies, locale/i18n, async, OVOS plugin
- docs/index.md — full API reference with constructor params, data
  fields per annotator, locale system, BaseAnnotator extension guide
- docs/TUTORIALS.md — 8 end-to-end tutorials with sample output
- examples/01-12: every annotator, dedup strategies, custom keywords,
  LocationNER label_confidence, TemporalNER anchor_date, multilang
  currency, async batch, custom annotator subclass, OVOS plugin,
  LookUpNER runtime wordlists, locale utilities direct usage
- examples/README.md — index with run commands

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
temporal_keywords.txt:
- Add it-it (43 kw), nl-nl (42 kw), pt-pt (43 kw)

locale files per language (es-es, fr-fr, it-it, nl-nl, pt-pt):
- currency.intent — native written currency templates
- currency.rx     — EU decimal format (dot-thousands, comma-decimal)
- organization.rx — country-specific legal suffixes + university patterns
- phone.rx        — country-specific phone formats + country code variants

locale/de-de:
- phone.rx        — +49 and 0xxx German formats

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- Rename locale/res dirs to bare language codes (de-de → de, en-us → en)
- locale loader normalises lang via .split('-')[0].lower() — 'de-DE',
  'de-de', 'de' all resolve to locale/de/
- Add 16 new languages: da, el, eu, fa, gl, hu, lt, pl, ro, ru, sv,
  tr, uk, an, ast, mwl
- Each language gets: date_months.txt, currency.intent, currency.rx,
  organization.rx, phone.rx, temporal_keywords.txt
- Custom currency.rx for fa (ریال/تومان), ru (₽), uk (₴), tr (₺)
- Merge regional variants (es-419, nl-be, pt-br, pt-ao, sv-fi) into
  primary bare-code dirs

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- Add simple_NER/version.py with OVOS version block (v0.9.0 stable)
- Wire pyproject.toml dynamic version from simple_NER.version.__version__
- Add standard workflows: release_workflow, publish_stable, build-tests,
  lint, coverage, release-preview, repo-health, license_check, pip_audit,
  opm-check, conventional-label
- Remove legacy build_tests.yml and license_tests.yml

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…ists for 24 languages

AI-Generated Change:
- Model: Claude Sonnet 4.6
- Intent: Close resolved AUDIT.md issue and expand LookUpNER coverage across all supported locales
- Impact:
  - AUDIT.md: TECH-011 moved from Open Issues to Resolved (commit a5bac24, 2026-03-31)
  - Added color/emotion/weather/animal.entity for de, es, fr, it, nl, pt, ca, cs, da, el, eu, fa,
    gl, hu, lt, mwl, pl, ro, ru, sv, tr, uk, an, ast (96 new files; total entity count: 107)
  - Non-Latin scripts (el, fa, ru, uk) use native script throughout
  - Minority languages (an, ast, mwl, eu, gl, ca) use accurate regional vocabulary
- Verified via: uv run pytest test/ -q → 402 passed

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
* feat(integration): ahocorasick-ner dataset wrapper + examples + tests

- Add AhocorasickAnnotatorWrapper: adapts any AhocorasickNER instance as
  a BaseAnnotator for use in NERPipeline
- Add test_ahocorasick_wrapper.py and test_integration_hf.py (10 tests)
- Add examples: huggingface_datasets, wikidata_subclasses,
  comprehensive_datasets
- Add docs/DATASET_INTEGRATION.md with full HF dataset guide
- Fix duplicate dependency in pyproject.toml

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

* fix(names_ner): sentence-boundary heuristic reduces false positives (S-001)

Single capitalised words at position 0 or after sentence-ending punctuation
(.!?) now score 0.55 — below the default threshold of 0.65 — so common
false positives like "Send" or "Meeting" at sentence starts are suppressed.
Multi-word compound names (e.g. "John Doe") retain confidence 0.85 regardless
of sentence position. Mid-sentence single words score 0.80 as before.

Adds sentence_initial flag to entity data. 10 new tests covering all cases.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

* test: cover cli, batch, diff, version; fix TECH-004 typing

- utils/diff.py: replace typing.Tuple with built-in tuple (closes TECH-004)
- test_diff.py: 7 tests, diff.py now 100% covered
- test_batch.py: 10 tests, batch.py 0% → 83%
- test_cli.py: 22 tests, cli.py 0% → 93%, version.py 0% → 100%
- Overall coverage: 79% → 89%

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

* docs: update FAQ and add AI_TRANSPARENCY_LOG

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

* refactor(lookup_ner): use public AhocorasickNER name, drop private alias

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Remove misleading _AhocorasickNER private alias; AhocorasickNER is a public API
- Impact: Import renamed; type annotation on _ac field updated from string literal to direct type
- Verified via: uv run pytest test/ (462 passed)

* refactor(locations_ner): use public AhocorasickNER name, drop private alias

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Consistent with lookup_ner.py rename; AhocorasickNER is a public API
- Impact: Import alias removed; _build_automaton instantiation updated
- Verified via: uv run pytest test/ (462 passed)

* feat(ahocorasick_wrapper): expose min_word_len param, forward to tag()

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Allow callers to tune minimum match length (default 5 mirrors AhocorasickNER.tag default)
- Impact: New keyword-only param min_word_len on __init__; mock updated to accept the param
- Verified via: uv run pytest test/ (462 passed)

* feat(lookup_ner): add add_word() for single-word runtime registration

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Allow per-word additions without replacing the whole wordlist (gap vs add_wordlist)
- Impact: New public method add_word(label, word); rebuilds automaton immediately
- Verified via: uv run pytest test/ (462 passed)

* docs(ahocorasick_wrapper): document min_word_len param and dataset loader examples

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Make dataset loader compatibility discoverable; document new min_word_len param
- Impact: Class docstring rewritten with two usage examples (custom vocab + dataset loader)
- Verified via: uv run pytest test/ (462 passed)

* docs(index): add AhocorasickAnnotatorWrapper section with min_word_len and dataset loaders

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Make wrapper discoverable in docs; document min_word_len param and dataset loader compatibility
- Impact: New section in docs/index.md with constructor param table and two usage examples
- Verified via: uv run pytest test/ (462 passed)

* test: cover min_word_len forwarding, LookUpNER.add_word, no private alias

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Verify all acceptance criteria from spec.md with dedicated tests
- Impact: +6 tests (2 min_word_len, 1 symbol check, 3 add_word); 468 total
- Verified via: uv run pytest test/ (468 passed)

* chore: mark all status items complete; update AI_TRANSPARENCY_LOG

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Close sprint record; all 8 checklist items done
- Impact: status.md fully checked; AI_TRANSPARENCY_LOG updated with sprint 2 summary
- Verified via: uv run pytest test/ (468 passed, coverage 89%)

* fix: remove broken alias test; fix two stale docstrings

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Address audit findings — false-pass test deleted, stale docstrings corrected
- Impact: test_no_private_alias_in_source removed (python -m grep is not real);
  lookup_ner.py fallback prose removed (ahocorasick-ner is a hard dep);
  ahocorasick_wrapper.py Args example class names corrected to actual dataset classes
- Verified via: uv run pytest test/ (467 passed)

* refactor(locations_ner): align _ac type annotation with lookup_ner (AhocorasickNER | None)

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Consistency — lookup_ner uses AhocorasickNER | None, locations_ner used Any
- Impact: _ac field annotation tightened; Any import retained for other fields
- Verified via: uv run pytest test/ (467 passed)

* docs(lookup_ner): warn about O(n²) rebuilds when using add_word in a loop

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Guide callers to add_wordlist() for bulk additions
- Impact: add_word() docstring gains a one-line rebuild-cost note
- Verified via: uv run pytest test/ (467 passed)

* test(factory): parametrized smoke tests for all 27 factory keys + edge cases

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Close 81% coverage gap on factory.py instantiation branches
- Impact: 34 new tests covering every registered key, unknown-key error, custom registration, create_pipeline happy/skip/all-unknown paths
- Verified via: uv run pytest test/unittests/test_factory.py (34 passed)

* test(temporal_ner): cover import-error fallback (lines 21-44)

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Exercise the _OVOS_AVAILABLE=False path that was previously untested
- Impact: 3 new tests — deps present, deps absent (sentinel check), annotate no-op when unavailable
- Verified via: uv run pytest test/unittests/test_temporal_import_fallback.py (3 passed)

* test: cover locations error paths, opm session branch, temporal fallback, factory keys

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Close remaining coverage gaps from audit — locations_ner 131-145, opm 127-132, temporal_ner 30-44, factory instantiation
- Impact: +47 tests; coverage 89%→90%; test/test_factory.py and test/test_opm.py removed (duplicated by test/unittests/)
- Verified via: uv run pytest test/ (481 passed, coverage 90%)

* docs: comprehensive onboarding and documentation reorganization

- Add docs/GETTING_STARTED.md: 10-section getting started guide for new users with quick install, first pipeline, entity types, async processing, multi-language support, and troubleshooting
- Add docs/README.md: navigation hub with decision tree routing (new user vs developer vs advanced) + annotator selection table
- Update docs/index.md: add quick-navigation section at top linking to getting started, API, FAQ, tutorials, examples
- Update readme.md: replace sparse links with organized learning paths (Getting Started → FAQ → API Reference → Tutorials → Examples)
- Update AUDIT.md: consolidate 5 open issues from recent sprint (false-pass test, O(n²) caveat, type annotation inconsistency, stale docstrings)
- Update SUGGESTIONS.md: add 4 pending fixes (S-011 to S-014) tied to AUDIT issues with effort estimates
- All 481 tests pass, coverage at 90%

Addresses user request to "get the whole repo in shape, ready for a total noob to use out of the box."

Co-Authored-By: Claude Haiku 4.5 <[email protected]>

* audit: consolidate all fixes into AUDIT.md, delete SUGGESTIONS.md

- Remove TECH-012 (invalid per user — S-011 superseded)
- Mark TECH-013 through TECH-016 as resolved (all fixes implemented):
  - TECH-013: O(n²) caveat documented in LookUpNER.add_word() docstring
  - TECH-014: LocationNER._ac type aligned to AhocorasickNER | None
  - TECH-015: LookUpNER stale fallback reference removed
  - TECH-016: AhocorasickAnnotatorWrapper docstring corrected with actual dataset classes
- Delete SUGGESTIONS.md (all pending fixes completed; AUDIT.md is authoritative)

All 481 tests pass. Code is production-ready.

Co-Authored-By: Claude Haiku 4.5 <[email protected]>

* test(opm): skip OPM tests when ovos_plugin_manager not installed

The OPM tests require ovos_plugin_manager, which is not part of the 'test' extras
(since it's only needed for the OVOS plugin integration). Skip the tests gracefully
in CI where the module is unavailable, matching the pattern used for optional deps.

Fixes CI failures in build-tests, coverage, and opm-check workflows.

Co-Authored-By: Claude Haiku 4.5 <[email protected]>

* build: add ovos-plugin-manager to test extras; remove skip logic

Add ovos-plugin-manager (>=1.0.0) to both 'test' and 'dev' extras so it's always
available during testing. This ensures proper coverage of the OVOS plugin integration
(SimpleNERIntentTransformer) instead of skipping tests.

Remove the conditional import check and @pytest.mark.skipif from test_opm.py since
the module is now guaranteed to be present.

All 481 tests pass.

Co-Authored-By: Claude Haiku 4.5 <[email protected]>

---------

Co-authored-by: Claude Sonnet 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant