Multi-signal file support and NEOARM TEM enhancements by jat255 · Pull Request #21 · datasophos/NexusLIMS

jat255 · 2025-12-22T15:19:57Z

Summary

This PR implements multi-signal file support and enhancement to the Digital Micrograph extractor for improved metadata extraction from DM3/DM4 files, particularly JEOL NEOARM TEM images.

Key Changes

Multi-signal file support: Files containing multiple signals (like DM3/DM4) are now automatically expanded into separate datasets in experimental records. Each signal gets its own metadata extraction, preview image, and XML dataset element.
NEOARM TEM enhancements: Enhanced Digital Micrograph extractor now captures JEOL-specific metadata including signal names (ADF, BF), aperture settings (condenser, objective, selected area), and pixel dwell time measurements.
Preview naming: Multi-signal files generate indexed previews (e.g., file.dm4_signal0.thumb.png). Single-signal files use traditional naming for backward compatibility.
Type safety improvement: Refactored try_getting_dict_value() to return None instead of magic string "not found" for better type safety and Python idiomaticity.
TIFF extractors refactoring: Standardized TIFF-based extractors to use consistent FieldDefinition configuration, reducing code duplication.
Documentation updates: Comprehensive updates to extractor documentation reflecting plugin architecture and multi-signal support.

Architecture Changes

The implementation maintains backward compatibility through:

Transparent expansion: Multi-signal files are expanded at the Activity layer, maintaining the 1:1 mapping between metadata/preview/warnings lists internally while supporting multiple datasets per file.
Fallback behavior: Single-signal files work exactly as before, using traditional preview naming (no signal suffix).
Signal detection: Detection of multi-signal structure happens at the extractor level via "nx_meta_list" key vs traditional "nx_meta" key.

Testing

All changes are validated with comprehensive test coverage:

Unit tests for multi-signal extraction using neoarm_gatan_si_file fixture
Integration tests for Activity layer handling of multi-signal files
XML generation tests verifying correct dataset element creation
Backward compatibility tests with existing single-signal DM3/DM4 files
End-to-end workflow tests with multiple file types

Frontend Requirements

Note: Proper display and download of multi-signal records in the CDCS frontend requires updated XSLT stylesheet from NexusLIMS-CDCS commit 240a7f9.

Files Modified

Core extractors:

nexusLIMS/extractors/plugins/digital_micrograph.py - Multi-signal extraction and NEOARM support
nexusLIMS/extractors/__init__.py - Signal-indexed preview generation
nexusLIMS/extractors/plugins/preview_generators/ - Enhanced preview handling
nexusLIMS/extractors/plugins/{quanta_tif,tescan_tif,orion_HIM_tif}.py - Standardized field definitions

Activity/record building:

nexusLIMS/schemas/activity.py - Multi-signal file handling in AcquisitionActivity
nexusLIMS/builder/record_builder.py - End-to-end integration

Utilities:

nexusLIMS/utils.py - Type-safe try_getting_dict_value()

Documentation:

docs/extractors.md - Architecture and plugin system documentation
docs/writing_extractor_plugins.md - Plugin development guide
CLAUDE.md - Project guidelines
Changelog entries for features and improvements

Changelog Entries

14.feature.md - NEOARM TEM metadata extraction enhancement
14.feature.2.md - Multi-signal file support
21.misc.md - TIFF extractor refactoring
+1.misc.md - Type safety improvement

Backward Compatibility

✅ All existing single-signal file tests pass
✅ Existing preview naming preserved for single signals
✅ New multi-signal structure transparent to downstream code
✅ Fallback extractors work identically

Performance

No performance regressions expected:

Multi-signal extraction uses same underlying mechanisms as single-signal
Preview generation parallelized as before
Activity expansion is O(n) where n = signal count per file

Related Issue

Fixes #14 - Multi-signal file support for improved microscopy data organization

- Create tar.gz archives for quanta-fei_2_dataZeroed.tif (6.2K), neoarm-gatan_SI_dataZeroed.dm4 (71K), and neoarm-gatan_image_dataZeroed.dm4 (12K) - Add QUANTA_FEI_2, NEOARM_GATAN_SI, and NEOARM_GATAN_IMAGE entries to tests/unit/utils.py tars dictionary - Create pytest fixtures (quanta_fei_2_file, neoarm_gatan_si_file, neoarm_gatan_image_file) in tests/unit/conftest.py for automatic extraction and cleanup - Test files use zeroed data for size optimization (metadata preserved) and metadata has been sanitized to remove potentially identifying information

- Add FieldDefinition NamedTuple to base.py for standardized field configuration - Supports unit conversion, string/numeric handling, and zero-value suppression - Refactor Quanta, Tescan, and Orion HIM extractors to use FieldDefinition - Reduces code duplication and improves maintainability across TIFF extractors - Update extractor tests and fixtures to reflect new extraction methodology

Replace 'not found' sentinel value with None for better type safety and Pythonic code. Updated all 58 call sites across extractors and utilities to use 'is None' and 'is not None' checks. Fixed exception handling in numeric field conversion to properly handle None values.

…on to DM3/DM4 extractor Enhanced the Digital Micrograph extractor to capture three additional metadata fields from JEOL NEOARM TEM images and other DM3/DM4 files: - Signal Name: Detector signal type (e.g., ADF, BF) from DataBar metadata - Aperture settings: Condenser, Objective, and Selected Area aperture values from Microscope Info - Sample Time: Pixel dwell time in microseconds from DigiScan metadata Added comprehensive test coverage using neoarm_gatan_image_file fixture to verify all three fields are extracted correctly. Fixes #14

… metadata dicts BREAKING CHANGE: All extractors must now return list[dict] instead of dict This change establishes the foundation for multi-signal file support by standardizing the extractor return type: - extract() methods now return list[dict[str, Any]] instead of dict[str, Any] - Single-signal files return a 1-element list for consistency - Multi-signal files return one dict per signal/dataset - Updated docstrings and examples to reflect new contract This allows the Activity layer to automatically expand multi-signal files (e.g., DM3/DM4 with multiple signals) into separate datasets in the experimental record. Related to multi-signal file handling initiative.

…eturn lists Update all extractor plugins for single-signal file formats to conform to the new list-based return contract: - BasicFileInfoExtractor: Returns [metadata_dict] - QuantaTiffExtractor: Returns [metadata_dict] - EdaxSpcExtractor & EdaxMsaExtractor: Returns [metadata_dict] - SerEmiExtractor: Returns [metadata_dict] - OrionTiffExtractor: Returns [metadata_dict] - TescanTiffExtractor: Returns [metadata_dict] For these formats, the list always contains exactly one element since each file represents a single dataset. This provides a consistent interface across all extractors and prepares for multi-signal handling in DM3/DM4 files. Updated docstrings to reflect the new return type.

Enable extraction of all signals from multi-signal DM3/DM4 files: - get_dm3_metadata() now returns list of metadata dicts (one per signal) - Previously returned only first signal, now returns all signals - Single-signal files return 1-element list for consistency - Multi-signal files (e.g., spectrum images with multiple channels) return one metadata dict per signal This allows proper representation of complex DM3/DM4 files that contain multiple datasets, such as: - Files with multiple image/spectrum signals - Spectrum images with separate energy loss and thickness maps - Combined STEM/EELS acquisition sessions Each signal gets its own metadata extraction and can be displayed as a separate dataset in the experimental record.

…d preview generation Update the extraction orchestration layer to handle multi-signal files: parse_metadata() changes: - Now processes list of metadata dicts returned by extractors - For multi-signal files, writes separate JSON files with _signalN suffix - Generates one preview per signal with _signalN.thumb.png naming - Returns list of metadata dicts and list of preview paths - Single-signal files maintain backward-compatible naming (no suffix) create_preview() changes: - Added signal_index parameter for multi-signal file preview generation - Generates preview filename with _signalN suffix when index provided - Passes signal_index through ExtractionContext to preview generators HyperSpyPreviewGenerator changes: - Handles multi-signal files by selecting appropriate signal via index - Uses context.signal_index to choose which signal to preview - Falls back to first signal for backward compatibility This enables complete multi-signal workflow: 1. DM3/DM4 file with 4 signals → 4 JSON files + 4 preview PNGs 2. Each signal gets unique metadata and preview 3. Single-signal files unaffected (no _signal0 suffix for compatibility)

…tivity records Update AcquisitionActivity to handle multi-signal files by creating one dataset entry per signal: add_file_by_path() changes: - Processes list of metadata dicts returned by parse_metadata() - For multi-signal files, adds one entry per signal to parallel lists - Repeats filename for each signal but uses different preview paths - Each signal gets its own metadata, preview, and warnings _add_dataset_element() changes: - Added preview_path parameter for explicit preview file specification - Added signal_index and total_signals for multi-signal naming - Dataset names include signal index: "filename.ext (X of Y)" - Uses provided preview_path instead of computing from filename as_xml() changes: - Tracks file occurrence counts to identify multi-signal files - Passes signal index and preview path to _add_dataset_element() - Each signal becomes a separate <dataset> element in XML Example: A DM3 file with 4 signals creates: - 4 <dataset> elements with names "file.dm3 (1 of 4)", "file.dm3 (2 of 4)", etc. - 4 unique preview paths: file_signal0.thumb.png, file_signal1.thumb.png, etc. - 4 metadata entries with signal-specific information - All share the same source file location

…return Update unit tests to handle the new list-based return format from extractors: All extractor plugin tests updated: - test_basic_metadata.py: Assert metadata is list, access first element - test_digital_micrograph.py: Handle multi-signal DM3/DM4 test cases - test_edax.py: Update SPC/MSA extractor tests for list returns - test_fei_emi.py: Update SER/EMI extractor tests - test_orion_HIM.py: Update Orion TIFF extractor tests - test_quanta_tif.py: Update Quanta TIFF extractor tests - test_tescan_tif.py: Update Tescan TIFF extractor tests test_extractor_module.py extensive updates: - parse_metadata() now returns (list[dict], list[Path]) - Added multi-signal specific tests for signal_index handling - test_parse_metadata_multi_signal_no_preview(): Verify [None] list returned - test_create_preview_multi_signal_list_with_index(): Test signal selection - test_create_preview_multi_signal_list_without_index(): Test legacy mode - Updated cleanup helpers to handle list of preview paths test_plugins.py updates: - Mock extractors return lists for compatibility - Registry tests verify list-based returns test_thumbnail_generator.py updates: - Added multi-signal preview generation tests All tests verify: 1. Extractors return lists (even for single-signal files) 2. First element accessed for single-signal validation 3. Multi-signal files properly handled with multiple elements 4. Backward compatibility maintained

Update record builder tests to verify multi-signal file handling: test_activity.py changes: - Added test_activity_multi_signal_file() to verify expansion behavior - Verifies that multi-signal files create multiple dataset entries - Confirms dataset names include signal indices - Validates parallel lists contain repeated filenames with unique metadata test_record_builder.py changes: - Updated XML validation tests for multi-signal datasets - Mock parse_metadata returns list format - Verified dataset naming with signal indices in XML output - Confirmed preview path handling for multi-signal files These tests ensure: 1. Multi-signal files expand into N dataset elements 2. Each dataset has unique name with (X of Y) format 3. All datasets share same file location 4. Each dataset has its own preview path and metadata 5. XML structure remains valid with multi-signal datasets

Add comprehensive integration tests for multi-signal file handling: test_end_to_end_workflow.py additions: - test_multi_signal_record_generation_and_structure(): Verifies complete multi-signal workflow from file discovery through XML generation - Tests DM4 file with 4 signals and DM3 file with 2 signals - Validates dataset naming: "filename.ext (X of Y)" format - Confirms each signal gets unique preview but shares file location - Verifies XML schema compliance with multi-signal datasets test_nemo_integration.py updates: - Updated for list-based metadata returns tests/integration/conftest.py additions: - Added multi_signal_integration_record fixture - Helper functions for URL validation and metadata verification - Seed data for multi-signal test files tests/integration/docker/nemo/fixtures/seed_data.json: - Added test data for multi-signal integration tests tests/conftest.py updates: - Fixture updates for multi-signal test support Test fixtures added: - multi_signal_test_files.py: Fixture definitions for multi-signal test files - test_hyperspy_preview_generator_multi_signal_no_index.png: Baseline image for multi-signal preview generation tests Other test updates: - test_nemo_api.py: Updated for list-based returns - test_utils.py: Minor updates for compatibility These integration tests verify the complete pipeline: 1. Multi-signal files discovered in session 2. Metadata extracted for each signal 3. Previews generated for each signal 4. XML record created with separate datasets 5. Record uploaded to CDCS successfully

- Ensure that PR runs use a PR-tagged docker image - Collect Docker logs from all services (nemo, cdcs, mongo, postgres, redis, mailpit, caddy) after tests complete - Detect which compose file was used (ci.yml or docker-compose.yml) to ensure correct log collection - Display collected logs in action output when tests fail for immediate visibility - Upload full logs as artifacts for all runs (success and failure) - Provides comprehensive troubleshooting information without requiring artifact download

codecov · 2025-12-22T15:23:37Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

github-actions · 2025-12-22T15:24:13Z

📚 Documentation Preview

The documentation for this PR has been deployed to:

Docs: https://datasophos.github.io/NexusLIMS/pr-21/
Coverage: https://datasophos.github.io/NexusLIMS/pr-21/coverage/

This preview will be updated on each push to this PR.

jat255 added 15 commits December 18, 2025 15:58

plan

f69d008

docs: Add changelog entry for multi-signal file support (issue #14)

4586b0c

github-actions Bot added a commit that referenced this pull request Dec 22, 2025

Deploy docs preview for PR #21 and update switcher.json

c913922

github-actions Bot added a commit that referenced this pull request Dec 22, 2025

Deploy docs preview for PR #21 and update switcher.json

895a9a0

jat255 force-pushed the feature/validate_penn_dm4_quanta branch from 635278e to 2fbad05 Compare December 22, 2025 16:42

github-actions Bot added a commit that referenced this pull request Dec 22, 2025

Deploy docs preview for PR #21 and update switcher.json

4deb89f

jat255 force-pushed the feature/validate_penn_dm4_quanta branch from 2fbad05 to 2d78fe2 Compare December 22, 2025 16:53

github-actions Bot added a commit that referenced this pull request Dec 22, 2025

Deploy docs preview for PR #21 and update switcher.json

db9fca2

jat255 force-pushed the feature/validate_penn_dm4_quanta branch 2 times, most recently from a7abaf1 to 216ba9e Compare December 22, 2025 17:06

github-actions Bot added a commit that referenced this pull request Dec 22, 2025

Deploy docs preview for PR #21 and update switcher.json

12f3958

github-actions Bot added a commit that referenced this pull request Dec 22, 2025

Deploy docs preview for PR #21 and update switcher.json

3f469d1

jat255 merged commit 58d0dbf into main Dec 23, 2025
13 checks passed

jat255 deleted the feature/validate_penn_dm4_quanta branch December 23, 2025 00:22

github-actions Bot added a commit that referenced this pull request Dec 23, 2025

Remove docs preview for closed PR #21

e690a0c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-signal file support and NEOARM TEM enhancements#21

Multi-signal file support and NEOARM TEM enhancements#21
jat255 merged 15 commits intomainfrom
feature/validate_penn_dm4_quanta

jat255 commented Dec 22, 2025 •

edited

Loading

Uh oh!

codecov Bot commented Dec 22, 2025

Uh oh!

github-actions Bot commented Dec 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jat255 commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Architecture Changes

Testing

Frontend Requirements

Files Modified

Changelog Entries

Backward Compatibility

Performance

Related Issue

Uh oh!

codecov Bot commented Dec 22, 2025

Codecov Report

Uh oh!

github-actions Bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📚 Documentation Preview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jat255 commented Dec 22, 2025 •

edited

Loading

github-actions Bot commented Dec 22, 2025 •

edited

Loading