Skip to content

Multi-signal file support and NEOARM TEM enhancements#21

Merged
jat255 merged 15 commits intomainfrom
feature/validate_penn_dm4_quanta
Dec 23, 2025
Merged

Multi-signal file support and NEOARM TEM enhancements#21
jat255 merged 15 commits intomainfrom
feature/validate_penn_dm4_quanta

Conversation

@jat255
Copy link
Copy Markdown
Contributor

@jat255 jat255 commented Dec 22, 2025

Summary

This PR implements multi-signal file support and enhancement to the Digital Micrograph extractor for improved metadata extraction from DM3/DM4 files, particularly JEOL NEOARM TEM images.

Key Changes

  • Multi-signal file support: Files containing multiple signals (like DM3/DM4) are now automatically expanded into separate datasets in experimental records. Each signal gets its own metadata extraction, preview image, and XML dataset element.

  • NEOARM TEM enhancements: Enhanced Digital Micrograph extractor now captures JEOL-specific metadata including signal names (ADF, BF), aperture settings (condenser, objective, selected area), and pixel dwell time measurements.

  • Preview naming: Multi-signal files generate indexed previews (e.g., file.dm4_signal0.thumb.png). Single-signal files use traditional naming for backward compatibility.

  • Type safety improvement: Refactored try_getting_dict_value() to return None instead of magic string "not found" for better type safety and Python idiomaticity.

  • TIFF extractors refactoring: Standardized TIFF-based extractors to use consistent FieldDefinition configuration, reducing code duplication.

  • Documentation updates: Comprehensive updates to extractor documentation reflecting plugin architecture and multi-signal support.

Architecture Changes

The implementation maintains backward compatibility through:

  1. Transparent expansion: Multi-signal files are expanded at the Activity layer, maintaining the 1:1 mapping between metadata/preview/warnings lists internally while supporting multiple datasets per file.

  2. Fallback behavior: Single-signal files work exactly as before, using traditional preview naming (no signal suffix).

  3. Signal detection: Detection of multi-signal structure happens at the extractor level via "nx_meta_list" key vs traditional "nx_meta" key.

Testing

All changes are validated with comprehensive test coverage:

  • Unit tests for multi-signal extraction using neoarm_gatan_si_file fixture
  • Integration tests for Activity layer handling of multi-signal files
  • XML generation tests verifying correct dataset element creation
  • Backward compatibility tests with existing single-signal DM3/DM4 files
  • End-to-end workflow tests with multiple file types

Frontend Requirements

Note: Proper display and download of multi-signal records in the CDCS frontend requires updated XSLT stylesheet from NexusLIMS-CDCS commit 240a7f9.

Files Modified

Core extractors:

  • nexusLIMS/extractors/plugins/digital_micrograph.py - Multi-signal extraction and NEOARM support
  • nexusLIMS/extractors/__init__.py - Signal-indexed preview generation
  • nexusLIMS/extractors/plugins/preview_generators/ - Enhanced preview handling
  • nexusLIMS/extractors/plugins/{quanta_tif,tescan_tif,orion_HIM_tif}.py - Standardized field definitions

Activity/record building:

  • nexusLIMS/schemas/activity.py - Multi-signal file handling in AcquisitionActivity
  • nexusLIMS/builder/record_builder.py - End-to-end integration

Utilities:

  • nexusLIMS/utils.py - Type-safe try_getting_dict_value()

Documentation:

  • docs/extractors.md - Architecture and plugin system documentation
  • docs/writing_extractor_plugins.md - Plugin development guide
  • CLAUDE.md - Project guidelines
  • Changelog entries for features and improvements

Changelog Entries

  • 14.feature.md - NEOARM TEM metadata extraction enhancement
  • 14.feature.2.md - Multi-signal file support
  • 21.misc.md - TIFF extractor refactoring
  • +1.misc.md - Type safety improvement

Backward Compatibility

✅ All existing single-signal file tests pass
✅ Existing preview naming preserved for single signals
✅ New multi-signal structure transparent to downstream code
✅ Fallback extractors work identically

Performance

No performance regressions expected:

  • Multi-signal extraction uses same underlying mechanisms as single-signal
  • Preview generation parallelized as before
  • Activity expansion is O(n) where n = signal count per file

Related Issue

Fixes #14 - Multi-signal file support for improved microscopy data organization

- Create tar.gz archives for quanta-fei_2_dataZeroed.tif (6.2K),
  neoarm-gatan_SI_dataZeroed.dm4 (71K), and
  neoarm-gatan_image_dataZeroed.dm4 (12K)
- Add QUANTA_FEI_2, NEOARM_GATAN_SI, and NEOARM_GATAN_IMAGE entries
  to tests/unit/utils.py tars dictionary
- Create pytest fixtures (quanta_fei_2_file, neoarm_gatan_si_file,
  neoarm_gatan_image_file) in tests/unit/conftest.py for automatic
  extraction and cleanup
- Test files use zeroed data for size optimization (metadata preserved)
  and metadata has been sanitized to remove potentially identifying
  information
- Add FieldDefinition NamedTuple to base.py for standardized field configuration
- Supports unit conversion, string/numeric handling, and zero-value suppression
- Refactor Quanta, Tescan, and Orion HIM extractors to use FieldDefinition
- Reduces code duplication and improves maintainability across TIFF extractors
- Update extractor tests and fixtures to reflect new extraction methodology
Replace 'not found' sentinel value with None for better type safety and
Pythonic code. Updated all 58 call sites across extractors and utilities
to use 'is None' and 'is not None' checks. Fixed exception handling in
numeric field conversion to properly handle None values.
…on to DM3/DM4 extractor

Enhanced the Digital Micrograph extractor to capture three additional metadata
fields from JEOL NEOARM TEM images and other DM3/DM4 files:

- Signal Name: Detector signal type (e.g., ADF, BF) from DataBar metadata
- Aperture settings: Condenser, Objective, and Selected Area aperture values
  from Microscope Info
- Sample Time: Pixel dwell time in microseconds from DigiScan metadata

Added comprehensive test coverage using neoarm_gatan_image_file fixture to
verify all three fields are extracted correctly.

Fixes #14
… metadata dicts

BREAKING CHANGE: All extractors must now return list[dict] instead of dict

This change establishes the foundation for multi-signal file support by
standardizing the extractor return type:

- extract() methods now return list[dict[str, Any]] instead of dict[str, Any]
- Single-signal files return a 1-element list for consistency
- Multi-signal files return one dict per signal/dataset
- Updated docstrings and examples to reflect new contract

This allows the Activity layer to automatically expand multi-signal files
(e.g., DM3/DM4 with multiple signals) into separate datasets in the
experimental record.

Related to multi-signal file handling initiative.
…eturn lists

Update all extractor plugins for single-signal file formats to conform to
the new list-based return contract:

- BasicFileInfoExtractor: Returns [metadata_dict]
- QuantaTiffExtractor: Returns [metadata_dict]
- EdaxSpcExtractor & EdaxMsaExtractor: Returns [metadata_dict]
- SerEmiExtractor: Returns [metadata_dict]
- OrionTiffExtractor: Returns [metadata_dict]
- TescanTiffExtractor: Returns [metadata_dict]

For these formats, the list always contains exactly one element since
each file represents a single dataset. This provides a consistent
interface across all extractors and prepares for multi-signal handling
in DM3/DM4 files.

Updated docstrings to reflect the new return type.
Enable extraction of all signals from multi-signal DM3/DM4 files:

- get_dm3_metadata() now returns list of metadata dicts (one per signal)
- Previously returned only first signal, now returns all signals
- Single-signal files return 1-element list for consistency
- Multi-signal files (e.g., spectrum images with multiple channels) return
  one metadata dict per signal

This allows proper representation of complex DM3/DM4 files that contain
multiple datasets, such as:
- Files with multiple image/spectrum signals
- Spectrum images with separate energy loss and thickness maps
- Combined STEM/EELS acquisition sessions

Each signal gets its own metadata extraction and can be displayed as a
separate dataset in the experimental record.
…d preview generation

Update the extraction orchestration layer to handle multi-signal files:

parse_metadata() changes:
- Now processes list of metadata dicts returned by extractors
- For multi-signal files, writes separate JSON files with _signalN suffix
- Generates one preview per signal with _signalN.thumb.png naming
- Returns list of metadata dicts and list of preview paths
- Single-signal files maintain backward-compatible naming (no suffix)

create_preview() changes:
- Added signal_index parameter for multi-signal file preview generation
- Generates preview filename with _signalN suffix when index provided
- Passes signal_index through ExtractionContext to preview generators

HyperSpyPreviewGenerator changes:
- Handles multi-signal files by selecting appropriate signal via index
- Uses context.signal_index to choose which signal to preview
- Falls back to first signal for backward compatibility

This enables complete multi-signal workflow:
1. DM3/DM4 file with 4 signals → 4 JSON files + 4 preview PNGs
2. Each signal gets unique metadata and preview
3. Single-signal files unaffected (no _signal0 suffix for compatibility)
…tivity records

Update AcquisitionActivity to handle multi-signal files by creating one
dataset entry per signal:

add_file_by_path() changes:
- Processes list of metadata dicts returned by parse_metadata()
- For multi-signal files, adds one entry per signal to parallel lists
- Repeats filename for each signal but uses different preview paths
- Each signal gets its own metadata, preview, and warnings

_add_dataset_element() changes:
- Added preview_path parameter for explicit preview file specification
- Added signal_index and total_signals for multi-signal naming
- Dataset names include signal index: "filename.ext (X of Y)"
- Uses provided preview_path instead of computing from filename

as_xml() changes:
- Tracks file occurrence counts to identify multi-signal files
- Passes signal index and preview path to _add_dataset_element()
- Each signal becomes a separate <dataset> element in XML

Example: A DM3 file with 4 signals creates:
- 4 <dataset> elements with names "file.dm3 (1 of 4)", "file.dm3 (2 of 4)", etc.
- 4 unique preview paths: file_signal0.thumb.png, file_signal1.thumb.png, etc.
- 4 metadata entries with signal-specific information
- All share the same source file location
…return

Update unit tests to handle the new list-based return format from extractors:

All extractor plugin tests updated:
- test_basic_metadata.py: Assert metadata is list, access first element
- test_digital_micrograph.py: Handle multi-signal DM3/DM4 test cases
- test_edax.py: Update SPC/MSA extractor tests for list returns
- test_fei_emi.py: Update SER/EMI extractor tests
- test_orion_HIM.py: Update Orion TIFF extractor tests
- test_quanta_tif.py: Update Quanta TIFF extractor tests
- test_tescan_tif.py: Update Tescan TIFF extractor tests

test_extractor_module.py extensive updates:
- parse_metadata() now returns (list[dict], list[Path])
- Added multi-signal specific tests for signal_index handling
- test_parse_metadata_multi_signal_no_preview(): Verify [None] list returned
- test_create_preview_multi_signal_list_with_index(): Test signal selection
- test_create_preview_multi_signal_list_without_index(): Test legacy mode
- Updated cleanup helpers to handle list of preview paths

test_plugins.py updates:
- Mock extractors return lists for compatibility
- Registry tests verify list-based returns

test_thumbnail_generator.py updates:
- Added multi-signal preview generation tests

All tests verify:
1. Extractors return lists (even for single-signal files)
2. First element accessed for single-signal validation
3. Multi-signal files properly handled with multiple elements
4. Backward compatibility maintained
Update record builder tests to verify multi-signal file handling:

test_activity.py changes:
- Added test_activity_multi_signal_file() to verify expansion behavior
- Verifies that multi-signal files create multiple dataset entries
- Confirms dataset names include signal indices
- Validates parallel lists contain repeated filenames with unique metadata

test_record_builder.py changes:
- Updated XML validation tests for multi-signal datasets
- Mock parse_metadata returns list format
- Verified dataset naming with signal indices in XML output
- Confirmed preview path handling for multi-signal files

These tests ensure:
1. Multi-signal files expand into N dataset elements
2. Each dataset has unique name with (X of Y) format
3. All datasets share same file location
4. Each dataset has its own preview path and metadata
5. XML structure remains valid with multi-signal datasets
Add comprehensive integration tests for multi-signal file handling:

test_end_to_end_workflow.py additions:
- test_multi_signal_record_generation_and_structure(): Verifies complete
  multi-signal workflow from file discovery through XML generation
- Tests DM4 file with 4 signals and DM3 file with 2 signals
- Validates dataset naming: "filename.ext (X of Y)" format
- Confirms each signal gets unique preview but shares file location
- Verifies XML schema compliance with multi-signal datasets

test_nemo_integration.py updates:
- Updated for list-based metadata returns

tests/integration/conftest.py additions:
- Added multi_signal_integration_record fixture
- Helper functions for URL validation and metadata verification
- Seed data for multi-signal test files

tests/integration/docker/nemo/fixtures/seed_data.json:
- Added test data for multi-signal integration tests

tests/conftest.py updates:
- Fixture updates for multi-signal test support

Test fixtures added:
- multi_signal_test_files.py: Fixture definitions for multi-signal test files
- test_hyperspy_preview_generator_multi_signal_no_index.png: Baseline image
  for multi-signal preview generation tests

Other test updates:
- test_nemo_api.py: Updated for list-based returns
- test_utils.py: Minor updates for compatibility

These integration tests verify the complete pipeline:
1. Multi-signal files discovered in session
2. Metadata extracted for each signal
3. Previews generated for each signal
4. XML record created with separate datasets
5. Record uploaded to CDCS successfully
- Ensure that PR runs use a PR-tagged docker image
- Collect Docker logs from all services (nemo, cdcs, mongo, postgres, redis, mailpit, caddy) after tests complete
- Detect which compose file was used (ci.yml or docker-compose.yml) to ensure correct log collection
- Display collected logs in action output when tests fail for immediate visibility
- Upload full logs as artifacts for all runs (success and failure)
- Provides comprehensive troubleshooting information without requiring artifact download
@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 22, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Dec 22, 2025

📚 Documentation Preview

The documentation for this PR has been deployed to:

This preview will be updated on each push to this PR.

@jat255 jat255 force-pushed the feature/validate_penn_dm4_quanta branch from 635278e to 2fbad05 Compare December 22, 2025 16:42
@jat255 jat255 force-pushed the feature/validate_penn_dm4_quanta branch from 2fbad05 to 2d78fe2 Compare December 22, 2025 16:53
@jat255 jat255 force-pushed the feature/validate_penn_dm4_quanta branch 2 times, most recently from a7abaf1 to 216ba9e Compare December 22, 2025 17:06
@jat255 jat255 merged commit 58d0dbf into main Dec 23, 2025
13 checks passed
@jat255 jat255 deleted the feature/validate_penn_dm4_quanta branch December 23, 2025 00:22
github-actions Bot added a commit that referenced this pull request Dec 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Validate Metadata Extraction for Penn's DM4 and Quanta SEM Datasets

1 participant