Conversation
Closed
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1805 +/- ##
===========================================
- Coverage 70.26% 32.47% -37.80%
===========================================
Files 175 185 +10
Lines 30192 30805 +613
===========================================
- Hits 21215 10003 -11212
- Misses 8977 20802 +11825 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Draft
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
TLDR: Abstracts source control to support different data ingestion patterns.
Motivation
BaseSourceControlstoday conflates two concerns: lifecycle management (progress bars, error messages, load triggering, source registration) and file-processing mechanics (file cards, format detection, CSV/Parquet/JSON parsing). Also, the previous design only covers one data ingestion pattern: the user provides raw files (upload or URL download), but there can be so many more:Each pattern has a different UI, a different user interaction flow, and a different metadata shape that matters for agent-driven discovery.
What this PR does
Refactors the source controls into a proper class hierarchy:
The key changes:
BaseSourceControlsbecomes a clean lifecycle base — just progress, messages,_run_load, and source registration. No file knowledge.CatalogSourceControlsis agent-ready by design. It accepts avector_storeparam. When the catalog loads, entries are embedded in the background (sameasyncio.create_taskpattern asSourceCatalog._sync_metadata_to_vector_store). Thesearch_columnsparam controls what text gets embedded. This means a futureCatalogLookupToolcan semantically search catalogs without any additional plumbing.ParametricSourceControlsauto-generates widgets fromparam.Parameterdefinitions. Subclasses just declare params and implement_fetch_data(). The_get_parameter_schema()method exposes parameter structure for future agent-driven parameter filling.File-specific code moved to
FileSourceControls—_process_files,_generate_file_cards,_add_table,_read_json_file,_read_geo_file, and all theUploadedFileRowmanagement. Upload and Download controls inherit from this instead of the base.Toward agent-driven data discovery
This refactoring is explicitly designed as the foundation for a conversational discovery agent. The control hierarchy exposes clean hooks:
_entry_to_text()→ vector store embedding → agent can semantically search ("find me mouse brain spatial transcriptomics data")_get_parameter_schema()→ agent can fill parameters from natural language ("get me weather data for Seattle last week")_fetch_entry()/_fetch_data()→ agent can trigger loading programmaticallyThe agent layer itself is not in this PR (will be separate), it builds on top of these controls as a separate piece.
How Has This Been Tested?
Will try actual use case scenarios, starting with Anndata, then others holoviz-topics/lumen-anndata#45
AI Disclosure
Opus for back and forth planning and Sonnet for implementation
Tools and Models: {e.g., Cursor + Sonnet 4.6, Claude Code + Opus 4.6, Antigravity + Gemini Flash 3, ChatGPT, etc.}
Checklist