Skip to content

Latest commit

 

History

History
117 lines (106 loc) · 10.3 KB

File metadata and controls

117 lines (106 loc) · 10.3 KB

Goal

Add support for ingesting an HCX (hierarchical CX2) network, resolve the chromosome for every gene referenced in the hierarchy, and annotate each hierarchy node with a tally of chromosome counts for all genes contained in that node’s subtree. Emit the updated hierarchy as CX2 wrapped in an updateNetwork action.

Current State (quick read)

  • chromloc/annotate.py is being added to replace the former updatenetwork demo logic.
  • cli.py will expose HCX-specific options (interaction UUID, gene attrs, chrom map path).
  • Hierarchy handling, NDEx fetching, chromosome tallies, and styling are implemented per design.

Assumptions & Open Questions

  • HCX hierarchy nodes carry HCX::members that list node IDs belonging to a separate “interaction network”.
  • The NDEx UUID of that interaction network is read in priority order: node attribute HCX::interactionNetworkUUID, else network attribute HCX::interactionNetworkUUID.
  • Gene identifiers come from those member node records in the interaction network (preferred attribute order: represents, then name, then configurable fallback).
  • Hierarchy structure follows the HCX spec (parent/child via hierarchy aspect). We treat leaves as biological entities (genes) and internal nodes as groupings.
  • Species defaults to human (GRCh38); chromosome set assumed human (chr1–chr22, chrX, chrY, chrM). Keep option to override species for future-proofing, but default behavior is human.
  • Chromosome resolution should work offline when given a local annotation file, and online (optional) via MyGene.info or Ensembl REST when permitted.

Proposed CLI/User Inputs

  • --interaction-gene-attrs: comma list of candidate attributes on the interaction network nodes (default represents,name).
  • --interaction-uuid-attr: attribute name to read NDEx UUID when present on hierarchy nodes (default HCX::interactionNetworkUUID).
  • --interaction-uuid-network-attr: network-level fallback attribute for UUID (default same as above).
  • --gene-list-delim: delimiter when a single string holds multiple genes (default ,).
  • --species: organism code (default human / GRCh38; controls chromosome list used for per-chromosome attributes).
  • --chrom-map: path to local TSV/JSON gene→chromosome map (default to packaged JSON built by build_gene_chr_map.py).
  • --cache-file: optional JSON cache for resolved genes → chromosome (primarily speeds repeated runs).
  • --ndex-server / --ndex-username / --ndex-password or token env var for fetching interaction networks when not embedded.
  • --progress: keep existing progress messages; extend with additional checkpoints.

Data Source Strategy

  • Implement a single resolver:
    • LocalResolver loads a gene→chromosome map from TSV (symbol\tchr) or JSON dict (the default path generated by scripts/build_gene_chr_map.py).
  • Add normalization: uppercase symbols, strip version suffixes, handle aliases via optional HGNC alias file when provided.
  • Caching layer writes/reads JSON to minimize repeat lookups within a run (mainly to avoid re-reading large maps).
  • NDEx fetcher utility: small wrapper around ndex2.client.Ndex2 to pull interaction networks by UUID; respects server/auth flags and caches results on disk (optional) to avoid repeated downloads.

Hierarchy Processing Design

  1. Load hierarchy network with RawCX2NetworkFactory.
  2. Determine interaction network UUID for each hierarchy node:
    • If node has HCX::interactionNetworkUUID, use it; else use network attribute HCX::interactionNetworkUUID.
    • If no UUID is found, error (unless --allow-missing-uuid is set).
  3. Fetch interaction network (CX/HCX) from NDEx using the UUID (respect server/auth CLI options). Cache the fetched network by UUID to avoid repeat downloads.
  4. Index interaction nodes → genes:
    • Build a map interaction_node_id -> gene_id using the first non-empty attribute from --interaction-gene-attrs.
    • Support gene-list-delim when the attribute is a string containing multiple symbols.
  5. Extract hierarchy structure: read HCX hierarchy aspect to build parent→children adjacency and find root(s). Validate that metaData lists hierarchy elements.
  6. Collect node genes via HCX::members:
    • For each hierarchy node, read HCX::members (list of interaction node IDs). Translate each ID to gene(s) via the interaction map. Drop missing IDs with a warning counter.
  7. Post-order aggregation:
    • Traverse hierarchy bottom-up, accumulating each child’s gene set into the parent (union or multiset? use multiset/counts to retain multiplicity; store both set for uniqueness and count for frequency).
    • Store per-node: gene_ids (set), gene_counts (Counter), and chrom_counts (dict of chr → count).
  8. Chromosome resolution:
    • For every unique gene encountered across the hierarchy, resolve chromosome once via resolver + cache.
    • Treat unknowns as chrUnknown and track separately to avoid data loss.
  9. Tally building:
    • For each node, translate its gene_counts to chromosome counts using the shared resolution table.
    • Representation (CX2-friendly):
      • Per-chromosome numeric node attributes: chr1_count, chr2_count, …, chr22_count, chrX_count, chrY_count, chrM_count (integers, 0 if none). If species overrides human, generate the chromosome list from the resolver’s metadata instead.
      • Summary attribute chromosomeCounts (list_of_string) e.g., chr1=12, chrX=3, chrUnknown=1.
      • Optional chromosomeCountsJson (string) with JSON-encoded map for consumers that prefer structured data.
    • Update attributeDeclarations to declare all new per-chromosome attributes plus the summary attributes.
  10. Emit updated CX2 preserving all existing aspects; only add/modify node attributes and metaData counts.

Error Handling & Validation

  • Warn (progress message) when hierarchy aspect is missing or malformed; fail fast unless --allow-nonhierarchy is passed.
  • Error when HCX::interactionNetworkUUID is missing at both node and network level unless --allow-missing-uuid is set.
  • Warn when HCX::members references an interaction node ID not found in the fetched network; count and report these.
  • Warn when a member node lacks any usable gene attribute; count unresolved genes separately.
  • Record the number of unresolved genes; add a network attribute chromosomeMappingUnresolved for transparency.
  • Validate that attributeDeclarations include the new attributes; adjust metaData elementCount where needed.

Testing Plan

  • Unit tests (pytest) for:
    • Extraction of NDEx UUID from node vs network attributes.
    • Mapping HCX::members IDs to genes given an interaction network fixture.
    • Gene extraction from nodes (single vs list attribute, delimiter handling).
    • Resolver behavior with local map (including alias handling and normalization).
    • Hierarchy aggregation (post-order) on a small synthetic HCX fixture.
    • CX2 mutation: ensure node attributes are added and declared, metaData updated.
  • Integration smoke test: run CLI on foo.cx2 (non-hierarchical) expecting a clear error message.

Package/Module Renaming

  • Rename Python package directory from updatenetwork to chromloc (or similar concise, descriptive name aligned with the repo).
  • Rename update.pyannotate.py (core logic) and cli.pycli.py (keep name, update imports) to reflect chromosome-location annotation purpose.
  • Update pyproject.toml entry points and any imports to the new package/module names.
  • Keep console script name user-facing, e.g., cytoscape-chromloc.

Implementation Steps

  1. Add resolver module (chromloc/resolvers.py) with Local/MyGene/Ensembl implementations + cache wrapper.
  2. Add NDEx fetcher helper (chromloc/ndex_fetch.py) to download and cache interaction networks by UUID.
  3. Extend annotate.py to:
    • Parse CLI options (passed through run_update).
    • Build hierarchy graph, aggregate genes, resolve chromosomes, and annotate nodes.
    • Keep existing progress messaging pattern.
    • Generate per-chromosome count attributes for the human chromosome list by default (extendable for other species).
  4. Update cli.py for new arguments and pass them into run_update (now importing from chromloc.annotate).
  5. Update requirements.txt to include ndex2 client (and requests if the fetcher needs it explicitly).
  6. Write pytest suite under tests/ with fixtures for hierarchy CX2 snippets.
  7. Document usage in README.rst (new options, expected outputs, example command).
  8. Add a small helper script scripts/build_gene_chr_map.py that:
    • Downloads/reads a public gene annotation source (e.g., NCBI gene info or Ensembl BioMart export) for human.
    • Produces a UTF-8 TSV symbol\tchromosome and a JSON dictionary {symbol: chromosome}.
    • Normalizes symbols to uppercase, strips version suffixes, and skips non-standard chromosomes unless --include-alt is passed.
    • Accepts --input (path or URL), --output-tsv, --output-json, --species (default human), and --allow-ambiguous (keep first mapping when multiple chromosomes exist).
    • Includes a brief README section on how to refresh the dataset and recommended refresh cadence.

Visualization (pie chart style)

  • Add/modify visualProperties aspect to define a node fill pie chart using the per-chromosome count attributes.
    • Use Cytoscape pie-chart custom graphics: map chr1_countchr22_count, chrX_count, chrY_count, chrM_count to pie slices in a fixed order for consistent color mapping.
    • Define a discrete palette (e.g., ColorBrewer qualitative set) and store in style properties.
    • Ensure style references attributes by name; if attributes are missing or zero, slices render as 0.
  • If an existing style is present, append a new style entry (do not overwrite). Provide a network attribute flag chromosomePieStyleApplied=true to avoid reapplying.
  • Keep sizing/opacity unchanged; only set node fill (custom graphics) to the pie chart. Text/labels remain as-is.

Risks & Mitigations

  • Chromosome map staleness: provide documented script to refresh the local map and recommend cadence.
  • Hierarchy variability: tolerate missing aspects by clear error; make attribute names configurable.
  • CX2 type constraints: use list_of_string + JSON string to stay within supported types.
  • Performance on large trees: single-pass post-order traversal with memoized resolutions; batch NDEx fetches via caching.