Skip to content

Column names in cpm_gene_counts_colnames do not match expected aliases #166

@anaperes95

Description

@anaperes95

Operating System

Windows 10

Other Linux

No response

Workflow Version

25.10.0

Workflow Execution

Command line (Local)

Other workflow execution

No response

EPI2ME Version

No response

CLI command run

nextflow run epi2me-labs/wf-transcriptomes --de_analysis --direct_rna --fastq '/mnt/d/ALL_WTS_FASTQ_04112025/DGE_input/' --minimap2_index_opts '-k 15' --transcriptome_source 'reference-guided' --ref_annotation '/mnt/d/ALL_WTS_FASTQ_04112025/DGE_input/Homo_sapiens.GRCh38.115.gtf.gz' --ref_genome '/mnt/d/ALL_WTS_FASTQ_04112025/DGE_input/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz' --sample_sheet '/mnt/d/ALL_WTS_FASTQ_04112025/DGE_input/sample_sheet.csv' -profile standard

Workflow Execution - CLI Execution Profile

standard (default)

What happened?

We encountered an error during pipeline:differential_expression:deAnalysis when running differential expression with 6 samples (3 control, 3 treated). The issue seems related to read.csv() automatically applying make.names() to column names in the counts file, while sample IDs in the samplesheet are not modified. This causes mismatched sample IDs between the two files and leads to the error: Column names in cpm_gene_counts_colnames do not match expected aliases.

sample_sheet.csv

user:/mnt/d/ALL_WTS_FASTQ_04112025/work/95/7ac5d50106a84cdb26fa3f9bc25e30$ cat cpm_gene_counts_colnames
ALL_14 ALL_22 ALL_13 ALL_63 ALL_65 ALL_66
user:/mnt/d/ALL_WTS_FASTQ_04112025/work/95/7ac5d50106a84cdb26fa3f9bc25e30$ cat expected_colnames
control treated
user:/mnt/d/ALL_WTS_FASTQ_04112025/work/95/7ac5d50106a84cdb26fa3f9bc25e30$ cat merged_counts_colnames
ALL_14 ALL_22 ALL_13 ALL_63 ALL_65 ALL_66 user:/mnt/d/ALL_WTS_FASTQ_04112025/work/95/7ac5d50106a84cdb26fa3f9bc25e30$ cat merged_filtered_colnames
ALL_14 ALL_22 ALL_13 ALL_63 ALL_65 ALL_66 user:/mnt/d/ALL_WTS_FASTQ_04112025/work/95/7ac5d50106a84cdb26fa3f9bc25e30$ head all_counts.tsv
ALL_13 ALL_14 ALL_22 ALL_63 ALL_65 ALL_66 Reference
323.0 3217.0 674.0 3962.0 10268.0 2911.0 MSTRG.5826.2
295.0 3255.0 688.0 3911.0 10254.0 2949.0 ENST00000534336
297.0 3154.0 705.0 3883.0 10216.0 2865.0 ENST00000619449
298.0 3175.0 702.0 3824.0 9953.0 2840.0 MSTRG.5826.14

Relevant log output

ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'

Caused by:
  Process `pipeline:differential_expression:deAnalysis (1)` terminated with an error exit status (70)


Command executed:

  de_analysis.R         --annotation annotation.gtf         --min_samps_gene_expr 3         --min_samps_feature_expr 1         --min_gene_expr 10         --min_feature_expr 3         --sample_sheet sample_sheet.csv         --all_counts all_counts.tsv         --de_out_dir de_analysis         --merged_out_dir merged

  # Check that the original aliases in the input TSV have not been mangled by R's read.csv or other functions
  # Sample column order should be in the same order as the input sample sheet
  alias_col=$(awk -v RS=',' '/alias/{print NR; exit}' "sample_sheet.csv")
  cut -d',' -f3 sample_sheet.csv | tail -n +2 | paste -sd '     ' - | sed 's/   *$//' > expected_colnames

  head -1 de_analysis/cpm_gene_counts.tsv | cut -f2-  > cpm_gene_counts_colnames
  head -1 merged/all_gene_counts.tsv  > merged_counts_colnames
  head -1 merged/filtered_transcript_counts_with_genes.tsv | cut -f3-  > merged_filtered_colnames

  # Check for mismatches in sample column names
  for file in cpm_gene_counts_colnames merged_counts_colnames merged_filtered_colnames; do
  if ! diff -q $file expected_colnames > /dev/null; then
      echo "Column names in $file do not match expected aliases."
      exit 70
  fi
  done

Command exit status:
  70

Command output:
  Loading counts, conditions and parameters.
  Checking annotation file type.
  Annotation file type is gtf.
  Checking annotation file for presence of transcript_id versions.
  Annotation file transcript_ids include versions.
  Loading annotation database.
  Filtering counts using DRIMSeq.
  Building model matrix.
  Sum transcript counts into gene counts.
  Running differential gene expression analysis using edgeR.
  Running differential transcript usage analysis using DEXSeq.
  null device
            1
  stageR analysis
  Running stageR analysis on the differential transcript usage results.
  Column names in cpm_gene_counts_colnames do not match expected aliases.

Command error:
  Loading counts, conditions and parameters.
  Checking annotation file type.
  Annotation file type is gtf.
  Checking annotation file for presence of transcript_id versions.
  Annotation file transcript_ids include versions.
  Loading annotation database.
  Import genomic features from the file as a GRanges object ... OK
  Prepare the 'metadata' data frame ... OK
  Make the TxDb object ... OK
  'select()' returned 1:many mapping between keys and columns
  Filtering counts using DRIMSeq.
  Building model matrix.
  Sum transcript counts into gene counts.
  Running differential gene expression analysis using edgeR.
  Running differential transcript usage analysis using DEXSeq.
  converting counts to integer mode
  Warning message:
  In DESeqDataSet(rse, design, ignoreRank = TRUE) :
    some variables in design formula are characters, converting to factors
  -- note: fitType='parametric', but the dispersion trend was not well captured by the
     function: y = a/x + b, and a local regression fit was automatically substituted.
     specify fitType='local' or 'mean' to avoid this message next time.
  Fit for gene/exon MSTRG.2096 threw the next warning(s): Too much damping - convergence tolerance not achievable
  Warning message:
  In vst(exp(alleffects), object) :
    Dispersion function not parametric, applying log2(x+ 1) instead of vst...

  null device
            1
  stageR analysis

  Attaching package: 'stageR'

  The following object is masked from 'package:methods':

      getMethod

  Running stageR analysis on the differential transcript usage results.
  The returned adjusted p-values are based on a stage-wise testing approach and are only valid for the provided target OFDR level of 10%. If a different target OFDR level is of interest,the entire adjustment should be re-run.

  Column names in cpm_gene_counts_colnames do not match expected aliases.

Work dir:
  /mnt/d/ALL_WTS_FASTQ_04112025/work/95/7ac5d50106a84cdb26fa3f9bc25e30

Container:
  ontresearch/wf-transcriptomes:shaaaf20a5a0e76f9e18bad21af639a6b69e4a31a2f

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

Application activity log entry

Were you able to successfully run the latest version of the workflow with the demo data?

yes

Other demo data information

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions