Column names in cpm_gene_counts_colnames do not match expected aliases

### Operating System

Windows 10

### Other Linux

_No response_

### Workflow Version

25.10.0

### Workflow Execution

Command line (Local)

### Other workflow execution

_No response_

### EPI2ME Version

_No response_

### CLI command run

nextflow run epi2me-labs/wf-transcriptomes     --de_analysis   --direct_rna        --fastq '/mnt/d/ALL_WTS_FASTQ_04112025/DGE_input/'      --minimap2_index_opts '-k 15'   --transcriptome_source 'reference-guided' --ref_annotation '/mnt/d/ALL_WTS_FASTQ_04112025/DGE_input/Homo_sapiens.GRCh38.115.gtf.gz'         --ref_genome '/mnt/d/ALL_WTS_FASTQ_04112025/DGE_input/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz'       --sample_sheet '/mnt/d/ALL_WTS_FASTQ_04112025/DGE_input/sample_sheet.csv'   -profile standard

### Workflow Execution - CLI Execution Profile

standard (default)

### What happened?

We encountered an error during pipeline:differential_expression:deAnalysis when running differential expression with 6 samples (3 control, 3 treated). The issue seems related to read.csv() automatically applying make.names() to column names in the counts file, while sample IDs in the samplesheet are not modified. This causes mismatched sample IDs between the two files and leads to the error: Column names in cpm_gene_counts_colnames do not match expected aliases.

[sample_sheet.csv](https://github.com/user-attachments/files/23459699/sample_sheet.csv)
---------------------------------------------------------------------------------------------------------
user:/mnt/d/ALL_WTS_FASTQ_04112025/work/95/7ac5d50106a84cdb26fa3f9bc25e30$ cat cpm_gene_counts_colnames 
ALL_14 ALL_22 ALL_13 ALL_63 ALL_65 ALL_66 
user:/mnt/d/ALL_WTS_FASTQ_04112025/work/95/7ac5d50106a84cdb26fa3f9bc25e30$ cat expected_colnames 
control treated 
user:/mnt/d/ALL_WTS_FASTQ_04112025/work/95/7ac5d50106a84cdb26fa3f9bc25e30$ cat merged_counts_colnames 
ALL_14 ALL_22 ALL_13 ALL_63 ALL_65 ALL_66 user:/mnt/d/ALL_WTS_FASTQ_04112025/work/95/7ac5d50106a84cdb26fa3f9bc25e30$ cat merged_filtered_colnames 
ALL_14 ALL_22 ALL_13 ALL_63 ALL_65 ALL_66 user:/mnt/d/ALL_WTS_FASTQ_04112025/work/95/7ac5d50106a84cdb26fa3f9bc25e30$ head all_counts.tsv 
ALL_13 ALL_14 ALL_22 ALL_63 ALL_65 ALL_66 Reference 
323.0 3217.0 674.0 3962.0 10268.0 2911.0 MSTRG.5826.2 
295.0 3255.0 688.0 3911.0 10254.0 2949.0 ENST00000534336 
297.0 3154.0 705.0 3883.0 10216.0 2865.0 ENST00000619449 
298.0 3175.0 702.0 3824.0 9953.0 2840.0 MSTRG.5826.14

### Relevant log output

```shell
ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'

Caused by:
  Process `pipeline:differential_expression:deAnalysis (1)` terminated with an error exit status (70)


Command executed:

  de_analysis.R         --annotation annotation.gtf         --min_samps_gene_expr 3         --min_samps_feature_expr 1         --min_gene_expr 10         --min_feature_expr 3         --sample_sheet sample_sheet.csv         --all_counts all_counts.tsv         --de_out_dir de_analysis         --merged_out_dir merged

  # Check that the original aliases in the input TSV have not been mangled by R's read.csv or other functions
  # Sample column order should be in the same order as the input sample sheet
  alias_col=$(awk -v RS=',' '/alias/{print NR; exit}' "sample_sheet.csv")
  cut -d',' -f3 sample_sheet.csv | tail -n +2 | paste -sd '     ' - | sed 's/   *$//' > expected_colnames

  head -1 de_analysis/cpm_gene_counts.tsv | cut -f2-  > cpm_gene_counts_colnames
  head -1 merged/all_gene_counts.tsv  > merged_counts_colnames
  head -1 merged/filtered_transcript_counts_with_genes.tsv | cut -f3-  > merged_filtered_colnames

  # Check for mismatches in sample column names
  for file in cpm_gene_counts_colnames merged_counts_colnames merged_filtered_colnames; do
  if ! diff -q $file expected_colnames > /dev/null; then
      echo "Column names in $file do not match expected aliases."
      exit 70
  fi
  done

Command exit status:
  70

Command output:
  Loading counts, conditions and parameters.
  Checking annotation file type.
  Annotation file type is gtf.
  Checking annotation file for presence of transcript_id versions.
  Annotation file transcript_ids include versions.
  Loading annotation database.
  Filtering counts using DRIMSeq.
  Building model matrix.
  Sum transcript counts into gene counts.
  Running differential gene expression analysis using edgeR.
  Running differential transcript usage analysis using DEXSeq.
  null device
            1
  stageR analysis
  Running stageR analysis on the differential transcript usage results.
  Column names in cpm_gene_counts_colnames do not match expected aliases.

Command error:
  Loading counts, conditions and parameters.
  Checking annotation file type.
  Annotation file type is gtf.
  Checking annotation file for presence of transcript_id versions.
  Annotation file transcript_ids include versions.
  Loading annotation database.
  Import genomic features from the file as a GRanges object ... OK
  Prepare the 'metadata' data frame ... OK
  Make the TxDb object ... OK
  'select()' returned 1:many mapping between keys and columns
  Filtering counts using DRIMSeq.
  Building model matrix.
  Sum transcript counts into gene counts.
  Running differential gene expression analysis using edgeR.
  Running differential transcript usage analysis using DEXSeq.
  converting counts to integer mode
  Warning message:
  In DESeqDataSet(rse, design, ignoreRank = TRUE) :
    some variables in design formula are characters, converting to factors
  -- note: fitType='parametric', but the dispersion trend was not well captured by the
     function: y = a/x + b, and a local regression fit was automatically substituted.
     specify fitType='local' or 'mean' to avoid this message next time.
  Fit for gene/exon MSTRG.2096 threw the next warning(s): Too much damping - convergence tolerance not achievable
  Warning message:
  In vst(exp(alleffects), object) :
    Dispersion function not parametric, applying log2(x+ 1) instead of vst...

  null device
            1
  stageR analysis

  Attaching package: 'stageR'

  The following object is masked from 'package:methods':

      getMethod

  Running stageR analysis on the differential transcript usage results.
  The returned adjusted p-values are based on a stage-wise testing approach and are only valid for the provided target OFDR level of 10%. If a different target OFDR level is of interest,the entire adjustment should be re-run.

  Column names in cpm_gene_counts_colnames do not match expected aliases.

Work dir:
  /mnt/d/ALL_WTS_FASTQ_04112025/work/95/7ac5d50106a84cdb26fa3f9bc25e30

Container:
  ontresearch/wf-transcriptomes:shaaaf20a5a0e76f9e18bad21af639a6b69e4a31a2f

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details
```

### Application activity log entry

```shell

```

### Were you able to successfully run the latest version of the workflow with the demo data?

yes

### Other demo data information

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column names in cpm_gene_counts_colnames do not match expected aliases #166

Operating System

Other Linux

Workflow Version

Workflow Execution

Other workflow execution

EPI2ME Version

CLI command run

Workflow Execution - CLI Execution Profile

What happened?

sample_sheet.csv

Relevant log output

Application activity log entry

Were you able to successfully run the latest version of the workflow with the demo data?

Other demo data information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Column names in cpm_gene_counts_colnames do not match expected aliases #166

Description

Operating System

Other Linux

Workflow Version

Workflow Execution

Other workflow execution

EPI2ME Version

CLI command run

Workflow Execution - CLI Execution Profile

What happened?

sample_sheet.csv

Relevant log output

Application activity log entry

Were you able to successfully run the latest version of the workflow with the demo data?

Other demo data information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions