More clusters from hierarchical clustering than direct 30% CD-HIT run

I’ve observed an unexpected result when comparing direct clustering using CD-HIT at 40% threshold versus hierarchical clustering down to 30%.

➤ Direct clustering (-c 0.4):
I have directly used cd-hit to cluster my sequences using 0.4 as threshold and recommended word size.
```
cd-hit -i input.fasta -o clustr40.fasta -c 0.4 -n 2 -d 0 -T 8 -M 16000
```
And I got 1754 sequence clusters:
```
file  format  type     num_seqs  sum_len  min_len  avg_len  max_len
-     FASTA   Protein     1,754  101,317       18     57.8      100
```

➤ Hierarchical clustering:
However, when I used hierarchical clustering, I got 5242 clusters.
My commands are shown below:
```
cd-hit -i input.fasta -o nr80 -c 0.8 -n 5 -d 0 -M 16000 -T 16
cd-hit -i nr80 -o nr60 -c 0.6 -n 4 -d 0 -M 16000 -T 16
psi-cd-hit.pl -i nr60 -o nr30 -c 0.3
clstr_rev.pl nr80.clstr nr60.clstr > nr80-60.clstr
clstr_rev.pl nr80-60.clstr nr30.clstr > nr80-60-30.clstr
```
My version is CD-HIT version 4.8.1 (built on Apr  7 2021)

🙋 Question:
Why does hierarchical clustering with a final 30% cutoff produce more clusters than a direct 40% clustering?
Shouldn't a lower threshold produce fewer, larger clusters (i.e., be more inclusive)?
How can I get fewer sequences using 30% threshold clustering?

💡 Additional Context:
The input dataset is protein sequences in FASTA format.
I’m following the standard workflow for hierarchical clustering as described in the documentation.
I'd appreciate any clarification on whether this is expected behavior, or if I might have misunderstood the intended use of clstr_rev.pl or psi-cd-hit.pl.

Thank you for developing and maintaining CD-HIT — it’s an incredibly useful tool for large-scale sequence analysis.

Best regards,
Gabriel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More clusters from hierarchical clustering than direct 30% CD-HIT run #148

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

More clusters from hierarchical clustering than direct 30% CD-HIT run #148

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions