I’ve observed an unexpected result when comparing direct clustering using CD-HIT at 40% threshold versus hierarchical clustering down to 30%.
➤ Direct clustering (-c 0.4):
I have directly used cd-hit to cluster my sequences using 0.4 as threshold and recommended word size.
cd-hit -i input.fasta -o clustr40.fasta -c 0.4 -n 2 -d 0 -T 8 -M 16000
And I got 1754 sequence clusters:
file format type num_seqs sum_len min_len avg_len max_len
- FASTA Protein 1,754 101,317 18 57.8 100
➤ Hierarchical clustering:
However, when I used hierarchical clustering, I got 5242 clusters.
My commands are shown below:
cd-hit -i input.fasta -o nr80 -c 0.8 -n 5 -d 0 -M 16000 -T 16
cd-hit -i nr80 -o nr60 -c 0.6 -n 4 -d 0 -M 16000 -T 16
psi-cd-hit.pl -i nr60 -o nr30 -c 0.3
clstr_rev.pl nr80.clstr nr60.clstr > nr80-60.clstr
clstr_rev.pl nr80-60.clstr nr30.clstr > nr80-60-30.clstr
My version is CD-HIT version 4.8.1 (built on Apr 7 2021)
🙋 Question:
Why does hierarchical clustering with a final 30% cutoff produce more clusters than a direct 40% clustering?
Shouldn't a lower threshold produce fewer, larger clusters (i.e., be more inclusive)?
How can I get fewer sequences using 30% threshold clustering?
💡 Additional Context:
The input dataset is protein sequences in FASTA format.
I’m following the standard workflow for hierarchical clustering as described in the documentation.
I'd appreciate any clarification on whether this is expected behavior, or if I might have misunderstood the intended use of clstr_rev.pl or psi-cd-hit.pl.
Thank you for developing and maintaining CD-HIT — it’s an incredibly useful tool for large-scale sequence analysis.
Best regards,
Gabriel
I’ve observed an unexpected result when comparing direct clustering using CD-HIT at 40% threshold versus hierarchical clustering down to 30%.
➤ Direct clustering (-c 0.4):
I have directly used cd-hit to cluster my sequences using 0.4 as threshold and recommended word size.
And I got 1754 sequence clusters:
➤ Hierarchical clustering:
However, when I used hierarchical clustering, I got 5242 clusters.
My commands are shown below:
My version is CD-HIT version 4.8.1 (built on Apr 7 2021)
🙋 Question:
Why does hierarchical clustering with a final 30% cutoff produce more clusters than a direct 40% clustering?
Shouldn't a lower threshold produce fewer, larger clusters (i.e., be more inclusive)?
How can I get fewer sequences using 30% threshold clustering?
💡 Additional Context:
The input dataset is protein sequences in FASTA format.
I’m following the standard workflow for hierarchical clustering as described in the documentation.
I'd appreciate any clarification on whether this is expected behavior, or if I might have misunderstood the intended use of clstr_rev.pl or psi-cd-hit.pl.
Thank you for developing and maintaining CD-HIT — it’s an incredibly useful tool for large-scale sequence analysis.
Best regards,
Gabriel