Skip to content

More clusters from hierarchical clustering than direct 30% CD-HIT run #148

@Gabriel-QIN

Description

@Gabriel-QIN

I’ve observed an unexpected result when comparing direct clustering using CD-HIT at 40% threshold versus hierarchical clustering down to 30%.

➤ Direct clustering (-c 0.4):
I have directly used cd-hit to cluster my sequences using 0.4 as threshold and recommended word size.

cd-hit -i input.fasta -o clustr40.fasta -c 0.4 -n 2 -d 0 -T 8 -M 16000

And I got 1754 sequence clusters:

file  format  type     num_seqs  sum_len  min_len  avg_len  max_len
-     FASTA   Protein     1,754  101,317       18     57.8      100

➤ Hierarchical clustering:
However, when I used hierarchical clustering, I got 5242 clusters.
My commands are shown below:

cd-hit -i input.fasta -o nr80 -c 0.8 -n 5 -d 0 -M 16000 -T 16
cd-hit -i nr80 -o nr60 -c 0.6 -n 4 -d 0 -M 16000 -T 16
psi-cd-hit.pl -i nr60 -o nr30 -c 0.3
clstr_rev.pl nr80.clstr nr60.clstr > nr80-60.clstr
clstr_rev.pl nr80-60.clstr nr30.clstr > nr80-60-30.clstr

My version is CD-HIT version 4.8.1 (built on Apr 7 2021)

🙋 Question:
Why does hierarchical clustering with a final 30% cutoff produce more clusters than a direct 40% clustering?
Shouldn't a lower threshold produce fewer, larger clusters (i.e., be more inclusive)?
How can I get fewer sequences using 30% threshold clustering?

💡 Additional Context:
The input dataset is protein sequences in FASTA format.
I’m following the standard workflow for hierarchical clustering as described in the documentation.
I'd appreciate any clarification on whether this is expected behavior, or if I might have misunderstood the intended use of clstr_rev.pl or psi-cd-hit.pl.

Thank you for developing and maintaining CD-HIT — it’s an incredibly useful tool for large-scale sequence analysis.

Best regards,
Gabriel

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions