Skip to content

Alignments extending into non-homologous regions in cluster and linclust #1104

@alephreish

Description

@alephreish

Sorry if this has been reported before.

I find the following behavior very counter-intuitive. Take the following two sequences:

>HBB
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK
VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
KEFTPPVQAAYQKVVAGVANALAHKYH
>HBB_alt
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK
VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
VKVYKVTYRGAHPPAEHFQWQPRKLAQ

They are identical in the first 120 residues and differ in the last 27 residues:

HBB:     MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
         mvhltpeeksavtalwgkvnvdevggealgrllvvypwtqrffesfgdlstpdavmgnpkvkahgkkvlgafsdglahldnlkgtfatlselhcdklhvdpenfrllgnvlvcvlahhfg
HBB_alt: MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGVKVYKVTYRGAHPPAEHFQWQPRKLAQ

A natural expectation, I think, is that the local alignment would cover only the first 120 residues, yet it clearly covers the whole sequences, such that when clustering at -c 1 the two sequences end up in the same cluster:

$ mmseqs  | grep Version
MMseqs2 Version: 45111b641859ed0ddd875b94d6fd1aef1a675b7e
$ mmseqs easy-cluster seqs.faa clust tmp -c 1 --min-seq-id 0.5
$ awk '{_[$1]=1}END{print length(_) " cluster(s)"}' clust_cluster.tsv
1 cluster(s)

but when clustering with relaxed coverage but increased identity, they don't:

$ mmseqs easy-cluster seqs.faa clust tmp -c 0.5 --min-seq-id 0.95
$ awk '{_[$1]=1}END{print length(_) " cluster(s)"}' clust_cluster.tsv
2 cluster(s)

Coverage mode and gap penalties have no influence on the outcome. The same happens with linclust.

Is there any way of controlling the alignment extension? I noticed that this does not happen with --single-step-clustering.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions