Skip to content

Proposal to Switch Clustering Method from k-means to Leiden in RNAseqShinyApp PCA #34

@charleschuang1993

Description

@charleschuang1993

Proposal to Switch Clustering Method from k-means to Leiden in RNAseqShinyApp PCA

Description:

This issue proposes replacing the current k-means clustering method with the Leiden algorithm specifically for the PCA component in RNAseqShinyApp. Below is a detailed comparison of the two methods in our application context, including their advantages, drawbacks, and the rationale behind choosing Leiden.


Current Approach: k-means Clustering in RNAseqShinyApp PCA

Pros:

  • Simplicity: k-means is straightforward to implement and widely understood, which eases both integration and interpretation in the PCA workflow.
  • Computational Efficiency: Performs well with lower-dimensional data and when clusters are expected to be similar in size and shape.
  • Availability: Extensive support in many libraries ensures quick deployment.

Cons:

  • Predefined Cluster Count: Requires specifying the number of clusters in advance, which can be problematic when the true number of clusters is uncertain after PCA.
  • Sensitivity to Initialization: Results may vary with different starting centroid positions, potentially leading to inconsistent clustering outcomes.
  • Cluster Assumptions: Assumes clusters are spherical and similar in scale, which may not align with the complex biological variability captured by PCA in RNA-seq data.

Proposed Approach: Leiden Clustering for PCA

Pros:

  • Adaptive Clustering: Automatically determines the number of clusters, removing the need for an a priori definition, which is advantageous in exploratory PCA.
  • Enhanced Quality: Tends to produce more robust and higher-quality clusters, which can better capture the inherent complexity and heterogeneity of RNA-seq data.
  • Community Detection: As a graph-based method, it excels at identifying non-spherical and irregularly sized clusters, offering a more nuanced view of the relationships in the PCA space.

Cons:

  • Higher Computational Demand: May require more computational resources compared to k-means, especially for larger datasets.
  • Parameter Tuning: Involves additional parameters specific to graph-based algorithms, which could add complexity to the pipeline.
  • Preprocessing Complexity: Transitioning to a graph-based approach might necessitate extra steps in data preparation compared to the straightforward application of k-means.

Rationale for Adopting Leiden Clustering in RNAseqShinyApp PCA

Given the challenges associated with k-means—specifically the need for predefining the number of clusters and its reliance on spherical cluster assumptions—the Leiden algorithm presents a compelling alternative for our RNAseqShinyApp PCA workflow. The key benefits include:

  • Flexibility and Adaptation: The ability to automatically infer the number of clusters aligns well with the exploratory nature of PCA in RNA-seq data.
  • Improved Cluster Resolution: Leiden is more effective at detecting complex, non-spherical, and heterogeneous clusters, which are common in biological datasets.
  • Enhanced Interpretation: By better reflecting the true structure in our data, Leiden can provide deeper insights into the underlying biological processes represented in the PCA.

Despite the potential increase in computational overhead and tuning complexity, the advantages of improved clustering quality and robustness make Leiden the superior choice for enhancing the PCA analysis in RNAseqShinyApp.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions