skills: enhance spark-advisor with DataFlint-inspired diagnostics

yaooqinn · Copilot · yaooqinn · commit 819f465c7a9b · 2026-03-23T00:24:56.000+08:00
Add SQL-level, resource utilization, and lakehouse-specific diagnostics:
- Small files read/written detection
- Broadcast too large / SortMergeJoin→BroadcastHashJoin conversion
- Large cross join detection
- Long filter condition detection
- Full scan on partitioned/clustered tables
- Wasted cores / over-provisioned cluster
- Executor/driver memory sizing alerts
- Iceberg inefficient replace detection
- Delta Lake full scan detection

Inspired by DataFlint OSS alert system (github.com/dataflint/spark).

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;
diff --git a/skills/spark-advisor/SKILL.md b/skills/spark-advisor/SKILL.md
@@ -142,6 +142,53 @@ These are the most impactful things to check. For the full diagnostic ruleset, s
 | Bad config | Partition count, executor sizing | `env`, `summary` |
 | AQE ineffective | Initial vs final plan difference | `sql-plan <id> --view initial/final` |
 | Gluten fallback | Non-Transformer nodes in final plan | `sql-plan <id> --view final` |
+| Small files read | Avg file size < 3MB, files > 100 | `sql <exec-id>` node metrics |
+| Small files written | Avg file size < 3MB, files > 100 | `sql <exec-id>` node metrics |
+| Broadcast too large | Broadcast data > 1GB | `sql <exec-id>` node metrics |
+| SMJ→BHJ conversion | SMJ with small input side | `sql-plan <id> --view final` |
+| Large cross join | Cross join rows > 10B | `sql <exec-id>` node metrics |
+| Long filter condition | Filter condition > 1000 chars | `sql-plan <id> --view final` |
+| Full scan on partitioned | Missing partition/cluster filters | `sql-plan <id> --view final` |
+| Large partition size | Max partition > 5GB | `stage-summary <id>` |
+| Wasted cores | Idle cores > 50% | `executors --all` |
+| Memory over-provisioned | Max usage < 70% | `executors --all` |
+| Driver memory risk | Driver heap > 95% | `executors --all` |
+| Iceberg inefficient replace | Files replaced > 30%, records < 30% | `sql <exec-id>` node metrics |
+
+## SQL Plan Analysis
+
+When diagnosing specific SQL queries, analyze the SQL plan nodes for these patterns:
+
+- **File I/O efficiency**: Check scan/write node metrics for `files read`, `bytes read`, `files written`, `bytes written`. Calculate average file size — small files (< 3MB) are a common hidden bottleneck.
+- **Join strategy**: Look for `SortMergeJoin` nodes where one input is significantly smaller than the other. These may benefit from broadcast hints or AQE tuning.
+- **Broadcast sizing**: Check `BroadcastExchange` node `data size` metric. Broadcasts > 1 GB cause excessive memory pressure and network overhead.
+- **Cross joins**: Identify `BroadcastNestedLoopJoin` or `CartesianProduct` nodes. Calculate total scanned rows from input sizes — cross joins on large tables are extremely dangerous.
+- **Filter complexity**: Inspect `Filter` node conditions. Very long conditions (> 1000 chars) with large IN-lists or OR chains should be converted to joins.
+- **Partition pruning**: For Delta Lake and Iceberg tables, verify that scan nodes show partition filters being applied. Full scans on partitioned tables waste I/O.
+- **Partition sizing**: Check stage task distribution for oversized partitions (> 5GB). These cause OOM risk, long tail tasks, and GC pressure.
+
+Use `sql <exec-id>` for node-level metrics and `sql-plan <exec-id> --view final` for post-AQE plan structure.
+
+## Lakehouse Awareness
+
+When analyzing workloads on Delta Lake or Apache Iceberg tables:
+
+### Delta Lake
+- **OPTIMIZE**: Recommend `OPTIMIZE` for tables with small file problems detected in scan metrics
+- **Z-ORDER**: Check if queries filter on z-ordered columns; if not, the z-ordering provides no benefit
+- **Liquid Clustering**: For Databricks, check if cluster key filters are being applied in scans
+- **Full scans**: Flag scans on partitioned Delta tables without partition filters
+
+### Apache Iceberg
+- **Copy-on-Write overhead**: For update/delete workloads, check if files replaced >> records changed — this indicates COW overhead
+- **Merge-on-Read**: Recommend `write.merge-mode=merge-on-read` for update-heavy tables
+- **Table maintenance**: Recommend `rewrite_data_files` for small file compaction
+- **Bulk replace detection**: If > 60% of table files are replaced in a single operation, flag potential misuse
+
+### General Lakehouse Checks
+- File sizes in scan/write metrics (target ~128MB per file)
+- Partition filter pushdown in scan nodes
+- Table statistics availability for cost-based optimization
 
 ## Gluten/Velox Awareness
 
diff --git a/skills/spark-advisor/references/diagnostics.md b/skills/spark-advisor/references/diagnostics.md
@@ -168,6 +168,183 @@ When a fallback occurs, data must be converted between columnar (Velox) and row
 - `RowToVeloxColumnar` → Spark to native conversion
 - These conversions add overhead; minimize them by ensuring contiguous native execution
 
+## SQL-Level Diagnostics
+
+### Small Files Read
+**Detection**: From SQL plan node metrics (`files read` and `bytes read`):
+- Average file size < 3 MB AND files read > 100 → small files problem
+
+**Root causes**:
+- Data written with too many partitions
+- High-cardinality partition keys
+- Frequent small-batch writes
+
+**Recommendations**:
+- Ask data owner to compact/repartition source data
+- Reduce executors to amortize small file overhead
+- Use table maintenance (OPTIMIZE for Delta, rewrite_data_files for Iceberg)
+
+### Small Files Written
+**Detection**: From SQL plan node metrics (`files written` and `bytes written`):
+- Average file size < 3 MB AND files written > 100 → writing small files
+- Ideal target: ~128 MB per file
+- For partitioned writes: check files per partition
+
+**Root causes**:
+- Too many output partitions
+- High-cardinality partition keys
+
+**Recommendations**:
+- For unpartitioned: `.repartition(N)` before write where N = total_bytes / 128MB
+- For partitioned: `.repartition("partition_key")` before write
+- Choose partition keys with lower cardinality
+
+### Broadcast Too Large
+**Detection**: From SQL plan BroadcastExchange node metrics (`data size`):
+- Broadcast data size > 1 GB → too large for broadcast
+
+**Root causes**:
+- `spark.sql.autoBroadcastJoinThreshold` set too high
+- Broadcast hint on large table
+
+**Recommendations**:
+- Lower `spark.sql.autoBroadcastJoinThreshold`
+- Remove broadcast hint from large DataFrames
+- Consider SortMergeJoin for tables > 1 GB
+
+### SortMergeJoin Should Be BroadcastHashJoin
+**Detection**: From SQL plan - SortMergeJoin node where one input is much smaller:
+- Small table < 10 MB (AQE should have caught this)
+- Small table < 100 MB AND large table > 10 GB
+- Small table < 1 GB AND large table > 300 GB
+- Small table < 5 GB AND large table > 1 TB
+
+**Root causes**:
+- AQE disabled or couldn't estimate sizes
+- Missing statistics
+
+**Recommendations**:
+- Use `broadcast(small_df)` hint
+- Increase `spark.sql.autoBroadcastJoinThreshold`
+- Ensure AQE is enabled: `spark.sql.adaptive.enabled=true`
+
+### Large Cross Join
+**Detection**: From BroadcastNestedLoopJoin or CartesianProduct node metrics:
+- Cross Join Scanned Rows > 10 billion → dangerous cross join
+
+**Root causes**:
+- Missing join conditions
+- Accidental Cartesian product
+
+**Recommendations**:
+- Add specific join conditions
+- Avoid cross joins on large datasets
+- Consider alternatives (window functions, explode + join)
+
+### Long Filter Conditions
+**Detection**: From Filter node plan condition:
+- Condition string length > 1000 characters → performance risk
+
+**Root causes**:
+- Large IN-lists
+- Complex OR chains
+- Programmatically generated filters
+
+**Recommendations**:
+- Convert filter to a join (create DataFrame of filter values, inner join)
+- Rewrite filter to be shorter
+- Use temp table for large value lists
+
+### Full Scan on Partitioned/Clustered Tables
+**Detection**: From scan nodes with Delta Lake / Iceberg metadata:
+- Partitioned table scanned without partition filters
+- Liquid Clustering table scanned without cluster key filters
+- Z-Ordered table scanned without z-order column filters
+
+**Root causes**:
+- Missing WHERE clauses on partition/cluster keys
+
+**Recommendations**:
+- Add filter on partition key(s)
+- Add filter on clustering key(s)
+- Review query to ensure predicate pushdown works
+
+### Large Partition Size
+**Detection**: From stage task distribution metrics:
+- Max partition size > 5 GB (input, output, shuffle read, or shuffle write)
+
+**Root causes**:
+- Uneven data distribution
+- Too few partitions
+
+**Recommendations**:
+- Increase number of partitions
+- Use more specific partitioning keys
+- Enable AQE auto-coalesce
+
+## Resource Utilization Diagnostics
+
+### Wasted Cores / Over-Provisioned Cluster
+**Detection**: From executor metrics:
+- Idle cores rate > 50% → cluster over-provisioned
+
+**Root causes**:
+- Too many executors/cores for the workload size
+
+**Recommendations**:
+- For static allocation: lower `spark.executor.cores` or `spark.executor.instances`
+- For dynamic allocation: tune `spark.dynamicAllocation.executorAllocationRatio` or increase `spark.dynamicAllocation.schedulerBacklogTimeout`
+
+### Executor Memory Over-Provisioned
+**Detection**: From executor memory metrics:
+- Max executor memory usage < 70% → over-provisioned (wasting money)
+- Max executor memory usage > 95% → under-provisioned (risk of OOM/spill)
+
+**Root causes**:
+- Wrong `spark.executor.memory` sizing
+
+**Recommendations**:
+- Over-provisioned: decrease `spark.executor.memory` to max_usage * 1.2
+- Under-provisioned: increase `spark.executor.memory` by 20%
+
+### Driver Memory Under-Provisioned
+**Detection**: From driver executor heap memory usage:
+- Driver heap usage > 95% of Xmx → risk of driver OOM
+
+**Root causes**:
+- Large collect() calls
+- Too many broadcast variables
+- Driver-side aggregations
+
+**Recommendations**:
+- Increase `spark.driver.memory`
+- Avoid `collect()` on large datasets
+- Reduce broadcast variable sizes
+
+## Lakehouse-Specific Diagnostics
+
+### Inefficient Iceberg Table Replace
+**Detection**: From Iceberg commit metrics on ReplaceData operations:
+- Table files replaced > 30% BUT records changed < 30% → rewriting too many files
+
+**Root causes**:
+- Copy-on-write mode rewriting entire files for small updates
+
+**Recommendations**:
+- Switch to merge-on-read mode (`write.merge-mode=merge-on-read`)
+- Partition table so updates touch fewer partitions
+
+### Replaced Most of Iceberg Table
+**Detection**: From Iceberg commit metrics:
+- Table files replaced > 60% → potential misuse of Iceberg
+
+**Root causes**:
+- Bulk updates/deletes that rewrite most of the table
+
+**Recommendations**:
+- Partition table to localize updates
+- Consider if Iceberg is the right format for this workload
+
 ## Thresholds Summary
 
 | Metric | OK | Warning | Critical |
@@ -179,3 +356,12 @@ When a fallback occurs, data must be converted between columnar (Velox) and row
 | Shuffle/Input ratio | < 1x | 1-3x | > 3x |
 | Partition size | 64-256MB | 32-512MB | < 16MB or > 1GB |
 | Memory per core | 4-8GB | 2-4GB or 8-16GB | < 2GB or > 16GB |
+| Avg file size (read/write) | > 64MB | 3-64MB | < 3MB |
+| Broadcast data size | < 256MB | 256MB-1GB | > 1GB |
+| Cross join rows | < 1B | 1-10B | > 10B |
+| Filter condition length | < 500 chars | 500-1000 chars | > 1000 chars |
+| Max partition size | < 2GB | 2-5GB | > 5GB |
+| Idle cores rate | < 20% | 20-50% | > 50% |
+| Executor memory usage | 70-90% | 50-70% or 90-95% | < 50% or > 95% |
+| Driver heap usage | < 80% | 80-95% | > 95% |
+| Iceberg files replaced | < 30% | 30-60% | > 60% |