perf: enhance dataset handling, training logic, and reporting features by cristofima · Pull Request #3 · cristofima/AWS-AutoML-Lite

cristofima · 2025-11-30T22:47:17Z

This PR enhances the AutoML platform's performance and maintainability by replacing heavyweight dependencies (Sweetviz) with custom HTML/CSS report generators, standardizing timezone handling, improving feature preprocessing with intelligent ID detection, and adding comprehensive training reports. The changes span infrastructure (S3 CORS), backend (training pipeline enhancements), and frontend (graceful handling of optional metadata).

Key Changes

Custom report generators: Replaced Sweetviz with pure HTML/CSS EDA and training reports, reducing container dependencies
Enhanced preprocessing: Added feature-engine library for robust feature selection, comprehensive ID column detection patterns, and automatic exclusion of constant/duplicate/high-cardinality features
Timezone standardization: Migrated from datetime.utcnow() to datetime.now(timezone.utc) throughout backend for consistent timezone-aware timestamps

…roblem type detection Improved the ConfigurePage component by integrating dataset metadata fetching and displaying additional information such as filename, row count, and column types. Added logic to detect problem types based on column types, enhancing user experience during training configuration. Modified files (2): - frontend/app/configure/[datasetId]/page.tsx: Updated metadata handling - frontend/lib/api.ts: Added getDatasetMetadata function

Added new ID detection patterns and improved column exclusion logic in the AutoPreprocessor class. Updated model training to exclude xgboost due to a bug with best_iteration without early stopping. Modified files (5): - backend/training/preprocessor.py: Enhanced ID detection - backend/training/model_trainer.py: Excluded xgboost - backend/training/train.py: Captured dropped columns info - backend/training/requirements.txt: Updated pandas version - backend/training/.dockerignore: Added ignore patterns

Updated datetime handling across multiple files to ensure consistent use of timezone-aware datetime objects. This change improves the accuracy of timestamps in dataset uploads and training job records. Modified files (4): - backend/api/routers/datasets.py: Updated uploaded_at - backend/api/routers/training.py: Updated created_at and updated_at - backend/api/services/dynamo_service.py: Updated updated_at - backend/api/utils/helpers.py: Added local_training setting

…ator Replaced the Sweetviz library with a custom EDA report generator that creates a comprehensive HTML report using only HTML and CSS. This change removes the dependency on Sweetviz, allowing for more control over the report's appearance and content. Modified files (1): - backend/training/eda.py: Implemented EDAReportGenerator class and associated methods for generating the report. Additionally, removed Sweetviz from requirements.txt to reflect this change.

Implemented a new training report generation feature that creates an HTML report summarizing the training results, including metrics, feature importance, preprocessing details, and configuration info. The report is saved to a specified path and uploaded to S3 for accessibility. Modified files (5): - backend/training/train.py: Added report generation and upload logic - backend/api/models/schemas.py: Updated JobDetails and JobResponse schemas - backend/api/routers/models.py: Adjusted job status handling for reports - backend/training/training_report.py: New module for report generation

Implemented CORS rules for models and reports S3 buckets to allow cross-origin requests. This change enables frontend applications to access resources stored in these buckets. Modified files: - infrastructure/terraform/s3.tf: Added CORS configuration for models Enhances compatibility with web applications requiring S3 access.

Copilot

Pull request overview

This PR enhances the AutoML platform's performance and maintainability by replacing heavyweight dependencies (Sweetviz) with custom HTML/CSS report generators, standardizing timezone handling, improving feature preprocessing with intelligent ID detection, and adding comprehensive training reports. The changes span infrastructure (S3 CORS), backend (training pipeline enhancements), and frontend (graceful handling of optional metadata).

Key Changes

Custom report generators: Replaced Sweetviz with pure HTML/CSS EDA and training reports, reducing container dependencies
Enhanced preprocessing: Added feature-engine library for robust feature selection, comprehensive ID column detection patterns, and automatic exclusion of constant/duplicate/high-cardinality features
Timezone standardization: Migrated from datetime.utcnow() to datetime.now(timezone.utc) throughout backend for consistent timezone-aware timestamps

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
infrastructure/terraform/s3.tf	Added CORS configuration for models and reports buckets to enable frontend downloads
frontend/lib/utils.ts	Enhanced datetime formatting with timezone awareness and null-safe problem type helpers
frontend/lib/api.ts	Updated interfaces to support optional job metadata fields and dual report paths
frontend/app/training/[jobId]/page.tsx	Added graceful handling of optional metadata during job execution
frontend/app/results/[jobId]/page.tsx	Added null checks for optional job fields
frontend/app/history/page.tsx	Implemented delete job functionality with confirmation modal
frontend/app/configure/[datasetId]/page.tsx	Added real dataset metadata fetching and improved problem type detection
backend/training/training_report.py	New module for generating CSS-based training performance reports
backend/training/train.py	Updated to generate both EDA and training reports, use timezone-aware timestamps
backend/training/requirements.txt	Removed Sweetviz/plotly/xgboost, added feature-engine, updated pandas version
backend/training/preprocessor.py	Added comprehensive ID detection, feature-engine integration, and useless column filtering
backend/training/model_trainer.py	Excluded xgboost due to best_iteration bug, improved feature importance extraction
backend/training/eda.py	Complete rewrite using pure HTML/CSS instead of Sweetviz
backend/training/.dockerignore	Added to reduce Docker context size
backend/api/utils/helpers.py	Added aws_endpoint_url and local_training settings for development
backend/api/services/dynamo_service.py	Updated to use timezone-aware UTC timestamps
backend/api/routers/training.py	Updated to use timezone-aware UTC timestamps
backend/api/routers/models.py	Added support for dual report paths (EDA + training) with backward compatibility
backend/api/routers/datasets.py	Updated to use timezone-aware UTC timestamps
backend/api/models/schemas.py	Extended schemas to support optional fields and dual report paths

Copilot · 2025-11-30T22:53:35Z

backend/training/training_report.py

+
+    def generate(self) -> str:
+        """Generate complete HTML report"""
+        timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S UTC")


Missing timezone awareness. The timestamp is generated with datetime.now() which returns a naive datetime (without timezone info), but the label says "UTC". This is inconsistent with the rest of the codebase which uses datetime.now(timezone.utc) for timezone-aware UTC timestamps.

Change to:

timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S UTC")

Note: You'll also need to import timezone from datetime at the top of the file.

Copilot · 2025-11-30T22:53:35Z

infrastructure/terraform/s3.tf

+  cors_rule {
+    allowed_headers = ["*"]
+    allowed_methods = ["GET"]
+    allowed_origins = ["*"]


Security concern: The CORS configuration uses allowed_origins = ["*"] which allows any website to request resources from these S3 buckets. While this might be acceptable for development, in production this should be restricted to specific domain(s) where your frontend is hosted.

Consider using a variable for allowed origins:

allowed_origins = var.cors_allowed_origins # e.g., ["https://yourdomain.com"]

This is especially important for the models bucket which contains trained ML models.

Copilot · 2025-11-30T22:53:35Z

infrastructure/terraform/s3.tf

+  cors_rule {
+    allowed_headers = ["*"]
+    allowed_methods = ["GET"]
+    allowed_origins = ["*"]


Security concern: Same as the models bucket - allowed_origins = ["*"] allows any website to request reports from this S3 bucket. This should be restricted to specific domain(s) in production.

Consider using a variable for allowed origins:

allowed_origins = var.cors_allowed_origins # e.g., ["https://yourdomain.com"]

Suggested change

allowed_origins = ["*"]

allowed_origins = var.cors_allowed_origins

Copilot · 2025-11-30T22:53:36Z

frontend/lib/utils.ts

+    if (dateStr.endsWith('Z') || dateStr.includes('+') || dateStr.includes('-', 10)) {
+      return new Date(dateStr).getTime();
+    }
+    return new Date(dateStr + 'Z').getTime();
+  };


Same timezone detection issue as in formatDateTime. The condition dateStr.includes('-', 10) will incorrectly match ISO date strings like 2024-11-28T10:00:00 where - appears after position 10 in the date portion.

Use a more specific regex pattern to detect timezone offsets:

if (dateStr.endsWith('Z') || /[+-]\d{2}:\d{2}$/.test(dateStr)) { return new Date(dateStr).getTime(); }

Copilot · 2025-11-30T22:53:36Z

backend/training/preprocessor.py

+            n_unique = series.nunique()
+            n_total = len(series)
+
+            # If almost all values are unique, likely an ID


Potential division by zero issue. If n_total (length of series) is 0, this will raise a ZeroDivisionError. While this is unlikely in practice for a valid dataset, it's good defensive programming to handle this edge case.

Consider adding a check:

if n_total == 0: return False if n_unique / n_total > 0.95: # ...

Suggested change

# If almost all values are unique, likely an ID

# If almost all values are unique, likely an ID

if n_total == 0:

return False

Copilot · 2025-11-30T22:53:37Z

backend/training/eda.py

+
+    def _detect_problem_type(self) -> str:
+        """Detect if classification or regression"""
+        if pd.api.types.is_numeric_dtype(self.target):


Potential division by zero issue. If len(self.target) is 0 (empty dataset), this will raise a ZeroDivisionError. While this is unlikely for a valid dataset, it's good defensive programming to handle this edge case.

Consider adding a check:

if len(self.target) == 0: return 'classification' # or raise an appropriate error unique_ratio = self.target.nunique() / len(self.target)

Suggested change

if pd.api.types.is_numeric_dtype(self.target):

if pd.api.types.is_numeric_dtype(self.target):

if len(self.target) == 0:

return 'classification'

Copilot · 2025-11-30T22:53:38Z

backend/training/eda.py

+            imbalance_ratio = class_counts.max() / class_counts.min()
+            if imbalance_ratio > 3:
+                self.warnings.append(f"Class imbalance detected (ratio: {imbalance_ratio:.1f}:1)")


Potential division by zero issue. If class_counts.min() returns 0 (which could happen if there's a class with no samples after value_counts, though unlikely), this will raise a ZeroDivisionError.

Add a guard check:

min_count = class_counts.min() if min_count > 0: imbalance_ratio = class_counts.max() / min_count if imbalance_ratio > 3: self.warnings.append(f"Class imbalance detected (ratio: {imbalance_ratio:.1f}:1)")

Suggested change

imbalance_ratio = class_counts.max() / class_counts.min()

if imbalance_ratio > 3:

self.warnings.append(f"Class imbalance detected (ratio: {imbalance_ratio:.1f}:1)")

min_count = class_counts.min()

if min_count > 0:

imbalance_ratio = class_counts.max() / min_count

if imbalance_ratio > 3:

self.warnings.append(f"Class imbalance detected (ratio: {imbalance_ratio:.1f}:1)")

Copilot · 2025-11-30T22:53:38Z

backend/training/eda.py

+
+        for col in self.df.columns:
+            series = self.df[col]
+            dtype = str(series.dtype)


Variable dtype is not used.

Suggested change

dtype = str(series.dtype)

Copilot · 2025-11-30T22:53:38Z

backend/training/eda.py

 import pandas as pd
-import sweetviz as sv
+import numpy as np
+from typing import Dict, List, Tuple, Any


Import of 'Dict' is not used.
Import of 'Any' is not used.

Suggested change

from typing import Dict, List, Tuple, Any

from typing import List, Tuple

Copilot · 2025-11-30T22:53:38Z

backend/training/model_trainer.py

+            except Exception:
+                pass


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except Exception:

pass

except Exception as e:

print(f"Could not extract feature importances from FLAML model: {str(e)}")

cristofima added 6 commits November 30, 2025 14:23

cristofima requested a review from Copilot November 30, 2025 22:47

Copilot started reviewing on behalf of cristofima November 30, 2025 22:47 View session

Copilot finished reviewing on behalf of cristofima November 30, 2025 22:51

Copilot AI reviewed Nov 30, 2025

View reviewed changes

cristofima merged commit 3237228 into main Nov 30, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: enhance dataset handling, training logic, and reporting features#3

perf: enhance dataset handling, training logic, and reporting features#3
cristofima merged 6 commits intomainfrom
dev

cristofima commented Nov 30, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 30, 2025

Uh oh!

Copilot AI Nov 30, 2025

Uh oh!

Copilot AI Nov 30, 2025

Uh oh!

Copilot AI Nov 30, 2025

Uh oh!

Copilot AI Nov 30, 2025

Uh oh!

Copilot AI Nov 30, 2025

Uh oh!

Copilot AI Nov 30, 2025

Uh oh!

Copilot AI Nov 30, 2025

Uh oh!

Copilot AI Nov 30, 2025

Uh oh!

Copilot AI Nov 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	allowed_origins = ["*"]
	allowed_origins = var.cors_allowed_origins

	from typing import Dict, List, Tuple, Any
	from typing import List, Tuple

-            except Exception:
-                pass
+            except Exception as e:
+                print(f"Could not extract feature importances from FLAML model: {str(e)}")

Conversation

cristofima commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cristofima commented Nov 30, 2025 •

edited

Loading