perf: enhance dataset handling, training logic, and reporting features#3
perf: enhance dataset handling, training logic, and reporting features#3cristofima merged 6 commits intomainfrom
Conversation
…roblem type detection Improved the ConfigurePage component by integrating dataset metadata fetching and displaying additional information such as filename, row count, and column types. Added logic to detect problem types based on column types, enhancing user experience during training configuration. Modified files (2): - frontend/app/configure/[datasetId]/page.tsx: Updated metadata handling - frontend/lib/api.ts: Added getDatasetMetadata function
Added new ID detection patterns and improved column exclusion logic in the AutoPreprocessor class. Updated model training to exclude xgboost due to a bug with best_iteration without early stopping. Modified files (5): - backend/training/preprocessor.py: Enhanced ID detection - backend/training/model_trainer.py: Excluded xgboost - backend/training/train.py: Captured dropped columns info - backend/training/requirements.txt: Updated pandas version - backend/training/.dockerignore: Added ignore patterns
Updated datetime handling across multiple files to ensure consistent use of timezone-aware datetime objects. This change improves the accuracy of timestamps in dataset uploads and training job records. Modified files (4): - backend/api/routers/datasets.py: Updated uploaded_at - backend/api/routers/training.py: Updated created_at and updated_at - backend/api/services/dynamo_service.py: Updated updated_at - backend/api/utils/helpers.py: Added local_training setting
…ator Replaced the Sweetviz library with a custom EDA report generator that creates a comprehensive HTML report using only HTML and CSS. This change removes the dependency on Sweetviz, allowing for more control over the report's appearance and content. Modified files (1): - backend/training/eda.py: Implemented EDAReportGenerator class and associated methods for generating the report. Additionally, removed Sweetviz from requirements.txt to reflect this change.
Implemented a new training report generation feature that creates an HTML report summarizing the training results, including metrics, feature importance, preprocessing details, and configuration info. The report is saved to a specified path and uploaded to S3 for accessibility. Modified files (5): - backend/training/train.py: Added report generation and upload logic - backend/api/models/schemas.py: Updated JobDetails and JobResponse schemas - backend/api/routers/models.py: Adjusted job status handling for reports - backend/training/training_report.py: New module for report generation
Implemented CORS rules for models and reports S3 buckets to allow cross-origin requests. This change enables frontend applications to access resources stored in these buckets. Modified files: - infrastructure/terraform/s3.tf: Added CORS configuration for models Enhances compatibility with web applications requiring S3 access.
There was a problem hiding this comment.
Pull request overview
This PR enhances the AutoML platform's performance and maintainability by replacing heavyweight dependencies (Sweetviz) with custom HTML/CSS report generators, standardizing timezone handling, improving feature preprocessing with intelligent ID detection, and adding comprehensive training reports. The changes span infrastructure (S3 CORS), backend (training pipeline enhancements), and frontend (graceful handling of optional metadata).
Key Changes
- Custom report generators: Replaced Sweetviz with pure HTML/CSS EDA and training reports, reducing container dependencies
- Enhanced preprocessing: Added feature-engine library for robust feature selection, comprehensive ID column detection patterns, and automatic exclusion of constant/duplicate/high-cardinality features
- Timezone standardization: Migrated from
datetime.utcnow()todatetime.now(timezone.utc)throughout backend for consistent timezone-aware timestamps
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| infrastructure/terraform/s3.tf | Added CORS configuration for models and reports buckets to enable frontend downloads |
| frontend/lib/utils.ts | Enhanced datetime formatting with timezone awareness and null-safe problem type helpers |
| frontend/lib/api.ts | Updated interfaces to support optional job metadata fields and dual report paths |
| frontend/app/training/[jobId]/page.tsx | Added graceful handling of optional metadata during job execution |
| frontend/app/results/[jobId]/page.tsx | Added null checks for optional job fields |
| frontend/app/history/page.tsx | Implemented delete job functionality with confirmation modal |
| frontend/app/configure/[datasetId]/page.tsx | Added real dataset metadata fetching and improved problem type detection |
| backend/training/training_report.py | New module for generating CSS-based training performance reports |
| backend/training/train.py | Updated to generate both EDA and training reports, use timezone-aware timestamps |
| backend/training/requirements.txt | Removed Sweetviz/plotly/xgboost, added feature-engine, updated pandas version |
| backend/training/preprocessor.py | Added comprehensive ID detection, feature-engine integration, and useless column filtering |
| backend/training/model_trainer.py | Excluded xgboost due to best_iteration bug, improved feature importance extraction |
| backend/training/eda.py | Complete rewrite using pure HTML/CSS instead of Sweetviz |
| backend/training/.dockerignore | Added to reduce Docker context size |
| backend/api/utils/helpers.py | Added aws_endpoint_url and local_training settings for development |
| backend/api/services/dynamo_service.py | Updated to use timezone-aware UTC timestamps |
| backend/api/routers/training.py | Updated to use timezone-aware UTC timestamps |
| backend/api/routers/models.py | Added support for dual report paths (EDA + training) with backward compatibility |
| backend/api/routers/datasets.py | Updated to use timezone-aware UTC timestamps |
| backend/api/models/schemas.py | Extended schemas to support optional fields and dual report paths |
|
|
||
| def generate(self) -> str: | ||
| """Generate complete HTML report""" | ||
| timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S UTC") |
There was a problem hiding this comment.
Missing timezone awareness. The timestamp is generated with datetime.now() which returns a naive datetime (without timezone info), but the label says "UTC". This is inconsistent with the rest of the codebase which uses datetime.now(timezone.utc) for timezone-aware UTC timestamps.
Change to:
timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S UTC")Note: You'll also need to import timezone from datetime at the top of the file.
| cors_rule { | ||
| allowed_headers = ["*"] | ||
| allowed_methods = ["GET"] | ||
| allowed_origins = ["*"] |
There was a problem hiding this comment.
Security concern: The CORS configuration uses allowed_origins = ["*"] which allows any website to request resources from these S3 buckets. While this might be acceptable for development, in production this should be restricted to specific domain(s) where your frontend is hosted.
Consider using a variable for allowed origins:
allowed_origins = var.cors_allowed_origins # e.g., ["https://yourdomain.com"]This is especially important for the models bucket which contains trained ML models.
| cors_rule { | ||
| allowed_headers = ["*"] | ||
| allowed_methods = ["GET"] | ||
| allowed_origins = ["*"] |
There was a problem hiding this comment.
Security concern: Same as the models bucket - allowed_origins = ["*"] allows any website to request reports from this S3 bucket. This should be restricted to specific domain(s) in production.
Consider using a variable for allowed origins:
allowed_origins = var.cors_allowed_origins # e.g., ["https://yourdomain.com"]| allowed_origins = ["*"] | |
| allowed_origins = var.cors_allowed_origins |
| if (dateStr.endsWith('Z') || dateStr.includes('+') || dateStr.includes('-', 10)) { | ||
| return new Date(dateStr).getTime(); | ||
| } | ||
| return new Date(dateStr + 'Z').getTime(); | ||
| }; |
There was a problem hiding this comment.
Same timezone detection issue as in formatDateTime. The condition dateStr.includes('-', 10) will incorrectly match ISO date strings like 2024-11-28T10:00:00 where - appears after position 10 in the date portion.
Use a more specific regex pattern to detect timezone offsets:
if (dateStr.endsWith('Z') || /[+-]\d{2}:\d{2}$/.test(dateStr)) {
return new Date(dateStr).getTime();
}| n_unique = series.nunique() | ||
| n_total = len(series) | ||
|
|
||
| # If almost all values are unique, likely an ID |
There was a problem hiding this comment.
Potential division by zero issue. If n_total (length of series) is 0, this will raise a ZeroDivisionError. While this is unlikely in practice for a valid dataset, it's good defensive programming to handle this edge case.
Consider adding a check:
if n_total == 0:
return False
if n_unique / n_total > 0.95:
# ...| # If almost all values are unique, likely an ID | |
| # If almost all values are unique, likely an ID | |
| if n_total == 0: | |
| return False |
|
|
||
| def _detect_problem_type(self) -> str: | ||
| """Detect if classification or regression""" | ||
| if pd.api.types.is_numeric_dtype(self.target): |
There was a problem hiding this comment.
Potential division by zero issue. If len(self.target) is 0 (empty dataset), this will raise a ZeroDivisionError. While this is unlikely for a valid dataset, it's good defensive programming to handle this edge case.
Consider adding a check:
if len(self.target) == 0:
return 'classification' # or raise an appropriate error
unique_ratio = self.target.nunique() / len(self.target)| if pd.api.types.is_numeric_dtype(self.target): | |
| if pd.api.types.is_numeric_dtype(self.target): | |
| if len(self.target) == 0: | |
| return 'classification' |
| imbalance_ratio = class_counts.max() / class_counts.min() | ||
| if imbalance_ratio > 3: | ||
| self.warnings.append(f"Class imbalance detected (ratio: {imbalance_ratio:.1f}:1)") |
There was a problem hiding this comment.
Potential division by zero issue. If class_counts.min() returns 0 (which could happen if there's a class with no samples after value_counts, though unlikely), this will raise a ZeroDivisionError.
Add a guard check:
min_count = class_counts.min()
if min_count > 0:
imbalance_ratio = class_counts.max() / min_count
if imbalance_ratio > 3:
self.warnings.append(f"Class imbalance detected (ratio: {imbalance_ratio:.1f}:1)")| imbalance_ratio = class_counts.max() / class_counts.min() | |
| if imbalance_ratio > 3: | |
| self.warnings.append(f"Class imbalance detected (ratio: {imbalance_ratio:.1f}:1)") | |
| min_count = class_counts.min() | |
| if min_count > 0: | |
| imbalance_ratio = class_counts.max() / min_count | |
| if imbalance_ratio > 3: | |
| self.warnings.append(f"Class imbalance detected (ratio: {imbalance_ratio:.1f}:1)") |
|
|
||
| for col in self.df.columns: | ||
| series = self.df[col] | ||
| dtype = str(series.dtype) |
There was a problem hiding this comment.
Variable dtype is not used.
| dtype = str(series.dtype) |
| import pandas as pd | ||
| import sweetviz as sv | ||
| import numpy as np | ||
| from typing import Dict, List, Tuple, Any |
There was a problem hiding this comment.
Import of 'Dict' is not used.
Import of 'Any' is not used.
| from typing import Dict, List, Tuple, Any | |
| from typing import List, Tuple |
| except Exception: | ||
| pass |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except Exception: | |
| pass | |
| except Exception as e: | |
| print(f"Could not extract feature importances from FLAML model: {str(e)}") |
This PR enhances the AutoML platform's performance and maintainability by replacing heavyweight dependencies (Sweetviz) with custom HTML/CSS report generators, standardizing timezone handling, improving feature preprocessing with intelligent ID detection, and adding comprehensive training reports. The changes span infrastructure (S3 CORS), backend (training pipeline enhancements), and frontend (graceful handling of optional metadata).
Key Changes