Skip to content

perf: enhance dataset handling, training logic, and reporting features#3

Merged
cristofima merged 6 commits intomainfrom
dev
Nov 30, 2025
Merged

perf: enhance dataset handling, training logic, and reporting features#3
cristofima merged 6 commits intomainfrom
dev

Conversation

@cristofima
Copy link
Copy Markdown
Owner

@cristofima cristofima commented Nov 30, 2025

This PR enhances the AutoML platform's performance and maintainability by replacing heavyweight dependencies (Sweetviz) with custom HTML/CSS report generators, standardizing timezone handling, improving feature preprocessing with intelligent ID detection, and adding comprehensive training reports. The changes span infrastructure (S3 CORS), backend (training pipeline enhancements), and frontend (graceful handling of optional metadata).

Key Changes

  • Custom report generators: Replaced Sweetviz with pure HTML/CSS EDA and training reports, reducing container dependencies
  • Enhanced preprocessing: Added feature-engine library for robust feature selection, comprehensive ID column detection patterns, and automatic exclusion of constant/duplicate/high-cardinality features
  • Timezone standardization: Migrated from datetime.utcnow() to datetime.now(timezone.utc) throughout backend for consistent timezone-aware timestamps

…roblem type detection

Improved the ConfigurePage component by integrating dataset metadata fetching and displaying additional information such as filename, row count, and column types. Added logic to detect problem types based on column types, enhancing user experience during training configuration.

Modified files (2):
- frontend/app/configure/[datasetId]/page.tsx: Updated metadata handling
- frontend/lib/api.ts: Added getDatasetMetadata function
Added new ID detection patterns and improved column exclusion logic in the AutoPreprocessor class.
Updated model training to exclude xgboost due to a bug with best_iteration without early stopping.

Modified files (5):
- backend/training/preprocessor.py: Enhanced ID detection
- backend/training/model_trainer.py: Excluded xgboost
- backend/training/train.py: Captured dropped columns info
- backend/training/requirements.txt: Updated pandas version
- backend/training/.dockerignore: Added ignore patterns
Updated datetime handling across multiple files to ensure consistent use of timezone-aware datetime objects. This change improves the accuracy of timestamps in dataset uploads and training job records.

Modified files (4):
- backend/api/routers/datasets.py: Updated uploaded_at
- backend/api/routers/training.py: Updated created_at and updated_at
- backend/api/services/dynamo_service.py: Updated updated_at
- backend/api/utils/helpers.py: Added local_training setting
…ator

Replaced the Sweetviz library with a custom EDA report generator that creates a comprehensive HTML report using only HTML and CSS. This change removes the dependency on Sweetviz, allowing for more control over the report's appearance and content.

Modified files (1):
- backend/training/eda.py: Implemented EDAReportGenerator class and
  associated methods for generating the report.

Additionally, removed Sweetviz from requirements.txt to reflect this change.
Implemented a new training report generation feature that creates an HTML report summarizing the training results, including metrics, feature importance, preprocessing details, and configuration info.
The report is saved to a specified path and uploaded to S3 for accessibility.

Modified files (5):
- backend/training/train.py: Added report generation and upload logic
- backend/api/models/schemas.py: Updated JobDetails and JobResponse schemas
- backend/api/routers/models.py: Adjusted job status handling for reports
- backend/training/training_report.py: New module for report generation
Implemented CORS rules for models and reports S3 buckets to allow cross-origin requests. This change enables frontend applications to access resources stored in these buckets.

Modified files:
- infrastructure/terraform/s3.tf: Added CORS configuration for models

Enhances compatibility with web applications requiring S3 access.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the AutoML platform's performance and maintainability by replacing heavyweight dependencies (Sweetviz) with custom HTML/CSS report generators, standardizing timezone handling, improving feature preprocessing with intelligent ID detection, and adding comprehensive training reports. The changes span infrastructure (S3 CORS), backend (training pipeline enhancements), and frontend (graceful handling of optional metadata).

Key Changes

  • Custom report generators: Replaced Sweetviz with pure HTML/CSS EDA and training reports, reducing container dependencies
  • Enhanced preprocessing: Added feature-engine library for robust feature selection, comprehensive ID column detection patterns, and automatic exclusion of constant/duplicate/high-cardinality features
  • Timezone standardization: Migrated from datetime.utcnow() to datetime.now(timezone.utc) throughout backend for consistent timezone-aware timestamps

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
infrastructure/terraform/s3.tf Added CORS configuration for models and reports buckets to enable frontend downloads
frontend/lib/utils.ts Enhanced datetime formatting with timezone awareness and null-safe problem type helpers
frontend/lib/api.ts Updated interfaces to support optional job metadata fields and dual report paths
frontend/app/training/[jobId]/page.tsx Added graceful handling of optional metadata during job execution
frontend/app/results/[jobId]/page.tsx Added null checks for optional job fields
frontend/app/history/page.tsx Implemented delete job functionality with confirmation modal
frontend/app/configure/[datasetId]/page.tsx Added real dataset metadata fetching and improved problem type detection
backend/training/training_report.py New module for generating CSS-based training performance reports
backend/training/train.py Updated to generate both EDA and training reports, use timezone-aware timestamps
backend/training/requirements.txt Removed Sweetviz/plotly/xgboost, added feature-engine, updated pandas version
backend/training/preprocessor.py Added comprehensive ID detection, feature-engine integration, and useless column filtering
backend/training/model_trainer.py Excluded xgboost due to best_iteration bug, improved feature importance extraction
backend/training/eda.py Complete rewrite using pure HTML/CSS instead of Sweetviz
backend/training/.dockerignore Added to reduce Docker context size
backend/api/utils/helpers.py Added aws_endpoint_url and local_training settings for development
backend/api/services/dynamo_service.py Updated to use timezone-aware UTC timestamps
backend/api/routers/training.py Updated to use timezone-aware UTC timestamps
backend/api/routers/models.py Added support for dual report paths (EDA + training) with backward compatibility
backend/api/routers/datasets.py Updated to use timezone-aware UTC timestamps
backend/api/models/schemas.py Extended schemas to support optional fields and dual report paths


def generate(self) -> str:
"""Generate complete HTML report"""
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S UTC")
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing timezone awareness. The timestamp is generated with datetime.now() which returns a naive datetime (without timezone info), but the label says "UTC". This is inconsistent with the rest of the codebase which uses datetime.now(timezone.utc) for timezone-aware UTC timestamps.

Change to:

timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S UTC")

Note: You'll also need to import timezone from datetime at the top of the file.

Copilot uses AI. Check for mistakes.
cors_rule {
allowed_headers = ["*"]
allowed_methods = ["GET"]
allowed_origins = ["*"]
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security concern: The CORS configuration uses allowed_origins = ["*"] which allows any website to request resources from these S3 buckets. While this might be acceptable for development, in production this should be restricted to specific domain(s) where your frontend is hosted.

Consider using a variable for allowed origins:

allowed_origins = var.cors_allowed_origins  # e.g., ["https://yourdomain.com"]

This is especially important for the models bucket which contains trained ML models.

Copilot uses AI. Check for mistakes.
cors_rule {
allowed_headers = ["*"]
allowed_methods = ["GET"]
allowed_origins = ["*"]
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security concern: Same as the models bucket - allowed_origins = ["*"] allows any website to request reports from this S3 bucket. This should be restricted to specific domain(s) in production.

Consider using a variable for allowed origins:

allowed_origins = var.cors_allowed_origins  # e.g., ["https://yourdomain.com"]
Suggested change
allowed_origins = ["*"]
allowed_origins = var.cors_allowed_origins

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +50
if (dateStr.endsWith('Z') || dateStr.includes('+') || dateStr.includes('-', 10)) {
return new Date(dateStr).getTime();
}
return new Date(dateStr + 'Z').getTime();
};
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same timezone detection issue as in formatDateTime. The condition dateStr.includes('-', 10) will incorrectly match ISO date strings like 2024-11-28T10:00:00 where - appears after position 10 in the date portion.

Use a more specific regex pattern to detect timezone offsets:

if (dateStr.endsWith('Z') || /[+-]\d{2}:\d{2}$/.test(dateStr)) {
  return new Date(dateStr).getTime();
}

Copilot uses AI. Check for mistakes.
n_unique = series.nunique()
n_total = len(series)

# If almost all values are unique, likely an ID
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential division by zero issue. If n_total (length of series) is 0, this will raise a ZeroDivisionError. While this is unlikely in practice for a valid dataset, it's good defensive programming to handle this edge case.

Consider adding a check:

if n_total == 0:
    return False
if n_unique / n_total > 0.95:
    # ...
Suggested change
# If almost all values are unique, likely an ID
# If almost all values are unique, likely an ID
if n_total == 0:
return False

Copilot uses AI. Check for mistakes.

def _detect_problem_type(self) -> str:
"""Detect if classification or regression"""
if pd.api.types.is_numeric_dtype(self.target):
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential division by zero issue. If len(self.target) is 0 (empty dataset), this will raise a ZeroDivisionError. While this is unlikely for a valid dataset, it's good defensive programming to handle this edge case.

Consider adding a check:

if len(self.target) == 0:
    return 'classification'  # or raise an appropriate error
unique_ratio = self.target.nunique() / len(self.target)
Suggested change
if pd.api.types.is_numeric_dtype(self.target):
if pd.api.types.is_numeric_dtype(self.target):
if len(self.target) == 0:
return 'classification'

Copilot uses AI. Check for mistakes.
Comment on lines +111 to +113
imbalance_ratio = class_counts.max() / class_counts.min()
if imbalance_ratio > 3:
self.warnings.append(f"Class imbalance detected (ratio: {imbalance_ratio:.1f}:1)")
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential division by zero issue. If class_counts.min() returns 0 (which could happen if there's a class with no samples after value_counts, though unlikely), this will raise a ZeroDivisionError.

Add a guard check:

min_count = class_counts.min()
if min_count > 0:
    imbalance_ratio = class_counts.max() / min_count
    if imbalance_ratio > 3:
        self.warnings.append(f"Class imbalance detected (ratio: {imbalance_ratio:.1f}:1)")
Suggested change
imbalance_ratio = class_counts.max() / class_counts.min()
if imbalance_ratio > 3:
self.warnings.append(f"Class imbalance detected (ratio: {imbalance_ratio:.1f}:1)")
min_count = class_counts.min()
if min_count > 0:
imbalance_ratio = class_counts.max() / min_count
if imbalance_ratio > 3:
self.warnings.append(f"Class imbalance detected (ratio: {imbalance_ratio:.1f}:1)")

Copilot uses AI. Check for mistakes.

for col in self.df.columns:
series = self.df[col]
dtype = str(series.dtype)
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable dtype is not used.

Suggested change
dtype = str(series.dtype)

Copilot uses AI. Check for mistakes.
import pandas as pd
import sweetviz as sv
import numpy as np
from typing import Dict, List, Tuple, Any
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Dict' is not used.
Import of 'Any' is not used.

Suggested change
from typing import Dict, List, Tuple, Any
from typing import List, Tuple

Copilot uses AI. Check for mistakes.
Comment on lines +173 to +174
except Exception:
pass
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except Exception:
pass
except Exception as e:
print(f"Could not extract feature importances from FLAML model: {str(e)}")

Copilot uses AI. Check for mistakes.
@cristofima cristofima merged commit 3237228 into main Nov 30, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants