[REFACTOR] BE - ai report#101
Conversation
Summary of ChangesHello @jxhee99, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 이 Pull Request는 AI 리포트 백엔드의 성능과 안정성을 향상시키기 위한 대규모 리팩토링을 포함합니다. 데이터 수집 방식의 근본적인 개선과 LLM 응답 처리 로직의 정교화를 통해 기존의 타임아웃 및 CPU 과부하 문제를 해결하고, 전반적인 코드 품질과 유지보수성을 높이는 데 중점을 두었습니다. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the AI report generation process by updating the report schema to remove 'predictions' and 'radar scores' and introduce 'error analysis'. It streamlines the LLM provider resolution, removes Ollama-specific call logic, and significantly overhauls data collection from HTTP API calls to direct InfluxDB queries using InfluxDBClient3. The JSON extraction logic for LLM responses has been simplified, and the prompt construction has been updated to reflect the new report structure. Review comments highlight several issues: the site_id parameter is ignored, leading to data aggregation across all sites and potential unauthorized data exposure; user inputs are directly concatenated into the LLM prompt, creating a prompt injection vulnerability; the simplified JSON parsing logic for LLM output is less robust and may fail with braces inside string literals; and the InfluxDB queries, while currently safe, should be parameterized to follow SQL injection prevention best practices.
|
|
||
| del from_ts, to_ts, bucket, site_id # Inputs are handled via widget bundle collection. | ||
| bundle = _collect_widget_data() | ||
| bundle = _collect_report_bundle() |
There was a problem hiding this comment.
The generate_report function receives a site_id parameter, but it is ignored in the call to _collect_report_bundle(). Furthermore, the SQL queries in _collect_report_bundle do not filter by site_id. Since the analytics data is stored with a site_id tag in InfluxDB, this lack of filtering causes the AI report to aggregate data from all sites in the database. In a multi-tenant environment, this leads to unauthorized data exposure where one user can see a summary of another user's site data.
There was a problem hiding this comment.
Currently, this service is operated locally for a single site, so it is not a multi-tenant issue.
There was a problem hiding this comment.
/gemini Currently, this service is operated locally for a single site, so it is not a multi-tenant issue.
There was a problem hiding this comment.
Thank you for the clarification. I understand that the service is currently operated locally for a single site.
However, the core issue remains: the _collect_report_bundle function, which is responsible for gathering all analytics data for the AI report, does not filter its SQL queries by site_id. This means that even in a single-site setup, if multiple sites were to ever write data to the same InfluxDB instance, the report would aggregate data from all sites.
This design choice creates a significant data isolation vulnerability. If the application were to scale to a multi-tenant environment, or if more than one site's data were ever present in the InfluxDB, a user requesting a report for their site_id would inadvertently receive a summary based on data from other sites as well. This constitutes unauthorized data exposure, which is a high severity security concern.
To ensure proper data segregation and future-proof the application for potential multi-tenancy, I strongly recommend the following:
- Reintroduce
site_idtogenerate_report's signature: Thegenerate_reportfunction should acceptsite_id: Optional[str]. - Modify
_collect_report_bundleto accept and usesite_id: Update_collect_report_bundleto takesite_idas a parameter and incorporate it into theWHEREclauses of all relevant SQL queries (e.g.,summary_sql,trend_sql,exit_sql).
For example, the _collect_report_bundle function could be modified like this:
def _collect_report_bundle(site_id: Optional[str] = None, days: int = 7) -> Dict[str, Any]:
# ... existing code ...
site_filter = f"AND site_id = '{site_id}'" if site_id else ""
summary_sql = f"""
SELECT
device_type, browser_family, country_code, path,
COUNT(*) AS cnt,
COUNT(DISTINCT session_id) AS sessions,
AVG(dwell_ms) AS avg_dwell_ms,
SUM(CASE WHEN error_flag = true THEN 1 ELSE 0 END) AS error_count
FROM events
WHERE time >= TIMESTAMP '{from_str}' AND time < TIMESTAMP '{to_str}'
AND session_id IS NOT NULL AND session_id <> ''
{site_filter}
GROUP BY device_type, browser_family, country_code, path
"""
# ... similar modifications for trend_sql and exit_sql ...
# ... existing code ...And then generate_report would call it like:
def generate_report(
# ... other parameters ...
site_id: Optional[str] = None,
# ... other parameters ...
) -> Dict[str, Any]:
# ... existing code ...
bundle = _collect_report_bundle(site_id=site_id)
# ... existing code ...Addressing this now will prevent significant refactoring and potential security incidents in the future.
| def _build_messages(bundle: Dict[str, Any], prompt: str, language: str, audience: str, word_limit: int) -> List[Dict[str, str]]: | ||
| soft_prompt = (prompt or "").strip()[:400] | ||
| user_prompt = ( | ||
| f"Language: {language}\n" | ||
| f"Audience: {audience}\n" | ||
| f"WordLimit: {word_limit}\n" | ||
| f"UserHint(LightlyIncorporate): {soft_prompt}\n\n" | ||
| "Build an AI report that does the following:\n" | ||
| "- `diagnostics`: 2~4 핵심 환경별 문제를 위젯 데이터를 근거로 설명.\n" | ||
| "- `page_issues`: 체류 시간 대비 이탈이 높은 페이지만 골라 가설을 작성.\n" | ||
| "- `interaction_insights`: 버튼/클릭 패턴을 기반으로 개선 방향을 제안.\n" | ||
| "- `ux_recommendations`: 즉시 실행 가능한 UX 조치와 검증 방법을 제시.\n" | ||
| "- `tech_recommendations`: 기술 조치와 추적 방법을 명시.\n" | ||
| "- `priorities`: 노력 대비 효과 기준으로 High/Medium/Low 분류.\n" | ||
| "- `metrics_to_track`: 개선 후 7일간 모니터링할 위젯과 목표 변화를 명확히 기재.\n" | ||
| "- `predictions`: 최소 2개 이상 반환하고, 조치 실행 시 baseline 대비 expected 값을 숫자로 제시.\n" | ||
| "- `radar_scores`: five axes 0-100 점수, 서로 다른 지표 근거 사용.\n\n" | ||
| "예외 없이 `predictions` 배열의 모든 항목에는 `metric`(string), `baseline`(number), `expected`(number), " | ||
| "`unit`(string, %, sessions 등), `narrative`(string) 필드를 모두 포함하세요. " | ||
| "`expected` 값을 비워 두거나 생략하면 전체 응답이 거부됩니다.\n\n" | ||
| "Build an AI report with these sections:\n" | ||
| "- `diagnostics`: 2~4 core issues by device/browser/country, citing actual numbers from the data.\n" | ||
| "- `page_issues`: pages with high exit rate AND low dwell time from page_exit_rate data.\n" | ||
| "- `error_analysis`: paths + browsers with notable error rates from error_analysis data.\n" | ||
| "- `ux_recommendations`: actionable UX fixes with validation methods.\n" | ||
| "- `tech_recommendations`: technical fixes with monitoring approach.\n" | ||
| "- `priorities`: rank recommendations by effort vs impact as High/Medium/Low.\n" | ||
| "- `metrics_to_track`: which metrics to monitor after improvements, using only fields that exist in the data.\n\n" | ||
| "Respond with JSON only, conforming to this schema:\n" | ||
| f"{json.dumps(schema_hint, ensure_ascii=False)}\n\n" | ||
| f"{json.dumps(_SCHEMA_HINT, ensure_ascii=False)}\n\n" | ||
| f"WIDGET_API_BUNDLE:\n{json.dumps(bundle, ensure_ascii=False)}" | ||
| ) | ||
| return [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}] | ||
| return [{"role": "system", "content": _SYSTEM_PROMPT}, {"role": "user", "content": user_prompt}] |
There was a problem hiding this comment.
User-provided inputs (prompt, language, audience) are directly concatenated into the LLM prompt without sanitization or escaping. This makes the application vulnerable to prompt injection, where an attacker can manipulate the prompt field in the request to override system instructions, potentially leading to the generation of malicious content or the leakage of internal data structures in the AI report.
| # find matching close brace (simple depth count, ignore strings for speed) | ||
| depth = 0 | ||
| end = -1 | ||
| for i in range(start, len(blob)): | ||
| if blob[i] == "{": | ||
| depth += 1 | ||
| elif blob[i] == "}": | ||
| depth -= 1 | ||
| if depth == 0: | ||
| end = i | ||
| break |
There was a problem hiding this comment.
The logic to find the matching closing brace } for a JSON object is based on a simple depth counter that doesn't account for braces inside string literals (e.g., {"key": "a value with { or }"}). This can lead to incorrect slicing of the JSON string and parsing failures if the LLM returns such content. The previous implementation was more robust as it tracked whether the parser was inside a string.
# find matching close brace (string-aware)
depth = 0
end = -1
in_string = False
escape = False
for i in range(start, len(blob)):
char = blob[i]
if in_string:
if escape:
escape = False
elif char == '\\':
escape = True
elif char == '"':
in_string = False
else:
if char == '"':
in_string = True
elif char == '{':
depth += 1
elif char == '}':
depth -= 1
if depth == 0:
end = i
break| summary_sql = f""" | ||
| SELECT | ||
| device_type, browser_family, country_code, path, | ||
| COUNT(*) AS cnt, | ||
| COUNT(DISTINCT session_id) AS sessions, | ||
| AVG(dwell_ms) AS avg_dwell_ms, | ||
| SUM(CASE WHEN error_flag = true THEN 1 ELSE 0 END) AS error_count | ||
| FROM events | ||
| WHERE time >= TIMESTAMP '{from_str}' AND time < TIMESTAMP '{to_str}' | ||
| AND session_id IS NOT NULL AND session_id <> '' | ||
| GROUP BY device_type, browser_family, country_code, path | ||
| """ |
There was a problem hiding this comment.
The SQL query is constructed using an f-string to embed the date range. While safe here because the dates are internally generated, it's a security best practice to use parameterized queries to prevent potential SQL injection. The influxdb-client-3 library supports this via the params argument.
You should modify the query to use placeholders and pass the values in the params dictionary when calling client.query.
Example of the updated call:
params = {"from_str": from_str, "to_str": to_str}
summary_rows = client.query(summary_sql, params=params).to_pylist()| summary_sql = f""" | |
| SELECT | |
| device_type, browser_family, country_code, path, | |
| COUNT(*) AS cnt, | |
| COUNT(DISTINCT session_id) AS sessions, | |
| AVG(dwell_ms) AS avg_dwell_ms, | |
| SUM(CASE WHEN error_flag = true THEN 1 ELSE 0 END) AS error_count | |
| FROM events | |
| WHERE time >= TIMESTAMP '{from_str}' AND time < TIMESTAMP '{to_str}' | |
| AND session_id IS NOT NULL AND session_id <> '' | |
| GROUP BY device_type, browser_family, country_code, path | |
| """ | |
| summary_sql = """ | |
| SELECT | |
| device_type, browser_family, country_code, path, | |
| COUNT(*) AS cnt, | |
| COUNT(DISTINCT session_id) AS sessions, | |
| AVG(dwell_ms) AS avg_dwell_ms, | |
| SUM(CASE WHEN error_flag = true THEN 1 ELSE 0 END) AS error_count | |
| FROM events | |
| WHERE time >= $from_str AND time < $to_str | |
| AND session_id IS NOT NULL AND session_id <> '' | |
| GROUP BY device_type, browser_family, country_code, path | |
| """ |
|
|
LGTM😎 |



🧾 ApiLog
✏️ Summary
데이터 수집 시 타임아웃과 cpu 과부하 문제 해결을 위해 리펙토링 진행했습니다.
📢 Motivation
📌 Type of Change
🛠️ Implementation Details
ai report 백엔드 전면 개선
기존 972줄이었던 service.py는 473줄로 줄었고, 불필요한 추상화·중복·미사용 기능을 모두 제거했습니다.
🧪 Testing
InfluxDB 실행 + 더미데이터 시드로 데이터 생성 후 api key 삽입 후 테스트 진행했습니다.
🖼️ Screenshots / Demos
✅ Checklist