Skip to content

ci: add Python data tools workflow#714

Draft
tianjianjiang wants to merge 2 commits intorefactor/py3_migrationfrom
ci/python_data_workflow
Draft

ci: add Python data tools workflow#714
tianjianjiang wants to merge 2 commits intorefactor/py3_migrationfrom
ci/python_data_workflow

Conversation

@tianjianjiang
Copy link
Copy Markdown
Member

@tianjianjiang tianjianjiang commented Oct 22, 2025

User description

Summary

Creates dedicated workflow for Python dictionary data tools that runs independently from Xcode builds.

Problem

Python data tool changes currently trigger expensive macOS Xcode runners unnecessarily. Need separate workflow for Python-only changes.

Changes

Python Data Tools Workflow (NEW: python-data-check.yml)

Triggers when:

  • Python files in Source/Data/ change
  • curation/ package files change
  • Makefile or pyproject.toml change
  • Workflow file itself changes

Runs on:

  • Ubuntu (cost-effective)
  • Tests: make check in Source/Data directory

Benefits

  • Independent CI for Python changes
  • Faster feedback (no Xcode build wait)
  • Cost optimization (Linux vs macOS)
  • Proper validation for data tool changes

Status

DRAFT - Needs fixing. Current make check may have issues that need to be resolved.


🤖 Generated with Claude Code


PR Type

Enhancement, Tests


Description

  • Add CI workflow for Python data tools

  • Trigger on Data, curation, scripts changes

  • Run make check on Ubuntu runners

  • Separate from macOS Xcode builds


Diagram Walkthrough

flowchart LR
  A["Git push/PR affecting Source/Data"] -- "triggers" --> B["Python Data Tools workflow"]
  B -- "runs on" --> C["ubuntu-latest"]
  C -- "execute" --> D["make check (Source/Data)"]
Loading

File Walkthrough

Relevant files
Configuration changes
python-data-check.yml
Add Python data CI workflow                                                           

.github/workflows/python-data-check.yml

  • Introduce new GitHub Actions workflow
  • Trigger on Python/Data, curation, scripts, config changes
  • Run make check in Source/Data on ubuntu-latest
  • Applies to push and pull_request events
+28/-0   

Comment on lines +23 to +28
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Test data files
run: make check
working-directory: Source/Data

Check warning

Code scanning / CodeQL

Workflow does not contain permissions

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {{contents: read}}

Copilot Autofix

AI 6 months ago

To fix this problem, add an explicit permissions block to limit the default permissions of the GITHUB_TOKEN used by the workflow.
The single best way is to add this block at the top-level of the workflow (YAML root), so it applies to all jobs unless overridden. The minimal starting point for most workflows is contents: read, which suffices for read-only actions such as checkout and running code without modifying repository contents or interacting with issues or pull requests.

You should add:

permissions:
  contents: read

immediately after the name: field and before the on: field, to be clear and conventional.

No changes to imports, methods, or definitions are required since this is GitHub workflow YAML.

Suggested changeset 1
.github/workflows/python-data-check.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/python-data-check.yml b/.github/workflows/python-data-check.yml
--- a/.github/workflows/python-data-check.yml
+++ b/.github/workflows/python-data-check.yml
@@ -1,4 +1,6 @@
 name: Python Data Tools
+permissions:
+  contents: read
 
 on:
   push:
EOF
@@ -1,4 +1,6 @@
name: Python Data Tools
permissions:
contents: read

on:
push:
Copilot is powered by AI and may make mistakes. Always verify output.
@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Trigger Scope

The path filters may miss non-Python files that affect checks (e.g., requirements files, lockfiles, config like .flake8, .editorconfig, setup scripts) or include overly broad dirs like scripts without extensions. Confirm the intended coverage and whether additional files (requirements.txt, poetry.lock, .env.example, .pre-commit-config.yaml) should trigger the workflow.

  paths:
    - 'Source/Data/**/*.py'
    - 'Source/Data/curation/**'
    - 'Source/Data/scripts/**'
    - 'Source/Data/pyproject.toml'
    - 'Source/Data/Makefile'
    - '.github/workflows/python-data-check.yml'
pull_request:
  paths:
    - 'Source/Data/**/*.py'
    - 'Source/Data/curation/**'
    - 'Source/Data/scripts/**'
    - 'Source/Data/pyproject.toml'
    - 'Source/Data/Makefile'
    - '.github/workflows/python-data-check.yml'
Tooling Setup

The job runs make check without setting up Python, caching, or dependencies. If make check relies on a specific Python version or pip/poetry deps, add actions/setup-python and dependency install steps to avoid flaky failures.

runs-on: ubuntu-latest
steps:
  - uses: actions/checkout@v4
  - name: Test data files
    run: make check
    working-directory: Source/Data
Permissions/Concurrency

Consider adding minimal permissions and concurrency/group cancellation to reduce CI load and avoid stale runs on rapid pushes, especially for data tool changes.

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Test data files
        run: make check
        working-directory: Source/Data

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Security
Reduce workflow token privileges

Add permissions: read-all or least-privilege permissions at the job level to
minimize token scope. GitHub defaults can be overly permissive and pose a security
risk.

.github/workflows/python-data-check.yml [21-28]

 jobs:
   test:
+    permissions:
+      contents: read
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v4
       - name: Test data files
         run: make check
         working-directory: Source/Data
Suggestion importance[1-10]: 8

__

Why: Tightening GITHUB_TOKEN permissions to least privilege meaningfully reduces security risk while preserving functionality; the change is accurate and directly applicable to this job.

Medium
General
Stabilize CI environment and shell

Pin the runner or core actions to a specific major/minor to prevent unexpected
breakages from environment changes. Also add defaults.run.shell: bash to ensure
consistent shell behavior for make check.

.github/workflows/python-data-check.yml [21-28]

 jobs:
   test:
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-24.04
+    defaults:
+      run:
+        shell: bash
     steps:
       - uses: actions/checkout@v4
       - name: Test data files
         run: make check
         working-directory: Source/Data
Suggestion importance[1-10]: 6

__

Why: Pinning the runner (e.g., ubuntu-24.04) and setting a default shell improves reproducibility with minimal risk; it's a good practice though not critical for a simple make check job.

Low
Avoid unnecessary workflow runs

Include paths-ignore for common non-code changes (e.g., docs) or add a workflow_call
trigger if this workflow is reused by other pipelines. This prevents unnecessary
runs and avoids missing triggers when invoked from composite workflows.

.github/workflows/python-data-check.yml [3-19]

 on:
   push:
     paths:
       - 'Source/Data/**/*.py'
       - 'Source/Data/curation/**'
       - 'Source/Data/scripts/**'
       - 'Source/Data/pyproject.toml'
       - 'Source/Data/Makefile'
       - '.github/workflows/python-data-check.yml'
+    paths-ignore:
+      - '**/*.md'
+      - '**/*.rst'
   pull_request:
     paths:
       - 'Source/Data/**/*.py'
       - 'Source/Data/curation/**'
       - 'Source/Data/scripts/**'
       - 'Source/Data/pyproject.toml'
       - 'Source/Data/Makefile'
       - '.github/workflows/python-data-check.yml'
+    paths-ignore:
+      - '**/*.md'
+      - '**/*.rst'
+  workflow_call:
Suggestion importance[1-10]: 5

__

Why: Adding paths-ignore and workflow_call can reduce needless CI runs and improve reuse, but it's optional and context-dependent; the proposed YAML is syntactically reasonable and aligns with the shown new hunk.

Low

@claude
Copy link
Copy Markdown

claude bot commented Oct 22, 2025

Code Review: Python Data Tools Workflow

Overview

This PR introduces a dedicated CI workflow for Python data tools, which is a good optimization strategy to avoid triggering expensive macOS runners for Python-only changes. The implementation is straightforward and follows GitHub Actions best practices.

Positive Aspects

Clear separation of concerns: Python-only changes now have their own workflow
Cost optimization: Using Ubuntu instead of macOS for Python validation
Appropriate path filters: Triggers on relevant file changes only
Proper checkout action: Uses up-to-date actions/checkout@v4
Follows Conventional Commits: PR title correctly formatted as ci: add Python data tools workflow

Critical Issues

1. Missing Python Setup Step ⚠️

Issue: The workflow assumes Python 3 is available but doesn't explicitly set it up. While Ubuntu runners come with Python pre-installed, this creates:

  • Unpredictable Python version (could be 3.8, 3.9, 3.10, 3.11, or 3.12)
  • Potential compatibility issues with the codebase

From pyproject.toml: The project requires >=3.9 and AGENTS.md specifies Python 3.9.

Recommendation: Add explicit Python setup:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.9'
      - name: Test data files
        run: make check
        working-directory: Source/Data

2. Missing Python Package Installation ⚠️

Issue: The make check target runs successfully without dependencies (it's just shell commands), but the workflow name "Python Data Tools" is misleading—it only validates data file integrity, not Python code.

From Makefile analysis (line 56-61):

check:
    @awk 'length($$1)/3!=NF-1' BPMFMappings.txt
    @awk '{print $$2,$$3,$$4,$$5,$$6,$$7}' BPMFMappings.txt | awk '/一/'
    @diff -u <(awk '{print $$1}' BPMFMappings.txt|sort -u) \
             <(awk 'length($$1)>3{print $$1}' phrase.occ|sort -u)

Current behavior: make check only validates data file format, it doesn't test Python code.

If the intent is to validate Python tools, consider:

  1. Installing the Python package:
- name: Install Python dependencies
  run: |
    pip install -e .
  working-directory: Source/Data
  1. Running additional validation:
- name: Run Python code style checks
  run: make _mycodecheck
  working-directory: Source/Data
  1. If tests exist:
- name: Run Python tests
  run: pytest
  working-directory: Source/Data

3. Path Filter Redundancy

Issue: Some paths are redundant:

paths:
  - 'Source/Data/**/*.py'        # This already covers the next two
  - 'Source/Data/curation/**'    # Redundant
  - 'Source/Data/scripts/**'     # Redundant

Recommendation: Simplify to:

paths:
  - 'Source/Data/**/*.py'
  - 'Source/Data/pyproject.toml'
  - 'Source/Data/Makefile'
  - '.github/workflows/python-data-check.yml'

Or if you want to catch changes to non-Python files in those directories:

paths:
  - 'Source/Data/**'
  - '!Source/Data/**/*.txt'  # Exclude data files
  - '!Source/Data/**/*.list' # Exclude data files
  - '.github/workflows/python-data-check.yml'

Minor Issues

4. Workflow Naming

Issue: The workflow is named "Python Data Tools" but only runs make check, which validates data files (not Python tools).

Recommendation: Either:

  • Rename to "Data File Validation" to match actual behavior
  • Or expand scope to include Python linting/testing

5. Missing Shell Specification for make check

Issue: The Makefile's check target requires bash (line 56: check: SHELL:=/bin/bash), but the workflow doesn't guarantee bash is used.

Recommendation: While Ubuntu uses bash by default, be explicit:

- name: Test data files
  run: make check
  working-directory: Source/Data
  shell: bash

Best Practice Suggestions

6. Add Status Badge

Consider adding a status badge to README to show this workflow's status:

[![Python Data Tools](https://github.com/openvanilla/McBopomofo/actions/workflows/python-data-check.yml/badge.svg)](https://github.com/openvanilla/McBopomofo/actions/workflows/python-data-check.yml)

7. Consider Caching

If you expand this workflow to install Python packages, consider caching:

- name: Set up Python
  uses: actions/setup-python@v5
  with:
    python-version: '3.9'
    cache: 'pip'

8. Align with Main Build Workflow

Observation: The main build workflow (continuous-integration-workflow-xcode-latest.yml:60-62) also runs make check but on macOS.

Recommendation: Consider:

  • Should this Python workflow replace that step in the main workflow for Python changes?
  • Or should both run (redundant but ensures consistency)?

Document the decision in PR description.

Security Considerations

✅ No security concerns identified
✅ No credential handling
✅ No external script execution
✅ Uses official GitHub Actions only

Test Coverage

⚠️ Gap: No Python code testing, only data file validation

Recommendation: Add Python tests if they exist, or create basic smoke tests for the curation package.

Performance Considerations

✅ Efficient: Ubuntu runners start faster than macOS
✅ Good: Path filters prevent unnecessary runs
✅ Fast: make check is lightweight (just awk/diff commands)

Summary and Recommendations

Must Fix Before Merge

  1. ✅ Add explicit Python version setup (actions/setup-python@v5)
  2. ⚠️ Clarify workflow purpose: data validation vs Python tool testing

Should Fix

  1. Simplify redundant path filters
  2. Consider adding Python package installation if needed
  3. Add explicit bash shell specification

Nice to Have

  1. Add status badge
  2. Add pip caching if installing packages
  3. Clarify relationship with main build workflow's make check

Questions for PR Author

  1. Intent: Is this workflow meant to validate data files only, or also test Python code?
  2. Scope: Should this include make _mycodecheck for Python linting?
  3. Replacement: Should this replace the make check step in the main build workflow for Python-only changes?
  4. Future: Are there plans to add Python unit tests that should run here?

Overall Assessment

Status: Needs revision before merge

Reasoning: While the core idea is sound, the workflow needs explicit Python setup and clarification of scope. The missing Python version specification could lead to inconsistent behavior across runner updates.

Estimated effort to fix: Low (10-15 minutes to add Python setup and clarify scope)


Great initiative on separating Python CI! This will definitely improve feedback speed for data tool changes.

Please address the Python setup issue and clarify the workflow's intended scope, then this will be ready to merge.

Copy link
Copy Markdown
Contributor

@zonble zonble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit tests are failing.

test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can use v5

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zonble @xatier Thank you for the comments. This pull request is a draft because I will have to fix Python 3 compatibility issue first. And then I will get back to this pull request.

tianjianjiang and others added 2 commits October 25, 2025 16:49
This commit migrates the Source/Data directory to Python 3 standards
and modern tooling:

Python 3 Migration:
- Converted all scripts to Python 3 (shebangs, unichr→chr, print functions)
- Fixed string comparisons (is → ==)
- Removed coding declarations (unnecessary in Python 3)
- Migrated to modern type hints (dict/list/tuple instead of Dict/List/Tuple)
- Applied Ruff auto-fixes (IOError→OSError, formatting)

Code Refactoring:
- Completely refactored frequency_builder.py with functions and type hints
- Rewrote main_compiler.py with structured functions and documentation
- Modernized plain_bpmf_compiler.py with proper function decomposition
- Refactored analyze_data.py with type aliases and clean structure
- Improved count_occurrences.py with better organization
- Rewrote map_bpmf.py to be more pythonic
- Created audit_encoding.py (Python replacement for Swift version)
- Converted score_validator.py to f-strings

Modern Tooling:
- Added Ruff configuration for linting/formatting
- Replaced pycodestyle with Ruff in dev dependencies
- Improved .gitignore with official Python template
- Documented Ruff usage in README.md

File Handling Modernization:
- Replaced while True: readline() with for line in file
- Ensured all file operations use with statements
- Converted string paths to pathlib.Path throughout
- Improved error handling with specific exceptions

Documentation:
- Consolidated 3 documentation files into single README.md
- Updated AGENTS.md navigation to Source/Data/README.md
- Added Ruff documentation and usage examples
- Fixed console script naming (mcbpmf_audit_encoding→mcbpmf-audit-encoding)

Legacy Cleanup:
- Deleted entire bin_legacy/ directory (29 files)
- Removed obsolete DEPRECATED.md, old README files
- Removed curation/README.md (merged into main README.md)

All changes maintain backward compatibility with existing workflows
and Makefile targets. Code formatted with Ruff for consistency.

Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Create dedicated workflow for Python dictionary data tools that runs
independently from Xcode builds. This ensures Python-only changes get
proper CI validation without triggering expensive macOS runners.

Triggers when:
- Python files in Source/Data/ change
- curation/ package files change
- Makefile or pyproject.toml change

Runs on:
- Ubuntu (cost-effective)
- Only tests data file validation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@tianjianjiang tianjianjiang force-pushed the ci/python_data_workflow branch from 8912cb4 to 96db181 Compare October 25, 2025 07:51
@tianjianjiang tianjianjiang changed the base branch from master to refactor/py3_migration October 25, 2025 07:54
@tianjianjiang tianjianjiang force-pushed the refactor/py3_migration branch 6 times, most recently from 1ef9514 to 2203bf3 Compare November 1, 2025 16:32
@tianjianjiang tianjianjiang force-pushed the refactor/py3_migration branch from 2203bf3 to c1b7f61 Compare November 3, 2025 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants