nicdun · nicdun · Aug 30, 2025 · Aug 30, 2025 · Aug 30, 2025 · Aug 30, 2025
diff --git a/.cursor/rules/commit-message.mdc b/.cursor/rules/commit-message.mdc
@@ -0,0 +1,129 @@
+---
+description: Use when user wants to git commit
+globs: 
+alwaysApply: false
+---
+# Commit
+
+Create well-formatted commits that comply with Conventional Commits v1.0.0 and pass `@commitlint/config-conventional`.
+
+## Features:
+- Runs pre-commit checks by default (lint, build, generate docs)
+- Automatically stages files if none are staged
+- Uses the Conventional Commits format and validates message structure
+- Suggests splitting commits for different concerns
+
+## Usage:
+- `/commit` - Standard commit with pre-commit checks
+- `/commit --no-verify` - Skip pre-commit checks
+
+## Message Format
+
+```
+<type>[optional scope][!]: <description>
+
+[optional body]
+
+[optional footer(s)]
+```
+
+### Unified Rules (Spec + Commitlint)
+1. Start with `type[optional scope][!]: subject`.
+2. `type` MUST be lower-case and one of: `build`, `chore`, `ci`, `docs`, `feat`, `fix`, `perf`, `refactor`, `revert`, `style`, `test`.
+3. Use `feat` for new features; use `fix` for bug fixes.
+4. `scope` MAY be provided in parentheses and SHOULD be lower-case (e.g., `fix(parser):`).
+5. `subject` MUST be present, written in imperative mood, and SHOULD NOT end with a period.
+6. Keep header length ≤ 72 characters where practical.
+7. If a body is present, it MUST be separated by a blank line; body is free-form.
+8. If footers are present, they MUST be separated by a blank line; use tokens like `Refs`, `Closes`, `Reviewed-by`.
+9. Footer tokens MUST use `-` instead of spaces, except `BREAKING CHANGE`, which MAY contain a space. `BREAKING-CHANGE` is synonymous with `BREAKING CHANGE`.
+10. Breaking changes MUST be indicated either by `!` in the header or by a `BREAKING CHANGE:` footer. If `!` is used, the subject SHOULD describe the breaking change.
+
+### Types
+- feat: Introduces a new feature (MINOR)
+- fix: Patches a bug (PATCH)
+- build: Build system or external dependencies
+- chore: Other changes that don’t modify src or test files
+- ci: CI configuration files and scripts
+- docs: Documentation only changes
+- perf: Improves performance without functional change
+- refactor: Code change that neither fixes a bug nor adds a feature
+- revert: Reverts a previous commit
+- style: Changes that do not affect the meaning of the code
+- test: Adding or correcting tests
+
+### Breaking Changes
+- Indicate with `!` after type/scope, e.g., `feat(api)!: ...`, or add a footer:
+  - `BREAKING CHANGE: <description>`
+- `BREAKING-CHANGE` is synonymous with `BREAKING CHANGE` in footers.
+
+### Canonical Examples
+```
+feat: allow provided config object to extend other configs
+
+BREAKING CHANGE: `extends` key in config file is now used for extending other config files
+```
+
+```
+feat!: send an email to the customer when a product is shipped
+```
+
+```
+feat(api)!: send an email to the customer when a product is shipped
+```
+
+```
+chore!: drop support for Node 6
+
+BREAKING CHANGE: use JavaScript features not available in Node 6.
+```
+
+```
+docs: correct spelling of CHANGELOG
+```
+
+```
+fix: prevent racing of requests
+
+Introduce a request id and a reference to latest request. Dismiss
+incoming responses other than from latest request.
+
+Remove timeouts which were used to mitigate the racing issue but are
+obsolete now.
+
+Reviewed-by: Z
+Refs: #123
+```
+
+## Process:
+1. Check for staged changes (`git status`)
+2. If no staged changes, review and stage appropriate files
+3. Run pre-commit checks (unless --no-verify)
+4. Analyze changes to determine commit type
+5. Generate descriptive commit message
+6. Include scope if applicable: `type(scope): description`
+7. Add body for complex changes explaining why
+8. Add footers (e.g., `BREAKING CHANGE:`) if needed
+9. Execute commit
+
+## Quick Validation Checklist
+- [ ] Type is allowed and lower-case
+- [ ] Optional scope is lower-case and in parentheses
+- [ ] Subject present, imperative, no trailing period
+- [ ] Header ideally ≤ 72 chars
+- [ ] Blank line before body (if any)
+- [ ] Blank line before footer(s) (if any)
+- [ ] Breaking change indicated via `!` or `BREAKING CHANGE:` footer
+
+## Best Practices:
+- Keep commits atomic and focused
+- Write in imperative mood ("Add feature" not "Added feature")
+- Explain why, not just what
+- Reference issues/PRs when relevant
+- Split unrelated changes into separate commits
+- Prefer recognized types; avoid nonstandard types that hinder tooling
+- Follow the spec for breaking changes and footers; prefer concise, clear subjects
+
+References:
+- Conventional Commits 1.0.0 — https://www.conventionalcommits.org/en/v1.0.0/
+- @commitlint/config-conventional — https://github.com/conventional-changelog/commitlint/tree/master/%40commitlint/config-conventional
diff --git a/.cursor/rules/lessons-learned.mdc b/.cursor/rules/lessons-learned.mdc
@@ -0,0 +1,178 @@
+---
+alwaysApply: true
+---
+
+# Lessons Learned - Kindle Parsing Bug Fix
+
+## Character Encoding Issues
+
+### Problem
+- HTML content from Kindle exports contains special characters that don't match test expectations
+- Non-breaking spaces (`&nbsp;` / `\u00A0`) appear identical to regular spaces but are different characters
+- Curly quotes (`'` / `"`) vs straight quotes (`'` / `"`)
+- En/em dashes (`–` / `—`) vs regular hyphens (`-`)
+
+### Solution
+- Always implement text normalization for HTML parsing
+- Create a `normalizeText()` function that handles common character encoding differences
+- Apply normalization to both content text and metadata (chapter names, etc.)
+
+### Code Pattern
+```typescript
+const normalizeText = (text: string): string => {
+  return text
+    .replace(/\u00A0/g, " ") // Non-breaking spaces
+    .replace(/\u2019/g, "'") // Curly single quotes
+    .replace(/\u2018/g, "'") // Curly single quotes
+    .replace(/\u201D/g, '"') // Curly double quotes
+    .replace(/\u201C/g, '"') // Curly double quotes
+    .replace(/\u2013/g, "-") // En dashes
+    .replace(/\u2014/g, "-") // Em dashes
+    .trim();
+};
+```
+
+## Testing Best Practices
+
+### Always Test Features/Fixes
+- **Never skip testing** - every feature or fix must have corresponding tests
+- **Test with real data** - use actual HTML fixtures from different languages/formats
+- **Test edge cases** - empty content, missing elements, malformed HTML
+- **Test character encoding** - especially for international content
+
+### Test Structure
+- Test the first N items (e.g., first 5 highlights) to ensure parsing works consistently
+- Validate all fields: text, color, page, location, chapter, notes
+- Test both positive cases (valid data) and negative cases (missing/invalid data)
+
+### Test Setup
+- Use Vitest with jsdom environment for DOM parsing tests
+- Disable plugins that conflict with test environment (e.g., logseq plugin)
+- Use absolute paths for fixture files to avoid path resolution issues
+
+## Parsing Robustness
+
+### Language Agnostic Design
+- Don't rely on specific text strings like "Highlight" or "Note"
+- Use structural elements (CSS classes, DOM hierarchy) for identification
+- Support multilingual labels for page numbers, locations, etc.
+
+### Error Handling
+- Always check for null/undefined before accessing properties
+- Use optional chaining (`?.`) instead of non-null assertions (`!`)
+- Provide fallback values for missing data
+
+### DOM Navigation
+- Use `nextElementSibling` and `previousElementSibling` for related content
+- Check element classes rather than text content for identification
+- Handle cases where expected elements might be missing
+
+## Code Quality
+
+### Linting
+- Always run linter after making changes
+- Fix regex usage (use `exec()` instead of `match()`)
+- Remove unnecessary type assertions
+- Use optional chaining where appropriate
+
+### Documentation
+- Document complex parsing logic with comments
+- Explain the reasoning behind structural decisions
+- Note language-specific considerations
+
+## Key Takeaways
+
+1. **Character encoding is critical** for international content
+2. **Always test with real data** from the target environment
+3. **Structural parsing** is more reliable than text-based parsing
+4. **Error handling** should be defensive and graceful
+5. **Test coverage** should include multiple languages and edge cases
+6. **Code quality** (linting, type safety) prevents future bugs
+
+# Lessons Learned - Kindle Parsing Bug Fix
+
+## Character Encoding Issues
+
+### Problem
+- HTML content from Kindle exports contains special characters that don't match test expectations
+- Non-breaking spaces (`&nbsp;` / `\u00A0`) appear identical to regular spaces but are different characters
+- Curly quotes (`'` / `"`) vs straight quotes (`'` / `"`)
+- En/em dashes (`–` / `—`) vs regular hyphens (`-`)
+
+### Solution
+- Always implement text normalization for HTML parsing
+- Create a `normalizeText()` function that handles common character encoding differences
+- Apply normalization to both content text and metadata (chapter names, etc.)
+
+### Code Pattern
+```typescript
+const normalizeText = (text: string): string => {
+  return text
+    .replace(/\u00A0/g, " ") // Non-breaking spaces
+    .replace(/\u2019/g, "'") // Curly single quotes
+    .replace(/\u2018/g, "'") // Curly single quotes
+    .replace(/\u201D/g, '"') // Curly double quotes
+    .replace(/\u201C/g, '"') // Curly double quotes
+    .replace(/\u2013/g, "-") // En dashes
+    .replace(/\u2014/g, "-") // Em dashes
+    .trim();
+};
+```
+
+## Testing Best Practices
+
+### Always Test Features/Fixes
+- **Never skip testing** - every feature or fix must have corresponding tests
+- **Test with real data** - use actual HTML fixtures from different languages/formats
+- **Test edge cases** - empty content, missing elements, malformed HTML
+- **Test character encoding** - especially for international content
+
+### Test Structure
+- Test the first N items (e.g., first 5 highlights) to ensure parsing works consistently
+- Validate all fields: text, color, page, location, chapter, notes
+- Test both positive cases (valid data) and negative cases (missing/invalid data)
+
+### Test Setup
+- Use Vitest with jsdom environment for DOM parsing tests
+- Disable plugins that conflict with test environment (e.g., logseq plugin)
+- Use absolute paths for fixture files to avoid path resolution issues
+
+## Parsing Robustness
+
+### Language Agnostic Design
+- Don't rely on specific text strings like "Highlight" or "Note"
+- Use structural elements (CSS classes, DOM hierarchy) for identification
+- Support multilingual labels for page numbers, locations, etc.
+
+### Error Handling
+- Always check for null/undefined before accessing properties
+- Use optional chaining (`?.`) instead of non-null assertions (`!`)
+- Provide fallback values for missing data
+
+### DOM Navigation
+- Use `nextElementSibling` and `previousElementSibling` for related content
+- Check element classes rather than text content for identification
+- Handle cases where expected elements might be missing
+
+## Code Quality
+
+### Linting
+- Always run linter after making changes
+- Fix regex usage (use `exec()` instead of `match()`)
+- Remove unnecessary type assertions
+- Use optional chaining where appropriate
+
+### Documentation
+- Document complex parsing logic with comments
+- Explain the reasoning behind structural decisions
+- Note language-specific considerations
+
+## Key Takeaways
+
+1. **Character encoding is critical** for international content
+2. **Always test with real data** from the target environment
+3. **Structural parsing** is more reliable than text-based parsing
+4. **Error handling** should be defensive and graceful
+5. **Test coverage** should include multiple languages and edge cases
+6. **Code quality** (linting, type safety) prevents future bugs
+
diff --git a/.github/workflows/pr.yaml b/.github/workflows/pr.yaml
@@ -0,0 +1,39 @@
+name: PR CI
+
+on:
+  pull_request:
+    types: [opened, synchronize, reopened, edited, ready_for_review]
+
+jobs:
+  ci:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+
+      - name: Setup Node.js
+        uses: actions/setup-node@v4
+        with:
+          node-version: '20'
+          cache: 'npm'
+
+      - name: Install dependencies
+        run: npm install
+
+      - name: Commitlint
+        uses: wagoid/commitlint-github-action@v6
+        with:
+          configFile: commitlint.config.mjs
+
+      - name: Lint
+        run: npm run lint
+
+      - name: Test
+        run: npm run test
+
+      - name: Build
+        run: npm run build
+
+
diff --git a/.github/workflows/publish.yaml b/.github/workflows/publish.yaml
@@ -20,7 +20,10 @@ jobs:
       - name: install dependencies
         run: |
           npm install
-      - name: build and test
+      - name: test
+        run: |
+          npm run test
+      - name: build
         run: |
           npm run build
       - name: Install zip

diff --git a/README.md b/README.md
@@ -29,3 +29,5 @@ If you find that something isn't working right then I'm always happy to hear it
 
 ## ☕ Thank you!
 A big thank you to the creators of the awesome logseq application :)
+
+<a href="https://www.buymeacoffee.com/nicdun" rel="nofollow"><img src="https://user-images.githubusercontent.com/3909046/150683481-be070424-7bb0-4dd7-a3cb-43b5605163f5.png" alt="buymeacoffee-button" style="max-width: 100%;"></a>
diff --git a/commitlint.config.js b/commitlint.config.js
@@ -0,0 +1,3 @@
+export default {
+  extends: ["@commitlint/config-conventional"],
+};
Original file line number	Diff line number	Diff line change
Expand Up		@@ -29,3 +29,5 @@ If you find that something isn't working right then I'm always happy to hear it

		## ☕ Thank you!
		A big thank you to the creators of the awesome logseq application :)

		<a href="https://www.buymeacoffee.com/nicdun" rel="nofollow"><img src="https://user-images.githubusercontent.com/3909046/150683481-be070424-7bb0-4dd7-a3cb-43b5605163f5.png" alt="buymeacoffee-button" style="max-width: 100%;"></a>