Skip to content

fix: dollar-quote over-capture and plpgsql NULL truncation#18

Merged
gmr merged 3 commits into
mainfrom
fix/dollar-quote-and-null-scanner
May 10, 2026
Merged

fix: dollar-quote over-capture and plpgsql NULL truncation#18
gmr merged 3 commits into
mainfrom
fix/dollar-quote-and-null-scanner

Conversation

@gmr
Copy link
Copy Markdown
Owner

@gmr gmr commented May 10, 2026

Addresses the two parser bugs reported in #10 by @AlexS778.

1. Dollar-quoted string over-capture (postgres)

dollar_quoted_string was a token(/\$[a-zA-Z_0-9]*\$[\s\S]*?\$[a-zA-Z_0-9]*\$/). Tree-sitter compiles token() regexes into a lexer DFA that does not honor non-greedy *?, so the match was effectively greedy and consumed across multiple $$...$$ strings in a single file — two CREATE FUNCTION ... $$ ... $$; statements collapsed into one CreateFunctionStmt whose dollar_quoted_string spanned both bodies. The regex also did not enforce matched open/close tags ($a$body$b$ would have matched).

Fix: add a postgres external scanner (postgres/src/scanner.c) that mirrors PostgreSQL's lexer — record the opening tag, scan forward, only succeed on a matching $tag$. Codegen emits an externals block and drops the lexer-level token rule, so the fix survives tree-sitter generate regeneration when PostgreSQL releases a new version. All five binding manifests (binding.gyp, build.rs, binding.go, setup.py, Package.swift) now compile the scanner.

2. plpgsql scanner truncates SQL at NULL

The plpgsql _sql_body external scanner treated null as a structural delimiter at every depth-0 occurrence, so IF _kind IS NULL OR length(_kind) = 0 THEN parsed as sql_expression = "_kind IS " with an ERROR node covering NULL OR length(_kind) = 0.

Fix: track an expecting_value flag in the scanner. Operators (+ - * / % < > = ~ ! @ # ^ & | ? ,), opening parens, and value-expecting keywords (IS, NOT, AND, OR, IN, LIKE, ILIKE, BETWEEN, SIMILAR, AS) set it true. The flag defaults to false so a bare leading NULL still falls through to kw_null and stmt_null matches — preserving every existing corpus test.

INTO is intentionally left untouched. As @AlexS778 noted, SELECT col INTO var FROM t requires knowing whether a select-list has already been consumed, which is messier than the NULL case and deserves its own pass.

Test plan

  • tree-sitter test — postgres corpus: 38/38 (3 new dollar-quote tests)
  • tree-sitter test — plpgsql corpus: 56/56 (3 new IS NULL / IS NOT NULL / NULL comparison tests)
  • cargo test — Rust bindings load both grammars
  • Original bug reproductions parse correctly: two $$...$$ functions in one file produce two CreateFunctionStmts; IF x IS NULL THEN ... END IF parses with full sql_expression

Summary by CodeRabbit

  • New Features

    • External scanner for PostgreSQL dollar-quoted strings now correctly handles tag matching without regex limitations.
  • Bug Fixes

    • Improved PL/pgSQL null keyword handling with better value-expectation tracking to prevent misinterpretation in SQL expressions.
  • Tests

    • Added test cases for dollar-quoted string expressions.
    • Added test cases for NULL comparison handling in control flow.

Review Change Stack

gmr and others added 2 commits May 10, 2026 16:07
Tree-sitter compiles token() regexes into a DFA that ignores non-greedy
*?, so the previous `$[a-zA-Z_0-9]*$[\s\S]*?$[a-zA-Z_0-9]*$` token was
effectively greedy and consumed across multiple dollar-quoted strings in
a single file — a two-function file collapsed into one CreateFunctionStmt
whose dollar_quoted_string spanned both bodies. The regex also did not
enforce matched open/close tags.

Add a postgres external scanner that mirrors PostgreSQL's own lexer:
record the opening tag, scan forward, and only succeed on a matching
$tag$. Codegen now emits an externals block and drops the lexer-level
token rule. All binding manifests (binding.gyp, build.rs, binding.go,
setup.py, Package.swift) compile postgres/src/scanner.c.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The plpgsql external scanner treated NULL as a structural delimiter at
every depth-0 occurrence, so \`IF x IS NULL THEN\` truncated the SQL
expression to \`x IS \` and produced an ERROR node around NULL. NULL is a
delimiter only when it stands alone as the NULL statement; inside an
expression — after IS, IS NOT, =, <>, !=, IN, AND, OR, LIKE, BETWEEN,
NOT, AS, or a binary operator — it is the SQL literal and must be
consumed.

Track an \`expecting_value\` flag that operators, opening parens, comma,
and value-expecting keywords set true. Default false so a leading bare
\`NULL\` still falls through to kw_null and \`stmt_null\` matches. INTO is
intentionally left for a separate pass per the report in #10.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 10, 2026

Warning

Rate limit exceeded

@gmr has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 27 minutes and 7 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f43572a-ba65-41d8-a6ef-fedb53c68bb9

📥 Commits

Reviewing files that changed from the base of the PR and between 1e17682 and 939b3c0.

📒 Files selected for processing (4)
  • plpgsql/src/parser.c
  • plpgsql/src/scanner.c
  • plpgsql/test/corpus/control_flow.txt
  • postgres/src/scanner.c
📝 Walkthrough

Walkthrough

This PR implements an external scanner for PostgreSQL dollar-quoted strings to enforce exact tag matching, refactors the grammar to delegate this tokenization, improves PL/pgSQL null-value parsing through an expecting_value heuristic, integrates the scanner across all language bindings, and validates both with new corpus tests.

Changes

Dollar-Quoted String Tokenization & Null Handling

Layer / File(s) Summary
Grammar Contract & External Declaration
postgres/grammar.js, postgres/src/grammar.json, script/codegen.js
PostgreSQL grammar declares dollar_quoted_string as external token provided by scanner; removed inline regex pattern. Code generators updated to emit externals block into grammar.js and grammar.json.
PostgreSQL External Scanner
postgres/src/scanner.c
New C scanner validates opening $tag$ delimiters (63-char limit), scans forward for exact matching closing tag. Self-contained ASCII helpers avoid libc. All tree-sitter scanner interface functions implemented (create, destroy, serialize, deserialize, scan).
PL/pgSQL Scanner Null Handling
plpgsql/src/scanner.c
Adds expecting_value heuristic flag to differentiate null-as-SQL from null-as-delimiter. Sets flag on opening brackets, clears on closing brackets and strings. For alphabetic words at depth 0, only value-expecting keywords update flag; null treated as SQL_BODY when flag true. Removes structural keyword early-return logic.
Build System Integration
Package.swift, binding.gyp, bindings/go/binding.go, bindings/rust/build.rs, setup.py
Scanner source file postgres/src/scanner.c registered in Swift package target, Node binding config, Go cgo preamble, Rust build with rerun-if-changed triggers, and Python C extension sources.
Parser Regeneration
postgres/src/parser.c
Git LFS pointer updated (oid and size) reflecting parser rebuild from new grammar with externals.
Corpus Tests
postgres/test/corpus/expressions.txt, plpgsql/test/corpus/control_flow.txt
PostgreSQL: three dollar-quoted cases (untagged, tagged with nesting, multiple per line). PL/pgSQL: three IF-condition cases testing IS NULL, IS NOT NULL, and NULL in boolean expressions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • gmr/tree-sitter-postgres#13: Modifies plpgsql/src/scanner.c external scanner logic and helpers with overlapping changes to scanner behavior.
  • gmr/tree-sitter-postgres#4: Implements coordinated grammar and scanner changes for dollar-quoted string support and boolean operator / IS NULL parsing fixes.
  • gmr/tree-sitter-postgres#5: Adds injection rules that depend on the dollar_quoted_string external token introduced by this PR.

Poem

A scanner born to match the tags so bright,
Dollar-quotes now dance with perfect might.
Null grows wise, expects what comes its way,
Through parenthesis and strings to play. 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: dollar-quote over-capture and plpgsql NULL truncation' accurately and concisely summarizes the two main bugs being fixed in this changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/dollar-quote-and-null-scanner

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gmr
Copy link
Copy Markdown
Owner Author

gmr commented May 10, 2026

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 10, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
plpgsql/src/scanner.c (1)

94-109: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Single < doesn't update expecting_value, so x < NULL truncates the expression.

The catch-all at lines 372–376 lists < as an operator that should set expecting_value = true, but that branch is unreachable: the << block-label check at lines 94–109 always handles < first and continues without updating the flag. Trace for IF x < NULL THEN:

  • xexpecting_value = false
  • < → lines 94–109 consume the char and continue; flag unchanged
  • NULLexpecting_value is still false, so the consume-as-value branch (lines 261–265) is skipped and null matches the delimiter list at line 303, stopping the scanner before NULL.

<>, <=, !=, > all happen to work because the second char (>/=) goes through the catch-all and sets the flag, but a bare < followed by NULL is broken. Worth fixing while you're touching this logic, since the rest of the heuristic is consistent.

🛠️ Proposed fix
       if (lexer->lookahead == '<') {
         /* << found — stop before it */
         if (has_content) {
           lexer->result_symbol = SQL_BODY;
           return true;
         }
         return false;
       }
       /* Single < — part of SQL operator, continue */
-
       has_content = true;
+      if (depth == 0) expecting_value = true;
       continue;
     }

Consider also adding a corpus case like IF x < NULL THEN ... END IF; to lock the behavior in.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plpgsql/src/scanner.c` around lines 94 - 109, The single-'<'' branch in the
scanner loop (the depth == 0 && lexer->lookahead == '<' block that calls
lexer->mark_end and lexer->advance) consumes a lone '<' but never updates the
expecting_value flag, causing constructs like "IF x < NULL" to truncate; modify
that branch so that when you detect a single '<' (i.e., not '<<') you set
expecting_value = true (preserving the existing early-return behavior for '<<'
and the has_content/SQL_BODY logic), and add a corpus test such as "IF x < NULL
THEN ... END IF;" to lock the behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@postgres/src/scanner.c`:
- Around line 33-35: Split the single is_tag_char check into two functions and
use the start-specific check for the first tag byte: add is_tag_start(int c)
which returns true for ASCII letters and underscore and for any non-ASCII byte
((unsigned char)c >= 0x80) but explicitly rejects ASCII digits, and keep/rename
the existing is_tag_char (or create is_tag_continue) to allow ASCII letters,
digits, underscore and non-ASCII bytes for subsequent bytes; then change the
tag-parsing code that currently calls is_tag_char for the first byte to call
is_tag_start instead (referencing is_tag_char/is_tag_start and the scanner
tag-parsing code).

---

Outside diff comments:
In `@plpgsql/src/scanner.c`:
- Around line 94-109: The single-'<'' branch in the scanner loop (the depth == 0
&& lexer->lookahead == '<' block that calls lexer->mark_end and lexer->advance)
consumes a lone '<' but never updates the expecting_value flag, causing
constructs like "IF x < NULL" to truncate; modify that branch so that when you
detect a single '<' (i.e., not '<<') you set expecting_value = true (preserving
the existing early-return behavior for '<<' and the has_content/SQL_BODY logic),
and add a corpus test such as "IF x < NULL THEN ... END IF;" to lock the
behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e175e4ee-ed72-4e2c-932f-930368fecab6

📥 Commits

Reviewing files that changed from the base of the PR and between 184255d and 1e17682.

📒 Files selected for processing (13)
  • Package.swift
  • binding.gyp
  • bindings/go/binding.go
  • bindings/rust/build.rs
  • plpgsql/src/scanner.c
  • plpgsql/test/corpus/control_flow.txt
  • postgres/grammar.js
  • postgres/src/grammar.json
  • postgres/src/parser.c
  • postgres/src/scanner.c
  • postgres/test/corpus/expressions.txt
  • script/codegen.js
  • setup.py

Comment thread postgres/src/scanner.c Outdated
- postgres scanner: split is_tag_char into is_tag_start_char (letter,
  underscore, or non-ASCII >=0x80) and is_tag_char (start chars + digits).
  Previously the first byte accepted digits ($1$...$1$ — invalid per PG
  identifier rules) and rejected non-ASCII letters ($é$...$é$ — valid).
- plpgsql scanner: the single-`<` branch consumed the char and continued
  without updating expecting_value, so `IF x < NULL THEN` truncated at
  NULL because the catch-all that flips the flag for `<` was unreachable.
  Set expecting_value = true on the single-`<` path so the next token is
  treated as a value.
- Add corpus case `IF x < NULL THEN` to lock the behavior.

Addresses CodeRabbit review on PR #18 (inline thread on postgres
scanner.c:35 and outside-diff comment on plpgsql scanner.c:94-109).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@gmr gmr merged commit d355237 into main May 10, 2026
2 checks passed
@gmr gmr deleted the fix/dollar-quote-and-null-scanner branch May 10, 2026 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant