fix: dollar-quote over-capture and plpgsql NULL truncation#18
Conversation
Tree-sitter compiles token() regexes into a DFA that ignores non-greedy *?, so the previous `$[a-zA-Z_0-9]*$[\s\S]*?$[a-zA-Z_0-9]*$` token was effectively greedy and consumed across multiple dollar-quoted strings in a single file — a two-function file collapsed into one CreateFunctionStmt whose dollar_quoted_string spanned both bodies. The regex also did not enforce matched open/close tags. Add a postgres external scanner that mirrors PostgreSQL's own lexer: record the opening tag, scan forward, and only succeed on a matching $tag$. Codegen now emits an externals block and drops the lexer-level token rule. All binding manifests (binding.gyp, build.rs, binding.go, setup.py, Package.swift) compile postgres/src/scanner.c. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The plpgsql external scanner treated NULL as a structural delimiter at every depth-0 occurrence, so \`IF x IS NULL THEN\` truncated the SQL expression to \`x IS \` and produced an ERROR node around NULL. NULL is a delimiter only when it stands alone as the NULL statement; inside an expression — after IS, IS NOT, =, <>, !=, IN, AND, OR, LIKE, BETWEEN, NOT, AS, or a binary operator — it is the SQL literal and must be consumed. Track an \`expecting_value\` flag that operators, opening parens, comma, and value-expecting keywords set true. Default false so a leading bare \`NULL\` still falls through to kw_null and \`stmt_null\` matches. INTO is intentionally left for a separate pass per the report in #10. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR implements an external scanner for PostgreSQL dollar-quoted strings to enforce exact tag matching, refactors the grammar to delegate this tokenization, improves PL/pgSQL null-value parsing through an ChangesDollar-Quoted String Tokenization & Null Handling
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
plpgsql/src/scanner.c (1)
94-109:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winSingle
<doesn't updateexpecting_value, sox < NULLtruncates the expression.The catch-all at lines 372–376 lists
<as an operator that should setexpecting_value = true, but that branch is unreachable: the<<block-label check at lines 94–109 always handles<first andcontinues without updating the flag. Trace forIF x < NULL THEN:
x→expecting_value = false<→ lines 94–109 consume the char andcontinue; flag unchangedNULL→expecting_valueis stillfalse, so the consume-as-value branch (lines 261–265) is skipped andnullmatches the delimiter list at line 303, stopping the scanner beforeNULL.
<>,<=,!=,>all happen to work because the second char (>/=) goes through the catch-all and sets the flag, but a bare<followed byNULLis broken. Worth fixing while you're touching this logic, since the rest of the heuristic is consistent.🛠️ Proposed fix
if (lexer->lookahead == '<') { /* << found — stop before it */ if (has_content) { lexer->result_symbol = SQL_BODY; return true; } return false; } /* Single < — part of SQL operator, continue */ - has_content = true; + if (depth == 0) expecting_value = true; continue; }Consider also adding a corpus case like
IF x < NULL THEN ... END IF;to lock the behavior in.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plpgsql/src/scanner.c` around lines 94 - 109, The single-'<'' branch in the scanner loop (the depth == 0 && lexer->lookahead == '<' block that calls lexer->mark_end and lexer->advance) consumes a lone '<' but never updates the expecting_value flag, causing constructs like "IF x < NULL" to truncate; modify that branch so that when you detect a single '<' (i.e., not '<<') you set expecting_value = true (preserving the existing early-return behavior for '<<' and the has_content/SQL_BODY logic), and add a corpus test such as "IF x < NULL THEN ... END IF;" to lock the behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@postgres/src/scanner.c`:
- Around line 33-35: Split the single is_tag_char check into two functions and
use the start-specific check for the first tag byte: add is_tag_start(int c)
which returns true for ASCII letters and underscore and for any non-ASCII byte
((unsigned char)c >= 0x80) but explicitly rejects ASCII digits, and keep/rename
the existing is_tag_char (or create is_tag_continue) to allow ASCII letters,
digits, underscore and non-ASCII bytes for subsequent bytes; then change the
tag-parsing code that currently calls is_tag_char for the first byte to call
is_tag_start instead (referencing is_tag_char/is_tag_start and the scanner
tag-parsing code).
---
Outside diff comments:
In `@plpgsql/src/scanner.c`:
- Around line 94-109: The single-'<'' branch in the scanner loop (the depth == 0
&& lexer->lookahead == '<' block that calls lexer->mark_end and lexer->advance)
consumes a lone '<' but never updates the expecting_value flag, causing
constructs like "IF x < NULL" to truncate; modify that branch so that when you
detect a single '<' (i.e., not '<<') you set expecting_value = true (preserving
the existing early-return behavior for '<<' and the has_content/SQL_BODY logic),
and add a corpus test such as "IF x < NULL THEN ... END IF;" to lock the
behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e175e4ee-ed72-4e2c-932f-930368fecab6
📒 Files selected for processing (13)
Package.swiftbinding.gypbindings/go/binding.gobindings/rust/build.rsplpgsql/src/scanner.cplpgsql/test/corpus/control_flow.txtpostgres/grammar.jspostgres/src/grammar.jsonpostgres/src/parser.cpostgres/src/scanner.cpostgres/test/corpus/expressions.txtscript/codegen.jssetup.py
- postgres scanner: split is_tag_char into is_tag_start_char (letter, underscore, or non-ASCII >=0x80) and is_tag_char (start chars + digits). Previously the first byte accepted digits ($1$...$1$ — invalid per PG identifier rules) and rejected non-ASCII letters ($é$...$é$ — valid). - plpgsql scanner: the single-`<` branch consumed the char and continued without updating expecting_value, so `IF x < NULL THEN` truncated at NULL because the catch-all that flips the flag for `<` was unreachable. Set expecting_value = true on the single-`<` path so the next token is treated as a value. - Add corpus case `IF x < NULL THEN` to lock the behavior. Addresses CodeRabbit review on PR #18 (inline thread on postgres scanner.c:35 and outside-diff comment on plpgsql scanner.c:94-109). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Addresses the two parser bugs reported in #10 by @AlexS778.
1. Dollar-quoted string over-capture (postgres)
dollar_quoted_stringwas atoken(/\$[a-zA-Z_0-9]*\$[\s\S]*?\$[a-zA-Z_0-9]*\$/). Tree-sitter compilestoken()regexes into a lexer DFA that does not honor non-greedy*?, so the match was effectively greedy and consumed across multiple$$...$$strings in a single file — twoCREATE FUNCTION ... $$ ... $$;statements collapsed into oneCreateFunctionStmtwhosedollar_quoted_stringspanned both bodies. The regex also did not enforce matched open/close tags ($a$body$b$would have matched).Fix: add a postgres external scanner (
postgres/src/scanner.c) that mirrors PostgreSQL's lexer — record the opening tag, scan forward, only succeed on a matching$tag$. Codegen emits anexternalsblock and drops the lexer-level token rule, so the fix survivestree-sitter generateregeneration when PostgreSQL releases a new version. All five binding manifests (binding.gyp, build.rs, binding.go, setup.py, Package.swift) now compile the scanner.2. plpgsql scanner truncates SQL at NULL
The plpgsql
_sql_bodyexternal scanner treatednullas a structural delimiter at every depth-0 occurrence, soIF _kind IS NULL OR length(_kind) = 0 THENparsed assql_expression = "_kind IS "with an ERROR node coveringNULL OR length(_kind) = 0.Fix: track an
expecting_valueflag in the scanner. Operators (+ - * / % < > = ~ ! @ # ^ & | ? ,), opening parens, and value-expecting keywords (IS,NOT,AND,OR,IN,LIKE,ILIKE,BETWEEN,SIMILAR,AS) set it true. The flag defaults to false so a bare leadingNULLstill falls through tokw_nullandstmt_nullmatches — preserving every existing corpus test.INTOis intentionally left untouched. As @AlexS778 noted,SELECT col INTO var FROM trequires knowing whether a select-list has already been consumed, which is messier than the NULL case and deserves its own pass.Test plan
tree-sitter test— postgres corpus: 38/38 (3 new dollar-quote tests)tree-sitter test— plpgsql corpus: 56/56 (3 new IS NULL / IS NOT NULL / NULL comparison tests)cargo test— Rust bindings load both grammars$$...$$functions in one file produce twoCreateFunctionStmts;IF x IS NULL THEN ... END IFparses with fullsql_expressionSummary by CodeRabbit
New Features
Bug Fixes
Tests