Skip to content

feat: reverse Expr for matching "before" context#2934

Open
hippietrail wants to merge 12 commits intoAutomattic:masterfrom
hippietrail:reverse-expr
Open

feat: reverse Expr for matching "before" context#2934
hippietrail wants to merge 12 commits intoAutomattic:masterfrom
hippietrail:reverse-expr

Conversation

@hippietrail
Copy link
Copy Markdown
Collaborator

Issues

N/A

Description

It would make things easier for more complex/advanced linters that need to check the context inside match_to_lint_with_context if there was a reverse equivalent of Expr so we can check the previous toke, then the one before that, etc.

This PR provides that in what I think is a minimal way by introducing _rev versions of a few trait methods such as run_rev() for Expr and step_rev() for Step.

I also noticed that the default implementation of run for Expr had a conditional check that step returned a positive value but also had code for what to do with a negative value, but as far as I can tell a negative value there should be impossible. I verified this by running all the tests with asserts there and they never got triggered.

I removed the code for the false paths in here but I'm not sure that's the right thing to do, or that the old code was the right thing to do either. It should be a sanity check but I believe we're not supposed to output to stderr or crash. What's the right thing to do here?

Somewhat similarly, in all the Patterns and Exprs that do not yet support a _rev version I currently do output a message with 🛑 to stderr and return None. Ideally I was thinking a way for each one to share whether or not they support reverse usage might be needed, but maybe just returning None is correct. It's only going to be hit with a contributor is working on a new reverse Expr.

The only Expr which support reverse mode so far is SequenceExpr, which I have a test for. There was a choice about whether the code would express reverse sequences "forward" or "backward" since they'll be evaluated backward but might be easier to reason about forward like other SequenceExpr that's what I went with. But maybe I chose wrong or maybe both are needed.

Please critique and give feedback!

How Has This Been Tested?

I introduced tests for both the forward and backward versions.

Checklist

  • I have performed a self-review of my own code
  • I have added tests to cover my changes

@hippietrail hippietrail changed the title feat: rever Expr for matching "before" context feat: reverse Expr for matching "before" context Mar 17, 2026
@hippietrail
Copy link
Copy Markdown
Collaborator Author

@86xsk I don't seem to be able to request your review of this but it would be great to get some opinions from you whenever you might have time.

Copy link
Copy Markdown
Contributor

@86xsk 86xsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make things easier for more complex/advanced linters that need to check the context inside match_to_lint_with_context if there was a reverse equivalent of Expr so we can check the previous toke, then the one before that, etc.

This PR provides that in what I think is a minimal way by introducing _rev versions of a few trait methods such as run_rev() for Expr and step_rev() for Step.

I'm not too experienced with writing linters so it's a little difficult to wrap my head around this. Would you please give an example of where this might be useful in practice?

I also noticed that the default implementation of run for Expr had a conditional check that step returned a positive value but also had code for what to do with a negative value, but as far as I can tell a negative value there should be impossible. I verified this by running all the tests with asserts there and they never got triggered.

I removed the code for the false paths in here but I'm not sure that's the right thing to do, or that the old code was the right thing to do either. It should be a sanity check but I believe we're not supposed to output to stderr or crash. What's the right thing to do here?

I'm not super familiar with this part of the code myself, but I think it is technically possible for a Step to return a negative value. Albeit it doesn't look like that's currently used. I believe the functionality was intentionally added in #1393.

Somewhat similarly, in all the Patterns and Exprs that do not yet support a _rev version I currently do output a message with 🛑 to stderr and return None. Ideally I was thinking a way for each one to share whether or not they support reverse usage might be needed, but maybe just returning None is correct. It's only going to be hit with a contributor is working on a new reverse Expr.

I think it might be better to use something like todo!() instead. That way it would panic, making it harder to accidentally use such an implementation without realizing it.

Another alternative might be to implement a separate trait for it instead, something like ExprRev perhaps, then implement it for Expr that are reversible. I think this is likely a more idiomatic way to go about it, but I guess it would cause some downstream complexity by requiring types like SequenceExpr to be generic over whether all the contained expressions support ExprRev or not. Because of this, I'm wondering if there's a simpler alternative that fulfills the same need, but it's hard to say since I'm not too experienced with actually writing linters myself.

@hippietrail
Copy link
Copy Markdown
Collaborator Author

It would make things easier for more complex/advanced linters that need to check the context inside match_to_lint_with_context if there was a reverse equivalent of Expr so we can check the previous token, then the one before that, etc.
This PR provides that in what I think is a minimal way by introducing _rev versions of a few trait methods such as run_rev() for Expr and step_rev() for Step.

I'm not too experienced with writing linters so it's a little difficult to wrap my head around this. Would you please give an example of where this might be useful in practice?

Parts-of-speech of words are often ambiguous and you have to look at surrounding words to figure out which one applies in context. Normally we try to make the SequenceExpr in the linter take this into account. But it's not as powerful as a full regex engine and can't do things like positive and negative lookbehind. Even in regex, designing regexes that work backwards before the main match is really hard.

I wrote a Rust linter just in the past week that does some manual walking backward through the "before" context in match_to_linter_with_context() but I forget which one and I'm not sure if it got merged or is still a pending PR. It's hard to just think up concrete examples on the spot!

I also noticed that the default implementation of run for Expr had a conditional check that step returned a positive value but also had code for what to do with a negative value, but as far as I can tell a negative value there should be impossible. I verified this by running all the tests with asserts there and they never got triggered.
I removed the code for the false paths in here but I'm not sure that's the right thing to do, or that the old code was the right thing to do either. It should be a sanity check but I believe we're not supposed to output to stderr or crash. What's the right thing to do here?

I'm not super familiar with this part of the code myself, but I think it is technically possible for a Step to return a negative value. Albeit it doesn't look like that's currently used. I believe the functionality was intentionally added in #1393.

I think that's the PR where Expr was introduced. Before that we only had Pattern, which turned out to be insufficient to identify lots of kinds of grammar linting patterns. I don't know the code inside-out though.

Somewhat similarly, in all the Patterns and Exprs that do not yet support a _rev version I currently do output a message with 🛑 to stderr and return None. Ideally I was thinking a way for each one to share whether or not they support reverse usage might be needed, but maybe just returning None is correct. It's only going to be hit with a contributor is working on a new reverse Expr.

I think it might be better to use something like todo!() instead. That way it would panic, making it harder to accidentally use such an implementation without realizing it.

I got in trouble for using unreachable! a few months ago in a code path that assumed hyphens couldn't occur inside Harper's concept of a word. But it turned out there's some logic that processes identifiers somehow in some languages under some conditions where hyphens can end up in words. This caused crashes inside the LSP, which is apparently a very bad thing.

Another alternative might be to implement a separate trait for it instead, something like ExprRev perhaps, then implement it for Expr that are reversible. I think this is likely a more idiomatic way to go about it, but I guess it would cause some downstream complexity by requiring types like SequenceExpr to be generic over whether all the contained expressions support ExprRev or not. Because of this, I'm wondering if there's a simpler alternative that fulfills the same need, but it's hard to say since I'm not too experienced with actually writing linters myself.

That's exactly the way I tried to implement it first. As I built it out and its tendrils spread, they hit something they tangled with in ugly ways that I no longer remember, but doing so helped me understand the code better to trim it way back and find which parts were essential.

One problem was the interplay between backwards and forwards elements, which there shouldn't be a need for. The one I'm still not sure about but I think this implementation might be able to handle is when a SequenceExpr contains some Expr that in turn contains a SequenceExpr being able to retain the original direction and not switch to the default. Such as FirstMatchOf. I'm not sure but I think SequenceExpr is the only Expr that has a sequence and needs a direction. Any others that do actually have a SequenceExpr inside. I could be wrong.

I appreciate the feedback!

@86xsk
Copy link
Copy Markdown
Contributor

86xsk commented Mar 18, 2026

I'm not too experienced with writing linters so it's a little difficult to wrap my head around this. Would you please give an example of where this might be useful in practice?

Parts-of-speech of words are often ambiguous and you have to look at surrounding words to figure out which one applies in context. Normally we try to make the SequenceExpr in the linter take this into account. But it's not as powerful as a full regex engine and can't do things like positive and negative lookbehind. Even in regex, designing regexes that work backwards before the main match is really hard.

About half a year ago, I tried to implement (but never finished) some form of lookahead/lookbehind by implementing functions like start_capture() and end_capture() on SequenceExpr. The idea was to be able to write something like:

SequenceExpr::default()
    .then(...) // Isn't captured in the match, but must precede for the match to succeed.
    .start_capture()
    .then(...) // What we actually want to have contained in the match.
    .end_capture()
    .then(...) // Isn't captured in the match, but must follow for the match to succeed.

(If you're familiar with vim regex, the idea is basically identical to \zs and \ze. For instance, the regex some\zsthing would match the 'thing' in 'something' but would not match 'thing' on its own.)

Do you think something like that would work well for the use cases you have in mind? If so, I could try taking another crack at it. If it ends up working, I feel it could help avoid some of the complexity with reverse matching.

I'm not sure but I think SequenceExpr is the only Expr that has a sequence and needs a direction. Any others that do actually have a SequenceExpr inside. I could be wrong.

The only other one that I can think of off the top of my head is Repeating.

@hippietrail
Copy link
Copy Markdown
Collaborator Author

Normally we try to make the SequenceExpr in the linter take this into account. But it's not as powerful as a full regex engine and can't do things like positive and negative lookbehind. Even in regex, designing regexes that work backwards before the main match is really hard.

About half a year ago, I tried to implement (but never finished) some form of lookahead/lookbehind by implementing functions like start_capture() and end_capture() on SequenceExpr. The idea was to be able to write something like:

SequenceExpr::default()
    .then(...) // Isn't captured in the match, but must precede for the match to succeed.
    .start_capture()
    .then(...) // What we actually want to have contained in the match.
    .end_capture()
    .then(...) // Isn't captured in the match, but must follow for the match to succeed.

(If you're familiar with vim regex, the idea is basically identical to \zs and \ze. For instance, the regex some\zsthing would match the 'thing' in 'something' but would not match 'thing' on its own.)

Do you think something like that would work well for the use cases you have in mind? If so, I could try taking another crack at it. If it ends up working, I feel it could help avoid some of the complexity with reverse matching.

Could do. I also tried to add some kind of capture groups last year and tried to add anchors before we got the ones we have now - which I still can't figure out how to use. I gave up both because there was too much about the overall architecture I didn't understand well enough, and probably too much Rust I hadn't yet learned either.

I do keep thinking though about taking another crack at making an alternative to Expr or perhaps an alternative to ExprLinter using lower-level stuff. I always think it's better when I have a concrete use case than an abstract idea though.

Another thing I think is missing is the ability to skip over things that are not matched. When you identify a pattern as a thing you don't want but know how many tokens it is only for it to get matched again a token or two later as a shorter sequence. This may already be doable. Unfortunately I also don't remember the concrete case by I know it's come up at least twice.

I'm not sure but I think SequenceExpr is the only Expr that has a sequence and needs a direction. Any others that do actually have a SequenceExpr inside. I could be wrong.

The only other one that I can think of off the top of my head is Repeating.

Interesting! That one shouldn't care about direction but maybe it does...

@hippietrail
Copy link
Copy Markdown
Collaborator Author

@86xsk I just ran into a case where this would be useful. So hopefully I can make a clear case for it.

I just read "except of" in a GitHub repo and wondered if it's common. It is. So I tried to think of edge cases that would prevent a naive "except of" -> "except for" Weir rule from being sufficient.

I thought people might also use "except" instead of "exception" and indeed they do. Various such can include "except of":

(Not to mention "except of course", but the after-context is much easier to check.)

To check manually and perfectly:

  • -1 is space -2 is "the" -3 is space -4 is "with"
  • -1 is space -2 is "possible" or "notable" -3 is space -4 is "the" -5 is space -6 is "with"

Or a reverse Expr something like

SequenceExpr::aco("with").t_ws().t_aco("the").then_optional(
    SequenceExpr::ws().t_set(&["notable", "possible")
)

Or if it had to be expressed backwards in full something like

RevSequenceExpr::optional(RevSequenceExpr::word_set(&["notable", "possible"]).back_ws())
.back_aco("the").back_ws().back_aco("with")

Still not a perfect example as it's two different mistakes clashing, rather than a mistake with edge cases that are not mistakes. But hopefully illustrative nonetheless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants