Skip to content

Detect duplicate recipients accross to,cc,bcc#1543

Open
Rashmi1613 wants to merge 7 commits intojenkinsci:mainfrom
Rashmi1613:detect-duplicate-recipients
Open

Detect duplicate recipients accross to,cc,bcc#1543
Rashmi1613 wants to merge 7 commits intojenkinsci:mainfrom
Rashmi1613:detect-duplicate-recipients

Conversation

@Rashmi1613
Copy link
Copy Markdown

Duplicate recipients within each list i.e. to,cc,bcc are already handles using sets, but the same email email address can still appear accross different recipient types.
Changes done here,

  1. Detection of duplicate recipients accross TO, CC, BCC.
  2. Logs the duplicates along with its location- TO,CC,BCC

This improves observability and helps in identification of any type of misconfiguration.

@Rashmi1613 Rashmi1613 requested a review from a team as a code owner March 28, 2026 16:07
@Vinamra-tech
Copy link
Copy Markdown
Contributor

Hi @Rashmi1613 I was wondering if there's a bettter approach to it like first creating a set of seen emails and then applying removeIf() to each of the to , cc , bcc list like this :

Set seenEmails = new HashSet<>();

to.removeIf(addr -> !seenEmails.add(addr.getAddress().toLowerCase()));

cc.removeIf(addr -> !seenEmails.add(addr.getAddress().toLowerCase()));

bcc.removeIf(addr -> !seenEmails.add(addr.getAddress().toLowerCase()));

I would be happy to know if you justify my approach misses something.

@Rashmi1613
Copy link
Copy Markdown
Author

Thanks for the suggestion, that approach is definitely clean and enforces a clear TO > CC > BCC priority. One thing I was considering is whether duplicates across TO/CC/BCC might sometimes be an intentional configuration and if removed it might break user intent and change behaviour. That’s why I initially did detection + logging to preserve user intent. Happy to switch to removal if we want to enforce removal+logging. Let me know what direction you’d prefer

@ArpanC6
Copy link
Copy Markdown

ArpanC6 commented Mar 29, 2026

Great work on the detection and logging approach.. Rashmi’s point about preserving user intent makes a lot of sense silently removing duplicates could mess with configurations where the same address needs to receive both a CC and a BCC copy.

A few technical things to keep in mind -

1) Missing Unit Tests: It looks like the new duplicate detection logic in ExtendedEmailPublisher.java doesn’t have any test coverage. It’d be good to add a test to ensure that when an address appears in both the TO and CC fields the log shows the expected warning, and the location is correctly set.

2) CI Failure on Windows-21: The build on windows-21 is failing due to an error in the 'bat' step. This needs a bit of investigation to figure out what’s going wrong and it should be fixed before the PR can be merged.

3) Non-Deterministic Log Output: Since emailLocations uses a HashMap and the location set is a HashSet the output order of entry.getValue() isn't guaranteed. This could result in inconsistent logging like [CC, TO] or [TO, CC]. Switching to a LinkedHashSet or sorting the set before logging will give you a predictable order every time.

4) Typo: In a few places (like the PR title description and commit message) “accross” should be spelled as “across.”

@Vinamra-tech
Copy link
Copy Markdown
Contributor

Hey @Rashmi1613 and @ArpanC6, thanks for the responses!

@Rashmi1613 that's a fair point about user intent, but I'd push back a little. In practice, having the same address in both CC and BCC is almost always a misconfiguration rather than intentional — most SMTP servers deduplicate at delivery anyway, so the "intentional duplicate" scenario is largely theoretical. Can you point to a concrete real-world case where the same address genuinely needs to be in both fields simultaneously?

That said, I do see one valid concern with my removeIf() approach — the implicit TO > CC > BCC priority. If an address exists in both CC and BCC, my approach retains it in CC and drops it from BCC, which could expose a BCC recipient to others — a potential privacy issue. That part I'll concede.

So here's what I'd propose as a middle ground:

  • Keep Rashmi's detection + logging for the CC/BCC overlap case specifically, to avoid the privacy risk.
  • Use removeIf() for TO/CC and TO/BCC overlaps, where there's no privacy concern and silent removal is safe and clean.
  • This way we get the best of both — clean deduplication where it's safe, and cautious logging where it isn't.

@ArpanC6 on your points — unit tests and the typo fix are fair, agreed on those. On the Windows-21 failure, worth checking the actual bat step error log first — it could easily be an infra flake rather than anything in this PR's code, so I wouldn't treat it as a hard blocker just yet.

On the LinkedHashMap/LinkedHashSet suggestion though — I'd actually push back here. Swapping the data structures just to guarantee log order adds accidental complexity without any functional benefit. Log output order is a debugging convenience, not a correctness requirement. A simpler alternative would be to just sort the location list inline before logging — one line, no structural changes, achieves the same result without introducing unfamiliar collection types that future contributors need to reason about.

Happy to discuss further!

@ArpanC6
Copy link
Copy Markdown

ArpanC6 commented Mar 29, 2026

That's a fair pushback sorting inline before logging is indeed simpler and avoids introducing unfamiliar collection types. The middle ground approach makes sense. Happy to see it implemented that way.

@Rashmi1613
Copy link
Copy Markdown
Author

I agree with the middle-ground approach. Removing duplicates for TO/CC and TO/BCC makes sense since the recipient is already visible, so there’s no change in behavior. For CC/BCC overlaps, I’ll keep it as detection + logging only, since it’s ambiguous and we shouldn’t risk changing visibility.

A real-world case I had in mind is when recipients come from multiple sources , for example, Jenkins might automatically add commit authors to CC,while a user manually adds someone to BCC to keep them hidden. In that situation, the same email can appear in both CC and BCC with different intent, and removing it from BCC could unintentionally expose that recipient and potentially rase privacy concerns.
Also, TO feels like a strong signal that the recipient is meant to be visible, so removing duplicates from CC/BCC when the same email is already in TO is safe. But between CC and BCC, it’s not clear which one was intended, so better not to modify it.

For the ordering part, I’ll keep the change minimal and use inline sorting with a fixed order (TO → CC → BCC) just for logging, instead of changing the data structure.Although LinkedHashSet has better time complexity of O(1)(insertion order) as comapred to sorting (O(nlog n)) , since the dataset is small , it will be constant. So sorting would be better than LinkedHashSet which introduces unnecessary complexity.

@ArpanC6
Copy link
Copy Markdown

ArpanC6 commented Mar 29, 2026

I agree with the middle-ground approach. Removing duplicates for TO/CC and TO/BCC makes sense since the recipient is already visible, so there’s no change in behavior. For CC/BCC overlaps, I’ll keep it as detection + logging only, since it’s ambiguous and we shouldn’t risk changing visibility.

A real-world case I had in mind is when recipients come from multiple sources , for example, Jenkins might automatically add commit authors to CC,while a user manually adds someone to BCC to keep them hidden. In that situation, the same email can appear in both CC and BCC with different intent, and removing it from BCC could unintentionally expose that recipient and potentially rase privacy concerns. Also, TO feels like a strong signal that the recipient is meant to be visible, so removing duplicates from CC/BCC when the same email is already in TO is safe. But between CC and BCC, it’s not clear which one was intended, so better not to modify it.

For the ordering part, I’ll keep the change minimal and use inline sorting with a fixed order (TO → CC → BCC) just for logging, instead of changing the data structure.Although LinkedHashSet has better time complexity of O(1)(insertion order) as comapred to sorting (O(nlog n)) , since the dataset is small , it will be constant. So sorting would be better than LinkedHashSet which introduces unnecessary complexity.

The new unit test looks good..However the latest commit seems to have accidentally deleted a large portion of ExtendedEmailPublisherTest.java linux-25 is now showing only 301 tests instead of the expected 366. Could you check if the test file was accidentally truncated during the commit??

@Vinamra-tech
Copy link
Copy Markdown
Contributor

I agree with the middle-ground approach. Removing duplicates for TO/CC and TO/BCC makes sense since the recipient is already visible, so there’s no change in behavior. For CC/BCC overlaps, I’ll keep it as detection + logging only, since it’s ambiguous and we shouldn’t risk changing visibility.

A real-world case I had in mind is when recipients come from multiple sources , for example, Jenkins might automatically add commit authors to CC,while a user manually adds someone to BCC to keep them hidden. In that situation, the same email can appear in both CC and BCC with different intent, and removing it from BCC could unintentionally expose that recipient and potentially rase privacy concerns. Also, TO feels like a strong signal that the recipient is meant to be visible, so removing duplicates from CC/BCC when the same email is already in TO is safe. But between CC and BCC, it’s not clear which one was intended, so better not to modify it.

For the ordering part, I’ll keep the change minimal and use inline sorting with a fixed order (TO → CC → BCC) just for logging, instead of changing the data structure.Although LinkedHashSet has better time complexity of O(1)(insertion order) as comapred to sorting (O(nlog n)) , since the dataset is small , it will be constant. So sorting would be better than LinkedHashSet which introduces unnecessary complexity.

Hey @Rashmi1613, that's actually a great real-world example — the Jenkins auto-CC + manual BCC case makes the ambiguity very concrete and I think that fully justifies keeping detection + logging for the CC/BCC overlap. Good call!!!

The final approach looks solid:

  • removeIf() for TO/CC and TO/BCC — safe, clean deduplication.
  • Detection + logging only for CC/BCC — preserves intent, avoids privacy risk.
  • Inline sort for deterministic log output — minimal and clean.

LGTM!!, Nice work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants