Fix broken link checker to await link validations before closing the browser#5965
Fix broken link checker to await link validations before closing the browser#5965tkuhemiya wants to merge 2 commits intowso2:masterfrom
Conversation
WalkthroughThe link checking logic in the script was refactored from a nested callback-based approach to a promise-based concurrent approach. A new helper function Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can customize the tone of the review comments and chat replies.Configure the |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@en/check-broken-links.js`:
- Around line 81-84: The code currently calls puppeteer.launch() then
browser.newPage() before entering the guarded try, which can leak a browser
process if newPage() throws; move the browser initialization (await
puppeteer.launch()) into the try block so both launch and newPage are protected,
and in the finally block check that browser is non-null/defined before calling
browser.close(); apply the same change to the second occurrence that initializes
a browser (the block around lines 125-126) so both browser launch usages are
created inside their respective try blocks and safely null-checked in finally.
- Around line 87-90: The anchor hrefs collected in linksWithInfo include
fragment hashes which cause duplicate visits (e.g., /page#intro vs /page#api);
normalize by stripping the URL.hash (or anything after '#') when mapping anchor
tags in page.evaluate so returned links are fragment-free, and also ensure you
strip fragments again before deduping against visitedUrls and before calling
checkLinksOnPage; update the same logic used around the checkLinksOnPage
invocation (lines referenced 117-118) to use the normalized (fragment-stripped)
URL.
- Around line 53-66: Update the markdownLinkCheck call and results handling to
use the documented API: replace the options object { retry: true } with the
supported retry options (e.g., retryOn429, retryCount, fallbackRetryDelay) when
invoking markdownLinkCheck, and in the results.forEach callback use the
documented result fields (status, statusCode, err, link) instead of the
undocumented dead field; specifically, mark a link broken if result.err is
truthy or result.statusCode (or result.status) indicates an HTTP error (>=400 or
equals 404), and continue calling logBrokenLink(link, sourceUrl) and console.log
for those cases. Ensure you reference the markdownLinkCheck invocation and the
results.forEach block when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8d77c0bf-03a0-40e6-bb0a-15b0f0b149d6
📒 Files selected for processing (1)
en/check-broken-links.js
| return new Promise((resolve, reject) => { | ||
| markdownLinkCheck(link, { retry: true }, (err, results) => { | ||
| if (err) { | ||
| reject(err); | ||
| return; | ||
| } | ||
|
|
||
| results.forEach(result => { | ||
| const { dead, statusCode } = result; | ||
|
|
||
| if (dead || statusCode === 404) { | ||
| logBrokenLink(link, sourceUrl); | ||
| console.log(`\n[Broken Link] Found: ${link}\n`); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What fields does the markdown-link-check JavaScript API expose on each result object, and what are the documented retry-related option names?
💡 Result:
Result object fields (JS API)
Each element in the results array has these documented properties: link, status, statusCode, err. (github.com)
Retry-related option names (documented)
The documented retry options are: retryOn429, retryCount, fallbackRetryDelay. (github.com)
Citations:
Use the documented markdown-link-check API fields and retry options.
The JS API result objects expose status, statusCode, err, and link. The supported retry options are retryOn429, retryCount, and fallbackRetryDelay. The current code uses the undocumented dead field and { retry: true } option, causing only explicit 404s to be reported and no retry logic to be applied.
♻️ Proposed fix
- markdownLinkCheck(link, { retry: true }, (err, results) => {
+ markdownLinkCheck(link, { retryOn429: true }, (err, results) => {
@@
- const { dead, statusCode } = result;
-
- if (dead || statusCode === 404) {
+ if (result.status === 'dead') {
logBrokenLink(link, sourceUrl);
console.log(`\n[Broken Link] Found: ${link}\n`);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| return new Promise((resolve, reject) => { | |
| markdownLinkCheck(link, { retry: true }, (err, results) => { | |
| if (err) { | |
| reject(err); | |
| return; | |
| } | |
| results.forEach(result => { | |
| const { dead, statusCode } = result; | |
| if (dead || statusCode === 404) { | |
| logBrokenLink(link, sourceUrl); | |
| console.log(`\n[Broken Link] Found: ${link}\n`); | |
| } | |
| return new Promise((resolve, reject) => { | |
| markdownLinkCheck(link, { retryOn429: true }, (err, results) => { | |
| if (err) { | |
| reject(err); | |
| return; | |
| } | |
| results.forEach(result => { | |
| if (result.status === 'dead') { | |
| logBrokenLink(link, sourceUrl); | |
| console.log(`\n[Broken Link] Found: ${link}\n`); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@en/check-broken-links.js` around lines 53 - 66, Update the markdownLinkCheck
call and results handling to use the documented API: replace the options object
{ retry: true } with the supported retry options (e.g., retryOn429, retryCount,
fallbackRetryDelay) when invoking markdownLinkCheck, and in the results.forEach
callback use the documented result fields (status, statusCode, err, link)
instead of the undocumented dead field; specifically, mark a link broken if
result.err is truthy or result.statusCode (or result.status) indicates an HTTP
error (>=400 or equals 404), and continue calling logBrokenLink(link, sourceUrl)
and console.log for those cases. Ensure you reference the markdownLinkCheck
invocation and the results.forEach block when making the change.
There was a problem hiding this comment.
These behaviors were already present before this PR. The purpose of this change was only to ensure that link checks are properly awaited so validation completes before the browser closes.
If maintainers prefer, I can address the additional issues mentioned here in this PR as well.
There was a problem hiding this comment.
@tkuhemiya, that's a fair clarification — the undocumented dead field and { retry: true } option were indeed pre-existing and are outside the scope of this PR's fix. The async/await correction is the right focus here.
If you'd prefer to keep this PR scoped, I'd suggest addressing the API correctness improvements in a separate follow-up. Would you like me to open a GitHub issue to track that?
| const browser = await puppeteer.launch(); | ||
| const page = await browser.newPage(); | ||
|
|
||
| try { |
There was a problem hiding this comment.
Move browser creation inside the guarded block.
If browser.newPage() fails after puppeteer.launch() succeeds, control never reaches finally and the browser process is left running. Initialize browser inside try and null-check it in finally.
♻️ Proposed fix
- const browser = await puppeteer.launch();
- const page = await browser.newPage();
-
- try {
+ let browser;
+ let page;
+
+ try {
+ browser = await puppeteer.launch();
+ page = await browser.newPage();
@@
- await browser.close(); // Ensure the browser is closed
+ if (browser) {
+ await browser.close();
+ }Also applies to: 125-126
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@en/check-broken-links.js` around lines 81 - 84, The code currently calls
puppeteer.launch() then browser.newPage() before entering the guarded try, which
can leak a browser process if newPage() throws; move the browser initialization
(await puppeteer.launch()) into the try block so both launch and newPage are
protected, and in the finally block check that browser is non-null/defined
before calling browser.close(); apply the same change to the second occurrence
that initializes a browser (the block around lines 125-126) so both browser
launch usages are created inside their respective try blocks and safely
null-checked in finally.
| const linksWithInfo = await page.evaluate(() => { | ||
| const anchorTags = Array.from(document.querySelectorAll('a')); | ||
| return anchorTags.map(tag => tag.href).filter(link => link.startsWith('http')); | ||
| }); |
There was a problem hiding this comment.
Normalize fragment-only variants before crawling.
tag.href includes #fragment, so the crawler treats /page#intro and /page#api as different entries in visitedUrls. That re-visits the same document, duplicates broken-link reports, and inflates the link counters. Strip hash before deduping and before calling checkLinksOnPage.
♻️ Proposed fix
async function checkLinksOnPage(url, depth = 2) {
+ const normalizedUrl = new URL(url);
+ normalizedUrl.hash = '';
+ url = normalizedUrl.toString();
+
if (isShuttingDown || depth < 0 || visitedUrls.has(url)) return; // Stop if shutting down, depth limit reached, or already visited
@@
- const linksWithInfo = await page.evaluate(() => {
+ const linksWithInfo = (await page.evaluate(() => {
const anchorTags = Array.from(document.querySelectorAll('a'));
return anchorTags.map(tag => tag.href).filter(link => link.startsWith('http'));
- });
+ })).map(link => {
+ const normalizedLink = new URL(link);
+ normalizedLink.hash = '';
+ return normalizedLink.toString();
+ });
@@
- for (const link of linksWithInfo) {
+ for (const link of new Set(linksWithInfo)) {Also applies to: 117-118
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@en/check-broken-links.js` around lines 87 - 90, The anchor hrefs collected in
linksWithInfo include fragment hashes which cause duplicate visits (e.g.,
/page#intro vs /page#api); normalize by stripping the URL.hash (or anything
after '#') when mapping anchor tags in page.evaluate so returned links are
fragment-free, and also ensure you strip fragments again before deduping against
visitedUrls and before calling checkLinksOnPage; update the same logic used
around the checkLinksOnPage invocation (lines referenced 117-118) to use the
normalized (fragment-stripped) URL.
Purpose
Fix the async flow in the broken link checker so link validations complete before the Puppeteer browser is closed.
Previously,
markdown-link-checkwas triggered inside the page crawl loop without being awaited, which could allow the crawl to finish and the browser to close before all link checks had completed. This change extracts link validation into a dedicatedcheckLink()helper and waits for all pending checks before closing the browser.Related Issue
No related issue
Related PRs
No related PRs
Test environment
404after a 3-second delaynode check-broken-links.jsSecurity checks
Summary by CodeRabbit