[GH-ISSUE #513] Bulk processing of markdown files for validating local links #418

Closed
opened 2026-03-03 01:26:44 +03:00 by kerem · 7 comments
Owner

Originally created by @holamgadol on GitHub (Mar 31, 2022).
Original GitHub issue: https://github.com/DavidAnson/markdownlint/issues/513

I want to make a markdownlint extension for validating local links in Foliant projects, especially for those using MkDocs and CustomIDs.

remark-validate-links has similar possibilities, but it is tailored to the Github/Gitlab hosted projects. Otherwise MkDocs (SSG backend for Foliant) has some differences, which make it unreasonable to add MkDocs support to remark-validate-links.

I've decided to make an effort with markdownlint. It has a clear structure and great VSCode compatibility. Unfortunately, markdownlint has one restriction. It parses only on MD-file and I'm looking for a possibility to bulk MD-file processing.

I want to imply an algorithm from remark-validate-links, where headings and anchors are parsed to a list of refs and links are parsed to a list of links. After that two lists are compared and broken local links could be found.

So it possible to bulk process MD-files? I think it could be done with custom running script, not only by rules.

I got acquainted with the discussion about feat(rules): add valid-link-fragments. That PR tried to solve a similar problem, but in another way.

Originally created by @holamgadol on GitHub (Mar 31, 2022). Original GitHub issue: https://github.com/DavidAnson/markdownlint/issues/513 I want to make a markdownlint extension for validating local links in [Foliant](https://foliant-docs.github.io/docs/) projects, especially for those using [MkDocs](https://foliant-docs.github.io/docs/backends/mkdocs/) and [CustomIDs](https://foliant-docs.github.io/docs/preprocessors/customids/). [remark-validate-links](https://github.com/remarkjs/remark-validate-links) has similar possibilities, but it is tailored to the Github/Gitlab hosted projects. Otherwise MkDocs (SSG backend for Foliant) has some differences, which make it unreasonable to add MkDocs support to remark-validate-links. I've decided to make an effort with markdownlint. It has a clear structure and great VSCode compatibility. Unfortunately, markdownlint has one restriction. It parses only on MD-file and I'm looking for a possibility to bulk MD-file processing. I want to imply an algorithm from remark-validate-links, where headings and anchors are parsed to a list of refs and links are parsed to a list of links. After that two lists are compared and broken local links could be found. So it possible to bulk process MD-files? I think it could be done with custom running script, not only by rules. I got acquainted with the discussion about [feat(rules): add valid-link-fragments](https://github.com/DavidAnson/markdownlint/pull/495). That PR tried to solve a similar problem, but in another way.
kerem 2026-03-03 01:26:44 +03:00
  • closed this issue
  • added the
    question
    label
Author
Owner

@DavidAnson commented on GitHub (Mar 31, 2022):

That remark project looks very similar to the proposed MD051 you link to and which I hope to finish off soon. As far as custom rules are concerned, they analyze one file at a time and don't really know about how many other files are going to be looked at. That's a pretty fundamental part of the system and I don't think you'll be able to work around that very cleanly. However, I assume the remark tool works similarly and so I'd want to understand how the behavior you want has been implemented there. Maybe that can provide some guidance? (FYI, I don't look at code for similar projects to avoid the chance of duplicating their code.)

<!-- gh-comment-id:1084786627 --> @DavidAnson commented on GitHub (Mar 31, 2022): That remark project looks very similar to the proposed MD051 you link to and which I hope to finish off soon. As far as custom rules are concerned, they analyze one file at a time and don't really know about how many other files are going to be looked at. That's a pretty fundamental part of the system and I don't think you'll be able to work around that very cleanly. However, I assume the remark tool works similarly and so I'd want to understand how the behavior you want has been implemented there. Maybe that can provide some guidance? (FYI, I don't look at code for similar projects to avoid the chance of duplicating their code.)
Author
Owner

@holamgadol commented on GitHub (Mar 31, 2022):

Well, I'll try to describe the algorithm from remark-validate-links in simple words

  1. Collect all links and resolve paths according to the document
  2. Parse all headings and slug them to valid refs
  3. Try to find links in refs
  4. If corresponding ref haven't been found, try to check the existence of the file
  5. If the file doesn't exist, throw an error.

Links and refs are collected only from files in input. So we can't check an anchor link to an unknown file, only the existence of the document.

<!-- gh-comment-id:1084926235 --> @holamgadol commented on GitHub (Mar 31, 2022): Well, I'll try to describe the algorithm from remark-validate-links in simple words 1. Collect all links and resolve paths according to the document 2. Parse all headings and slug them to valid refs 3. Try to find links in refs 4. If corresponding ref haven't been found, try to check the existence of the file 5. If the file doesn't exist, throw an error. Links and refs are collected only from files in input. So we can't check an anchor link to an unknown file, only the existence of the document.
Author
Owner

@DavidAnson commented on GitHub (Apr 1, 2022):

Thanks! What part of this do you think would be hard to translate to markdownlint?

<!-- gh-comment-id:1085260933 --> @DavidAnson commented on GitHub (Apr 1, 2022): Thanks! What part of this do you think would be hard to translate to markdownlint?
Author
Owner

@holamgadol commented on GitHub (Apr 1, 2022):

Collect all refs before checking links.
For example, there are two adjacent files (there are both in input):

  • article.md

You can see authors in [authors](readme.md#author) section in readme

  • readme.md

# Authors

- Jonh Doe
- Bob Foo

If we try to check the link in article.md before collecting refs from readme.md, we won't have any information about anchors in readme.md. So we'll miss the mistake in the link and just check the existence of readme.md

<!-- gh-comment-id:1085283870 --> @holamgadol commented on GitHub (Apr 1, 2022): Collect all refs before checking links. For example, there are two adjacent files (there are both in input): - article.md ```md You can see authors in [authors](readme.md#author) section in readme ``` - readme.md ```md # Authors - Jonh Doe - Bob Foo ``` If we try to check the link in `article.md` before collecting refs from `readme.md`, we won't have any information about anchors in `readme.md`. So we'll miss the mistake in the link and just check the existence of `readme.md`
Author
Owner

@DavidAnson commented on GitHub (Apr 1, 2022):

Agreed. But you could check README.md on demand right then, I think? (At the cost of reading/parsing it.)

Does remark make the content of all files available to rules at once? Or does it let them go back and report issues for files that have already been scanned?

<!-- gh-comment-id:1085296179 --> @DavidAnson commented on GitHub (Apr 1, 2022): Agreed. But you could check README.md on demand right then, I think? (At the cost of reading/parsing it.) Does remark make the content of all files available to rules at once? Or does it let them go back and report issues for files that have already been scanned?
Author
Owner

@holamgadol commented on GitHub (Apr 1, 2022):

Does checking on demand mean we can read a file that haven't been in input?

In case of remark-validate-links , all files are available to rules at once, I suppose. I should check more properly.

<!-- gh-comment-id:1085305915 --> @holamgadol commented on GitHub (Apr 1, 2022): Does checking on demand mean we can read a file that haven't been in input? In case of remark-validate-links , all files are available to rules at once, I suppose. I should check more properly.
Author
Owner

@DavidAnson commented on GitHub (Apr 1, 2022):

Under Node.js, the fs APIs are available and can be used. Under VS Code, the situation is more awkward because the file system MAY be virtualized.

<!-- gh-comment-id:1085342001 --> @DavidAnson commented on GitHub (Apr 1, 2022): Under Node.js, the fs APIs are available and can be used. Under VS Code, the situation is more awkward because the file system MAY be virtualized.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/markdownlint#418
No description provided.