[GH-ISSUE #1243] [$20 bounty] Bug: reader view and katex / math rendering #807

Open
opened 2026-03-02 11:52:53 +03:00 by kerem · 9 comments
Owner

Originally created by @thiswillbeyourgithub on GitHub (Apr 12, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1243

Describe the Bug

Hi,

On this research page by anthropic the equations and shows raw latex (?) formulas instead.

Image

Steps to Reproduce

  1. Go to https://transformer-circuits.pub/2025/attribution-graphs/methods.html
  2. Save to karakeep
  3. Search in the text for refer to the output of the original, notice the non interpreted latex formula.

Expected Behaviour

Image

Screenshots or Additional Context

No response

Device Details

No response

Exact Hoarder Version

v0.23.2

Have you checked the troubleshooting guide?

  • I have checked the troubleshooting guide and I haven't found a solution to my problem
Originally created by @thiswillbeyourgithub on GitHub (Apr 12, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1243 ### Describe the Bug Hi, On [this research page by anthropic](https://transformer-circuits.pub/2025/attribution-graphs/methods.html) the equations and shows raw latex (?) formulas instead. ![Image](https://github.com/user-attachments/assets/af086187-391e-4d19-9a6c-af51c29e8c7c) ### Steps to Reproduce 1. Go to https://transformer-circuits.pub/2025/attribution-graphs/methods.html 2. Save to karakeep 3. Search in the text for `refer to the output of the original`, notice the non interpreted latex formula. ### Expected Behaviour ![Image](https://github.com/user-attachments/assets/84dfa898-5875-4d03-bb38-b6bbe0e0540a) ### Screenshots or Additional Context _No response_ ### Device Details _No response_ ### Exact Hoarder Version v0.23.2 ### Have you checked the troubleshooting guide? - [x] I have checked the troubleshooting guide and I haven't found a solution to my problem
Author
Owner

@thiswillbeyourgithub commented on GitHub (Apr 26, 2025):

Hello again, i see that karakeep is using monolith, which has an issue mentionning that mathjax cannot be rendered in monolith. They mention themselves that SingleFile works flawlessly for that. As SF is simple to embed using docker or many other ways what do you think of switching from monolith to singlefile? Or making the archival backend customizable at least.

No mathjax is a total dealbreaker unfortunately, so I am willing to bounty $20 for this :)

Here's the line: github.com/karakeep-app/karakeep@1880a59f2c/apps/workers/crawlerWorker.ts (L515)

Edit: related to #329 where it is mentionned that we can actually use singlefile browser extension directly, seems like a usable workaround but makes it necessary to use a desktop computer (android cannot trigger that crawling)

Somewhat related to #1125

<!-- gh-comment-id:2831938603 --> @thiswillbeyourgithub commented on GitHub (Apr 26, 2025): Hello again, i see that karakeep is using monolith, which has an [issue mentionning that mathjax cannot be rendered in monolith](https://github.com/Y2Z/monolith/issues/278). They mention themselves that [SingleFile]() works flawlessly for that. As SF [is simple to embed using docker or many other ways](https://github.com/gildas-lormeau/single-file-cli) what do you think of switching from monolith to singlefile? Or making the archival backend customizable at least. No mathjax is a total dealbreaker unfortunately, so I am willing to bounty $20 for this :) Here's the line: https://github.com/karakeep-app/karakeep/blob/1880a59f2c17dad96962b4110a8544d7e5ed41ea/apps/workers/crawlerWorker.ts#L515 Edit: related to #329 where it is mentionned [that we can actually use singlefile browser extension directly](https://docs.karakeep.app/Guides/singlefile/), seems like a usable workaround but makes it necessary to use a desktop computer (android cannot trigger that crawling) Somewhat related to #1125
Author
Owner

@MohamedBassem commented on GitHub (Apr 26, 2025):

I have plans to replace monolith, but it's not easy because singleFile requires having the browser running within the container which we currently don't do. For now as you figured out, the advice is to use the SingleFile extension guide.

<!-- gh-comment-id:2832012757 --> @MohamedBassem commented on GitHub (Apr 26, 2025): I have plans to replace monolith, but it's not easy because singleFile requires having the browser running within the container which we currently don't do. For now as you figured out, the advice is to use the SingleFile extension guide.
Author
Owner

@thiswillbeyourgithub commented on GitHub (Apr 26, 2025):

Thanks for the followup. I'm not sure i understand: it seems that single file can be used as its own container, or as a standalone cli similarly to how monolith is called:
github.com/karakeep-app/karakeep@1880a59f2c/apps/workers/crawlerWorker.ts (L515)

Could you be a bit more specific to explain what the exact issue is?

<!-- gh-comment-id:2832016782 --> @thiswillbeyourgithub commented on GitHub (Apr 26, 2025): Thanks for the followup. I'm not sure i understand: it seems that [single file can be used as its own container](https://github.com/screenbreak/SingleFile-dockerized), or [as a standalone cli](https://github.com/gildas-lormeau/single-file-cli) similarly to how monolith is called: https://github.com/karakeep-app/karakeep/blob/1880a59f2c17dad96962b4110a8544d7e5ed41ea/apps/workers/crawlerWorker.ts#L515 Could you be a bit more specific to explain what the exact issue is?
Author
Owner

@MohamedBassem commented on GitHub (Apr 26, 2025):

@thiswillbeyourgithub if you check the README of the CLI repo, you'll see:

Make sure Chrome or a Chromium-based browser is installed in the default folder. Otherwise you might need to set the --browser-executable-path option to help SingleFile locating the path of the executable file.

This is because it spawns chrome and does the crawling itself. Monolith on the other hand, takes the serialized HTML so it doesn't require chrome to be available.

<!-- gh-comment-id:2832020453 --> @MohamedBassem commented on GitHub (Apr 26, 2025): @thiswillbeyourgithub if you check the README of the CLI repo, you'll see: > Make sure Chrome or a Chromium-based browser is installed in the default folder. Otherwise you might need to set the --browser-executable-path option to help SingleFile locating the path of the executable file. This is because it spawns chrome and does the crawling itself. Monolith on the other hand, takes the serialized HTML so it doesn't require chrome to be available.
Author
Owner

@thiswillbeyourgithub commented on GitHub (Apr 26, 2025):

Thanks.

If you asked me: I prefer an oversized image containing both a chrome container and a singlefile container but is certain to keep the data I want to karakeep, instead of something lighter but less reliable. Until we find a better way of course.

I highlighted quite a bit of text until I figured out that the mathjax was not parsed. I tried using single file extension on that url but although the archived is now present in karakeep, there is no Cached Content because of that issue:

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

That's even though I set that in my web env following the advice in that issue:


      NODE_OPTIONS: --max-old-space-size=16384

Any pointers? Should I try a more recent version than release?

Edit: actually I had wrongly restarted my env so now it fails just because of my timeout. Will update this comment.

<!-- gh-comment-id:2832038567 --> @thiswillbeyourgithub commented on GitHub (Apr 26, 2025): Thanks. If you asked me: I prefer an oversized image containing both a chrome container and a singlefile container but is **certain** to keep the data I want to karakeep, instead of something lighter but less reliable. Until we find a better way of course. I highlighted quite a bit of text until I figured out that the mathjax was not parsed. I tried using single file extension on [that url](https://transformer-circuits.pub/2025/attribution-graphs/methods.html) but although the archived is now present in karakeep, there is no Cached Content because of that issue: `FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory` That's even though I set that in my web env [following the advice in that issue](https://github.com/karakeep-app/karakeep/issues/902): ```yml NODE_OPTIONS: --max-old-space-size=16384 ``` Any pointers? Should I try a more recent version than `release`? Edit: actually I had wrongly restarted my env so now it fails just because of my timeout. Will update this comment.
Author
Owner

@thiswillbeyourgithub commented on GitHub (Apr 26, 2025):

Finally I was able to make the SingleFileExtension work and have both an archive and a Cached Content. But unfortunately the mathjax is still not rendered in the cached content so I can't use highlights over the math.

In conclusion: the SingleFileExtension is not an acceptable workaround to get mathjax.

<!-- gh-comment-id:2832045746 --> @thiswillbeyourgithub commented on GitHub (Apr 26, 2025): Finally I was able to make the SingleFileExtension work and have both an archive and a Cached Content. But unfortunately the mathjax is still not rendered in the cached content so I can't use highlights over the math. In conclusion: the SingleFileExtension is not an acceptable workaround to get mathjax.
Author
Owner

@thiswillbeyourgithub commented on GitHub (Apr 30, 2025):

Question: i was browsing the api documentation. Do you know if I could make the mathjax rendering externally then update the karakeep content using the api? Any reason that wouldn't work? Should I update the text or assetContent? Thanks!

<!-- gh-comment-id:2843569808 --> @thiswillbeyourgithub commented on GitHub (Apr 30, 2025): Question: i was browsing the [api documentation](https://docs.karakeep.app/API/update-a-bookmark). Do you know if I could make the mathjax rendering externally then update the karakeep content using the api? Any reason that wouldn't work? Should I update the `text` or `assetContent`? Thanks!
Author
Owner

@thiswillbeyourgithub commented on GitHub (May 12, 2025):

After taking a second look it appears to be KaTeX that is not being rendered.

For additional context:

Here's a page where math is not rendered (neither in archive nor in reader view):
https://transformer-circuits.pub/2025/attribution-graphs/methods.html

Image

Appears like this:

Image

And the corresponding line in html is:


<p>where <d-math>W_{enc}^{\ell}</d-math> is the CLT encoder matrix at layer <d-math>\ell</d-math>.</p>

And here is a page that works well:
https://www.gilesthomas.com/2025/03/llm-from-scratch-9-causal-attention

This is rendered correctly (in both archive and reader view):

Image

Here's the corresponding code:


<li>Firstly, we project the inputs into the query, key and value spaces:</li>
</ol>

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mi>Q</mi><mo>&#x0003D;</mo><mi>X</mi><msub><mi>W</mi><mi>q</mi></msub></mrow></math>

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mi>K</mi><mo>&#x0003D;</mo><mi>X</mi><msub><mi>W</mi><mi>k</mi></msub></mrow></math>

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mi>V</mi><mo>&#x0003D;</mo><mi>X</mi><msub><mi>W</mi><mi>v</mi></msub></mrow></math>

<ol start="2">

Edit: after tweaking some settings of singlefile, it appears that the "archive" page now renders the katex fine. But I can't highlight it.

<!-- gh-comment-id:2872312934 --> @thiswillbeyourgithub commented on GitHub (May 12, 2025): After taking a second look it appears to be KaTeX that is not being rendered. For additional context: Here's a page where math is not rendered (neither in archive nor in reader view): https://transformer-circuits.pub/2025/attribution-graphs/methods.html ![Image](https://github.com/user-attachments/assets/67fc43a6-e5fb-44ff-b914-dc234fc24446) Appears like this: ![Image](https://github.com/user-attachments/assets/95f7b8d7-369c-42ca-ab50-50ff055da03b) And the corresponding line in html is: ```html <p>where <d-math>W_{enc}^{\ell}</d-math> is the CLT encoder matrix at layer <d-math>\ell</d-math>.</p> ``` And here is a page that works well: https://www.gilesthomas.com/2025/03/llm-from-scratch-9-causal-attention This is rendered correctly (in both archive and reader view): ![Image](https://github.com/user-attachments/assets/fec60bc2-4816-483f-834a-b2e0610b3d34) Here's the corresponding code: ```html <li>Firstly, we project the inputs into the query, key and value spaces:</li> </ol> <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mi>Q</mi><mo>&#x0003D;</mo><mi>X</mi><msub><mi>W</mi><mi>q</mi></msub></mrow></math> <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mi>K</mi><mo>&#x0003D;</mo><mi>X</mi><msub><mi>W</mi><mi>k</mi></msub></mrow></math> <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mi>V</mi><mo>&#x0003D;</mo><mi>X</mi><msub><mi>W</mi><mi>v</mi></msub></mrow></math> <ol start="2"> ``` Edit: after tweaking some settings of singlefile, it appears that the "archive" page now renders the katex fine. But I can't highlight it.
Author
Owner

@thiswillbeyourgithub commented on GitHub (May 12, 2025):

Can anyone tell me what's used to create the reader view? I could try digging deeper to find a workaround.

IIRC monolith is used to download the archive, then karakeep locally turns it into the reader view? And if I use singlefile extension it will bypass the archive but the reader view will still be computed locally?

Lastly, if I use the API's update-a-bookmark endpoint, I can apparently update text or assetContent which both are strings. I have an intuition that assetContent refers to the "raw download" as fetched by singlefile or monolith, and the text refers to the "reader view". If that's the case, I could maybe use an external service to render the equations then update the text value to have them work correctly in the reader view, is that correct?

Edit: actually the API's page might contain an error no? Looking at the output of seach-a-bookmark I see that a bookmark has a content key value, for a value of type link I see a htmlContent key value, and no text.
So

  1. it's not clear what assetContent refers to for a bookmark, especially if a bookmark can contain several assets
  2. it's not clear what text the bookmark api refers too.

Am I missing something or is that an API doc error? Is so do you want me to open a tracking issue?

<!-- gh-comment-id:2872348065 --> @thiswillbeyourgithub commented on GitHub (May 12, 2025): Can anyone tell me what's used to create the reader view? I could try digging deeper to find a workaround. IIRC monolith is used to download the archive, then karakeep locally turns it into the reader view? And if I use singlefile extension it will bypass the archive but the reader view will still be computed locally? Lastly, if I use the API's [update-a-bookmark](https://docs.karakeep.app/API/update-a-bookmark/) endpoint, I can apparently update `text` or `assetContent` which both are `strings`. I have an intuition that `assetContent` refers to the "raw download" as fetched by singlefile or monolith, and the `text` refers to the "reader view". If that's the case, I could maybe use an external service to render the equations then update the `text` value to have them work correctly in the reader view, is that correct? Edit: actually the API's page might contain an error no? Looking at the output of `seach-a-bookmark` I see that a bookmark has a content key value, for a value of type `link` I see a `htmlContent` key value, and no `text`. So 1. it's not clear what `assetContent` refers to for a bookmark, especially if a bookmark can contain several assets 2. it's not clear what `text` the bookmark api refers too. Am I missing something or is that an API doc error? Is so do you want me to open a tracking issue?
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#807
No description provided.