[GH-ISSUE #491] Bug: Images Are Not Loaded Correctly - All Images Appear as "Broken" in the Webpage #317

Closed
opened 2026-03-02 11:48:46 +03:00 by kerem · 5 comments
Owner

Originally created by @ljq29 on GitHub (Oct 6, 2024).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/491

Bug: Images Are Not Loaded Correctly - All Images Appear as "Broken" in the Webpage

Description

When using Hoarder to scrape images from a webpage, none of the images are loaded correctly. Instead, all the images appear as "broken" or missing. It seems that Hoarder is failing to fetch the images properly from the webpage.

Steps to Reproduce

  1. Use Hoarder to scrape a webpage with images.
  2. Check the output for images.
  3. Notice that all images are displayed as broken or missing.

Expected Behavior

The images should be properly fetched and displayed in the output without being broken.

Actual Behavior

All images on the webpage show up as broken, indicating that the image URLs or fetching process might not be working as expected.

Possible Causes

  • The image URLs could be relative paths, and Hoarder might not be converting them to absolute URLs.
  • There could be an issue with how Hoarder processes image tags in the webpage's HTML.
  • There might be network or permission issues preventing image fetching.

Environment

  • Hoarder version: [version]
  • Operating system: [OS]
  • Any relevant details or logs: [include logs if necessary]

Additional Context

Any webpage with images will have this issue, making it impossible to scrape or view images correctly.

Originally created by @ljq29 on GitHub (Oct 6, 2024). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/491 ### Bug: Images Are Not Loaded Correctly - All Images Appear as "Broken" in the Webpage #### Description When using Hoarder to scrape images from a webpage, none of the images are loaded correctly. Instead, all the images appear as "broken" or missing. It seems that Hoarder is failing to fetch the images properly from the webpage. #### Steps to Reproduce 1. Use Hoarder to scrape a webpage with images. 2. Check the output for images. 3. Notice that all images are displayed as broken or missing. #### Expected Behavior The images should be properly fetched and displayed in the output without being broken. #### Actual Behavior All images on the webpage show up as broken, indicating that the image URLs or fetching process might not be working as expected. #### Possible Causes - The image URLs could be relative paths, and Hoarder might not be converting them to absolute URLs. - There could be an issue with how Hoarder processes image tags in the webpage's HTML. - There might be network or permission issues preventing image fetching. #### Environment - **Hoarder version**: [version] - **Operating system**: [OS] - **Any relevant details or logs**: [include logs if necessary] #### Additional Context Any webpage with images will have this issue, making it impossible to scrape or view images correctly.
kerem closed this issue 2026-03-02 11:48:46 +03:00
Author
Owner

@kamtschatka commented on GitHub (Oct 6, 2024):

So this is definitely not a general problem, as it works just fine and you are the first one reporting this.
Considering that you had issues setting up your environment before (https://github.com/hoarder-app/hoarder/issues/487), did you check everything is correct?

  • Are the log files showing any errors?
  • Can you provide a sample page where this actually happens?
  • Can you show a screenshot of what you see?
<!-- gh-comment-id:2395328348 --> @kamtschatka commented on GitHub (Oct 6, 2024): So this is definitely not a general problem, as it works just fine and you are the first one reporting this. Considering that you had issues setting up your environment before (https://github.com/hoarder-app/hoarder/issues/487), did you check everything is correct? * Are the log files showing any errors? * Can you provide a sample page where this actually happens? * Can you show a screenshot of what you see?
Author
Owner

@ljq29 commented on GitHub (Oct 6, 2024):

Description

I found an example at this link, where the original image is:

Image from Baijiahao

And the screenshot in Hoarder is:

Image in Hoarder

Additionally, I just noticed that Hoarder uses the original image link, rather than caching the image to its own server like Cubox does. Would it be possible to improve this aspect?

Benefits

  • Data Integrity: Caching the image on the server ensures that the image is always available, even if the original link becomes unavailable.
  • Security: Storing images locally on the server can prevent issues related to linking to external content that may change or be compromised.
  • Performance: Serving images from the same server could potentially improve the loading speed and provide a smoother user experience.

Proposed Solution

  • Update Hoarder to cache the linked images to its own server, similar to how Cubox handles images.
  • Ensure that cached images are stored securely and can be accessed efficiently within the application.
<!-- gh-comment-id:2395333631 --> @ljq29 commented on GitHub (Oct 6, 2024): ## Description I found an example at [this link](https://baijiahao.baidu.com/s?id=1801620175325254434&wfr=spider&for=pc), where the original image is: ![Image from Baijiahao](https://github.com/user-attachments/assets/38009ee7-8b8d-4a33-84f0-f6e20e2682a0) And the screenshot in **Hoarder** is: ![Image in Hoarder](https://github.com/user-attachments/assets/123d4fcf-fa7c-4e06-89b6-c4bfdb4237dc) Additionally, I just noticed that **Hoarder** uses the original image link, rather than caching the image to its own server like **Cubox** does. Would it be possible to improve this aspect? ## Benefits - **Data Integrity**: Caching the image on the server ensures that the image is always available, even if the original link becomes unavailable. - **Security**: Storing images locally on the server can prevent issues related to linking to external content that may change or be compromised. - **Performance**: Serving images from the same server could potentially improve the loading speed and provide a smoother user experience. ## Proposed Solution - Update **Hoarder** to cache the linked images to its own server, similar to how **Cubox** handles images. - Ensure that cached images are stored securely and can be accessed efficiently within the application.
Author
Owner

@kamtschatka commented on GitHub (Oct 6, 2024):

please check out the configs flags in the documentation.
You can already configure the archives by setting CRAWLER_FULL_PAGE_ARCHIVE to true.

From what I can tell, those requests are first blocked by the browser due to Opaque Request Blocking. Even if we were to prevent that with some changes, Baidu simply does not want you to embed their images in other webpages, so this does not work. For the preview you'll have to live with that. If you configure the archiving, everything is downloaded correctly, because it is no longer constrained by the browser rules.

<!-- gh-comment-id:2395337535 --> @kamtschatka commented on GitHub (Oct 6, 2024): please check out the configs flags in the [documentation](https://docs.hoarder.app/configuration). You can already configure the archives by setting CRAWLER_FULL_PAGE_ARCHIVE to true. From what I can tell, those requests are first blocked by the browser due to Opaque Request Blocking. Even if we were to prevent that with some changes, Baidu simply does not want you to embed their images in other webpages, so this does not work. For the preview you'll have to live with that. If you configure the archiving, everything is downloaded correctly, because it is no longer constrained by the browser rules.
Author
Owner

@ljq29 commented on GitHub (Oct 6, 2024):

please check out the configs flags in the documentation. You can already configure the archives by setting CRAWLER_FULL_PAGE_ARCHIVE to true.

From what I can tell, those requests are first blocked by the browser due to Opaque Request Blocking. Even if we were to prevent that with some changes, Baidu simply does not want you to embed their images in other webpages, so this does not work. For the preview you'll have to live with that. If you configure the archiving, everything is downloaded correctly, because it is no longer constrained by the browser rules.

As for your issue with the env configuration, where exactly should this be placed within the containers? I tried deploying it in the worker container, but it did not work.
image

When I placed the env configuration in the web container, several web pages repeatedly failed to fetch:
image

<!-- gh-comment-id:2395343457 --> @ljq29 commented on GitHub (Oct 6, 2024): > please check out the configs flags in the [documentation](https://docs.hoarder.app/configuration). You can already configure the archives by setting CRAWLER_FULL_PAGE_ARCHIVE to true. > > From what I can tell, those requests are first blocked by the browser due to Opaque Request Blocking. Even if we were to prevent that with some changes, Baidu simply does not want you to embed their images in other webpages, so this does not work. For the preview you'll have to live with that. If you configure the archiving, everything is downloaded correctly, because it is no longer constrained by the browser rules. As for your issue with the `env` configuration, where exactly should this be placed within the containers? I tried deploying it in the worker container, but it did not work. ![image](https://github.com/user-attachments/assets/1d66689d-5679-416e-9bda-befedf714a5e) When I placed the `env` configuration in the web container, several web pages repeatedly failed to fetch: ![image](https://github.com/user-attachments/assets/b7711eb3-c69e-4ca5-9b64-adb942316d38)
Author
Owner

@kamtschatka commented on GitHub (Oct 6, 2024):

yes, in the worker environment variables (Btw: you are using the old setup, where web and worker are separate docker containers). You probably did not look at the "Archive" tag in the preview, but at the same screen from above

<!-- gh-comment-id:2395367086 --> @kamtschatka commented on GitHub (Oct 6, 2024): yes, in the worker environment variables (Btw: you are using the old setup, where web and worker are separate docker containers). You probably did not look at the "Archive" tag in the preview, but at the same screen from above
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#317
No description provided.