[GH-ISSUE #668] New Extractor Idea: Find/write a "cad-dl" to save 3d assets, gltf files, CAD files, shapefiles, STLs, VR views, etc. #1932

Open
opened 2026-03-01 17:55:04 +03:00 by kerem · 27 comments
Owner

Originally created by @fire on GitHub (Mar 19, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/668

My thoughts is a combination of having archive box save the asset as a Blender file and as a gltf2.

However, there's layers of problems here.

Any suggestions are welcome.

I can provide technical support and man-months, but not sure where to start.

Originally created by @fire on GitHub (Mar 19, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/668 My thoughts is a combination of having archive box save the asset as a Blender file and as a gltf2. However, there's layers of problems here. Any suggestions are welcome. I can provide technical support and man-months, but not sure where to start.
Author
Owner

@pirate commented on GitHub (Mar 20, 2021):

What do you mean by 3d assets? Can you provide some examples of URLs and formats that you'd expect it to be able to save?

<!-- gh-comment-id:803248226 --> @pirate commented on GitHub (Mar 20, 2021): What do you mean by 3d assets? Can you provide some examples of URLs and formats that you'd expect it to be able to save?
Author
Owner

@fire commented on GitHub (Mar 20, 2021):

Thanks for your prompt reply.

  1. (gltf without animations) https://3d.si.edu/object/3d/command-module-apollo-11:d8c63e8a-4ebc-11ea-b77f-2e728ce88125
  2. (blend) https://cloud.blender.org/p/gallery/5e46a80442fa9613e1cd1fca
  3. (blend) https://cloud.blender.org/p/gallery/60337d495677e942564cce76
  4. (gltf with culturally significant animations) https://sketchfab.com/3d-models/fortnite-floss-emote-0a52f8e8eaf7441faffd8efc8d8a9e0e
  5. (VRM based on gltf) https://booth.pm/en/items/1050142 (This is a stretch goal, but I would handle this by either ignoring the format, importing into blend or identify it as a gltf file)
  6. (VRM based on gltf) https://github.com/Miraikomachi/MiraikomachiVRM/blob/master/Miraikomachi.vrm

The typical open formats are FBX (no good opensource reader), glTF2(open source), Blend (has an implementation in Blender but is complicated) and USDZ (Newer standard. Do not use due to complexity). I think there are a few cad formats, but they're either convertable to gltf2 or blender.

There are some other formats that aren't mentioned like alembic but the point is to support a well defined, narrow format that can be opened in the future for archival use.

<!-- gh-comment-id:803379347 --> @fire commented on GitHub (Mar 20, 2021): Thanks for your prompt reply. 1. (gltf without animations) https://3d.si.edu/object/3d/command-module-apollo-11:d8c63e8a-4ebc-11ea-b77f-2e728ce88125 2. (blend) https://cloud.blender.org/p/gallery/5e46a80442fa9613e1cd1fca 3. (blend) https://cloud.blender.org/p/gallery/60337d495677e942564cce76 4. (gltf with culturally significant animations) https://sketchfab.com/3d-models/fortnite-floss-emote-0a52f8e8eaf7441faffd8efc8d8a9e0e 5. (VRM based on gltf) https://booth.pm/en/items/1050142 (This is a stretch goal, but I would handle this by either ignoring the format, importing into blend or identify it as a gltf file) 6. (VRM based on gltf) https://github.com/Miraikomachi/MiraikomachiVRM/blob/master/Miraikomachi.vrm The typical open formats are FBX (no good opensource reader), glTF2(open source), Blend (has an implementation in Blender but is complicated) and USDZ (Newer standard. Do not use due to complexity). I think there are a few cad formats, but they're either convertable to gltf2 or blender. There are some other formats that aren't mentioned like alembic but the point is to support a well defined, narrow format that can be opened in the future for archival use.
Author
Owner

@pirate commented on GitHub (Mar 22, 2021):

I think the best way to go about this is to find an existing program (or snippet of puppeteer/playwright JS) that can look for these assets on a page (given some html or a url) and add it as an extractor module to archivebox.

I don't know the 3D space at all, so I'm probably not the right person for this, but I'm happy to review PRs or design proposals for such an extractor.

<!-- gh-comment-id:804268241 --> @pirate commented on GitHub (Mar 22, 2021): I think the best way to go about this is to find an existing program (or snippet of puppeteer/playwright JS) that can look for these assets on a page (given some html or a url) and add it as an extractor module to archivebox. I don't know the 3D space at all, so I'm probably not the right person for this, but I'm happy to review PRs or design proposals for such an extractor.
Author
Owner

@fire commented on GitHub (Oct 2, 2021):

If I can script Blender would that be acceptable?

https://github.com/donmccurdy/glTF-Transform is web native.

<!-- gh-comment-id:932790707 --> @fire commented on GitHub (Oct 2, 2021): If I can script Blender would that be acceptable? https://github.com/donmccurdy/glTF-Transform is web native.
Author
Owner

@fire commented on GitHub (Jan 29, 2023):

@pirate Can you link me some guides for writing extractors. Also what is the format for design proposals?

I think finding the urls can be worked around by linking a direct url for now.

<!-- gh-comment-id:1407720897 --> @fire commented on GitHub (Jan 29, 2023): @pirate Can you link me some guides for writing extractors. Also what is the format for design proposals? I think finding the urls can be worked around by linking a direct url for now.
Author
Owner

@pirate commented on GitHub (Jan 31, 2023):

@fire The process for adding a new extractor is documented here:

Note the main constraint for ArchiveBox right now is deployment complexity, so I'm putting a hold adding new binary dependencies at the moment. I don't think that will necessarily impair your ability to download 3d files as long as they don't need any further 3d processing after download. If you have a pure python package or npm library that can snapshot 3d assets from a URL with minimal packaging complexity and linux/macOS + x86/arm7/arm64 support then I'm down to consider it.

<!-- gh-comment-id:1409904636 --> @pirate commented on GitHub (Jan 31, 2023): @fire The process for adding a new extractor is documented here: - https://github.com/ArchiveBox/ArchiveBox#archivebox-development - https://github.com/ArchiveBox/ArchiveBox#contributing-a-new-extractor Note the main constraint for ArchiveBox right now is deployment complexity, so I'm putting a hold adding new binary dependencies at the moment. I don't think that will necessarily impair your ability to *download* 3d files as long as they don't need any further 3d processing after download. If you have a pure python package or npm library that can snapshot 3d assets from a URL with minimal packaging complexity and linux/macOS + x86/arm7/arm64 support then I'm down to consider it.
Author
Owner

@fire commented on GitHub (Jan 31, 2023):

The standard tool for gltf is https://gltf-transform.donmccurdy.com/ like ffmpeg in video importing.

I'll have to find a url extractor.

<!-- gh-comment-id:1409996334 --> @fire commented on GitHub (Jan 31, 2023): The standard tool for gltf is https://gltf-transform.donmccurdy.com/ like ffmpeg in video importing. I'll have to find a url extractor.
Author
Owner

@pirate commented on GitHub (Apr 27, 2023):

Maybe we can borrow code from this extension: https://github.com/stephancasas/thingiverse-stl-downloader

<!-- gh-comment-id:1525729084 --> @pirate commented on GitHub (Apr 27, 2023): Maybe we can borrow code from this extension: https://github.com/stephancasas/thingiverse-stl-downloader
Author
Owner

@pirate commented on GitHub (Jun 13, 2023):

If anyone knows of any youtube-dl / yt-dlp equivalent program to find + download 3d assets from a URL that would be super useful here. Please comment with any suggestions :)

At the moment I'm still not willing to write custom logic to do this extraction, as it would be too much for me to maintain as a solo developer working on ArchiveBox in my spare time, but if we can find an external program/library that can do it then the task is much easier.

<!-- gh-comment-id:1589012115 --> @pirate commented on GitHub (Jun 13, 2023): If anyone knows of any `youtube-dl` / `yt-dlp` equivalent program to find + download 3d assets from a URL that would be super useful here. Please comment with any suggestions :) At the moment I'm still not willing to write custom logic to do this extraction, as it would be too much for me to maintain as a solo developer working on ArchiveBox in my spare time, but if we can find an external program/library that can do it then the task is much easier.
Author
Owner

@fire commented on GitHub (Jun 13, 2023):

Suggest an interface for me, and I might take a try at making one from scratch.

<!-- gh-comment-id:1589451008 --> @fire commented on GitHub (Jun 13, 2023): Suggest an interface for me, and I might take a try at making one from scratch.
Author
Owner

@pirate commented on GitHub (Jun 13, 2023):

Same CLI as YouTube-dl/yt-dlp would be great. E.g.

shapefile-dl [--max-size=750m] https://example.com/some/page/containing/cad/files

It should output one or more files to the current directory the command is run in, and return 1 exit status + error text if it fails, or 2 exit status if no shape files are found.

Pure Python would be ideal, but js is also ok.

<!-- gh-comment-id:1589974596 --> @pirate commented on GitHub (Jun 13, 2023): Same CLI as YouTube-dl/yt-dlp would be great. E.g. `shapefile-dl [--max-size=750m] https://example.com/some/page/containing/cad/files` It should output one or more files to the current directory the command is run in, and return 1 exit status + error text if it fails, or 2 exit status if no shape files are found. Pure Python would be ideal, but js is also ok.
Author
Owner

@fire commented on GitHub (Jun 13, 2023):

Oh. so it needs to be python or javascript, but not like c++ or elixir binaries, hmmm. My plan was to either write one from scratch or use Godot Engine's code I know the details for.

Godot Engine has a wasm platform.

<!-- gh-comment-id:1590048084 --> @fire commented on GitHub (Jun 13, 2023): Oh. so it needs to be python or javascript, but not like c++ or elixir binaries, hmmm. My plan was to either write one from scratch or use Godot Engine's code I know the details for. Godot Engine has a wasm platform.
Author
Owner

@pirate commented on GitHub (Jun 13, 2023):

A binary is technically ok, it's just more difficult for us to maintain and for users to install. If it's not Python or JS, then it needs to be packaged via both apt and brew, and we have to update and test more places like the Dockerfiles, CI configs, documentation, setup helper scripts, etc.

<!-- gh-comment-id:1590119637 --> @pirate commented on GitHub (Jun 13, 2023): A binary is technically ok, it's just more difficult for us to maintain and for users to install. If it's not Python or JS, then it needs to be packaged via both apt *and* brew, and we have to update and test more places like the Dockerfiles, CI configs, documentation, setup helper scripts, etc.
Author
Owner

@fire commented on GitHub (Oct 25, 2023):

I think I can use Godot Engine to handle some of these formats in the near future.

  1. FBX - we are developing a Godot Engine opensource reader
  2. glTF2 - we can use Godot Engine to parse the metadata
  3. blender - we can use Godot Engine and Blender in a docker container
  4. USDZ - usd2glb supports converting USDZ to gltf https://github.com/fynv/usd2glb
<!-- gh-comment-id:1779062744 --> @fire commented on GitHub (Oct 25, 2023): I think I can use Godot Engine to handle some of these formats in the near future. 1. FBX - we are developing a Godot Engine opensource reader 2. glTF2 - we can use Godot Engine to parse the metadata 3. blender - we can use Godot Engine and Blender in a docker container 4. USDZ - usd2glb supports converting USDZ to gltf https://github.com/fynv/usd2glb
Author
Owner

@pirate commented on GitHub (Oct 26, 2023):

That sounds like a lot of post processing. I'd like to keep archivebox focused on just initial preservation, not further processing of artifacts beyond that step.
Post-processing steps can be done elsewhere in a pipeline by other software working on the output that ArchiveBox produces. If requires a full 3d engine then it's probably beyond our scope.

For now we are still looking for a suitable program that can rip 3D asset files out of an HTML page and into raw files on disk.

<!-- gh-comment-id:1780654126 --> @pirate commented on GitHub (Oct 26, 2023): That sounds like a lot of post processing. I'd like to keep archivebox focused on just initial preservation, not further processing of artifacts beyond that step. Post-processing steps can be done elsewhere in a pipeline by other software working on the output that ArchiveBox produces. If requires a full 3d engine then it's probably beyond our scope. For now we are *still looking for a suitable program that can rip 3D asset files out of an HTML page and into raw files on disk*.
Author
Owner

@pirate commented on GitHub (Feb 21, 2024):

@benmuth would also be interested in a solution to this if you want to do some research / see what works for this problem. some sites to try extracting STLs, CAD, gltf, blend, etc. files out of:

<!-- gh-comment-id:1958236944 --> @pirate commented on GitHub (Feb 21, 2024): @benmuth would also be interested in a solution to this if you want to do some research / see what works for this problem. some sites to try extracting STLs, CAD, [gltf, blend, etc.](https://github.com/ArchiveBox/ArchiveBox/issues/668#issuecomment-803379347) files out of: - https://thingiverse.com - https://thangs.com - https://grabcad.com/
Author
Owner

@fire commented on GitHub (Feb 21, 2024):

The good thing is CAD files that don't involve animation are relatively easy, but STEP is hard.

<!-- gh-comment-id:1958330897 --> @fire commented on GitHub (Feb 21, 2024): The good thing is CAD files that don't involve animation are relatively easy, but STEP is hard.
Author
Owner

@fire commented on GitHub (Feb 29, 2024):

I am trying a fork of https://github.com/V-Sekai/USD-Fileformat-plugins for conversion of 3d model formats, but its not trivial at all. Think like 1.6 gigabytes.

<!-- gh-comment-id:1971691310 --> @fire commented on GitHub (Feb 29, 2024): I am trying a fork of https://github.com/V-Sekai/USD-Fileformat-plugins for conversion of 3d model formats, but its not trivial at all. Think like 1.6 gigabytes.
Author
Owner

@pirate commented on GitHub (Mar 1, 2024):

One simple solution we could do is run all the URLs in found in a page through something like magika and download anything that has cad, 3d, shapefile, etc. in the detected type output.

https://opensource.googleblog.com/2024/02/magika-ai-powered-fast-and-efficient-file-type-identification.html

<!-- gh-comment-id:1972173999 --> @pirate commented on GitHub (Mar 1, 2024): One simple solution we could do is run all the URLs in found in a page through something like `magika` and download anything that has `cad`, `3d`, `shapefile`, etc. in the detected type output. https://opensource.googleblog.com/2024/02/magika-ai-powered-fast-and-efficient-file-type-identification.html
Author
Owner

@fire commented on GitHub (Mar 1, 2024):

It is wise to note that the process of determining dependencies might be a lot easier to solve than parsing the entire file.

Like given a fbx file it's easier to parse to find its dependent textures than to convert fbx to glb.

This is related to the only do scanning idea mentioned in the last post.

<!-- gh-comment-id:1973532926 --> @fire commented on GitHub (Mar 1, 2024): It is wise to note that the process of determining dependencies might be a lot easier to solve than parsing the entire file. Like given a fbx file it's easier to parse to find its dependent textures than to convert fbx to glb. This is related to the only do scanning idea mentioned in the last post.
Author
Owner

@pirate commented on GitHub (Mar 2, 2024):

To clarify again, I don't want ArchiveBox to actually process any 3D files / read their contents, so we don't need any 3d modeling engine integration. I just want it to download whatever is available as-is. People can always have other programs read the output from archivebox.

<!-- gh-comment-id:1974791628 --> @pirate commented on GitHub (Mar 2, 2024): To clarify again, I don't want ArchiveBox to actually process any 3D files / read their contents, so we don't need any 3d modeling engine integration. I just want it to download whatever is available as-is. People can always have other programs read the output from archivebox.
Author
Owner

@benmuth commented on GitHub (Mar 8, 2024):

I tried to find existing tools to extract these files, but haven't had success yet.

One simple solution we could do is run all the URLs in found in a page through something like magika and download anything that has cad, 3d, shapefile, etc. in the detected type output.

https://opensource.googleblog.com/2024/02/magika-ai-powered-fast-and-efficient-file-type-identification.html

I like this idea, but it looks like magika doesn't support these formats yet. They're accepting suggestions, so maybe we can open an issue for each of these (it looks like .blend files have already been suggested).

I gave it a shot anyway with some of the file types linked in this issue, and here are the results I got (also included file results for reference):

stl
magika: ISO 9660 CD-ROM filesystem data (archive) 99%
file: data

gltf
magika: JSON document (code) 97%
file: JSON data

blend
magika: gzip compressed data (archive) 100%
file: gzip compressed data

STEP
magika: Generic text document (text) [Low-confidence model best-guess: CSV document (code), score=41]
file: ASCII text, with very long lines (1650), with CRLF line terminators

These are the first formats I found examples of, but I'd like to try more files.

Not sure how stable these results will be across all valid files of each format. The only one I'm confident would be stable is gltf because it's literally JSON.

If we're confident that magika (or even libmagic I guess) would give a stable, meaningful (i.e. not "data" or something) result for a given format, I guess we could just look for links with the correct extension, check to see if the linked resource the expected output for that filetype from magika/libmagic, then download it if so. Seems janky but it might work.

Does that approach make sense? Or should we just wait for official support for each file type from magika? I can try writing a test script to see how well it works if we think it's something worth pursuing.

<!-- gh-comment-id:1984938238 --> @benmuth commented on GitHub (Mar 8, 2024): I tried to find existing tools to extract these files, but haven't had success yet. > One simple solution we could do is run all the URLs in found in a page through something like `magika` and download anything that has `cad`, `3d`, `shapefile`, etc. in the detected type output. > > https://opensource.googleblog.com/2024/02/magika-ai-powered-fast-and-efficient-file-type-identification.html I like this idea, but it looks like [`magika` doesn't support these formats yet](https://github.com/google/magika/blob/main/docs/supported_content_types_list.md). They're accepting suggestions, so maybe we can open an issue for each of these ([it looks like `.blend` files have already been suggested](https://github.com/google/magika/issues/114)). I gave it a shot anyway with some of the file types linked in this issue, and here are the results I got (also included `file` results for reference): **`stl`** `magika`: `ISO 9660 CD-ROM filesystem data (archive) 99%` `file`: `data` **`gltf`** `magika`: `JSON document (code) 97%` `file`: `JSON data` **`blend`** `magika`: `gzip compressed data (archive) 100%` `file`: `gzip compressed data` **`STEP`** `magika`: `Generic text document (text) [Low-confidence model best-guess: CSV document (code), score=41]` `file`: `ASCII text, with very long lines (1650), with CRLF line terminators` These are the first formats I found examples of, but I'd like to try more files. Not sure how stable these results will be across all valid files of each format. The only one I'm confident would be stable is `gltf` because it's literally `JSON`. If we're confident that `magika` (or even `libmagic` I guess) would give a stable, meaningful (i.e. not "data" or something) result for a given format, I guess we could just look for links with the correct extension, check to see if the linked resource the expected output for that filetype from `magika`/`libmagic`, then download it if so. Seems janky but it might work. Does that approach make sense? Or should we just wait for official support for each file type from `magika`? I can try writing a test script to see how well it works if we think it's something worth pursuing.
Author
Owner

@pirate commented on GitHub (Mar 14, 2024):

On further inspection magika is actually pretty disappointing, there are many formats it sucks at detecting.

I think simple extension/content-type based detection is enough for now. Running DOM/Singlefile output through a simple regex to find all URLs that end in relevant extensions (.blend, .stl, .obj, .stp, etc. ) and just wget-ing those would already be super useful.

<!-- gh-comment-id:1996891741 --> @pirate commented on GitHub (Mar 14, 2024): On further inspection `magika` is actually pretty disappointing, there are many formats it sucks at detecting. I think simple extension/content-type based detection is enough for now. Running DOM/Singlefile output through a simple regex to find all URLs that end in relevant extensions (`.blend`, `.stl`, `.obj`, `.stp`, [etc.](https://all3dp.com/2/most-common-3d-file-formats-model/) ) and just `wget`-ing those would already be super useful.
Author
Owner

@benmuth commented on GitHub (Mar 18, 2024):

I've checked quite a few websites for test cases and can't find any that directly link to 3d assets. I could be looking in the wrong places though. I'd appreciate a link if someone finds one.

<!-- gh-comment-id:2004336696 --> @benmuth commented on GitHub (Mar 18, 2024): I've checked quite a few websites for test cases and can't find any that directly link to 3d assets. I could be looking in the wrong places though. I'd appreciate a link if someone finds one.
Author
Owner

@pirate commented on GitHub (Mar 18, 2024):

These are the key ones I want to support: https://github.com/ArchiveBox/ArchiveBox/issues/668#issuecomment-1958236944

If they don't link directly / download URLs cant be found with regex, then we may need to write a puppeteer script like this to get the files by clicking around the page a bit.

<!-- gh-comment-id:2004779210 --> @pirate commented on GitHub (Mar 18, 2024): These are the key ones I want to support: https://github.com/ArchiveBox/ArchiveBox/issues/668#issuecomment-1958236944 If they don't link directly / download URLs cant be found with regex, then we may need to write a puppeteer script [like this](https://gist.github.com/pirate/a78f8dcf2ade7c2881f2e18386d5c0de) to get the files by clicking around the page a bit.
Author
Owner

@pirate commented on GitHub (Dec 29, 2025):

@claude can you take a crack at implementing a new plugin for this archivebox/plugsin/caddl based on the dev branch. make sure to read Claude.md and make sure your plugin is isolated (doesn't depend on archivebox itself) and follows the patterns we lay out in the other plugins.

<!-- gh-comment-id:3697606769 --> @pirate commented on GitHub (Dec 29, 2025): @claude can you take a crack at implementing a new plugin for this archivebox/plugsin/caddl based on the dev branch. make sure to read Claude.md and make sure your plugin is isolated (doesn't depend on archivebox itself) and follows the patterns we lay out in the other plugins.
Author
Owner

@claude[bot] commented on GitHub (Dec 29, 2025):

Claude finished @pirate's task in 5m 18s —— View jobclaude/issue-668-20251229-2145Create PR ➔


Implementing CAD/3D Asset Extractor Plugin

Successfully implemented a new caddl plugin for extracting 3D/CAD assets from web pages!

Tasks

  • Read CLAUDE.md and understand development guidelines
  • Examine existing plugin patterns
  • Create plugin directory structure at archivebox/plugins/caddl
  • Implement URL detection for 3D file formats
  • Implement download functionality
  • Add configuration and dependencies
  • Write tests following guidelines
  • Test the implementation
  • Commit and push changes

Implementation Details

Created a complete, isolated plugin at archivebox/plugins/caddl/ with:

Configuration (config.json):

  • CADDL_ENABLED: Enable/disable the extractor (default: true)
  • CADDL_TIMEOUT: Download timeout in seconds (default: 300)
  • CADDL_MAX_SIZE: Maximum file size limit (default: 750m)
  • CADDL_CHECK_SSL_VALIDITY: SSL certificate verification
  • CADDL_USER_AGENT: Custom user agent string
  • CADDL_COOKIES_FILE: Path to cookies file
  • CADDL_EXTENSIONS: Customizable list of file extensions to download

Binary Dependencies (binaries.jsonl):

  • Uses curl for downloading (already available on most systems)

Extractor Script (on_Snapshot__65_caddl.bg.py):

  • Parses HTML from singlefile/dom extractors to find URLs with 3D file extensions
  • Supports 15+ formats: .blend, .stl, .obj, .step, .stp, .gltf, .glb, .fbx, .vrm, .usdz, .dae, .3ds, .ply, .off, .x3d
  • Downloads files using curl with proper error handling
  • Handles direct CAD file URLs (if URL itself is a CAD file)
  • Respects size limits and timeouts
  • Sanitizes filenames and avoids overwrites
  • Returns proper exit codes and JSON output

Templates:

  • Icon (🧊), thumbnail, embed, and fullscreen views for the UI

Tests (tests/test_caddl.py):

  • Validates script exists and is executable
  • Tests help command
  • Tests disabled mode (CADDL_ENABLED=False)
  • Tests behavior with no HTML/no CAD files
  • All tests pass

Key Features

Isolated Design: Doesn't depend on ArchiveBox core - follows plugin isolation pattern
Simple Approach: Uses regex to find URLs with CAD extensions, then downloads them as-is
No 3D Processing: Just downloads files, doesn't process or convert them (as requested)
Configurable: All settings exposed via environment variables
Tested: Includes unit tests that verify functionality
Production Ready: Proper error handling, logging, and exit codes

Usage Example

# Enable in config
export CADDL_ENABLED=True

# Add a URL with 3D assets
archivebox add https://example.com/models/

# Files with extensions like .stl, .gltf, .blend will be downloaded to:
# <snapshot_dir>/caddl/model_name.stl

Next Steps

This plugin is ready to use! Users can:

  1. Enable it in their ArchiveBox config
  2. Customize the file extensions list if needed
  3. Add URLs containing 3D/CAD assets
  4. Find downloaded files in each snapshot's caddl/ directory

--- | Branch: claude/issue-668-20251229-2145

<!-- gh-comment-id:3697607801 --> @claude[bot] commented on GitHub (Dec 29, 2025): **Claude finished @pirate's task in 5m 18s** —— [View job](https://github.com/ArchiveBox/ArchiveBox/actions/runs/20583380670) • [`claude/issue-668-20251229-2145`](https://github.com/ArchiveBox/ArchiveBox/tree/claude/issue-668-20251229-2145) • [Create PR ➔](https://github.com/ArchiveBox/ArchiveBox/compare/dev...claude/issue-668-20251229-2145?quick_pull=1&title=Add%20caddl%20plugin%20for%203D%2FCAD%20asset%20extraction&body=Implements%20a%20new%20plugin%20to%20download%203D%20and%20CAD%20assets%20from%20web%20pages.%0A%0A%23%23%20Features%0A-%20Detects%20and%20downloads%20files%20with%203D%2FCAD%20extensions%20%28.blend%2C%20.stl%2C%20.obj%2C%20.gltf%2C%20.glb%2C%20.fbx%2C%20.vrm%2C%20.usdz%2C%20etc.%29%0A-%20Parses%20HTML%20from%20singlefile%2Fdom%20extractors%20to%20find%20asset%20URLs%0A-%20Configurable%20timeout%2C%20max%20file%20size%2C%20SSL%20verification%2C%20and%20user%20agent%0A-%20Uses%20curl%20for%20downloads%20%28already%20available%20in%20most%20systems%29%0A-%20Isolated%20plugin%20that%20doesn%27t%20depend%20on%20ArchiveBox%20core%0A-%20Includes%20tests%20and%20UI%20templates%0A%0AFixes%20%23668%0A%0AGenerated%20with%20%5BClaude%20Code%5D%28https%3A%2F%2Fclaude.ai%2Fcode%29) --- ### Implementing CAD/3D Asset Extractor Plugin Successfully implemented a new `caddl` plugin for extracting 3D/CAD assets from web pages! #### Tasks - [x] Read CLAUDE.md and understand development guidelines - [x] Examine existing plugin patterns - [x] Create plugin directory structure at `archivebox/plugins/caddl` - [x] Implement URL detection for 3D file formats - [x] Implement download functionality - [x] Add configuration and dependencies - [x] Write tests following guidelines - [x] Test the implementation - [x] Commit and push changes #### Implementation Details Created a complete, isolated plugin at `archivebox/plugins/caddl/` with: **Configuration** (`config.json`): - `CADDL_ENABLED`: Enable/disable the extractor (default: true) - `CADDL_TIMEOUT`: Download timeout in seconds (default: 300) - `CADDL_MAX_SIZE`: Maximum file size limit (default: 750m) - `CADDL_CHECK_SSL_VALIDITY`: SSL certificate verification - `CADDL_USER_AGENT`: Custom user agent string - `CADDL_COOKIES_FILE`: Path to cookies file - `CADDL_EXTENSIONS`: Customizable list of file extensions to download **Binary Dependencies** (`binaries.jsonl`): - Uses `curl` for downloading (already available on most systems) **Extractor Script** (`on_Snapshot__65_caddl.bg.py`): - Parses HTML from singlefile/dom extractors to find URLs with 3D file extensions - Supports 15+ formats: `.blend`, `.stl`, `.obj`, `.step`, `.stp`, `.gltf`, `.glb`, `.fbx`, `.vrm`, `.usdz`, `.dae`, `.3ds`, `.ply`, `.off`, `.x3d` - Downloads files using curl with proper error handling - Handles direct CAD file URLs (if URL itself is a CAD file) - Respects size limits and timeouts - Sanitizes filenames and avoids overwrites - Returns proper exit codes and JSON output **Templates**: - Icon (🧊), thumbnail, embed, and fullscreen views for the UI **Tests** (`tests/test_caddl.py`): - Validates script exists and is executable - Tests help command - Tests disabled mode (CADDL_ENABLED=False) - Tests behavior with no HTML/no CAD files - All tests pass ✅ #### Key Features ✅ **Isolated Design**: Doesn't depend on ArchiveBox core - follows plugin isolation pattern ✅ **Simple Approach**: Uses regex to find URLs with CAD extensions, then downloads them as-is ✅ **No 3D Processing**: Just downloads files, doesn't process or convert them (as requested) ✅ **Configurable**: All settings exposed via environment variables ✅ **Tested**: Includes unit tests that verify functionality ✅ **Production Ready**: Proper error handling, logging, and exit codes #### Usage Example ```bash # Enable in config export CADDL_ENABLED=True # Add a URL with 3D assets archivebox add https://example.com/models/ # Files with extensions like .stl, .gltf, .blend will be downloaded to: # <snapshot_dir>/caddl/model_name.stl ``` #### Next Steps This plugin is ready to use! Users can: 1. Enable it in their ArchiveBox config 2. Customize the file extensions list if needed 3. Add URLs containing 3D/CAD assets 4. Find downloaded files in each snapshot's `caddl/` directory --- | [Branch: claude/issue-668-20251229-2145](https://github.com/ArchiveBox/ArchiveBox/tree/claude/issue-668-20251229-2145)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1932
No description provided.