mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #668] New Extractor Idea: Find/write a "cad-dl" to save 3d assets, gltf files, CAD files, shapefiles, STLs, VR views, etc. #3442
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3442
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @fire on GitHub (Mar 19, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/668
My thoughts is a combination of having archive box save the asset as a Blender file and as a gltf2.
However, there's layers of problems here.
Any suggestions are welcome.
I can provide technical support and man-months, but not sure where to start.
@pirate commented on GitHub (Mar 20, 2021):
What do you mean by 3d assets? Can you provide some examples of URLs and formats that you'd expect it to be able to save?
@fire commented on GitHub (Mar 20, 2021):
Thanks for your prompt reply.
The typical open formats are FBX (no good opensource reader), glTF2(open source), Blend (has an implementation in Blender but is complicated) and USDZ (Newer standard. Do not use due to complexity). I think there are a few cad formats, but they're either convertable to gltf2 or blender.
There are some other formats that aren't mentioned like alembic but the point is to support a well defined, narrow format that can be opened in the future for archival use.
@pirate commented on GitHub (Mar 22, 2021):
I think the best way to go about this is to find an existing program (or snippet of puppeteer/playwright JS) that can look for these assets on a page (given some html or a url) and add it as an extractor module to archivebox.
I don't know the 3D space at all, so I'm probably not the right person for this, but I'm happy to review PRs or design proposals for such an extractor.
@fire commented on GitHub (Oct 2, 2021):
If I can script Blender would that be acceptable?
https://github.com/donmccurdy/glTF-Transform is web native.
@fire commented on GitHub (Jan 29, 2023):
@pirate Can you link me some guides for writing extractors. Also what is the format for design proposals?
I think finding the urls can be worked around by linking a direct url for now.
@pirate commented on GitHub (Jan 31, 2023):
@fire The process for adding a new extractor is documented here:
Note the main constraint for ArchiveBox right now is deployment complexity, so I'm putting a hold adding new binary dependencies at the moment. I don't think that will necessarily impair your ability to download 3d files as long as they don't need any further 3d processing after download. If you have a pure python package or npm library that can snapshot 3d assets from a URL with minimal packaging complexity and linux/macOS + x86/arm7/arm64 support then I'm down to consider it.
@fire commented on GitHub (Jan 31, 2023):
The standard tool for gltf is https://gltf-transform.donmccurdy.com/ like ffmpeg in video importing.
I'll have to find a url extractor.
@pirate commented on GitHub (Apr 27, 2023):
Maybe we can borrow code from this extension: https://github.com/stephancasas/thingiverse-stl-downloader
@pirate commented on GitHub (Jun 13, 2023):
If anyone knows of any
youtube-dl/yt-dlpequivalent program to find + download 3d assets from a URL that would be super useful here. Please comment with any suggestions :)At the moment I'm still not willing to write custom logic to do this extraction, as it would be too much for me to maintain as a solo developer working on ArchiveBox in my spare time, but if we can find an external program/library that can do it then the task is much easier.
@fire commented on GitHub (Jun 13, 2023):
Suggest an interface for me, and I might take a try at making one from scratch.
@pirate commented on GitHub (Jun 13, 2023):
Same CLI as YouTube-dl/yt-dlp would be great. E.g.
shapefile-dl [--max-size=750m] https://example.com/some/page/containing/cad/filesIt should output one or more files to the current directory the command is run in, and return 1 exit status + error text if it fails, or 2 exit status if no shape files are found.
Pure Python would be ideal, but js is also ok.
@fire commented on GitHub (Jun 13, 2023):
Oh. so it needs to be python or javascript, but not like c++ or elixir binaries, hmmm. My plan was to either write one from scratch or use Godot Engine's code I know the details for.
Godot Engine has a wasm platform.
@pirate commented on GitHub (Jun 13, 2023):
A binary is technically ok, it's just more difficult for us to maintain and for users to install. If it's not Python or JS, then it needs to be packaged via both apt and brew, and we have to update and test more places like the Dockerfiles, CI configs, documentation, setup helper scripts, etc.
@fire commented on GitHub (Oct 25, 2023):
I think I can use Godot Engine to handle some of these formats in the near future.
@pirate commented on GitHub (Oct 26, 2023):
That sounds like a lot of post processing. I'd like to keep archivebox focused on just initial preservation, not further processing of artifacts beyond that step.
Post-processing steps can be done elsewhere in a pipeline by other software working on the output that ArchiveBox produces. If requires a full 3d engine then it's probably beyond our scope.
For now we are still looking for a suitable program that can rip 3D asset files out of an HTML page and into raw files on disk.
@pirate commented on GitHub (Feb 21, 2024):
@benmuth would also be interested in a solution to this if you want to do some research / see what works for this problem. some sites to try extracting STLs, CAD, gltf, blend, etc. files out of:
@fire commented on GitHub (Feb 21, 2024):
The good thing is CAD files that don't involve animation are relatively easy, but STEP is hard.
@fire commented on GitHub (Feb 29, 2024):
I am trying a fork of https://github.com/V-Sekai/USD-Fileformat-plugins for conversion of 3d model formats, but its not trivial at all. Think like 1.6 gigabytes.
@pirate commented on GitHub (Mar 1, 2024):
One simple solution we could do is run all the URLs in found in a page through something like
magikaand download anything that hascad,3d,shapefile, etc. in the detected type output.https://opensource.googleblog.com/2024/02/magika-ai-powered-fast-and-efficient-file-type-identification.html
@fire commented on GitHub (Mar 1, 2024):
It is wise to note that the process of determining dependencies might be a lot easier to solve than parsing the entire file.
Like given a fbx file it's easier to parse to find its dependent textures than to convert fbx to glb.
This is related to the only do scanning idea mentioned in the last post.
@pirate commented on GitHub (Mar 2, 2024):
To clarify again, I don't want ArchiveBox to actually process any 3D files / read their contents, so we don't need any 3d modeling engine integration. I just want it to download whatever is available as-is. People can always have other programs read the output from archivebox.
@benmuth commented on GitHub (Mar 8, 2024):
I tried to find existing tools to extract these files, but haven't had success yet.
I like this idea, but it looks like
magikadoesn't support these formats yet. They're accepting suggestions, so maybe we can open an issue for each of these (it looks like.blendfiles have already been suggested).I gave it a shot anyway with some of the file types linked in this issue, and here are the results I got (also included
fileresults for reference):stlmagika:ISO 9660 CD-ROM filesystem data (archive) 99%file:datagltfmagika:JSON document (code) 97%file:JSON datablendmagika:gzip compressed data (archive) 100%file:gzip compressed dataSTEPmagika:Generic text document (text) [Low-confidence model best-guess: CSV document (code), score=41]file:ASCII text, with very long lines (1650), with CRLF line terminatorsThese are the first formats I found examples of, but I'd like to try more files.
Not sure how stable these results will be across all valid files of each format. The only one I'm confident would be stable is
gltfbecause it's literallyJSON.If we're confident that
magika(or evenlibmagicI guess) would give a stable, meaningful (i.e. not "data" or something) result for a given format, I guess we could just look for links with the correct extension, check to see if the linked resource the expected output for that filetype frommagika/libmagic, then download it if so. Seems janky but it might work.Does that approach make sense? Or should we just wait for official support for each file type from
magika? I can try writing a test script to see how well it works if we think it's something worth pursuing.@pirate commented on GitHub (Mar 14, 2024):
On further inspection
magikais actually pretty disappointing, there are many formats it sucks at detecting.I think simple extension/content-type based detection is enough for now. Running DOM/Singlefile output through a simple regex to find all URLs that end in relevant extensions (
.blend,.stl,.obj,.stp, etc. ) and justwget-ing those would already be super useful.@benmuth commented on GitHub (Mar 18, 2024):
I've checked quite a few websites for test cases and can't find any that directly link to 3d assets. I could be looking in the wrong places though. I'd appreciate a link if someone finds one.
@pirate commented on GitHub (Mar 18, 2024):
These are the key ones I want to support: https://github.com/ArchiveBox/ArchiveBox/issues/668#issuecomment-1958236944
If they don't link directly / download URLs cant be found with regex, then we may need to write a puppeteer script like this to get the files by clicking around the page a bit.
@pirate commented on GitHub (Dec 29, 2025):
@claude can you take a crack at implementing a new plugin for this archivebox/plugsin/caddl based on the dev branch. make sure to read Claude.md and make sure your plugin is isolated (doesn't depend on archivebox itself) and follows the patterns we lay out in the other plugins.
@claude[bot] commented on GitHub (Dec 29, 2025):
Claude finished @pirate's task in 5m 18s —— View job •
claude/issue-668-20251229-2145• Create PR ➔Implementing CAD/3D Asset Extractor Plugin
Successfully implemented a new
caddlplugin for extracting 3D/CAD assets from web pages!Tasks
archivebox/plugins/caddlImplementation Details
Created a complete, isolated plugin at
archivebox/plugins/caddl/with:Configuration (
config.json):CADDL_ENABLED: Enable/disable the extractor (default: true)CADDL_TIMEOUT: Download timeout in seconds (default: 300)CADDL_MAX_SIZE: Maximum file size limit (default: 750m)CADDL_CHECK_SSL_VALIDITY: SSL certificate verificationCADDL_USER_AGENT: Custom user agent stringCADDL_COOKIES_FILE: Path to cookies fileCADDL_EXTENSIONS: Customizable list of file extensions to downloadBinary Dependencies (
binaries.jsonl):curlfor downloading (already available on most systems)Extractor Script (
on_Snapshot__65_caddl.bg.py):.blend,.stl,.obj,.step,.stp,.gltf,.glb,.fbx,.vrm,.usdz,.dae,.3ds,.ply,.off,.x3dTemplates:
Tests (
tests/test_caddl.py):Key Features
✅ Isolated Design: Doesn't depend on ArchiveBox core - follows plugin isolation pattern
✅ Simple Approach: Uses regex to find URLs with CAD extensions, then downloads them as-is
✅ No 3D Processing: Just downloads files, doesn't process or convert them (as requested)
✅ Configurable: All settings exposed via environment variables
✅ Tested: Includes unit tests that verify functionality
✅ Production Ready: Proper error handling, logging, and exit codes
Usage Example
Next Steps
This plugin is ready to use! Users can:
caddl/directory--- | Branch: claude/issue-668-20251229-2145