mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #963] Ability to set title based on variables and regex and cascade to filenames for filesystem or REST API access or export #3619
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3619
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @ga-it on GitHub (Apr 10, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/963
Type
What is the problem that your feature request solves
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
@rickcecil commented on GitHub (May 4, 2022):
I want to second this feature request. My use case is very similar. I want to be able to copy the saved PDFs to my filesystem and make them available to Filerun, Open Semantic Search, and an internally hosted Hypothesis server. Being able to add meaningful filenames to the archived files is critical to this working.
I am also willing to contribute beers/moneyforbeer to make this happen.
@pirate commented on GitHub (May 10, 2022):
I'm not really willing to consider renaming the filename locations on disk, many things would break (but adding symlinks might be ok). I was more considering going the opposite direction and using unique UUIDs or hashses for the filenames and then referencing them in the DB. Either way the answer is always going to be to query the ArchiveBox
statusoutput or REST API to get the file paths and then fetch them from there.@rickcecil commented on GitHub (May 10, 2022):
Here's the use case that I am going for. Sharing not to try to change your mind, but maybe you will see a different way to achieve this with Archivebox than what I am currently envisioning...
There are two things I would like to do.
First, I would like to add all of the content that I have archived to a local instance of Open Semantic Search. Now I could point it to the Archivebox file structure itself (I am only trying to save a single file type -- either PDF or HTML, so I don't need to worry about duplicate data showing up in the search results), but the files won't have meaningful names. I think this is going to be a real issue given the amount of content I have in Archivebox.
Second, I am looking for a way to annotate my archived content with a local instance of Hypothesis. The sticking point is a combination of mobile Safari and Cross-origin script access. I've explored two possible solutions with Archivebox.
#2a - If I use PDFs, Archivebox opens the PDFs in the browser default PDF reader. Mobile Safari will not activate bookmarklets when viewing a PDF in the default viewer. It will, though, allow the bookmarklet to run if I open the PDF via PDF.js. So my thought was to copy the files from Archivebox into another directory that would be managed by Filerun, where I have created a plugin that opens PDFs in PDF.js on all platforms. The issue here is, of course, the filenames. Allowing us to rename the files solves Problem #1 and #2a.
#2b - Alternatively, I have tried using the SingleFile format instead of the PDF format. The issue here is that SingleFile, in order to prevent malicious JS from loading, by default inserts a meta tag that limits cross-origin scripts from running, which prevents my bookmarklet for Hypothesis from running. There is an option to turn this off (see: https://github.com/gildas-lormeau/SingleFile/discussions/926 -- added at my request), but I am not sure how this would be exposed in Archivebox. I thought being able to reference a SingleFile binary would make this possible, but it seems that the command has to be run when the file is archived and I am not sure how to pass that kind of command to SingleFile from Archivebox.
I am exploring other options besides Archivebox as it might not be the exact tool for my usecase, but so far, it is the closest. AFAICT, it is the only tool that I can programmatically feed it a list of links and it will go out and save exact replicas of web pages.
I do understand why you wouldn't implement the functionality to rename posts, but if you had any guidance on how to use Archivebox to do what I want it to do, I would be most appreciative. I feel like I am so close to having my research system in place, but these last few hurdles are proving very stubborn. Thanks!
@pirate commented on GitHub (May 11, 2022):
You can pass arguments to singlefile to use during archivebox with the config option
SINGLEFILE_ARGS.You could also try to add the URL and page title ad PDF metadata fields so at least they're visible somehow, even though it may not be the file name it's better than nothing.
@rickcecil commented on GitHub (Aug 28, 2022):
I've been working on updating my set up, trying to get SINGLEFILE_ARGS to work. Here's what I've learned: The feature has not yet been committed to the dev branch AFAICT. Given the status, there may need to be more testing on it. Just wanted to let folks know in case they come across this page.
This post, which was posted a couple of weeks after mine, indicates that the SINGLEFILE_ARGS still has some work to do:
https://github.com/ArchiveBox/ArchiveBox/issues/981
However, at the end of the post, a reference was made to this post, which indicates that the work was done and committed:
github.com/renaisun/ArchiveBox@8899fe0b92While I am on dev, I could not get my version of archivebox to use the branch that has this code. It was just four lines, so I added them myself. Fingers cross that I didn't screw anything else up. So far it is working.
Now, what is incredibly cool -- and I am going to test next: SingleFile can parse userscripts before and after it has run, which, in theory, will allow you to modify pages before they are saved.