[GH-ISSUE #963] Ability to set title based on variables and regex and cascade to filenames for filesystem or REST API access or export #2109

Open
opened 2026-03-01 17:56:33 +03:00 by kerem · 5 comments
Owner

Originally created by @ga-it on GitHub (Apr 10, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/963

Type

  • Propose a brand new feature

What is the problem that your feature request solves

  1. Current title field is sometimes unable to pull title correctly.
  2. Title could be sourced from other sources such as URL (e.g. https://foo.bar/title-1-2
  3. Filenames are currently generic within archive snapshot number (e.g. output.pdf) making file system operations difficult (e.g. access from another application such as a CMS)

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

  1. At a user level and confirmed at the add snapshot stage, allow specification of title source as variable subject to regex operations.
  2. This could make available items such as the
Originally created by @ga-it on GitHub (Apr 10, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/963 ## Type - [x] Propose a brand new feature ## What is the problem that your feature request solves 1) Current title field is sometimes unable to pull title correctly. 2) Title could be sourced from other sources such as URL (e.g. https://foo.bar/title-1-2 3) Filenames are currently generic within archive snapshot number (e.g. output.pdf) making file system operations difficult (e.g. access from another application such as a CMS) ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes 1) At a user level and confirmed at the add snapshot stage, allow specification of title source as variable subject to regex operations. 2) This could make available items such as the <title> markup as %titlemeta%, and URL filename as %URLfilename% 3) Regex operations could allow manipulation to add combinations of variables and operations on variables (such as substitution of spaces with hyphens, addition of dates, etc) 4) The filename could then repeat the options taking the derived title and allowing further operations on this or allowing a fresh set of operations on the variables. 5) defaults at the add snapshot stage should be from templates specified at the user account level. 6) expose results and subsets of results via filesystem, REST API and or export. Subsets should allow a filter to show by type (e.g. PDF, tag combinations) Using the above, I want to be able to download from various external sources and expose the results (at the filesystem level, via API, or export) into our CMS to allow them to be integrated into search results, etc ## What hacks or alternative solutions have you tried to solve the problem? Without this, I would have to relay on manual download from the web interface, renaming and upload into the CMS or perform a filesystem scan and rename and then upload into the CMS. ## How badly do you want this new feature? This would allow archivebox to perform a valuable function as a means of adding content into our CMS --- - [x] I'm willing to contribute (beers? ;) ) money (https://github.com/sponsors/pirate) to fix this issue
Author
Owner

@rickcecil commented on GitHub (May 4, 2022):

I want to second this feature request. My use case is very similar. I want to be able to copy the saved PDFs to my filesystem and make them available to Filerun, Open Semantic Search, and an internally hosted Hypothesis server. Being able to add meaningful filenames to the archived files is critical to this working.

I am also willing to contribute beers/moneyforbeer to make this happen.

<!-- gh-comment-id:1117513601 --> @rickcecil commented on GitHub (May 4, 2022): I want to second this feature request. My use case is very similar. I want to be able to copy the saved PDFs to my filesystem and make them available to Filerun, Open Semantic Search, and an internally hosted Hypothesis server. Being able to add meaningful filenames to the archived files is critical to this working. I am also willing to contribute beers/moneyforbeer to make this happen.
Author
Owner

@pirate commented on GitHub (May 10, 2022):

I'm not really willing to consider renaming the filename locations on disk, many things would break (but adding symlinks might be ok). I was more considering going the opposite direction and using unique UUIDs or hashses for the filenames and then referencing them in the DB. Either way the answer is always going to be to query the ArchiveBox status output or REST API to get the file paths and then fetch them from there.

<!-- gh-comment-id:1121923238 --> @pirate commented on GitHub (May 10, 2022): I'm not really willing to consider renaming the filename locations on disk, many things would break (but adding symlinks might be ok). I was more considering going the opposite direction and using unique UUIDs or hashses for the filenames and then referencing them in the DB. Either way the answer is always going to be to query the ArchiveBox `status` output or REST API to get the file paths and then fetch them from there.
Author
Owner

@rickcecil commented on GitHub (May 10, 2022):

Here's the use case that I am going for. Sharing not to try to change your mind, but maybe you will see a different way to achieve this with Archivebox than what I am currently envisioning...

There are two things I would like to do.

First, I would like to add all of the content that I have archived to a local instance of Open Semantic Search. Now I could point it to the Archivebox file structure itself (I am only trying to save a single file type -- either PDF or HTML, so I don't need to worry about duplicate data showing up in the search results), but the files won't have meaningful names. I think this is going to be a real issue given the amount of content I have in Archivebox.

Second, I am looking for a way to annotate my archived content with a local instance of Hypothesis. The sticking point is a combination of mobile Safari and Cross-origin script access. I've explored two possible solutions with Archivebox.

#2a - If I use PDFs, Archivebox opens the PDFs in the browser default PDF reader. Mobile Safari will not activate bookmarklets when viewing a PDF in the default viewer. It will, though, allow the bookmarklet to run if I open the PDF via PDF.js. So my thought was to copy the files from Archivebox into another directory that would be managed by Filerun, where I have created a plugin that opens PDFs in PDF.js on all platforms. The issue here is, of course, the filenames. Allowing us to rename the files solves Problem #1 and #2a.

#2b - Alternatively, I have tried using the SingleFile format instead of the PDF format. The issue here is that SingleFile, in order to prevent malicious JS from loading, by default inserts a meta tag that limits cross-origin scripts from running, which prevents my bookmarklet for Hypothesis from running. There is an option to turn this off (see: https://github.com/gildas-lormeau/SingleFile/discussions/926 -- added at my request), but I am not sure how this would be exposed in Archivebox. I thought being able to reference a SingleFile binary would make this possible, but it seems that the command has to be run when the file is archived and I am not sure how to pass that kind of command to SingleFile from Archivebox.

I am exploring other options besides Archivebox as it might not be the exact tool for my usecase, but so far, it is the closest. AFAICT, it is the only tool that I can programmatically feed it a list of links and it will go out and save exact replicas of web pages.

I do understand why you wouldn't implement the functionality to rename posts, but if you had any guidance on how to use Archivebox to do what I want it to do, I would be most appreciative. I feel like I am so close to having my research system in place, but these last few hurdles are proving very stubborn. Thanks!

<!-- gh-comment-id:1122687579 --> @rickcecil commented on GitHub (May 10, 2022): Here's the use case that I am going for. Sharing not to try to change your mind, but maybe you will see a different way to achieve this with Archivebox than what I am currently envisioning... There are two things I would like to do. First, I would like to add all of the content that I have archived to a local instance of Open Semantic Search. Now I could point it to the Archivebox file structure itself (I am only trying to save a single file type -- either PDF or HTML, so I don't need to worry about duplicate data showing up in the search results), but the files won't have meaningful names. I think this is going to be a real issue given the amount of content I have in Archivebox. Second, I am looking for a way to annotate my archived content with a local instance of Hypothesis. The sticking point is a combination of mobile Safari and Cross-origin script access. I've explored two possible solutions with Archivebox. #2a - If I use PDFs, Archivebox opens the PDFs in the browser default PDF reader. Mobile Safari will not activate bookmarklets when viewing a PDF in the default viewer. It will, though, allow the bookmarklet to run if I open the PDF via PDF.js. So my thought was to _copy_ the files from Archivebox into another directory that would be managed by Filerun, where I have created a plugin that opens PDFs in PDF.js on all platforms. The issue here is, of course, the filenames. Allowing us to rename the files solves Problem #1 and #2a. #2b - Alternatively, I have tried using the SingleFile format instead of the PDF format. The issue here is that SingleFile, in order to prevent malicious JS from loading, by default inserts a meta tag that limits cross-origin scripts from running, which prevents my bookmarklet for Hypothesis from running. There is an option to turn this off (see: https://github.com/gildas-lormeau/SingleFile/discussions/926 -- added at my request), but I am not sure how this would be exposed in Archivebox. I thought being able to reference a SingleFile binary would make this possible, but it seems that the command has to be run when the file is archived and I am not sure how to pass that kind of command to SingleFile from Archivebox. I am exploring other options besides Archivebox as it might not be the exact tool for my usecase, but so far, it is the closest. AFAICT, it is the only tool that I can programmatically feed it a list of links and it will go out and save exact replicas of web pages. I do understand why you wouldn't implement the functionality to rename posts, but if you had any guidance on how to use Archivebox to do what I want it to do, I would be most appreciative. I feel like I am so close to having my research system in place, but these last few hurdles are proving very stubborn. Thanks!
Author
Owner

@pirate commented on GitHub (May 11, 2022):

You can pass arguments to singlefile to use during archivebox with the config option SINGLEFILE_ARGS.

You could also try to add the URL and page title ad PDF metadata fields so at least they're visible somehow, even though it may not be the file name it's better than nothing.

<!-- gh-comment-id:1124281755 --> @pirate commented on GitHub (May 11, 2022): You can pass arguments to singlefile to use during archivebox with the config option `SINGLEFILE_ARGS`. You could also try to add the URL and page title ad PDF metadata fields so at least they're visible somehow, even though it may not be the file name it's better than nothing.
Author
Owner

@rickcecil commented on GitHub (Aug 28, 2022):

I've been working on updating my set up, trying to get SINGLEFILE_ARGS to work. Here's what I've learned: The feature has not yet been committed to the dev branch AFAICT. Given the status, there may need to be more testing on it. Just wanted to let folks know in case they come across this page.

This post, which was posted a couple of weeks after mine, indicates that the SINGLEFILE_ARGS still has some work to do:

https://github.com/ArchiveBox/ArchiveBox/issues/981

However, at the end of the post, a reference was made to this post, which indicates that the work was done and committed:

github.com/renaisun/ArchiveBox@8899fe0b92

While I am on dev, I could not get my version of archivebox to use the branch that has this code. It was just four lines, so I added them myself. Fingers cross that I didn't screw anything else up. So far it is working.

Now, what is incredibly cool -- and I am going to test next: SingleFile can parse userscripts before and after it has run, which, in theory, will allow you to modify pages before they are saved.

<!-- gh-comment-id:1229371494 --> @rickcecil commented on GitHub (Aug 28, 2022): I've been working on updating my set up, trying to get SINGLEFILE_ARGS to work. Here's what I've learned: The feature has not yet been committed to the dev branch AFAICT. Given the status, there may need to be more testing on it. Just wanted to let folks know in case they come across this page. This post, which was posted a couple of weeks after mine, indicates that the SINGLEFILE_ARGS still has some work to do: https://github.com/ArchiveBox/ArchiveBox/issues/981 However, at the end of the post, a reference was made to this post, which indicates that the work was done and committed: https://github.com/renaisun/ArchiveBox/commit/8899fe0b9259748da2ef19d37028c317f39f37d3 While I am on dev, I could not get my version of archivebox to use the branch that has this code. It was just four lines, so I added them myself. Fingers cross that I didn't screw anything else up. So far it is working. Now, what is incredibly cool -- and I am going to test next: SingleFile can parse userscripts before and after it has run, which, in theory, will allow you to modify pages before they are saved.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2109
No description provided.