[GH-ISSUE #143] Adding a new URL via plaintext causes it to be the only URL in the archive #1607

Closed
opened 2026-03-01 17:52:10 +03:00 by kerem · 2 comments
Owner

Originally created by @sbrl on GitHub (Feb 14, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/143

Describe the bug
Found another one!

If I add a new URL to ArchiveBox, then it replaces any content I had there previously.

Steps to reproduce
Steps to reproduce the behavior:

  1. Do this: echo "https://www.robmiles.com/journal/2019/02/06/options" | ONLY_NEW=true ./archive
  2. Check the archive to see the page present
  3. Now do this: echo "https://unix.stackexchange.com/questions/87138/ssh-tunnelling-explanation/119719" | ONLY_NEW=true ./archive
  4. Check the archive again
  5. See that it's now replaced the previous entry

Screenshots or log output
If applicable, use screenshots or copy/pasted terminal output to help explain your problem.

$ ./archive-url 'https://unix.stackexchange.com/questions/87138/ssh-tunnelling-explanation/119719'
[*] [2019-02-14 21:00:59] Parsing new links from output/sources/stdin-1550178059.txt and fetching titles...
    .
    > Adding 1 new links to index from output/sources/stdin-1550178059.txt (parsed as Plain Text format)
[*] [2019-02-14 21:00:59] Updating main index files...
    > output/index.json
    > output/index.html
[▶] [2019-02-14 21:00:59] Updating content for 1 pages in archive...
[+] [2019-02-14 21:01:01] "x11 - ssh tunnelling explanation - Unix & Linux Stack Exchange"
    https://unix.stackexchange.com/questions/87138/ssh-tunnelling-explanation/119719
    > output/archive/1550178059 (new)
      > favicon
      > wget                                                                    
      > archive_org                                                             
      > git                                                                     
      > media
      √ index.json                                                              
      √ index.html
[√] [2019-02-14 21:01:16] Update of 1 pages complete (16.36 sec)
    - 1 entries skipped
    - 4 entries updated
    - 0 errors
    To view your archive, open: output/index.html
$ cat ./archive-url
#!/usr/bin/env bash
export ONLY_NEW=true;
echo $1 | ./archive-custom
$ ./archive-url https://www.robmiles.com/journal/2019/02/06/options
[*] [2019-02-14 21:03:01] Parsing new links from output/sources/stdin-1550178181.txt and fetching titles...
    .
    > Adding 1 new links to index from output/sources/stdin-1550178181.txt (parsed as Plain Text format)
[*] [2019-02-14 21:03:02] Updating main index files...
    > output/index.json
    > output/index.html
[▶] [2019-02-14 21:03:02] Updating content for 1 pages in archive...
[+] [2019-02-14 21:03:03] "https://www.robmiles.com/journal/2019/02/06/options"
    https://www.robmiles.com/journal/2019/02/06/options
    > output/archive/1550178181 (new)
      > favicon
      > wget                                                                    
        Got wget response code 8:                                               
          https://www.robmiles.com/journal/2019/02/06/options:
          2019-02-14 21:03:04 ERROR 404: Not Found.
        Some resources were skipped: 404 Not Found                              
        Run to see full output:
            cd /path/to/ArchiveBox/output/archive/1550178181;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=unix --timeout=60 --warc-file=warc/1550178183 --page-requisites --user-agent="ArchiveBox/74b99fe9e (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://www.robmiles.com/journal/2019/02/06/options
      > archive_org
      > git                                                                     
      > media
      √ index.json                                                              
      √ index.html
[√] [2019-02-14 21:03:14] Update of 1 pages complete (12.80 sec)
    - 1 entries skipped
    - 3 entries updated
    - 1 errors
    To view your archive, open: output/index.html

Software versions (please complete the following information):

  • ArchiveBox version: 74b99fe9eb
  • Python version: Python 3.5.3
  • OS: Raspbian GNU/Linux 9.6 (stretch)
  • Chrome version: Not installed
Originally created by @sbrl on GitHub (Feb 14, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/143 **Describe the bug** Found another one! If I add a new URL to ArchiveBox, then it replaces any content I had there previously. **Steps to reproduce** Steps to reproduce the behavior: 1. Do this: `echo "https://www.robmiles.com/journal/2019/02/06/options" | ONLY_NEW=true ./archive` 2. Check the archive to see the page present 3. Now do this: `echo "https://unix.stackexchange.com/questions/87138/ssh-tunnelling-explanation/119719" | ONLY_NEW=true ./archive` 4. Check the archive again 5. See that it's now replaced the previous entry **Screenshots or log output** If applicable, use screenshots or copy/pasted terminal output to help explain your problem. ``` $ ./archive-url 'https://unix.stackexchange.com/questions/87138/ssh-tunnelling-explanation/119719' [*] [2019-02-14 21:00:59] Parsing new links from output/sources/stdin-1550178059.txt and fetching titles... . > Adding 1 new links to index from output/sources/stdin-1550178059.txt (parsed as Plain Text format) [*] [2019-02-14 21:00:59] Updating main index files... > output/index.json > output/index.html [▶] [2019-02-14 21:00:59] Updating content for 1 pages in archive... [+] [2019-02-14 21:01:01] "x11 - ssh tunnelling explanation - Unix & Linux Stack Exchange" https://unix.stackexchange.com/questions/87138/ssh-tunnelling-explanation/119719 > output/archive/1550178059 (new) > favicon > wget > archive_org > git > media √ index.json √ index.html [√] [2019-02-14 21:01:16] Update of 1 pages complete (16.36 sec) - 1 entries skipped - 4 entries updated - 0 errors To view your archive, open: output/index.html $ cat ./archive-url #!/usr/bin/env bash export ONLY_NEW=true; echo $1 | ./archive-custom $ ./archive-url https://www.robmiles.com/journal/2019/02/06/options [*] [2019-02-14 21:03:01] Parsing new links from output/sources/stdin-1550178181.txt and fetching titles... . > Adding 1 new links to index from output/sources/stdin-1550178181.txt (parsed as Plain Text format) [*] [2019-02-14 21:03:02] Updating main index files... > output/index.json > output/index.html [▶] [2019-02-14 21:03:02] Updating content for 1 pages in archive... [+] [2019-02-14 21:03:03] "https://www.robmiles.com/journal/2019/02/06/options" https://www.robmiles.com/journal/2019/02/06/options > output/archive/1550178181 (new) > favicon > wget Got wget response code 8: https://www.robmiles.com/journal/2019/02/06/options: 2019-02-14 21:03:04 ERROR 404: Not Found. Some resources were skipped: 404 Not Found Run to see full output: cd /path/to/ArchiveBox/output/archive/1550178181; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=unix --timeout=60 --warc-file=warc/1550178183 --page-requisites --user-agent="ArchiveBox/74b99fe9e (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://www.robmiles.com/journal/2019/02/06/options > archive_org > git > media √ index.json √ index.html [√] [2019-02-14 21:03:14] Update of 1 pages complete (12.80 sec) - 1 entries skipped - 3 entries updated - 1 errors To view your archive, open: output/index.html ``` **Software versions (please complete the following information):** - ArchiveBox version: 74b99fe9eb68cd57e64648690a2e158952b6b18e - Python version: `Python 3.5.3` - OS: Raspbian GNU/Linux 9.6 (stretch) - Chrome version: Not installed
kerem closed this issue 2026-03-01 17:52:11 +03:00
Author
Owner

@pirate commented on GitHub (Feb 19, 2019):

Try the latest master (fixed in 3571ef2), comment back here if it doesn't work and I'll reopen the ticket.

<!-- gh-comment-id:465015449 --> @pirate commented on GitHub (Feb 19, 2019): Try the latest master (fixed in 3571ef2), comment back here if it doesn't work and I'll reopen the ticket.
Author
Owner

@sbrl commented on GitHub (Feb 19, 2019):

Yep, looks like it's fixed! Thanks again, @pirate 😺

<!-- gh-comment-id:465260839 --> @sbrl commented on GitHub (Feb 19, 2019): Yep, looks like it's fixed! Thanks again, @pirate :smiley_cat:
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1607
No description provided.