mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #336] archivebox 0.4.2 init fails parsing old json (ValueError: year 1586476777 is out of range/dateutil.parser._parser.ParserError: year 1586476777 is out of range) #242
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#242
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @terxw on GitHub (Apr 10, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/336
Describe the bug
archivebox init produces error
ValueError: year 1586476777 is out of range dateutil.parser._parser.ParserError: year 1586476777 is out of rangeSteps to reproduce
create virtual environment
mkcd /home/kangus/src/archivebox0.4/ pew new -p /usr/bin/python3.8 -a $(pwd) archivebox0.4clone
git clone https://github.com/pirate/ArchiveBox cd ArchiveBox git branch -acheckout relevant
git checkout remotes/origin/v0.4.3install dependencies
pip install -e .config ENV
eval export $(grep -v '^#' /home/kangus/.ArchiveBox.conf)migration
`
/home/kangus/src/archivebox0.4/ArchiveBox/bin/archivebox init
`
Screenshots or log output
Software versions
374dd39)@terxw commented on GitHub (Apr 10, 2020):
To reproduce the error:
throws error:
@pirate commented on GitHub (Apr 10, 2020):
I know exactly where it is, it's caused by removing the old dateparsing code we used to have in favor of dateutil. Turns out my old custom logic actually is needed to handle this case, so I'll have to put it back.
@terxw commented on GitHub (Apr 10, 2020):
util.py line 144
if I change it to
it handles unix epoch timestamp correctly, but all other etc ISO time string incorrectly
@terxw commented on GitHub (Apr 10, 2020):
if I change line 144 in util.py from
to:
it works!
@pirate commented on GitHub (Apr 10, 2020):
Yeah the downside unfortunately is that many times it's not normal Unix timestamps, even when it matches some of those conditions. A lot of the programs generating these timestamps have custom offsets (cough cough macOS).
There's no perfect solution here, for now we can just choose something as stable as possible.
@terxw commented on GitHub (Apr 11, 2020):
Ok, my last attemp, all my links are now parsed
@mdhowle commented on GitHub (Apr 20, 2020):
It's surprising dateutil.parse doesn't handle timestamps on its own. I did a quick search but didn't see any issues about parsing timestamps.
I think using
reis overkill, but I don't have a Mac or Windows PC on hand. Which programs are returning non-UNIX epoch timestamps?This was my solution
@pirate commented on GitHub (Apr 20, 2020):
See here for an explanation for why it's not as simple as using
dateutil.parse: https://github.com/pirate/ArchiveBox/issues/119Timestmap handling has been a long-running issue in this project, and unfortunately, there's no one simple solution that works for all timestamps from all programs, because many programs use custom timestamp offsets. This is why we always save the original parsed timestamp in str format next to any parsed version, so that we can later re-parse it if we get more information about the timestmaps or if the user wants to set a custom offset.
@mdhowle commented on GitHub (Apr 20, 2020):
Thanks, I understand now. I wasn't thinking of browser histories.
If
archivebox-export-browser-historyis exporting browser history, it would know the browser and the epoch it uses internally. Is there any reason why that script couldn't convert the timestamps to a standard epoch, or pass the browser name/source to archivebox so it knows how to convert it? From a quick look, the internal epochs the browsers use don't change between OSes.@pirate commented on GitHub (Apr 27, 2020):
It could, that's a good idea, moving the offset fixing into the browser export scripts would greatly simplify the timestamp handling in archivebox. It wouldn't fix other sources of timestamp issues, but at least browser imports would work well.
@mdhowle commented on GitHub (Apr 30, 2020):
Here's what I changed in the browser export script
github.com/mdhowle/ArchiveBox@414d5e6189. I haven't tested it thoroughly, but it does output correctly on my machine.Like you've mentioned, it's difficult to interpret an integer/float as a timestamp confidently. Guessing the datetime format is good for the user's experience until it is wrong.
One idea is to only accept the common formats you'd know like Unix timestamp and anything dateutil can parse. Otherwise require the user to define the format. Maybe via command line argument
/archive --date-format="%Y-%m-%d %H:%M:%S" https://example.com/rss/feed.xmlor defined in the config like@pirate commented on GitHub (Apr 30, 2020):
We can also warn the user or bail out if the parsed date is outside of something like this:
1960-01-01 < date < $CURRENT_YEAR+1.@mdhowle commented on GitHub (Apr 30, 2020):
I like that idea of warning. Since you are still recording the raw value, it's really a superficial problem.
On the frontend, if the parsed date fails the
1960-01-01 < date < $CURRENT_YEAR+1check, it could display a placeholder value in addition to logging a warning to stdout/file.I understand you are busy. I am waiting for the rewrite to be merged before I start hacking away. :)
@cdvv7788 commented on GitHub (Jul 20, 2020):
@terxw are you still experiencing this issue? We recently merged something that should have helped with it.
@pirate commented on GitHub (Jul 20, 2020):
If you see anything, comment back here and I can reopen the ticket.
./archive#1676./archive#3187