mirror of
https://github.com/healthchecks/healthchecks.git
synced 2026-04-25 23:15:49 +03:00
[GH-ISSUE #261] System broke after upgrade to 1.7.0 #193
Labels
No labels
bug
bug
bug
feature
good-first-issue
new integration
pull-request
question
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/healthchecks#193
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @stevenmcastano on GitHub (Jun 21, 2019).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/261
I'm getting an error while trying to look at the checks... I can get to other pages, but anytime I try to look at the checks I get the following:
@stevenmcastano commented on GitHub (Jun 21, 2019):
It looks like if I create a new project, everything seems to work under that new one... but not my original project that was migrated in.
@cuu508 commented on GitHub (Jun 21, 2019):
One quick thing to check before digging deeper – did you run the database migrations (
./manage.py migrate)?@stevenmcastano commented on GitHub (Jun 21, 2019):
I did run the migrate yes... and I've got it "maybe" fixed. I dug a little deeper, after looking at the sort that was actually failing, based on check status, I went into the django admin console and took a look at the data to see if there was something weird...
It turns out one check someone had no status... it wasn't up, down, late, paused... it was just blank. Apparently that null value was what I throwing the whole thing off. When I manually set it in the admin console to "down" I was able to log in and see all my checks and the error went away.
It took until about 3am last night to figure it out those... I was pretty wiped out. So right now, I'm pulling a copy of my production data and putting it back into my staging server and I'm going to run the migrate process again as well as look at the SQL export from both yesterday and today and try to figure out if this was just an insane stroke of luck to catch a check in between status changes or if the migration somehow is messing with.
If it's reproducible, I'll capture as much data as possible... but it could have just been a fluke of freak timing!
@stevenmcastano commented on GitHub (Jun 23, 2019):
As it turns out this is reproducible... it seems to be happen to one check in particular. I've got one check with 1 minute timeout and a 1 minute grace period... and I've tried resetting the data to the old version and doing the migration again a few times, and every time it seems to hang up on that one check. Even in my production system that one check sits there with no status in it.
@cuu508 commented on GitHub (Jun 24, 2019):
That's the culprit here. If you manually changed it to, say, "down" in the production database I'm pretty sure the issues would go away.
But the obvious question of course is how it ended up with a blank / null status in the first place. I'm not sure how that could have happened. How old is that check? Has it received any pings? If you have database backups, would it be possible to look at backups and pin down the time when it went from a valid status to null?
@stevenmcastano commented on GitHub (Jun 24, 2019):
That's the weird part... it's an active check that actually gets pings constantly. I did change it to down, and up... and it just changes back to NULL again. It seems to be something weird with the 1 minute and 1 minute configuration.
The good news though... is that once I switch it in the django admin site everything goes back to normal and on v1.7.0 it seems to reflect it's status properly now.
@cuu508 commented on GitHub (Jun 24, 2019):
Still trying to wrap my head around this, some clarifying questions--
Did you do that through the Django admin or some other way?
Do you reckon that was done by sendalerts? i.e., did it switch back to null instantly, or a minute or so later?
Is this the only check with that configuration, and also the only check having an issue?
So if you update the status in Django admin, it doesn't flip back to null any more?
Did you get the problematic check fixed "permanenlty" before upgrading to v1.7.0 or only after?
@stevenmcastano commented on GitHub (Jun 24, 2019):
The manual changes I made were through django admin, yes
As for it switching back... as soon as it hit a status change away from "Up" it did... it's a check that happens "almost" every minute, so it does come in a few seconds late sometimes. So once it was late and changed, status, then a new ping would come in, it would switch back to NULL... and yes, this is the only check did. It's also the only check in my system that had such low times.
Yes, after the upgrade to v1.7.0 it does seem to switch status properly now. As I look closer, then doesn't appear to be a status for "late"... I wonder if it just kept getting caught in between? Was there one in a previous version?
The thing I'm finding weird here is that I've created a "Testing" ping with 1 minute interval and a 1 minute grace... and it shows in the normal web interface when it's late, and that it's down... but if I look in django admin, and directly in the database, both still say it's up. It seems like it's just not updating the database. Is it supposed to? Or does it cache it in memory somewhere?