All tasks error out after reboot

Questions and Answers : Issue Discussion : All tasks error out after reboot
Message board moderation

To post messages, you must log in.

AuthorMessage
zombie67 [MM]
Avatar

Send message
Joined: 1 Jul 20
Posts: 34
Credit: 26,118,410
RAC: 0
Message 27 - Posted: 2 Jul 2020, 4:08:54 UTC
Last modified: 2 Jul 2020, 4:49:19 UTC

I saw that the tasks were checkpointing, which is great. So I decided to go ahead and reboot my machines for an unrelated issue. When it started up, all the tasks error out. Yikes.

Edit: The tasks are now listed as validation pending. That is a different problem already reported.
Reno, NV
Team: SETI.USA
ID: 27 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 29 - Posted: 2 Jul 2020, 4:46:47 UTC - in response to Message 27.  

Hmm, continuing after reboot works for me. Is this one of the workunits?

https://www.mlcathome.org/mlcathome_ops/db_action.php?table=result&id=8910

That's a crash deep in libtorch, which is well below any code I wrote. So either something really bad is going on, or for some reason the snapshot got corrupted. I'll see if I can dig up the stdout and see what happened.
ID: 29 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 1 Jul 20
Posts: 34
Credit: 26,118,410
RAC: 0
Message 31 - Posted: 2 Jul 2020, 4:54:38 UTC - in response to Message 29.  
Last modified: 2 Jul 2020, 4:56:17 UTC

Hmm, continuing after reboot works for me. Is this one of the workunits?

https://www.mlcathome.org/mlcathome_ops/db_action.php?table=result&id=8910

That's a crash deep in libtorch, which is well below any code I wrote. So either something really bad is going on, or for some reason the snapshot got corrupted. I'll see if I can dig up the stdout and see what happened.


I am not allowed to log into that URL.

But no, that single task is not what I am referring to.

The errored out tasks I am talking about were still sitting on my machines. Also weird that they had not reported yet. Anyway, the errored out tasks are now reported and available to look at. Here is a sample:

https://www.mlcathome.org/mlcathome/result.php?resultid=9662

There were about 60 tasks on each machine that errored out after reboot.

(note: The errored-out tasks are now listed as validation pending. That is a different problem already reported.)
Reno, NV
Team: SETI.USA
ID: 31 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 33 - Posted: 2 Jul 2020, 5:07:50 UTC - in response to Message 31.  

Understood. I'll look into it, thanks for reporting.
ID: 33 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sergey Kovalchuk

Send message
Joined: 1 Jul 20
Posts: 31
Credit: 123,959
RAC: 0
Message 34 - Posted: 2 Jul 2020, 6:13:54 UTC

similarly with me
https://www.mlcathome.org/mlcathome/results.php?hostid=99
suspension in checkpoint and resumption leads to a jump to 100% and successful completion. tasks have been validated.
suspension after approximately 10-15 minutes, i.e. the task further does nothing really for one or two hours ;o)
ID: 34 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 1 Jul 20
Posts: 34
Credit: 26,118,410
RAC: 0
Message 35 - Posted: 2 Jul 2020, 6:42:15 UTC

I think the "pending validation" pool of tasks is going to grow large very fast. Reboots are going to create a lot of false "pending validation" tasks, that will then need a third task to be issued (and hopefully completed properly), to get the WU done. This is especially problematic now, because the recent linux updates are requiring reboot.
Reno, NV
Team: SETI.USA
ID: 35 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sergey Kovalchuk

Send message
Joined: 1 Jul 20
Posts: 31
Credit: 123,959
RAC: 0
Message 36 - Posted: 2 Jul 2020, 7:00:39 UTC - in response to Message 35.  

my "short" tasks were already validated by other, different hosts
if there are no differences between the two, then the third is superfluous
ID: 36 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 1 Jul 20
Posts: 34
Credit: 26,118,410
RAC: 0
Message 37 - Posted: 2 Jul 2020, 7:25:33 UTC - in response to Message 36.  

my "short" tasks were already validated by other, different hosts
if there are no differences between the two, then the third is superfluous


If the interrupted tasks that error out and move to pending validation, if they still validate eventually, then we should all just interrupt all tasks to maximize points. What is the point in taking the time to run tasks to completion. Right?
Reno, NV
Team: SETI.USA
ID: 37 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sergey Kovalchuk

Send message
Joined: 1 Jul 20
Posts: 31
Credit: 123,959
RAC: 0
Message 38 - Posted: 2 Jul 2020, 7:36:16 UTC - in response to Message 37.  
Last modified: 2 Jul 2020, 7:44:37 UTC

the admin can simply limit the time to 15 minutes and not torment the task of a couple of hours

this is not a fixed-volume math, AI after 15 minutes of training stops responding to stimuli ;)
ID: 38 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 43 - Posted: 2 Jul 2020, 18:24:39 UTC

I'm still tracking this down. I know the suspend/resume code needed some more testing but jumoing to 100% shouldn't happen. And the way validation works will also need to be tweaked to catch this case. I'm hoping to get a new release out tonight to address this. Thanks for reporting, and for your patience. Luckily the actual science won't be bothered too much by this due to how the backend works, but the credit system can be gamed.
ID: 43 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Questions and Answers : Issue Discussion : All tasks error out after reboot

©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)