Questions and Answers :
Issue Discussion :
All tasks error out after reboot
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 1 Jul 20 Posts: 34 Credit: 26,118,410 RAC: 0 |
I saw that the tasks were checkpointing, which is great. So I decided to go ahead and reboot my machines for an unrelated issue. When it started up, all the tasks error out. Yikes. Edit: The tasks are now listed as validation pending. That is a different problem already reported. Reno, NV Team: SETI.USA
|
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
Hmm, continuing after reboot works for me. Is this one of the workunits? https://www.mlcathome.org/mlcathome_ops/db_action.php?table=result&id=8910 That's a crash deep in libtorch, which is well below any code I wrote. So either something really bad is going on, or for some reason the snapshot got corrupted. I'll see if I can dig up the stdout and see what happened. |
|
Send message Joined: 1 Jul 20 Posts: 34 Credit: 26,118,410 RAC: 0 |
Hmm, continuing after reboot works for me. Is this one of the workunits? I am not allowed to log into that URL. But no, that single task is not what I am referring to. The errored out tasks I am talking about were still sitting on my machines. Also weird that they had not reported yet. Anyway, the errored out tasks are now reported and available to look at. Here is a sample: https://www.mlcathome.org/mlcathome/result.php?resultid=9662 There were about 60 tasks on each machine that errored out after reboot. (note: The errored-out tasks are now listed as validation pending. That is a different problem already reported.) Reno, NV Team: SETI.USA
|
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
Understood. I'll look into it, thanks for reporting. |
|
Send message Joined: 1 Jul 20 Posts: 31 Credit: 123,959 RAC: 0 |
similarly with me https://www.mlcathome.org/mlcathome/results.php?hostid=99 suspension in checkpoint and resumption leads to a jump to 100% and successful completion. tasks have been validated. suspension after approximately 10-15 minutes, i.e. the task further does nothing really for one or two hours ;o) |
|
Send message Joined: 1 Jul 20 Posts: 34 Credit: 26,118,410 RAC: 0 |
I think the "pending validation" pool of tasks is going to grow large very fast. Reboots are going to create a lot of false "pending validation" tasks, that will then need a third task to be issued (and hopefully completed properly), to get the WU done. This is especially problematic now, because the recent linux updates are requiring reboot. Reno, NV Team: SETI.USA
|
|
Send message Joined: 1 Jul 20 Posts: 31 Credit: 123,959 RAC: 0 |
my "short" tasks were already validated by other, different hosts if there are no differences between the two, then the third is superfluous |
|
Send message Joined: 1 Jul 20 Posts: 34 Credit: 26,118,410 RAC: 0 |
my "short" tasks were already validated by other, different hosts If the interrupted tasks that error out and move to pending validation, if they still validate eventually, then we should all just interrupt all tasks to maximize points. What is the point in taking the time to run tasks to completion. Right? Reno, NV Team: SETI.USA
|
|
Send message Joined: 1 Jul 20 Posts: 31 Credit: 123,959 RAC: 0 |
the admin can simply limit the time to 15 minutes and not torment the task of a couple of hours this is not a fixed-volume math, AI after 15 minutes of training stops responding to stimuli ;) |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
I'm still tracking this down. I know the suspend/resume code needed some more testing but jumoing to 100% shouldn't happen. And the way validation works will also need to be tweaked to catch this case. I'm hoping to get a new release out tonight to address this. Thanks for reporting, and for your patience. Luckily the actual science won't be bothered too much by this due to how the backend works, but the credit system can be gamed. |
©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)