Validate errors

Questions and Answers : Issue Discussion : Validate errors
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Werinbert

Send message
Joined: 30 Nov 20
Posts: 14
Credit: 7,958,883
RAC: 16
Message 938 - Posted: 19 Dec 2020, 5:20:12 UTC

I am now seeing quite a few tasks ending up as validate errors and the wing-men are also showing validate errors. I think something got messed up somewhere.
ID: 938 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 939 - Posted: 19 Dec 2020, 6:17:12 UTC - in response to Message 938.  

Shoot! I think I know what's up. dang it. I can fix, and my apologies.
ID: 939 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 940 - Posted: 19 Dec 2020, 6:24:46 UTC - in response to Message 939.  

Fixed.

Sigh.. I made a change to double the length of each GPU WU, trying to get through ParityModified WUs faster. A simple change, but I forgot to update the validation criteria to account for the new length of returned data. Shame on me for not testing it in "test" first.
ID: 940 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Werinbert

Send message
Joined: 30 Nov 20
Posts: 14
Credit: 7,958,883
RAC: 16
Message 943 - Posted: 19 Dec 2020, 16:11:35 UTC

As the length of the task has doubled, has the credit reward doubled as well?
ID: 943 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 944 - Posted: 19 Dec 2020, 16:33:23 UTC - in response to Message 943.  
Last modified: 19 Dec 2020, 16:35:25 UTC

Yes, any of the new, longer WUs get double the credit.

There are still plenty of older non-doubled length WUs, those of course get the non-double credit as expected.
ID: 944 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Luigi R.

Send message
Joined: 8 Jul 20
Posts: 10
Credit: 1,128,059
RAC: 0
Message 978 - Posted: 28 Dec 2020, 18:06:34 UTC
Last modified: 28 Dec 2020, 18:14:01 UTC

I noticed that many invalid tasks have a stderr.txt containing lines like this one.

[2020-12-28 18:48:49	                main:581]	:	INFO	:	Epoch 2 | loss: nan | val_loss: nan | Time: 1735.95 ms


If we see "nan" (not a number) at some point, our task will not be valid and we could have aborted way earlier than task completion.

Example: https://www.mlcathome.org/mlcathome/result.php?resultid=3453877
ID: 978 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 980 - Posted: 29 Dec 2020, 3:13:13 UTC - in response to Message 978.  

You are correct. Sadly it's a bit trickier than it seems on the surface. First, we're working (actually, right now) on re-vamping the validation process to not penalize user's credit when they hit NaN. Overall, NaNs are still relatively uncommon, but we're currently concentrating on Parity WUs in the GPU queue, which tend to generate them at a higher rate.

BOINC doesn't also respond well to WUs ending early. The problem is the BOINC client itself tries to be smart about WUs and estimating its computer's speed, independent of what the server says. Each WU is specified with an amount of FLOPS a full WU is to take. If the WU ends early, and didn't error out, then the BOINC client assumes it's completed all those flops in a much shorter amount of time, meaning the client starts to think the computer is faster than it is, and over time starts thinking it should complete all WUs in a short time.. and starts ending perfectly good WUs that run the full time early with a "WU time out" error. Our understanding is there's not much we can do at the server level to combat this.

Long story short, the particular work queue we're doing now is exacerbating this problem more than it was before. Our new GPU WUs go on for 2048 epochs, instead of the older ones that only went up to 128 or 192. So a NaN encountered early means more "wasted" work until the end of the WU than it did before. So we'll continue to work on ways to improve this.

I know its not a great answer, but its the one we can offer at the moment.
ID: 980 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alex

Send message
Joined: 4 Dec 20
Posts: 32
Credit: 47,319,359
RAC: 0
Message 984 - Posted: 29 Dec 2020, 6:45:22 UTC

I have checked my account. Currently i have 86 invalid results; 4 of them are cpu wu's. Tasks are 3262846, 3259758, 3291712 and 3294610.
I have also 31 wu's that report as error; about 50% CPU and 50% GPU wu's.
Since my first report of failing wu's the rise of the numbers slowed down.
Hope this can help a little bit to figure out the source.
ID: 984 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Luigi R.

Send message
Joined: 8 Jul 20
Posts: 10
Credit: 1,128,059
RAC: 0
Message 987 - Posted: 29 Dec 2020, 17:36:00 UTC - in response to Message 980.  

Ok pianoman, I wrote a bash script to catch these NaN WUs.
Expected behaviour is to suspend faulty tasks, so I can manually abort them after "human-check".
If you are not against it, I will post a pastebin with my code.
ID: 987 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alex

Send message
Joined: 4 Dec 20
Posts: 32
Credit: 47,319,359
RAC: 0
Message 989 - Posted: 29 Dec 2020, 19:16:09 UTC - in response to Message 987.  

I can manually abort them after "human-check".

Do you mean that they are cancelled during runtime, not to wast cpu/gpu time? Sounds good ...
There should be a way to do that from your script, Boinc Manager is remote controllable.

Are you shure that a NaN automatically signals a failed wu?
ID: 989 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Luigi R.

Send message
Joined: 8 Jul 20
Posts: 10
Credit: 1,128,059
RAC: 0
Message 990 - Posted: 29 Dec 2020, 20:26:02 UTC - in response to Message 989.  

Do you mean that they are cancelled during runtime, not to wast cpu/gpu time? Sounds good ...
There should be a way to do that from your script, Boinc Manager is remote controllable.

Yeah, we could automatically abort them by boinccmd too.

Are you shure that a NaN automatically signals a failed wu?

Not at all, but, if you go to top hosts and check valid tasks vs invalid ones, you observe that.
ID: 990 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 991 - Posted: 29 Dec 2020, 20:32:00 UTC

Sure, you don't need my permission to post a script like that, go right ahead.
ID: 991 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alex

Send message
Joined: 4 Dec 20
Posts: 32
Credit: 47,319,359
RAC: 0
Message 992 - Posted: 29 Dec 2020, 22:23:41 UTC

Some time ago i asked a similar question, one can see the thread here https://www.mlcathome.org/mlcathome/forum_thread.php?id=133
My initial assumption was that is was related to windows.
There one can find an answer from pianoman
However, the presence of NaNs in your result should not cause a validation failure. it's not your fault the algorithm happened to search down an un-fruitful path in the error plane, so you should still get credit for the computation. Instead, we simply don't generate a follow-on WU to continue searching down that path. I'll need to check the DB later tonight to see why this particular WU failed validation, but the nan isn't the reason.

So maybe it's a good idea to wait for a clear statement from the program developers. It makes no sense to kill wu's that are useful and would validate.
ID: 992 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Luigi R.

Send message
Joined: 8 Jul 20
Posts: 10
Credit: 1,128,059
RAC: 0
Message 994 - Posted: 29 Dec 2020, 23:38:58 UTC - in response to Message 991.  

Sure, you don't need my permission to post a script like that, go right ahead.
Well, I have to ask because it could encourage suspending/aborting tasks. Thanks anyway. :)

So... this is script is made for users (like me) that run gpu tasks, one at a time.
I tested it on different customized stderr, but I have not run into NaN/faulty tasks yet.
Yesterday I got 3 consecutive validate errors, today nothing.

https://pastebin.com/efiEDfRh
ID: 994 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Luigi R.

Send message
Joined: 8 Jul 20
Posts: 10
Credit: 1,128,059
RAC: 0
Message 995 - Posted: 29 Dec 2020, 23:44:38 UTC - in response to Message 992.  

So maybe it's a good idea to wait for a clear statement from the program developers. It makes no sense to kill wu's that are useful and would validate.

I have found validate errors without NaNs, but I have not found a valid task with NaNs yet.
My three invalid results contain NaNs as well.
ID: 995 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 9 Jul 20
Posts: 142
Credit: 11,536,204
RAC: 3
Message 997 - Posted: 30 Dec 2020, 10:09:20 UTC - in response to Message 995.  

Thanks for providing this resource Luigi. Recently, I have seen a bump in invalid results on my host as well and even though I only have ~ 100 GPU tasks done on my host, I am approaching nearly 10% invalids. That's a bit high IMO to conitnue computing those tasks without intervention.

Isn't there a way to catch those exepctions (NaN errors in a given epoch) and make this as conditional check so whenever this is reached, before training a new epoch, just abort and reassign the task or reinitalize the same task? I don't see any point in letting those tasks compute to the very end. The NaN error is just carried through to the last epoch and results as well in a NaN error message. I can't imagine anything that is valuable from such a task. Surely, it won't help the current research. Is there any way to trim down those invalid rates that could be directly built in the application? (not circumvent alltogether, as the stochastical nature of ML/NN has been discussed here many times)
ID: 997 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Luigi R.

Send message
Joined: 8 Jul 20
Posts: 10
Credit: 1,128,059
RAC: 0
Message 1008 - Posted: 31 Dec 2020, 15:34:08 UTC

@pianoman

This task was suspended by my script at 99.6% (some NaNs occurred), but I resumed it because it was close to 100%.
Task resumed from a checkpoint before the first NaN and no NaNs appeared then. So it was completed with success. How do you explain that?
ID: 1008 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 1011 - Posted: 31 Dec 2020, 19:51:25 UTC - in response to Message 1008.  
Last modified: 31 Dec 2020, 19:54:56 UTC

@Luigi R

Easily understandable that that could happen, but the reasons get a little bit technical:

Unlike most BOINC projects, ML training is a stochastic search. You're searching for an optimal set of weights for the network that minize the difference between the correct results and the ones the neural network is producing at that time. Its a bit random how it searches the state space for the optimal values. The official term is "stochasic gradient descent". The algorithm looks for a path to minimize the loss function (the maximum downward gradient), but sometimes randomly takes a slightly different path (the stochastic part).. this guarantees that you search more of the state space, and (helps) prevent you from getting stuck in a local minima when there might be a better global minima somewhere else.

To get a little further into the weeds, the algoithm that guides this search if called the "optimizer", because its searching for the optimal result. Our client uses the "adam" optimizer, which is a variant of stochatic gradient descent. At each epoch, this optimizer keeps a bit of history about the previous places its searched. While we checkpoint the state of the trained network, we do not checkpoint the internal state of the optimizer. This is by design, as its been shown that sometimes the optimizer will get stuck in an unfruitful part of the search space, and "restarting" the optimizer will send it down a new path. So this "reset" happens at two times during our WUs: whenever the client checkpoints and restarts, and when a new "continuation" WU is created on the server is the loss isn't below our threshold.

When you hit NaN, it means you've been unlucky and hit a *really* bad patch of the search space, and will need to back up and start again, with a reset optimizer, to look elsewhere. NaNs happen as part of training, it has to do with a computer's inability to represent certain small/bug numbers effectively. Your recourse at that moment is to restart the training at a known good point.

In your case, the last snapshot happened to be good, and you restarted, with a fresh optimizer, and it chose a different path that didn't lead to NaNs. This is analgous to a user returning a WU to the server with NaNs, the server seeing this, and re-sending the initial WU back out again with a fresh optimizer to start again. In short, you're just short-circuiting process that normally happens on the server.

Also, the harder a network is to learn, the more likely you are to hit NaNs. The ParityMachine/ParityModified WUs are by far the hardest systems we have to model to date.. the reason for that is the networks are soo tiny there's not a lot of "expressibility" for the optimizer to play with when trying to find a response. Yet, if we change the size of those networks, they won't match the rest of the DS1/DS2 networks, which breaks one of the things we're trying to measure.

I hope that helps explain what's going on.

Edit: It does suggest that there might be a way to detect and restart if we hit a NaN automatically. And at least retry a few times. That would change the amount of work done in a WU (there would be more work if a NaN was generated, and BOINC doesn't like that), but it shouldn't be *too* bad to throw the numbers off, given that NaNs still only appear in a small subset of results.
ID: 1011 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Luigi R.

Send message
Joined: 8 Jul 20
Posts: 10
Credit: 1,128,059
RAC: 0
Message 1013 - Posted: 1 Jan 2021, 9:20:25 UTC

Happy new year!

A case that my task turned into NaNs after Epoch 2020?
[2020-12-31 15:42:49	                main:581]	:	INFO	:	Epoch 2020 | loss: 0.0368034 | val_loss: 0.0325741 | Time: 1645.75 ms
[2020-12-31 15:42:51	                main:581]	:	INFO	:	Epoch 2021 | loss: nan | val_loss: nan | Time: 1645.33 ms

I don't think so. :D


So, if I got it, you are saying with your magnificent explanation that I don't destroy randomness when I suspend/resume task.
My manual intervention does not introduce determinism because it's like changing seed accidentally and a linear combination of N different solutions belongs to the same space.

I'm going to decrease interval to 30 seconds in my script to see if I can avoid validation errors much more.
ID: 1013 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Werinbert

Send message
Joined: 30 Nov 20
Posts: 14
Credit: 7,958,883
RAC: 16
Message 1120 - Posted: 8 Mar 2021, 2:12:36 UTC

The validate errors are still a bit frustrating... how are things coming along to fix or mitigate these errors?

The problem as I see it is that the computers processed the tasks just fine but the stochastic model went off the deep end. Unfortunately, we as volunteers are getting dinged for our troubles. I am still unclear if the randomness is from set up or from run time. It seems to be from run time (such as optimizers changing after a restart) so for any given task a set of computers can return different results, some succeed and some fail.... the computers did not do anything wrong yet the "failed" computers fail to get credit for work done. And if that result of Luigi's task has any validity (message 1008), I think restarting tasks from a known good checkpoint would help a lot.

And in regards to another comment from above, I would prefer a known "to fail" task be aborted than to waste precious computer processing to compensate for the bizarreness of Boinc task management. Trying to force this projects tasks to conform to problems with Boinc is a waste when other projects will easily enough mess up the Boinc task management machinations.
ID: 1120 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Questions and Answers : Issue Discussion : Validate errors

©2024 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)