6 invalid wu's

Questions and Answers : Windows : 6 invalid wu's
Message board moderation

To post messages, you must log in.

AuthorMessage
alex

Send message
Joined: 4 Dec 20
Posts: 32
Credit: 47,319,359
RAC: 0
Message 916 - Posted: 7 Dec 2020, 10:36:21 UTC

Hi,
i have 6 invalid wu's on different PC's. The common thing is: all have all or some entries in stderr of loss or val_loss of nan.
One example is https://www.mlcathome.org/mlcathome/result.php?resultid=3189474
Is the value 'nan' the reason for the invalid wu? If yes, why does it make sense to finish calculation? Would it be possible to simply restart the wu?
ID: 916 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Werinbert

Send message
Joined: 30 Nov 20
Posts: 14
Credit: 7,958,883
RAC: 16
Message 917 - Posted: 8 Dec 2020, 4:12:34 UTC

nan is probably the output code for "not a number" such as the value resulting from a square root of negative number. I don't think it is a problem with your machines, but more likely a problem with the simulation parameters causing it. Restarting the WU would more than likely just result in repeating the error.
ID: 917 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alex

Send message
Joined: 4 Dec 20
Posts: 32
Credit: 47,319,359
RAC: 0
Message 918 - Posted: 8 Dec 2020, 10:13:57 UTC

The thing is, that these wu's validate on other PC's. So a reinit could propably help.
The other idea is: if the first nan occurs, does it make sense to complete the wu wasting GPU-time? If a reinit is no option, the wu could be stopped with an errorcode other than 0 (as my failed wu's do).
If this is a more common problem it might be worth to look into the code. But i did not find similar posts. A windows-only problem?

The number of failed wu's (Bestätigungsfehler) is now up to 9. At least this wu also failed on another PC, a windows server. https://www.mlcathome.org/mlcathome/show_host_detail.php?hostid=5082
This one validated on another win10 pc https://www.mlcathome.org/mlcathome/workunit.php?wuid=1470598
ID: 918 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 921 - Posted: 9 Dec 2020, 1:49:56 UTC - in response to Message 918.  

NaNs mean that the floating point value is either too small or too large to be represented by the floating point machinery, or don't exist (such as division by zero). Getting NaNs is sadly not all that unusual when training neural networks, especially ones that are difficult to learn like ParityMachine. Training a neural network means searching a multi-dimensional hyperplane for a global minima (the point with the minimal error given the training set). The search is stochastic in nature.. sometimes you just get unlucky and get stuck in a search path that spirals out of control. That's a vastly simplified description of a complex topic but I hope it at least makes some sense.

However, the presence of NaNs in your result should not cause a validation failure. it's not your fault the algorithm happened to search down an un-fruitful path in the error plane, so you should still get credit for the computation. Instead, we simply don't generate a follow-on WU to continue searching down that path. I'll need to check the DB later tonight to see why this particular WU failed validation, but the nan isn't the reason.

Ideally, we would stop computation at the first sign of a NaN, but due to a quirk of the way boinc works, ending WUs early causes more harm than good (it's a long story).

NaNs are relatively rare overall, so its not worth the headache to stop early. However, it may be worth re-evaluating that policy now since a) We're sending a lot of extra Parity WUs to try and get the number of completed networks up quickly, and b) the new GPU-based WUs are running 8x the number of epochs as the CPU ones.. so there's a potential if a NaN is encountered early there's a bit more of a penalty to the science.

Thanks for reporting this and let us think about it for a bit and see if we can tweak the algorithm going forward.
ID: 921 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alex

Send message
Joined: 4 Dec 20
Posts: 32
Credit: 47,319,359
RAC: 0
Message 927 - Posted: 9 Dec 2020, 11:13:48 UTC - in response to Message 921.  

Thank you for that explanation.
It's a young project and failurs are an option, otherwise progress would be too slow. I accept that. Happy that you are aware of these facts. I have seen projects where the admins were not aware (for at least half a year) that the formula they used was wrong.
ID: 927 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 950 - Posted: 21 Dec 2020, 2:05:38 UTC - in response to Message 927.  

As a follow-up here, I have noticed some evidence in the log that NaN's may be causing a small subset of WUs to be marked as invalid. Not explicitly, but because some code to translate the results to json are causing the json parser to fail to parse the file (it expects a number, non NaNs) is causing a failure in another part of the validation process. I'll work on fixing that over the next few days.
ID: 950 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alex

Send message
Joined: 4 Dec 20
Posts: 32
Credit: 47,319,359
RAC: 0
Message 952 - Posted: 21 Dec 2020, 19:42:34 UTC

Murphy said: If something can go wrong, it will. Even if there is only a little chance.

Thank you for keeping us informed!
ID: 952 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Luigi R.

Send message
Joined: 8 Jul 20
Posts: 10
Credit: 1,128,059
RAC: 0
Message 1014 - Posted: 2 Jan 2021, 9:26:15 UTC - in response to Message 950.  

As a follow-up here, I have noticed some evidence in the log that NaN's may be causing a small subset of WUs to be marked as invalid. Not explicitly, but because some code to translate the results to json are causing the json parser to fail to parse the file (it expects a number, non NaNs) is causing a failure in another part of the validation process. I'll work on fixing that over the next few days.
Is this line can help?
terminate called after throwing an instance of 'nlohmann::detail::type_error'
  what():  [json.exception.type_error.302] type must be number, but is null
ID: 1014 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Questions and Answers : Windows : 6 invalid wu's

©2024 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)