Questions and Answers :
Windows :
Computation errors on 2080 Ti
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 13 Feb 21 Posts: 2 Credit: 592,909 RAC: 0 |
I'm getting errors on every single GPU task. I've got a 2080 Ti. They're cuda10200 tasks, v9.75. I'll suspend the gpu tasks for now until I know what the problem is. No sense in spending cycles crunching for four hours on tasks that are going to error out. I've recently crunched F@H and GPUGRID tasks, so I don't think it's my system. Thank you for any advice you can give. |
|
Send message Joined: 4 Dec 20 Posts: 32 Credit: 47,319,359 RAC: 0 |
I too had problems when i started to crunch MLC. Driver updates helped in some cases. I still have a lot of wu's that failed. Looking into the std_err output i can see a lot 'NaN' entries (Not a Number), which happens here from time to time. But all wu's? Could you please post some of the error-codes of the results? Maybe this gives hints to the problem. Is windows up to date? Are you crunching multiple wu's on the card? From my experience i can say: 3 different windows systems 20H2 up to date, 3 different Cuda GPU's with drivers up to date work fine, the error-rate is well below 5%.. Also 2 Laptops with mobile Nvidia GPU's work fine. |
|
Send message Joined: 5 Dec 20 Posts: 1 Credit: 2,552,082 RAC: 0 |
Anwendung Machine Learning Dataset Generator (GPU) 9.75 (cuda10200) Name ParityModified-1611091940-16978-7 Status Aktiv erhalten 25.02.2021 01:56:36 Ablaufdatum 04.03.2021 01:56:36 Ressourcen 0.997 CPUs + 1 NVIDIA GPU Geschätzter Berechnungsaufwand 640.200 GFLOPs Prozessorzeit 01:43:32 Prozessor-Zeit seit dem letzten Checkpoint 00:51:34 bisherige Laufzeit 02:01:48 Geschätzte verbleibende Zeit 173d 03:50:45 Fortschritt 0,049% benötigter Arbeitsspeicher 4,25 GB Größe des Arbeitspakets 2,08 GB Verzeichnis slots/5 Prozess-Nr. 13924 Ausführbare Datei mlds-gpu_9.75_windows-x86_64__cuda10200.exe 0,244 after 2 hours on a Nvidia 3090 ? Something goes wrong here, I will abort this. Cpu Tasks are ok, on GPU every single one has errors. |
|
Send message Joined: 4 Dec 20 Posts: 32 Credit: 47,319,359 RAC: 0 |
every failed wu should give you something like this: Stderr Ausgabe <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> (unknown error) - exit code 3765269347 (0xe06d7363)</message> <stderr_txt> 0.0309688 | val_loss: 0.0312789 | Time: 1741.63 ms [2021-02-24 03:23:01 main:574] : INFO : Epoch 1361 | loss: 0.0309637 | val_loss: 0.0312642 | Time: 1752.23 ms [2021-02-24 03:23:03 main:574] : INFO : Epoch 1362 | loss: 0.0309647 | val_loss: 0.0312205 | Time: 1745.28 ms [2021-02-24 03:23:05 main:574] : INFO : Epoch 1363 | loss: 0.0309646 | val_loss: 0.031285 | Time: 1731.2 ms [2021-02-24 03:23:06 main:574] : INFO : Epoch 1364 | loss: 0.0309602 | val_loss: 0.0312414 | Time: 1771.33 ms [2021-02-24 03:23:08 main:574] : INFO : Epoch 1365 | loss: 0.0309603 | val_loss: 0.0312525 | Time: 1717.28 ms The exit-code might give Pianoman Infos why it failed. GTX3090 is a brand new produkt and it might be that MLC was not involved anywhere. |
|
Send message Joined: 13 Feb 21 Posts: 2 Credit: 592,909 RAC: 0 |
I've got a 2080 Ti, not a 3090 (I wish though). Still have the same error that was present in February when I tried to run it last, though. Task description from: https://www.mlcathome.org/mlcathome/result.php?resultid=5638050 Task 5638050 Name ParityMachine-1624423002-31433-2_0 Workunit 3478983 Created 5 Jul 2021, 6:30:13 UTC Sent 10 Jul 2021, 22:45:29 UTC Report deadline 17 Jul 2021, 22:45:29 UTC Received 11 Jul 2021, 4:46:50 UTC Server state Over Outcome Computation error Client state Compute error Exit status 197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED Computer ID 9215 Run time 3 hours 43 min 55 sec CPU time 3 hours 32 min 50 sec Validate state Invalid Credit 0.00 Device peak FLOPS 15,279.60 GFLOPS Application version Machine Learning Dataset Generator (GPU) v9.75 (cuda10200) windows_x86_64 Peak working set size 2.08 GB Peak swap size 4.41 GB Peak disk usage 1.54 GB Here's some, just an excerpt, of the Stderr output from the failed wu (due to character limits, apparently I can't paste the whole thing in here, but the whole thing is available at https://www.mlcathome.org/mlcathome/result.php?resultid=5638050). exceeded elapsed time limit 13433.76 (4000000.00G/297.76G)</message> Unhandled Exception Detected... - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x00007FFC36BC9A92 Engaging BOINC Windows Runtime Debugger... |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
Well, I'm about to turn my attention back to the windows build anyway for the new client, so I'll take a look within the next week. I'm sorry lots of you are having issues! |
©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)