Computation errors on 2080 Ti

Questions and Answers : Windows : Computation errors on 2080 Ti
Message board moderation

To post messages, you must log in.

AuthorMessage
joeybuddy96

Send message
Joined: 13 Feb 21
Posts: 2
Credit: 592,909
RAC: 0
Message 1091 - Posted: 15 Feb 2021, 4:16:01 UTC

I'm getting errors on every single GPU task. I've got a 2080 Ti. They're cuda10200 tasks, v9.75. I'll suspend the gpu tasks for now until I know what the problem is. No sense in spending cycles crunching for four hours on tasks that are going to error out. I've recently crunched F@H and GPUGRID tasks, so I don't think it's my system. Thank you for any advice you can give.
ID: 1091 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alex

Send message
Joined: 4 Dec 20
Posts: 32
Credit: 47,319,359
RAC: 0
Message 1092 - Posted: 16 Feb 2021, 13:15:52 UTC - in response to Message 1091.  

I too had problems when i started to crunch MLC. Driver updates helped in some cases.
I still have a lot of wu's that failed. Looking into the std_err output i can see a lot 'NaN' entries (Not a Number), which happens here from time to time. But all wu's? Could you please post some of the error-codes of the results? Maybe this gives hints to the problem. Is windows up to date? Are you crunching multiple wu's on the card?
From my experience i can say: 3 different windows systems 20H2 up to date, 3 different Cuda GPU's with drivers up to date work fine, the error-rate is well below 5%.. Also 2 Laptops with mobile Nvidia GPU's work fine.
ID: 1092 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
conf [MM]

Send message
Joined: 5 Dec 20
Posts: 1
Credit: 2,552,082
RAC: 0
Message 1106 - Posted: 25 Feb 2021, 10:18:05 UTC

Anwendung
Machine Learning Dataset Generator (GPU) 9.75 (cuda10200)
Name
ParityModified-1611091940-16978-7
Status
Aktiv
erhalten
25.02.2021 01:56:36
Ablaufdatum
04.03.2021 01:56:36
Ressourcen
0.997 CPUs + 1 NVIDIA GPU
Geschätzter Berechnungsaufwand
640.200 GFLOPs
Prozessorzeit
01:43:32
Prozessor-Zeit seit dem letzten Checkpoint
00:51:34
bisherige Laufzeit
02:01:48
Geschätzte verbleibende Zeit
173d 03:50:45
Fortschritt
0,049%
benötigter Arbeitsspeicher
4,25 GB
Größe des Arbeitspakets
2,08 GB
Verzeichnis
slots/5
Prozess-Nr.
13924
Ausführbare Datei
mlds-gpu_9.75_windows-x86_64__cuda10200.exe

0,244 after 2 hours on a Nvidia 3090 ?
Something goes wrong here, I will abort this.
Cpu Tasks are ok, on GPU every single one has errors.
ID: 1106 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alex

Send message
Joined: 4 Dec 20
Posts: 32
Credit: 47,319,359
RAC: 0
Message 1107 - Posted: 25 Feb 2021, 11:57:01 UTC - in response to Message 1106.  

every failed wu should give you something like this:
Stderr Ausgabe

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 3765269347 (0xe06d7363)</message>
<stderr_txt>
0.0309688 | val_loss: 0.0312789 | Time: 1741.63 ms
[2021-02-24 03:23:01 main:574] : INFO : Epoch 1361 | loss: 0.0309637 | val_loss: 0.0312642 | Time: 1752.23 ms
[2021-02-24 03:23:03 main:574] : INFO : Epoch 1362 | loss: 0.0309647 | val_loss: 0.0312205 | Time: 1745.28 ms
[2021-02-24 03:23:05 main:574] : INFO : Epoch 1363 | loss: 0.0309646 | val_loss: 0.031285 | Time: 1731.2 ms
[2021-02-24 03:23:06 main:574] : INFO : Epoch 1364 | loss: 0.0309602 | val_loss: 0.0312414 | Time: 1771.33 ms
[2021-02-24 03:23:08 main:574] : INFO : Epoch 1365 | loss: 0.0309603 | val_loss: 0.0312525 | Time: 1717.28 ms

The exit-code might give Pianoman Infos why it failed. GTX3090 is a brand new produkt and it might be that MLC was not involved anywhere.
ID: 1107 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
joeybuddy96

Send message
Joined: 13 Feb 21
Posts: 2
Credit: 592,909
RAC: 0
Message 1248 - Posted: 11 Jul 2021, 6:06:58 UTC
Last modified: 11 Jul 2021, 6:10:59 UTC

I've got a 2080 Ti, not a 3090 (I wish though). Still have the same error that was present in February when I tried to run it last, though.
Task description from: https://www.mlcathome.org/mlcathome/result.php?resultid=5638050
Task 5638050
Name	ParityMachine-1624423002-31433-2_0
Workunit	3478983
Created	5 Jul 2021, 6:30:13 UTC
Sent	10 Jul 2021, 22:45:29 UTC
Report deadline	17 Jul 2021, 22:45:29 UTC
Received	11 Jul 2021, 4:46:50 UTC
Server state	Over
Outcome	Computation error
Client state	Compute error
Exit status	197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED
Computer ID	9215
Run time	3 hours 43 min 55 sec
CPU time	3 hours 32 min 50 sec
Validate state	Invalid
Credit	0.00
Device peak FLOPS	15,279.60 GFLOPS
Application version	Machine Learning Dataset Generator (GPU) v9.75 (cuda10200)
windows_x86_64
Peak working set size	2.08 GB
Peak swap size	4.41 GB
Peak disk usage	1.54 GB

Here's some, just an excerpt, of the Stderr output from the failed wu (due to character limits, apparently I can't paste the whole thing in here, but the whole thing is available at https://www.mlcathome.org/mlcathome/result.php?resultid=5638050).
exceeded elapsed time limit 13433.76 (4000000.00G/297.76G)</message>
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x00007FFC36BC9A92

Engaging BOINC Windows Runtime Debugger...
ID: 1248 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 1259 - Posted: 14 Jul 2021, 14:52:47 UTC - in response to Message 1248.  

Well, I'm about to turn my attention back to the windows build anyway for the new client, so I'll take a look within the next week. I'm sorry lots of you are having issues!
ID: 1259 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Questions and Answers : Windows : Computation errors on 2080 Ti

©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)