GPU running idle most of the time

Questions and Answers : Windows : GPU running idle most of the time
Message board moderation

To post messages, you must log in.

AuthorMessage
erich56

Send message
Joined: 22 Apr 21
Posts: 12
Credit: 2,876,582
RAC: 0
Message 1162 - Posted: 22 Apr 2021, 6:40:09 UTC

I just joined this project with a host that has 2 RTX3070. So besides several CPU tasks (on a 10-core / 20 threads CPU), 2 GPU tasks are running concurrently.
However, when watching the crunching process in the MSI Afterburner, I notice that the GPUs work at full load only rarely, with long idle periods inbetween. Is this "normal" behavour of MLC GPU tasks, or is there something wrong?
ID: 1162 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
erich56

Send message
Joined: 22 Apr 21
Posts: 12
Credit: 2,876,582
RAC: 0
Message 1163 - Posted: 22 Apr 2021, 8:19:26 UTC

so, the first GPU task got finished and ended up with a validation error:
https://www.mlcathome.org/mlcathome/result.php?resultid=5012881

can anyone tell me what's going wrong?
ID: 1163 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 1164 - Posted: 22 Apr 2021, 13:54:08 UTC - in response to Message 1163.  
Last modified: 22 Apr 2021, 13:57:57 UTC

Hello and welcome to the project!

First, it doesn't look like you're doing anything wrong. I should turn this into a FAQ.

I'll cover the validation error in a second (and no, it is not ideal), but first, on GPU utilization: The current batch and network sizes for the WUs in the GPU queue are rather small, especially for higher-end cards. This can lead to some "less than 100%" utilization on the GPU.. it's not that the GPU isn't busy, its that it spends a lot more time moving data into and out of the GPU than it does doing the matrix multiplication (what shows up as utilization). We're looking at ways to better utilize higher end GPUs for DS4 (in development), which is a CNN-network as opposed to an RNN network, so it should take better advantage of GPUs. Even with these inefficiencies, GPU training is still faster than CPU training (although CPU training is no slouch!). This seems to be especially an issue on Windows.. Linux seems to do a better job overall keeping the GPU busy, but not by much.

As for the validation error, this is a huge problem for us at the moment. ML training is, in general, a randomly-guided search for an optimal solution. Sometimes that search leads down paths that push the bounds of floating point representation, and can lead to NaNs (invalid numbers). When this happens there's really no way to recover[1], and the result is invalid. Now, why not just give you credit anyway since you did the work? you shouldn't be penalized because you got unlucky.... and we agree. The problem is we can't (yet) tell the difference if you did the computation and it came out invalid (in that case you deserve credit), or you're someone cheating and returning a random result to game the credit system. Even worse, if we detect the condition and return early, and then mark the result as valid, BOINC resource management gets screwed up and thinks you completed the full work much faster and starts mismanaging your resources (you'll get results ending with TIMEOUT because it thinks you have a much faster machine than you do (BOINC accounting is... very powerful and very convoluted). Overall, our validation error rate is under 0.5%, so it hasn't been too much of an issue.

Additionally, certain network constructs are more susceptible to this behavior than others, and the work in the GPU queue at the moment is filled with some of those networks. As we're finishing up DS1/DS2, the remaining Parity networks are the hardest ones to search for. That's why they're still not complete yet. They also have a much higher rate of these wayward searches, up to 5%. That's too high and we haven't fixed it yet. We need to so something about that, by just granting credit for NaN networks and accepting that there's a chance people can cheat at least until those networks are cleared out. The CPU queue doesn't have th

This is a very long answer, the TL;DR is, you've done everything right, the random nature of ML training makes this hard; and we'll attempt to tweak the validation criteria in the next few days so that your result at least gets credit.

Thanks for volunteering and I hope this hasn't scared you off from the project!



[1] To make matters even more complicated, if the task is suspended/resumed while its stuck in a NaN state, that re-kicks the optimizer and the WU can (sometimes) recover and start searching valid values again. We may be able to update the client to do this, but it is not implemented yet.
ID: 1164 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
erich56

Send message
Joined: 22 Apr 21
Posts: 12
Credit: 2,876,582
RAC: 0
Message 1165 - Posted: 22 Apr 2021, 17:58:59 UTC - in response to Message 1164.  

hello pianoman,

many thanks for your thorough explanations.
As it seems, the invalid task maybe was a one-timer, anyway. I havn't had any others so far :-) So let me just keep my fingers crossed!
Best regards, Erich
ID: 1165 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Questions and Answers : Windows : GPU running idle most of the time

©2024 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)