Questions and Answers :
Issue Discussion :
Parity Modified GPU WUs often fail
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 9 Jul 20 Posts: 142 Credit: 11,536,204 RAC: 3 |
While the rand automata GPU tasks have reliably finished and validated for about a week, except a few WUs I had to abort because the initial dataset loading, that is handled exclusively by the CPU, somehow got stuck and the WU thus never started, the story is different for the new dataset 1+2 GPU versions. Now I see a rather strange behaviour, where the Parity modified WUs start reliably but crash after a very short runtime of 30-70 sec only. For reference the standard average runtime of a rand GPU WU on this GPU (2 WUs simultaneously) was ~1,730 sec. So this seems rather strange to me. Most of the WUs that failed on my host, had also crashed on prior hosts. Unfortunately, I can't make a lot of sense of the error message. They all basically read like this... and the error "-529697949 (0xE06D7363) Unknown error code" doesn't help much in understanding this behaviour. [2020-11-19 19:51:03 load:106] : INFO : Successfully loaded dataset of 512 examples into memory. [2020-11-19 19:51:03 main:494] : INFO : Creating Model [2020-11-19 19:51:03 main:507] : INFO : Preparing config file [2020-11-19 19:51:03 main:519] : INFO : Creating new config file [2020-11-19 19:51:04 main:559] : INFO : Loading DataLoader into Memory [2020-11-19 19:51:05 main:562] : INFO : Starting Training Unhandled Exception Detected... - Unhandled Exception Record - Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x00007FFF3BC73B29 Engaging BOINC Windows Runtime Debugger... See f.ex. WU 1302119 WU 1302124 WU 1305925 WU 1306231 WU 1306653 As some of them however got validated on other hosts, I wonder what it might be on my system, that causes them to fail. I didn't change any setting in the meantime, the same project (CPU tasks) are running in the background, GPU is running at same clock speeds, etc. And some of the Parity Modified WUs validated on my system so far. They took roughly 2.7x as long (4,800 sec) at 2.2x the credit. Given this, the fact that some validate and some on the other hand error out immediately has me stumped. Also, don't know if others see that error as well or if it is just me... Any idea to what might cause this? |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
..It's a little counter-intuitive, but Dataset 1+2 WUs use significantly more memory on GPU than Dataset 3. This is true on CPU too, but the GPU case exacerbates the issue. |
|
Send message Joined: 22 Aug 20 Posts: 7 Credit: 18,867,632 RAC: 0 |
My two cents: 1. I had two 750TI video adapters in a one box. They were working just fine. For some reason I have replaced both of them for two GTX 770. All GPU tasks have failed after say 15 seconds with this error message. Both 750TI and 770 have 2Gb of video memory however GTX 770 has more cuda cores. Do you mean that GTX 770 needs say twice more memory if it has double number of cuda cores comparing with 750Ti? 2. Is it possible to use conditional mem_alloc? bad_alloc, for example. In this case more clear message like "This video card does not have enough memory. Put it in your scrap box, go and buy another one" might be provided in log. |
|
Send message Joined: 9 Jul 20 Posts: 142 Credit: 11,536,204 RAC: 3 |
First of all, thanks for your feedback. I did lower the memory clock slightly, and less tasks seem to fail. I did check the task properties of currently running tasks as well, and foremost it seems to really be an issue with VRAM being close to 100% loaded. Surprisingly, I saw a couple instances where 2 Parity/Parity Modified tasks actually did finish together overnight – that's when VRAM hit 98% load (2 GB). The observed error rate seems to have definitely improved over night. The ratio was 14:2 with the 2 tasks that failed, erroring out almost immediately. What is interesting though, that at least on a low bandwidth/small VRAM size card, runtimes suffer again some more, if 2 Parity/Parity Modified tasks are running concurrently. Interestingly, I see large discrepancies for the per epoch training time (always 2 WUs of the same type are running concurrently so the ratio between the runtimes should be rather expressive) - rand automata: 8.9-9.0 sec/epoch or ø 1,730 sec per WU (192 epochs) - parity: 4.6-4.7 sec/epoch or ø 4,820 sec per WU (1024 epochs) - parity modified: 4.9-5.1 sec/epoch or ø 5,190 sec per WU (1024 epochs) I am glad, that my intuition about the different network types seems to be correct. Wider networks (data set 3/rand) are indeed easier to learn than deeper networks (data set 1+2/parity + parity mod.). Depending on how reliable these short-term average estimates turn out to be, the average epoch training time for ds 1+2 sees an uplift of ~40-50% over the ds 3 WUs. It's gonna be interesting to see where the long-term average numbers end up at. |
|
Send message Joined: 9 Jul 20 Posts: 142 Credit: 11,536,204 RAC: 3 |
My WU stats quickly deteriorated overnight. For comparison: - 4 errors with ~700 WU - 32 error with ~ 830 WU To my surprise, I now saw many rand WU fail as well. They failed even though they started training successfully, every time a DS 1/2 WU was selected to start running in tandem with an already running rand WU. It was just too much for the card to bear. Upon seeing this, I reverted back to running only 1 WU at a time. CUDA compute load for those WUs reaches now between 85-100% with an average of 92% on my card (65% for rand WU), and I see now no errors anymore with those WU, as VRAM is loaded at 56%-60% only and kept well within its 2 GB capacity. So it wasn't exceeded by much before, but the overload of 5-10% or just 200MB of the VRAM's capacity let those WUs fail. With average runtimes of 2,800 sec. (2.65 sec/epoch), I see a drop in throughput and credits though, due to lower overall utilisation at a rate of ~2,650 credits/h as opposed to ~4,000 credits/h with 2 rand GPU WU run concurrently at every time. |
|
Send message Joined: 12 Jul 20 Posts: 48 Credit: 73,492,193 RAC: 0 |
To my surprise, I now saw many rand WU fail as well.It is not you. https://www.mlcathome.org/mlcathome/forum_thread.php?id=111&postid=886#886 |
©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)