wu's fail with err. message out of memory

Questions and Answers : Issue Discussion : wu's fail with err. message out of memory
Message board moderation

To post messages, you must log in.

AuthorMessage
alex

Send message
Joined: 4 Dec 20
Posts: 32
Credit: 47,319,359
RAC: 0
Message 919 - Posted: 8 Dec 2020, 21:22:25 UTC

This wu https://www.mlcathome.org/mlcathome/result.php?resultid=3201137
failed with
- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x00007FFC6C79D759

PC: Koprozessor NVIDIA GeForce GTX 1060 6GB (4095MB) driver: 457.51

So does this mean GPU-memory or main memory? Main Memory can be increased, GPU mem not.
ID: 919 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 920 - Posted: 9 Dec 2020, 1:32:50 UTC - in response to Message 919.  

This wu https://www.mlcathome.org/mlcathome/result.php?resultid=3201137
failed with
- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x00007FFC6C79D759

PC: Koprozessor NVIDIA GeForce GTX 1060 6GB (4095MB) driver: 457.51

So does this mean GPU-memory or main memory? Main Memory can be increased, GPU mem not.


The error indicates the system ran out of GPU RAM.

Each WU takes on the order of 1.6GB-1.9GB of GPU memory when computing. And we developed the cuda app on a system with a 1650 with only 4GB of ram, so your 1060 6GB should have plenty of headroom with memory.
Are you were running anything else graphics intensive at the time? maybe a game? Or are you trying to run multiple WUs at the same time on a GPU? if so you could easily run out of GPU memory in total.

Hope that helps, and thanks for crunching!
ID: 920 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alex

Send message
Joined: 4 Dec 20
Posts: 32
Credit: 47,319,359
RAC: 0
Message 928 - Posted: 9 Dec 2020, 11:40:56 UTC - in response to Message 920.  
Last modified: 9 Dec 2020, 11:41:43 UTC

The PC is a live backup, running nothing than BOINC and no special setups to increase GPU-load. CPU-load is around 96%. All out of the box.But BOINC cpu wu's are also running. Also rosetta wu's which are very,very memory hungry. Sometimes they are suspended with the status 'waiting for memory'. Knowing this behaviour triggered me to ask.

In the meantime i have 3 failed wu' on PC 5172 with the same message and
one on PC 5173 running a NVIDIA GeForce GTX 750 Ti (2048MB) driver: 457.51 OpenCL: 1.2
https://www.dropbox.com/s/5wehq33maxx12d6/gpu-z1.PNG?dl=0
ID: 928 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 9 Jul 20
Posts: 142
Credit: 11,536,204
RAC: 3
Message 929 - Posted: 9 Dec 2020, 12:08:40 UTC - in response to Message 928.  
Last modified: 9 Dec 2020, 12:09:07 UTC

I happend to see that error very often myself, when my host went a bit rogue one night when the ds1+2 GPU WUs were deployed, and my cc config file specified to run 2 WUs on my 750Ti simultaneously, which incidentally triggered an error whenever my host didn't start 2 rand WUs at the same time which proved to stay well below the 2 GB VRAM. Since then I am staying at 1 task only with a somewhat lower compute load. As the ds1+2 GPU WUs only load VRAM at about 52-54% on my 750Ti, the VRAM capacity wasn't exceeded by much, but that still caused the already started rand WUs to immediately crash as soon as those WUs were read into the GPU memory.

By the way, I saw a similar behaviour if overclocking the memory clock on the GPU too aggressively, so that in the middle of the network training, the tasks threw an error.
ID: 929 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Questions and Answers : Issue Discussion : wu's fail with err. message out of memory

©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)