All my GPU applications have crushed.

Questions and Answers : Issue Discussion : All my GPU applications have crushed.
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Sid

Send message
Joined: 22 Aug 20
Posts: 7
Credit: 18,815,424
RAC: 16,743
Message 727 - Posted: 28 Oct 2020, 13:37:46 UTC

The error message is the same:

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process got signal 11</message>
<stderr_txt>
DEBUG: Args: ../../projects/www.mlcathome.org_mlcathome/mldstest_9.72_x86_64-pc-linux-gnu__cuda_fermi -a LSTM -w 64 -b 2 -s 32 --lr 0.001 --maxepoch 192 --device 0
nthreads: 1 gpudev: 0
Re-exec()-ing to set number of threads correctly...

</stderr_txt>
]]>

Hardware:
GenuineIntel
Intel(R) Xeon(R) CPU L5640 @ 2.27GHz [Family 6 Model 44 Stepping 2]
(24 processors)

Linux Mint 20.
Nvidia driver 450.80.02
Nvidia gtx 750TI
ID: 727 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 12 Aug 20
Posts: 20
Credit: 22,054,115
RAC: 21,401
Message 733 - Posted: 29 Oct 2020, 0:16:40 UTC - in response to Message 727.  

Same here!

<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process got signal 11</message>
<stderr_txt>
DEBUG: Args: ../../projects/www.mlcathome.org_mlcathome/mldstest_9.72_x86_64-pc-linux-gnu__cuda_fermi -c -a LSTM --lr 0.001 -w 64 -b 2 -s 32 --maxepoch 192 --device 0
nthreads: 1 gpudev: 0
Re-exec()-ing to set number of threads correctly...

</stderr_txt>
]]>

(signal 11 == segfault)

I have run the nvidia-smi command and from this I draw the conclusion that my Nvidia driver is
390.116
OS: Xubuntu 18.04
GPU: GTX750Ti

//Gunnar
ID: 733 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 405
Credit: 10,650,439
RAC: 40,338
Message 738 - Posted: 29 Oct 2020, 6:17:30 UTC

Known issue, something with libhdf5 of all things. working on a fix.
ID: 738 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 405
Credit: 10,650,439
RAC: 40,338
Message 740 - Posted: 31 Oct 2020, 6:55:56 UTC

Minor update on this:

I'm fairly certain the issue isn't in our app code, as I placed an infinite loop before the part that's crashing to attach a debugger, and it still crashed wile sitting in an infinite loop. Which makes me suspect the BOINC client is killing it for some reason.

Still working it.
ID: 740 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sid

Send message
Joined: 22 Aug 20
Posts: 7
Credit: 18,815,424
RAC: 16,743
Message 742 - Posted: 31 Oct 2020, 10:35:50 UTC - in response to Message 740.  

Minor update on this:

I'm fairly certain the issue isn't in our app code, as I placed an infinite loop before the part that's crashing to attach a debugger, and it still crashed wile sitting in an infinite loop. Which makes me suspect the BOINC client is killing it for some reason.

Still working it.


Thank you for the update.
As far as I can see the error is "process got signal 11" that might mean: "A signal 11 error, commonly know as a segmentation fault, means that the program accessed a memory location that was not assigned."
From my point of view it is unlikely that BOINC client might kill the process by this way.
ID: 742 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
floyd

Send message
Joined: 24 Jul 20
Posts: 30
Credit: 3,485,605
RAC: 0
Message 743 - Posted: 31 Oct 2020, 11:27:18 UTC - in response to Message 742.  

As far as I can see the error is "process got signal 11" that might mean: "A signal 11 error, commonly know as a segmentation fault, means that the program accessed a memory location that was not assigned."
From my point of view it is unlikely that BOINC client might kill the process by this way.
Not intentionally, but as far as I know BOINC client and science application communicate through shared memory. If the application tried to communicate when the memory had already been released for some reason I think a segfault could (or should?) happen. Maybe the client thought the application was already dead, intentionally or not. Just a thought. One could look at the BOINC log around the time when the crash happened, maybe with some additional log flags on. app_msg_* and and heartbeat_debug look interesting.

I could offer a 750 Ti for testing but the system has the 418.x driver installed and I would like to keep it that way. That means CUDA 10.1 and CC 5.0, but I'm not sure if that is worth a try as I got the impression that CUDA 10.2 is required.
ID: 743 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 405
Credit: 10,650,439
RAC: 40,338
Message 745 - Posted: 31 Oct 2020, 15:19:19 UTC - in response to Message 743.  

I suspect the issue might be with conflicting signals and signal handlers. SEGV just means that something accessed memory it shouldn't have.
It seems BOINC on Linux uses SIGALRM for things. If Pytorch/CUDA *also* wants to use SIGALRM, and overwrote the BOINC SIGALRM handler with its own version, and boinc sent a SIGALRM, it would trigger the wrong handler which might not be prepared for that. And it wouldn't matter what the main code is doing.

Development is fun!
ID: 745 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 405
Credit: 10,650,439
RAC: 40,338
Message 761 - Posted: 5 Nov 2020, 6:32:57 UTC

I think I got it. Took some doing, and compiling pytorch/cuda from source takes nearly three hours, which is why it's taken a few days of trial and error (make change, wait 3 hours, try.. repeat...nvcc doesn't like ccache apparently)... but the Linux client now appears to be at least getting past the "sig11 within 2 seconds of starting" issue.

The issue appears to be something related to MKL_DNN and threading. Once again, compiling pytorch from source and removing Intel's special performance library makes things work as intended (we had a similar issue with older CPUs on the CPU client).

I'm going to run a few more tests before pushing an update for all to mldstest sometime tomorrow (hopefully). Stay tuned.

I'm also going to be setting up proper app plans to filter hosts correctly. I'm going to set a minimum of cuda v10.2 (at least driver version 440+) and a card with compute capability 3.5 or higher. Also I'm going to require GLIBC 2.27 or higher (ubuntu 18.04+, RHEL/CentOS 8+) for the cuda clients.

Thanks for your patience.
ID: 761 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sid

Send message
Joined: 22 Aug 20
Posts: 7
Credit: 18,815,424
RAC: 16,743
Message 762 - Posted: 5 Nov 2020, 11:33:05 UTC - in response to Message 761.  

Thank you for update. Looking forward to run
ID: 762 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 405
Credit: 10,650,439
RAC: 40,338
Message 769 - Posted: 8 Nov 2020, 23:18:49 UTC - in response to Message 762.  

OK, ran into a few more issues but we've pushed out a new mlds test 9.75 cuda app a few minutes ago.
I'm also going to release a new batch of GPU testing WUs.

A few things about this:


  • We've created a new "AppPlan" that should only send you WUs if you meet the minimum requirements (see below).
  • The new linux app is still quite large
  • We're now compiling pytorch/cuda from source. That may lead to more issues, please report.



As for requirements, here's what you need to get GPU WUs:


Linux/CUDA


    • 64-bit CPU with SSE support
    • NVIDIA card w/ compute capability 3.5 or greater
    • nvidia binary driver version 440+
    • GLIBC 2.27+ (Ubuntu 18.04 equivalent)



Windows/CUDA


    • Windows 10 64-bit
    • NVIDIA card w/ compute capability 3.5 or greater
    • CUDA 10.2+
    ∘ nvidia driver version 440+



It is quite possible we have some bugs here. If you were getting (and successfully crunching!) WUs before and aren't anymore, please let us know.

ID: 769 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 405
Credit: 10,650,439
RAC: 40,338
Message 770 - Posted: 9 Nov 2020, 2:53:54 UTC

nvidia driver 450 might be an issue.

I'm seeing a few hosts with something like "CUDA error: forward compatibility was attempted on non supported HW" . This appears to be a bug in certain versions of the 450 version of the driver, see: https://forums.developer.nvidia.com/t/forward-compatibility-runtime-error-after-installing-cuda-11-0/128503/4. Try updating to a newer version of the driver (even a later 450 driver) if this happens. It apparently can also be a driver/version mismatch https://github.com/pytorch/pytorch/issues/40671.
ID: 770 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
floyd

Send message
Joined: 24 Jul 20
Posts: 30
Credit: 3,485,605
RAC: 0
Message 773 - Posted: 9 Nov 2020, 10:53:49 UTC - in response to Message 769.  

OK, ran into a few more issues but we've pushed out a new mlds test 9.75 cuda app a few minutes ago.
I'm also going to release a new batch of GPU testing WUs.
Looks like few hosts quickly sucked up all available tasks. You may want to temporarily set up additional restrictions, like on the number of tasks per host, to get a broader distribution.
ID: 773 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 405
Credit: 10,650,439
RAC: 40,338
Message 774 - Posted: 9 Nov 2020, 14:59:19 UTC - in response to Message 773.  

New test tasks flowing, and CPU tasks have been deprecating in testing.

Assuming no regressions with the windows client, I'll at least move that into the new, non-test "mlds-gpu" app and graduate it from testing to production. I want to let the linux CUDA client bake a little longer.

Its the first time a windows client has gone smoother than a linux one. Weird times, 2020.
ID: 774 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 9 Jul 20
Posts: 113
Credit: 3,900,296
RAC: 30,996
Message 776 - Posted: 9 Nov 2020, 15:20:16 UTC - in response to Message 774.  

Thanks for the update. Looking forward to try out the GPU version as I didn't catch any GPU test task so far. Will try on a Linux Ubuntu 20.04 LTS and Windows based machine with my 750 Ti. Curious to what performance increase I will be able to observe.
ID: 776 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 12 Jul 20
Posts: 35
Credit: 51,968,353
RAC: 75,285
Message 777 - Posted: 9 Nov 2020, 15:53:35 UTC - in response to Message 774.  

No luck on a GTX 1060 under Ubuntu 20.04.1.

I started with the 450 drivers that came with the OS. But when they all failed, I upgraded to 455 (CUDA 11.1)
from the ppa:graphics-drivers/ppa repository. They all failed too.
ID: 777 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dataman
Avatar

Send message
Joined: 1 Jul 20
Posts: 32
Credit: 22,436,564
RAC: 0
Message 778 - Posted: 9 Nov 2020, 16:13:40 UTC

I am seeing success here. Only one problem with a Linux machine which is on my end (driver).

So far, I am running:

Windows:

1 x RTX 2080Ti
4 x GTX 1080Ti
2 x GTX 1080
1 x GTX 1070
1 x GTX 1060
1 x GTX 750Ti [if it runs on this ancient card it will run on anything ;) ]

Linux:

1 x GTX 980Ti (I need to install a new driver; will do later.)

Running multiple GPU's now appears to work. Project initiation time is >10 min (for me) but new work download time is nominal.

Good work, pianoman & Team.

ID: 778 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 405
Credit: 10,650,439
RAC: 40,338
Message 781 - Posted: 9 Nov 2020, 18:25:41 UTC - in response to Message 777.  

I wonder if a lot of the issues we see are with people running cuda 11 instead of 10. I'll need to go back and check the data later tonight.

Thanks again for testing.
ID: 781 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 9 Jul 20
Posts: 113
Credit: 3,900,296
RAC: 30,996
Message 783 - Posted: 9 Nov 2020, 21:46:33 UTC - in response to Message 778.  
Last modified: 9 Nov 2020, 21:48:37 UTC

GTX 750Ti [if it runs on this ancient card it will run on anything ;)]
I wouldn't write off a GTX 750 Ti altogether however. Even though it is arguably ancient technology, at 30$ a piece second-hand on Ebay and @60W TDP it still performs rather solid, and is a great solution for SFF cases or cases with bad airflow as the low TDP tends to keep the card on the cool end anyway. With the recent CUDA support that has been rolled out at Folding@Home on some cores, specific workloads such as the weekly sprint WUs finish considerably faster. It manages ~24 WUs a day and that equates to ~250k credits. On GPUGrid this card can also manage ~120k credits. I'd say this is still some science done at the end of the end while not being the most efficient card performance out there now. Compared to modern cards, that is of course nothing but I rather tend to look at projects that offer both a CPU and GPU app version for comparison and my GTX 750Ti always wins performance and efficiency-wise against the Xeon X5660 @95W that accompanies this CPU. Anyway, I just hope that the new RTX 30xx Ampere and new Radeon RX 6000 series cards will push prices of last generation cards further down. Still looking for a GPU upgrade myself but haven't figured out yet what offers the best value. RTX 2060/1660Ti cards seem rather affordable right now as do the lower end new Ampere cards. Don't know about the Radeon cards as I would likely miss CUDA for some projects.

Will run the first GPU WUs soon and I am already excited to see the results.

ID: 783 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dataman
Avatar

Send message
Joined: 1 Jul 20
Posts: 32
Credit: 22,436,564
RAC: 0
Message 784 - Posted: 9 Nov 2020, 22:12:26 UTC - in response to Message 783.  
Last modified: 9 Nov 2020, 22:33:12 UTC

GTX 750Ti [if it runs on this ancient card it will run on anything ;)]
I wouldn't write off a GTX 750 Ti altogether however. Even though it is arguably ancient technology, at 30$ a piece second-hand on Ebay and @60W TDP it still performs rather solid, and is a great solution for SFF cases or cases with bad airflow as the low TDP tends to keep the card on the cool end anyway. With the recent CUDA support that has been rolled out at Folding@Home on some cores, specific workloads such as the weekly sprint WUs finish considerably faster. It manages ~24 WUs a day and that equates to ~250k credits. On GPUGrid this card can also manage ~120k credits. I'd say this is still some science done at the end of the end while not being the most efficient card performance out there now. Compared to modern cards, that is of course nothing but I rather tend to look at projects that offer both a CPU and GPU app version for comparison and my GTX 750Ti always wins performance and efficiency-wise against the Xeon X5660 @95W that accompanies this CPU. Anyway, I just hope that the new RTX 30xx Ampere and new Radeon RX 6000 series cards will push prices of last generation cards further down. Still looking for a GPU upgrade myself but haven't figured out yet what offers the best value. RTX 2060/1660Ti cards seem rather affordable right now as do the lower end new Ampere cards. Don't know about the Radeon cards as I would likely miss CUDA for some projects.

Will run the first GPU WUs soon and I am already excited to see the results.




It's sloooooow. The only reason it has not been pitched into the recycle bin is that it is in an Alienware R2 with a wimpy, non-modular PSU. Upgrading the PSU to a bigger, modular one so I can upgrade the GPU has not made the top of the "to do" list. I rarely crunch with it but thought it would be worth testing for this project. It is working OK but takes nearly 2 hours.
Cheers.

@ Pianoman: I upgraded the driver on the Linux machine with the GTX 980Ti but have a 100% failure rate (as opposed to Win10 with a 100% success rate). I need to do more investigation. Anyone else having problems with Linux/GPU?
ID: 784 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 9 Jul 20
Posts: 113
Credit: 3,900,296
RAC: 30,996
Message 785 - Posted: 9 Nov 2020, 22:21:13 UTC
Last modified: 9 Nov 2020, 22:38:14 UTC

My first rand GPU WU just finished crunching a few minutes ago at 1,218.45 sec. That is compared to roughly 10.5 hrs for the CPU version. That's roughly 31x times faster if they are the same length.
https://www.mlcathome.org/mlcathome/workunit.php?wuid=1259521
https://www.mlcathome.org/mlcathome/result.php?resultid=2796956
It was assigned to me after it threw an error almost immediately on 2 prior hosts, but mine finished without even a hiccup. GPU loaded on average at 75% (range: 62-79%), compute load at ø 72% (60-75%), power limit at ø 48% (42-54%), fans at 1299 RPM or 37%, mem load at 38% constantly, bus interface at ø 18% (16-19%), frame buffer at ø 15% (4-17%). Thought it'd be interesting to share. This is what I could read off the MSI afterburner log.

But anyway, gotta hand it to you. Yeah, it's painfully slow. That's why I consider the upgrade and am still looking into my options.

Edit: Just saw a couple GPU test WUs on your computers and somehow this seems wrong. Your 1080 Ti averaged ~1,350 sec for the GPU 9.75 Windows CUDA version. How or rather why? My next WU is right on track to deliver another ~1,250 sec. runtime. – It did finish indeed at 1,214 sec. So rather reliably and computed one epoch every ~6 sec.

Here's the second valid WU. https://www.mlcathome.org/mlcathome/result.php?resultid=2796261 For context. The GPU is overclocked at 1361 MHz core and 2820 MHz mem clock and paired with a Xeon X5660. But I still cannot figure out how mine could run at 1/4 of what your 750Ti delivered apparently .... Completely clueless here.
ID: 785 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Questions and Answers : Issue Discussion : All my GPU applications have crushed.

©2021 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)