GPU update

Questions and Answers : Unix/Linux : GPU update
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Dataman
Avatar

Send message
Joined: 1 Jul 20
Posts: 32
Credit: 22,436,564
RAC: 0
Message 691 - Posted: 25 Oct 2020, 13:25:57 UTC
Last modified: 25 Oct 2020, 13:29:20 UTC

Just to let you know there is progress. This is with a rand_automata/DS3 datafile. Want to get it into testing tonight or tomorrow.

"Use GPU's" should not default to "Yes". The option should default to opt-in not opt-out.
ID: 691 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 440
Credit: 13,636,524
RAC: 37,984
Message 694 - Posted: 25 Oct 2020, 16:02:16 UTC - in response to Message 691.  

Will look at fixing the GPU option. The option showed up as soon as we uploaded the mldstest windows client (which is *huge*, 1GB in size).

So its whatever the BOINC server defaults to now.
ID: 694 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dataman
Avatar

Send message
Joined: 1 Jul 20
Posts: 32
Credit: 22,436,564
RAC: 0
Message 697 - Posted: 25 Oct 2020, 17:12:14 UTC - in response to Message 694.  

Will look at fixing the GPU option.


Fixed. Thanks!
ID: 697 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 12 Aug 20
Posts: 21
Credit: 37,774,242
RAC: 189,638
Message 698 - Posted: 25 Oct 2020, 22:06:49 UTC - in response to Message 694.  

Will there also be a CUDA app for Linux soon?
//Gunnar
ID: 698 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 440
Credit: 13,636,524
RAC: 37,984
Message 699 - Posted: 25 Oct 2020, 22:31:32 UTC - in response to Message 698.  
Last modified: 26 Oct 2020, 5:52:43 UTC

Yes. only reason it isn't out yet is I have only one nvidia card, and I started with windows thinking it would be the hardest.
Once I'm able to verify the GPU client works with boinc/windows (I know I can run it by hand just fine), I'll reboot into linux and compile a linux cuda client.

I still have the rocm version (linux) too, but that will require a custom app plan, which I still haven't learned how to do.

Please note, there are going to be issues (which is why it's in mldstest). Here's a rundown of things I expect to be a problem:


  • The windows cuda client is huge ~1GB download, and uses up 1.6GB on disk. Some of the bloat may be unnecessary, as I'm basically shipping about 500MB of cuda 10.2 libs with the client that might already be on the system.
  • Windows cuda client is compiled against cuda 10.2, so you'll need that, and a driver that supports that (some information for linux here: https://docs.nvidia.com/deploy/cuda-compatibility/index.html).
  • Needs compute capability 3.5 or higher, so that's higher end kepler or later. See wikipedia for info on mapping card names to compute capability: https://en.wikipedia.org/wiki/CUDA
  • I'm unsure right now how the client behaves if any of the above situations aren't met. It may crash, or it may fall back to CPU mode, or something completely different.
  • Memory usage also shoots up. While I the CPU client needs ~300MB for DS3 WUs, the ROCM client needs ~760MB (similar to DS1/DS2), but I measured 1.8GB memory usage when running the client on windows/cuda.
  • Linux CUDA will likely require a newer build and newer minimal glibc (Ubuntu 16.04 vs. ubuntu 14.04 base), since cuda no longer supports 14.04.



GPU support is messy, which is part of why its taken so long.

Also, as for requests for intel or opencl clients, we're limited by what pytorch natively supports well. It's like biological projects like rosetta@home that rely on autodock or some other common program. We're not hand-writing these clients from scratch. Currently, PyTorch supports CPU, CUDA and ROCm, so that's what MLC@Home will support.

ID: 699 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 440
Credit: 13,636,524
RAC: 37,984
Message 709 - Posted: 27 Oct 2020, 1:50:44 UTC

I got good news and bad new.

The good news is, I have (finally) received the first test GPU WU served up by the project and its crunching now.

The bad news is that it uses up to 2GB of RAM and 2GB of disk space for DS3 WUs, which are well beyond the limits CPU crunching requires, so if anyone else gets a GPU WU, they'll immediately error out with a "exceeded disk space" error. This leads to a complicated problem on my end....

Do I a) up the limits on all existing WUs so they can be run on a GPU or CPU, meaning pure CPU crunchers will suffer because the client will refuse to run unless it has 2GB/WU free (despite actually using only ~300MB when not using a GPU), and making things like a RPi3 w/ 1GB of ram impossible to schedule, or b) somehow create two sets of WUs, one for GPUs, one for CPUs, and figure out how to tell the BOINC server which is which (hint, that ain't easy).

Option b) is really the only answer, but will require a bit more thought.

Perhaps it would be best to create a separate "mlds-gpu" application with a separate WU pool. Hmm.
ID: 709 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 440
Credit: 13,636,524
RAC: 37,984
Message 710 - Posted: 27 Oct 2020, 5:41:02 UTC

Linux cuda client rolled out, not much success by the looks of the early returns.. lots of library/driver mismatches.

Will let it run overnight and assess the damage tomorrow.
ID: 710 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 1 Jul 20
Posts: 34
Credit: 25,497,808
RAC: 2,873
Message 711 - Posted: 27 Oct 2020, 5:41:29 UTC
Last modified: 27 Oct 2020, 5:46:08 UTC

This is what I am getting with my linux machines. It doesn't seem like a memory/storage issue. But what do I know. :)

<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process got signal 6</message>
<stderr_txt>
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: forward compatibility was attempted on non supported HW
Exception raised from current_device at ../c10/cuda/CUDAFunctions.h:40 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f7feff56eb9 in /tmp/.mount_mldsterozvD4/usr/bin/../lib/libc10.so)
frame #1: at::cuda::getCurrentDeviceProperties() + 0x175 (0x7f7fa2fb7355 in /tmp/.mount_mldsterozvD4/usr/bin/../lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x8ec3c (0x561fc54b1c3c in ../../projects/www.mlcathome.org_mlcathome/mldstest_9.70_x86_64-pc-linux-gnu__cuda_fermi)
frame #3: __libc_start_main + 0xe7 (0x7f7f9f177b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x8d55a (0x561fc54b055a in ../../projects/www.mlcathome.org_mlcathome/mldstest_9.70_x86_64-pc-linux-gnu__cuda_fermi)


</stderr_txt>
]]>

I haven't received a task for my windows machines yet.

But if the ram/storage requirements are significantly different, then yes, perhaps a different app is required, so that people can select one and/or the other.
Reno, NV
Team: SETI.USA
ID: 711 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 1 Jul 20
Posts: 34
Credit: 25,497,808
RAC: 2,873
Message 712 - Posted: 27 Oct 2020, 5:49:53 UTC

Also, if you send out tasks that allow any resource to use them, then they get consumed by CPUs. And you won't get testing just on GPUs, assuming that is your goal for this batch.
Reno, NV
Team: SETI.USA
ID: 712 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 1 Jul 20
Posts: 34
Credit: 25,497,808
RAC: 2,873
Message 714 - Posted: 27 Oct 2020, 7:35:46 UTC

This task worked on a windows machine. It's a single GPU machine. I wonder if that is why my dual GPU windows machine gives errors.

https://www.mlcathome.org/mlcathome/result.php?resultid=2592308
Reno, NV
Team: SETI.USA
ID: 714 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 440
Credit: 13,636,524
RAC: 37,984
Message 719 - Posted: 27 Oct 2020, 13:21:50 UTC

First, thanks again for testing.

According to this question on the nvidia docs, it might be related to the specific driver version you have installed (450?). I note the one that worked uses a later driver (452):
https://forums.developer.nvidia.com/t/forward-compatibility-runtime-error-after-installing-cuda-11-0/128503/4. Proprietary drivers are fun!

Also, I'm fairly certain this app currently ignores multiple GPUs and only uses GPU 0. I haven't implemented that yet.
ID: 719 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProDigit

Send message
Joined: 20 Jul 20
Posts: 23
Credit: 968,634
RAC: 74
Message 731 - Posted: 28 Oct 2020, 17:22:00 UTC - in response to Message 709.  

I got good news and bad new.

The good news is, I have (finally) received the first test GPU WU served up by the project and its crunching now.

The bad news is that it uses up to 2GB of RAM and 2GB of disk space for DS3 WUs, which are well beyond the limits CPU crunching requires, so if anyone else gets a GPU WU, they'll immediately error out with a "exceeded disk space" error. This leads to a complicated problem on my end....

Do I a) up the limits on all existing WUs so they can be run on a GPU or CPU, meaning pure CPU crunchers will suffer because the client will refuse to run unless it has 2GB/WU free (despite actually using only ~300MB when not using a GPU), and making things like a RPi3 w/ 1GB of ram impossible to schedule, or b) somehow create two sets of WUs, one for GPUs, one for CPUs, and figure out how to tell the BOINC server which is which (hint, that ain't easy).

Option b) is really the only answer, but will require a bit more thought.

Perhaps it would be best to create a separate "mlds-gpu" application with a separate WU pool. Hmm.

option b. Most of my mlc units run on Atom boards that only have 2GB (1.5GB) shared between all 4 cores, and 8GB of remaining disk space.
Cuda also shouldn't be this big.
I was hoping you'd start with Intel, as most intel GPUs are unused (or, doing the collatz).
If gpu acceleration is slow, it would make no sense to have big and heavy gpus do the job, and focus on the smaller ones.
On the other hand, if you can improve performance on big gpus, making them 90-100% utilized, then a separate pool is necessary, as these gpu systems usually do have the ram requirements.
ID: 731 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
swiftmallard
Avatar

Send message
Joined: 23 Sep 20
Posts: 12
Credit: 6,674,304
RAC: 29,175
Message 732 - Posted: 28 Oct 2020, 19:42:54 UTC - in response to Message 731.  

I was hoping you'd start with Intel, as most intel GPUs are unused (or, doing the collatz).

I would love to use my onboard Intel GPU
ID: 732 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 12 Aug 20
Posts: 21
Credit: 37,774,242
RAC: 189,638
Message 734 - Posted: 29 Oct 2020, 0:24:43 UTC - in response to Message 719.  

Hi!

Got my first CUDA task now (tasknr 2629681), but it ended up in sigsegv and computation error (signal 11).

OS: Ubuntu 18.04
Nvidia driver: 390.116 ?? (got it via the "nvidia-smi" command)
GPU: GTX750Ti

I'm standing by for more tests! :-)

Good luck with the CUDA app!!!

//Gunnar
ID: 734 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 440
Credit: 13,636,524
RAC: 37,984
Message 735 - Posted: 29 Oct 2020, 2:14:35 UTC
Last modified: 29 Oct 2020, 2:39:20 UTC

Honestly, its very bizarre. The segv isn't coming from the new CUDA code, its coming from the training data loading code which is unmodified and running without issue on every other client, It also only happens when run by the boinc client, not when run by hand on the exact same system with same input.

I'm reproducing it here, but its a real head scratcher.

I won't be sending out more tests until I solve it.
ID: 735 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 1 Jul 20
Posts: 34
Credit: 25,497,808
RAC: 2,873
Message 873 - Posted: 23 Nov 2020, 14:34:33 UTC

Cuda on linux has never worked for me to date. But with 9.8, it is working now. Nice!
Reno, NV
Team: SETI.USA
ID: 873 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 440
Credit: 13,636,524
RAC: 37,984
Message 874 - Posted: 23 Nov 2020, 16:02:44 UTC - in response to Message 873.  

Cuda on linux has never worked for me to date. But with 9.8, it is working now. Nice!


woohoo!

I'm going to post an update in a new thread to capture current status.
ID: 874 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 9 Jul 20
Posts: 138
Credit: 8,773,476
RAC: 11,583
Message 875 - Posted: 23 Nov 2020, 16:08:08 UTC - in response to Message 874.  

Looking forward to test on my system and verify. Great accomplishment!
ID: 875 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Questions and Answers : Unix/Linux : GPU update

©2021 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)