GPU tasks runtime comparison

Questions and Answers : Issue Discussion : GPU tasks runtime comparison
Message board moderation

To post messages, you must log in.

AuthorMessage
bozz4science

Send message
Joined: 9 Jul 20
Posts: 142
Credit: 11,533,376
RAC: 487
Message 819 - Posted: 12 Nov 2020, 14:58:53 UTC
Last modified: 12 Nov 2020, 15:01:29 UTC

As the other thread was getting cluttered with posts about runtime comparisons of those tasks that actually didn't fail, I thought it would be great to share them in a dedicated thread. Let me share my experiences so far.
My system: https://www.mlcathome.org/mlcathome/show_host_detail.php?hostid=574
- CPU: Xeon X5660 @95W TDP
- GPU: GTX 750Ti@60W TDP
- OC setting: +120 MHz core + mem (1360 MHz core / 2820 MHz mem)
- OS: Win 10 (Dual Boot with Linux 20.04 LTS) --> gonna try this soon

In the wattage estimations, I forgot to include the power of the dedicated CPU thread(s).

1 GPU / 1 thread + 5 threads CPU tasks (MLC):
ø runtime: 1,250 sec
ø compute load (GPU): 65%
ø power load (GPU): 48%
Wattage per task: 10W (GPU) + 5.5W (CPU) = 15.5W (total)

1 GPU / 2 threads + 4 threads CPU tasks (MLC)
--> just one thread loaded for GPU task
ø runtime: 1,100 sec
ø compute load (GPU): 85%
ø power load (GPU): 58%
Wattage per task: 10.6W (GPU) + 4.8W (CPU) = 15.4W (total)

1 GPU / 2 threads + 4 threads CPU tasks (TNGrid)
--> just one thread loaded for GPU task
ø runtime: 1,030 sec
ø compute load (GPU): 85%
ø power load (GPU): 58%
Wattage per task: 9.95W (GPU) + 4.5W (CPU) = 14.5W (total)

1 GPU / 2 threads + 4 threads CPU tasks (TNGrid)
--> just one thread loaded for GPU task
ø runtime: 1,000 sec
ø compute load (GPU): 85%
ø power load (GPU): 62%
OC setting: aggressive +150 MHz core and mem vs. 120 MHz prior
Wattage per task: 10.35W (GPU) + 4.4W (CPU) = 14.75W (total)

0.5 GPU / 1 threads + 0.5 GPU / 1 thread (F@H GPU task) + 3 threads CPU tasks (TNGrid)
ø runtime: 1,550 sec
ø compute load (GPU): 99%
ø power load (GPU): 70%
Wattage per task not very intuitive: 16.8W (GPU combined) + 6.8W (CPU combined) = 23.6 W (total combined) (As the runtime vs. the normal 1 GPU/2 thread scenario is roughly 155% and the runtime of the same project F@H WU are between 2.5-3.0x times longer, I estimate that the GPU compute load distribution between MLC and F@H is roughly 65:35% and thus I get 15.35W for the MLC GPU WU)

Comparison of runtime vs. efficiency
1 GPU / 1 thread + 5 threads CPU tasks (MLC) --> 1,250 sec @ 15.5W / 6.51 sec per epoch
1 GPU / 2 threads + 4 threads CPU tasks (MLC) --> 1,100 sec @ 15.4W / 5.73 sec per epoch
1 GPU / 2 threads + 4 threads CPU tasks (TNGrid) --> 1,030 sec @14.5W / 5.36 sec per epoch
1 GPU / 2 threads + 4 threads CPU tasks (TNGrid) (higher OC) --> 1,000 sec @ 14.75W per epoch
0.5 GPU / 1 threads + 0.5 GPU / 1 thread (F@H GPU task) + 3 threads CPU tasks (TNGrid) --> 1,550 sec @ ~15.35W / 8.07 sec per epoch

rand WU - CPU version 1 thread @ 6 concurrent threads --> ~35,000 sec (rough estimate) @ 166.25W / 182.3 sec per epoch

Efficiency gain: ~ 166W/15W = 11x
Performance gain: 35,000sec/1,000sec = 35x


Took me quite a while to compile these estimates, especially as I had to wait for enough WU to finish and validate to have a representative average. Thought it'd be interesting to share these numbers with you and compare.

Just realized I posted this in the wrong forum but I cannot move it anymore to the Café section of the discussion forum.
ID: 819 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 12 Jul 20
Posts: 44
Credit: 71,449,633
RAC: 668
Message 820 - Posted: 12 Nov 2020, 15:55:37 UTC - in response to Message 819.  

Took me quite a while to compile these estimates, especially as I had to wait for enough WU to finish and validate to have a representative average. Thought it'd be interesting to share these numbers with you and compare.

Wonderful. Efficiency first and speed second are just what I look for.
But we now have to make some trade-offs between projects for our CPU cores. At least now we have the data for Win10.
Thanks.
ID: 820 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 9 Jul 20
Posts: 142
Credit: 11,533,376
RAC: 487
Message 821 - Posted: 12 Nov 2020, 16:43:49 UTC - in response to Message 820.  
Last modified: 12 Nov 2020, 17:01:41 UTC

Sure. Definitely on the same page with you. Efficiency is top priority for me as well. Thus, finally, my curiosity was sparked enough, to try running 2 WU in parallel again and I succeeded by dedicating 2 CPU threads to each task. (However, still only 1 thread out of the 2 dedicated ones is loaded per task) Don’t know what to do with the results yet, as thinking about CPU resource allocation definitely is an issue with this setup. However, I see not only an increase in efficiency but also in throughput as seen below.

2 CPUs / 0.5 GPU
- 100% compute load (short bursts down to 92/94% after every epoch cycle)
- ø 25% bus interface load
- ø 63% memory load
- ø ~60% power load (peak at 70%)
- ø 5 % copy engine load (vs. 2% prior) --> more data handoff seems to be going on
- ø 1.5% 3D engine load (vs. <1% prior)
- same OC setting

Power consumption:
7.48 W (CPU) + 17 W (GPU) = 21.8 W (total)
--> W per WU = 10.9 W or 8.85 sec. per epoch

This compares to the best case scenario of 1,000 sec per WU like the following:
Runtime increase: 5.21 sec vs. 8.85 —> + ~70%
Efficiency increase: 14.75W vs. 10.9W —> + ~26% —> reasonable remembering that GPU compute load with single WU was around ~65%
Standardized throughput (assuming 959.40 credits per WU):
1 GPU task: ~691 epochs trained —> 3.60 WU —> 3,453 credits/hr @ 53.1W
2 GPU tasks in parallel: ~814 epochs trained —> 4.24 WU —> 4,065 credits/hr @ 46.2W

3 successful concurrent runs so far delivered:

Run 1)
WU #1 = 1,702 sec
WU #2 = 1,699 sec
∆ = 3 sec or 0.2%

Run 2)
WU #1 = 1,714 sec
WU #2 = 1,710 sec
∆ = 10 sec or 0.6%

Run 3)
WU #1 = 1,704 sec
WU #2 = 1,702 sec
∆ = 2 sec or 0.2%

It seems that the workload is not exactly evenly distributed between the available CUDA cores but indeed very similar. Thus, comparable runtimes for both WU in each run and very little variance. They seem rather robust with a std.dev of only ~5 sec. And all this with the card only running at 55C and 39% fan load.
ID: 821 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 456
Credit: 14,368,944
RAC: 939
Message 825 - Posted: 12 Nov 2020, 21:22:12 UTC

Great analysis. I'm still trying to figure out why multiple threads are making a difference at all. Will look over this over the next few days.

Would you prefer I to move this thread to another forum? I think its probably fine here, but I can also see Cafe or Windows or Linux.
ID: 825 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 456
Credit: 14,368,944
RAC: 939
Message 826 - Posted: 12 Nov 2020, 21:29:54 UTC

Ohhhhh... wait a minute. The Windows build is using the pre-compiled pytorch from the pytorch.org website, which is using MKLDNN, which is notorious for ignoring thread limits and doing whatever it dang well pleases.

For Linux, we compile from scratch to disable mkldnn and use openblas instead (which actually gets more performance on non-Intel CPUs, and about the same on Intel).

That might explain some of what's happening.
ID: 826 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 9 Jul 20
Posts: 142
Credit: 11,533,376
RAC: 487
Message 827 - Posted: 12 Nov 2020, 21:38:20 UTC
Last modified: 12 Nov 2020, 21:43:52 UTC

Thanks, I was just curious!

But I think you might be on to sth here. Appreciate your input! Could imagine that'd be part of the problem of ignoring the thread limits especially if it is notorious for this issue anyway. However, I still don't understand, how others and I come to observe this speedup as if more resources would be allocated but I cannot visually observe it. All remains unchanged for the GPU app WUs (CPU utilisation, active threads, bus interface load) except for the near instantaneous increase in the CUDA compute load in spite of the additional reserved CPU thread being untouched...

If it were to make a difference and could potentially solve the CPU thread allocation problem, I guess most of us would be willing to test this if you were to decide to push another 100 or so test WU.

I am fine with leaving this thread here, just thought it would fit better somewhere else, as this wasn't really an issue but rather just supposed to inform others. But as we might be on to sth here together I guess it's fine where it is.
ID: 827 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 9 Jul 20
Posts: 142
Credit: 11,533,376
RAC: 487
Message 836 - Posted: 13 Nov 2020, 20:11:07 UTC - in response to Message 827.  
Last modified: 13 Nov 2020, 20:14:05 UTC

Usually, I tend to consult the GPU comparison page in order to analyse how well my ancient card is still holding up against today's wattage beasts. https://www.mlcathome.org/mlcathome/gpu_list.php Surely, as more hosts and more hosts continue to crunch these GPU WUs, the comparison date becomes more accurate with time. However, it seems weird that the newer gen and more powerful RTX 20xx cards lack far behind much older cards. Could the Tensor cores of NVIDIA's RTX cards maybe be utilised for this?

For the sake of completion, I'd like to also see the 750Ti in that comparison list, but I guess according to the current layout, the table only lists the top 21 cards, right? F.ex. for the Milkyway@Home project, the comparison page lists all GPUs that is >30 models on their page. Could you expand the list or allow for more than 21 models to be listed?

Especially, given that I am among the top 20 user (GPU RAC) with only one 750 Ti, the numbers provided on this page, still seem to be out of proportion. Could you elaborate on this shortly?
ID: 836 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 456
Credit: 14,368,944
RAC: 939
Message 844 - Posted: 14 Nov 2020, 20:34:46 UTC - in response to Message 836.  

I think there's too many variables to draw many conclusions from the GPU list yet. The WUs aren't very stressful, and the number of GPU users and sample size is still too small.
ID: 844 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 9 Jul 20
Posts: 142
Credit: 11,533,376
RAC: 487
Message 848 - Posted: 17 Nov 2020, 1:11:43 UTC
Last modified: 17 Nov 2020, 1:11:58 UTC

I guess that's true – at least for now. Do you know if these lists are only analysing per WU runtime or can they also spot whether a user is running many WU concurrently on a particular card and thus change to another metric such as credit/hr per card? Just came to think about that as interpretation of the statistics would drastically change.

Edit to the 2nd runtime post: Just wanted to correct some mistakes I did in the wattage estimates. I forgot to count the energy requirement of the 2nd thread of the 2 WU in parallel option. The runtimes converged against ~1770 sec and thus I now get a slightly worse picture with 16.4W/WU and 3924.8 credits/hr
Running 3 WUs concurrently makes matters even worse efficiency wise while still increasing throughput slightly. Avg. runtimes are about 2610 sec, that result in 20.2W/WU and 3970 credits/hr.
At least for a very low power card, where avg. compute load was already high but not a 100%, offers potential to increase throughput but lowers overall efficiency as the "theoretical" compute load for 2 WUs in tandem would already push it beyond 100% of its compute capability resulting in inefficiencies arising from resource sharing/allocation, but load is now at 100% constantly.

Tradeoff to increase the throughput from 1 to 2 WUs concurrently by ~17% while raising power requirement by 13%. Anything beyond that doesn't offer any measurable benefit.
ID: 848 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 9 Jul 20
Posts: 142
Credit: 11,533,376
RAC: 487
Message 890 - Posted: 25 Nov 2020, 12:19:54 UTC - in response to Message 848.  

I recently realized that my particular 750 Ti according to nvidia-smi has a power limit of only 38.5W instead of the reference card's 60W TDP rating. So, in retro perspective I overestimated the GPU wattage per hour by ~35%. With 2 tasks run concurrently, at 60% ø power load, gives just a little over 23 Wh. That is quite decent, even in comparison to newer cards, that run at higher TDP, but generate lower runtimes per WU.
ID: 890 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Questions and Answers : Issue Discussion : GPU tasks runtime comparison

©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)