Questions and Answers :
Issue Discussion :
All my GPU applications have crushed.
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 12 Jul 20 Posts: 48 Credit: 73,492,193 RAC: 0 ![]() ![]() ![]() ![]() |
Anyway, I just hope that the new RTX 30xx Ampere and new Radeon RX 6000 series cards will push prices of last generation cards further down.I have learned to go with Nvidia, as much as I like my RX 570s. But the AMD drivers always cause problems sooner or later, on either Windows or Linux. However, Nvidia used the Samsung 8nm process for the RTX 30xx series, and apparently is not all that happy with it. The efficiency gain is not that much. So it is said that they are going with TSMC (their traditional supplier) for a new 7nm version before long. That could be worth waiting for. https://www.tweaktown.com/news/75679/nvidia-will-shift-over-to-tsmc-for-new-7nm-ampere-gpus-in-2021/index.html Still looking for a GPU upgrade myself but haven't figured out yet what offers the best value. RTX 2060/1660Ti cards seem rather affordable right now as do the lower end new Ampere cards.For low-power use, I have bought a couple of GTX 1650 SUPER recently (the "super" version has1280 CUDA cores) They have almost the output of a GTX 1060, at lower power (100 watts vs. 120 watts). But I would not trade may two remaining GTX 750 Tis for the world. |
Send message Joined: 24 Jul 20 Posts: 30 Credit: 3,485,605 RAC: 0 ![]() ![]() ![]() ![]() |
Linux/CUDAI wonder why host 4162 doesn't get tasks. It should meet the requirements and there are tasks ready to send, yet nothing. |
Send message Joined: 9 Jul 20 Posts: 142 Credit: 11,536,204 RAC: 3 ![]() ![]() ![]() ![]() |
Have you checked whether you ticked the option to receive test applications in the computing preferences? Otherwise I don’t see why it shouldn’t. |
Send message Joined: 9 Jul 20 Posts: 142 Credit: 11,536,204 RAC: 3 ![]() ![]() ![]() ![]() |
Would be worth testing for this project. It is working OK but takes nearly 2 hours. That just confused the heck out of me, but after looking at the stats of your host with the 750Ti right now, it seems the runtimes are very similar. Definitely a net improvement for me and thus, I'll keep my GPU busy here occasionally. – Interestingly the WU of your host with the 1060 is considerably faster than the one with the 2080Ti. I have learned to go with Nvidia, as much as I like my RX 570s. I've heard about the driver issues before, and earlier gen AMD Radeon cards were on top of that not very power efficient yet cheaper. Interesting read about NVIDIA's apparent problem with Samsung's 8nm process. So I guess, early adopters of the RTX 30xx series are really upset if they were "lucky" enough to get one in spite of the spares supply. For low-power use, I have bought a couple of GTX 1650 SUPER Honestly, they seem like a better value deal than the 1660 or 1660Ti. The 1660Ti's performance relative to the 1650 Super is ~130% and at 120W, at 120%, so power almost scales linearly with performance. However, price doesn't scale linearly. The price of a 1660Ti is usually 50-75% higher than for the 1650 Super but only for a 30% increase in performance, depending on the make and model of course. So there is a premium of 20-45% roughly on the higher-end 16xx series cards. That gives me some food for thought.... |
Send message Joined: 24 Jul 20 Posts: 30 Credit: 3,485,605 RAC: 0 ![]() ![]() ![]() ![]() |
Have you checked whether you ticked the option to receive test applicationsYes, everything is ticked. That computer actually got a v9.72 task on Sunday and the settings haven't changed since. It's getting late, I'll check again tomorrow. |
Send message Joined: 1 Jul 20 Posts: 34 Credit: 26,118,410 RAC: 0 ![]() ![]() ![]() ![]() |
FWIW, I have only 1 of 4 machines that are working with GPU tasks. working: Win10 with a single 1660 Ti errors: Win10 with dual 1660 TI Linux with dual 1660 Ti (same machine dual boot) Win10 with one 2080 Ti and one 1080 Ti Linux with one 2080 Ti and one 1080 Ti (same machine dual boot) Win10 with dual 1660 Ti (this is the one I sent the output file for) All are running the latest drivers available. Reno, NV Team: SETI.USA ![]() |
![]() Send message Joined: 3 Aug 20 Posts: 8 Credit: 7,650,164 RAC: 0 ![]() ![]() ![]() ![]() |
not all my gpu apps have crashed but a decent number have for me. I am running a rtx 2070 s and also a gtx 950 they don't seem to get much load whatsoever, my 2070 s seems to get like maybe 1% load according to task manager, i find that my readings tend to be somewhat incorrect however because of running an rtx 2070 s with a 950, but i dont really know. Most of my WU's complete successfully, but i just wanted to make sure something isn't wrong. Also, on win 10 with the nvidia control panel, i am rather curious, would the option under 3d settings "optimize for compute performance" help with gpu WU's whatsoever? I noticed it a while back after a driver update. |
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 ![]() ![]() ![]() ![]() |
These WUs aren't going to tax GPU very much, especially high end ones. What we don't get is why there are issues on some machines and not others, especially on the Linux side. It also doesn't help that the stats reported by the boinc server are heavily skewed negative when you first deploy an app. Some clients (for some reason), continue to grab WUs even when they fail, causing a huge wave of failures right off the bat. Machines that work take a while to compute WUs, and the good ones start trickling in over the next few days. The stat the boinc server shows us project owners the most is "percentage of returned WUs that passed" and "percentage that failed". If one user grabs 60 WUs in the first 24 hours and they all fail immediately, versus another that grabs 60 and compute the results fine, it takes 10s of hours before the stats balance out. That's a long winded way of saying when I checked the stats yesterday morning, I saw a linux pass rate < 10%, and thought there was no way these are ready to release. But now that rate is up to 50-ish % and climbing.. which is probably good enough to release. The windows app is up to ~70% passing. That doesn't mean we're going to stop trying to solve the issues we do have, we know there are several(!) machines that seem to have preventable errors, especially on the Linux side. What's also nice is that most of the errors now are at least providing useful error messages and are almost all related to CUDA, unlike the first linux client which was just failing without any error messages. So we'll continue to release updated clients on the test channel.. but maybe we'll spend tonight setting up the release gpu app for linux and windows. |
Send message Joined: 6 Jul 20 Posts: 7 Credit: 2,082,893 RAC: 9 ![]() ![]() ![]() ![]() |
Good morning. I happen to have one machine which has a sufficient driver version versus several machines that have the standard driver version that Debian provides. For me it would be nice to be able to use those machines as well, without having to go through the trouble of updating GPU drivers manually. Tried it on one machine, and it didn't like it, so I kept from it. The machine with the appropriate driver version keeps erroring out, so to me the GPU application makes no difference atm. ;-) Some clients (for some reason), continue to grab WUs even when they fail, causing a huge wave of failures right off the bat. You could set a quota of tasks per machine that gets reset daily. That way you should get a broader feedback earlier. I really appreciate you working on things and giving out timely feedback while providing us with regular news. Apart from credits and badges, to us volunteers both is a great way of saying you care about us! Thank you! - - - - - - - - - - Greetings, Jens |
![]() Send message Joined: 1 Jul 20 Posts: 32 Credit: 22,436,564 RAC: 0 ![]() ![]() ![]() ![]() |
I did a check this morning on the GPU's I have running here and the results are quite dismal. While I have a 100% success rate on the Windows machines (I turned off the Linux GPU's), the GPU utilization rate is horrible. My RTX 2080Ti is <1% and twin GTX 1080Ti's are between 1-2%. Even my lowly GTX 750Ti is barely managing 5% utilization. Wasting 1 CPU thread to control a GPU running at <1% will not work for me considering the extra expense of running GPU's. I may, at some future time, try coding one to run multiple wu's/GPU but I doubt even that would get the utilization up to something acceptable. I have never tried running more than 4/GPU before and I do not know if there is some upper limit. It appears you have moved GPU's into production. If so, mine would be better served running other projects. I like this project and will continue to run some CPU wu's. Cheers! ![]() |
Send message Joined: 12 Jul 20 Posts: 48 Credit: 73,492,193 RAC: 0 ![]() ![]() ![]() ![]() |
I did a check this morning on the GPU's I have running here and the results are quite dismal. While I have a 100% success rate on the Windows machines (I turned off the Linux GPU's), the GPU utilization rate is horrible. My RTX 2080Ti is <1% and twin GTX 1080Ti's are between 1-2%. Even my lowly GTX 750Ti is barely managing 5% utilization. Wasting 1 CPU thread to control a GPU running at <1% will not work for me considering the extra expense of running GPU's. I may, at some future time, try coding one to run multiple wu's/GPU but I doubt even that would get the utilization up to something acceptable. I have never tried running more than 4/GPU before and I do not know if there is some upper limit. If it is really that low (and I am not sure that you are not looking at the CPU utilization), I would consider that a great success. The card would run very cool, and you would still get a great speedup as compared to the CPU work units. So it would be very efficient. In fact, it looks too good. But I won't be able to put a card on Windows for a few more days to get a good comparison with Linux (either CPU or GPU). Thanks for the numbers though. |
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 ![]() ![]() ![]() ![]() |
Well, I'm sure you all would appreciate it more if the clients worked better, but thanks for the kind words. I'm going to try something different with the next spin of the linux/cuda client in mldstest. I'm going to a) stop using appimage and b) try shipping without the cuda runtime libraries. I'm not sure it will work, but look for that in the coming days. |
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 ![]() ![]() ![]() ![]() |
The current WUs are definitely bound more by transfer bandwidth than compute, that has to do with the size of the current networks, and the current batch size. On my test machine, using a GTX 1650 and monitoring using nvidia-smi, I see brief spikes of 90% utilization followed by 1-2% for a few moments, then up over 90% again briefly then back down again. Depending on how you're measuring utilization (instantaneous or average) I can imagine seeing wildly different results. This is consistent with what I would expect with ML jobs as the spikes are doing the batch-sizexnetwork-size matrix multiplies (which for these WUs are small). I could increase utilization easily (just raise the batch size, which makes the matrix multiplications bigger), but that means there are less chances for back-propogation (updating the weights) per epoch, so we'd need to compensate by adding more epochs, and now we're starting to change hyperparameters compared to CPU WUs, and that's something I want to hold constant for the integrity of the generated dataset. Even with all that, we still see a significant speedup in total runtime. Also, while we tried to optimize GPU usage as much as possible, there's still probably plenty of room for improvements. I would not at all claim we're doing things the most efficient way possible. All that said, I would never tell anyone how to allocate their own resources! I certainly appreciate all the help and support everyone has given this project, whether they choose to contribute by GPU or CPU. If some feel that our GPU utilization (and high CPU usage during GPU compute) isn't high enough to justify, then I hope we've provided enough knobs in preferences and separate GPU apps to easily control how and with what you volunteers are able to contribute with. |
Send message Joined: 9 Jul 20 Posts: 142 Credit: 11,536,204 RAC: 3 ![]() ![]() ![]() ![]() |
Don't be humble! What you accomplished here IMO is already a great accomplishment and proof of your relentless effort that definitely does speed things up quite a bit and at least for me results in a speedup of ~30x times compared to the CPU version. While my ancient 750Ti, as already discussed in an earlier thread, appears to be loaded rather high at ø 65% comparatively, (thus, I cannot really believe your single-digit compute load on your 750Ti) I can easily imagine the disappointment of someone with much more powerful hardware. If those numbers are true, they seem rather low and this means that the WUs never utilise the GPU compute power to its full extent. While this certainly is true, it also means that not much power will be needed to run the WU at such low utilisation but you’ll still see the speedup. I see however the reasoning behind wanting to utilise the compute power to the full extent Dataman and thus, why you would rather transfer your GPU compute power to projects such as F@H, GPUGrid, Einstein@Home or the like that tend to reach much higher utilisation and would probably do so myself if my CPU alone would deliver as much throughput as my current 750Ti :) I will crunch away those rand WU 30 times faster than I did with a CPU alone. For efficiency comparison see my quick estimate here: (referring to CPU vs. GPU rand WU) - CPU only: ~10.5 hrs on 1 thread out of 6 on a Xeon X5660 @95W gives an estimate of ~166.25W - GPU + 1 CPU thread: ~1,250 sec @60W and an ø power load of 48% gives an estimate of 10W (= ~6.5sec per epoch) Don’t know if I can do the calculation in the way I just did, but if this were true, these numbers are definitely an indication of the efficiency gain (~16x) of the GPU version over the CPU only version. I must admit Jim, that looks a bit too good to be true... And that is besides the 30x performance boost in runtimes. At least for low-powered GPUs. I tried to run multiple WU in parallel but if the single WU compute load is over 50% already, my experience so far was that it would fail. And it did again. But I can easily forgive a compute load average of only 65% on my 750Ti given these aforementioned improvements. I can easily imagine that future workloads, in which we would train multiple networks in parallel with different sets of inputs or hyperparameters, or in general more complex workloads such as RNN-WU will provide with experiment run 3, would largely increase the compute load. Anyway, I love the low power draw while still seeing the speedup but would still appreciate squeezing this 35% additional compute capacity out of the box. And letting 2 tasks compute simultaneously unfortunately doesn't work for me. From what I could read out of the stderr output it has to do with a memory error. At this fast pace of learning per epoch of only 6.5 sec I can easily see why memory bandwidth could impede a card's compute capability. My error was the following "- Unhandled Exception Record - Reason: Out Of Memory (C++ Exception)", though I am not sure of what caused it ultimately... |
Send message Joined: 12 Jul 20 Posts: 48 Credit: 73,492,193 RAC: 0 ![]() ![]() ![]() ![]() |
I think the gain is not so great as it appeared at first. I installed a GTX 750 Ti on my Win7 64-bit machine, and with the latest (457.30) drivers, I am seeing compute times of 80 minutes (4800 seconds), and a power of 9.4% of TDP, or about 5.6 watts. That is better than a CPU, but nowhere nearly as good as you report. Either the earlier drivers (451.67) you use are better, or there is something in Win10 that makes it faster than Win7. But there seems to be quite a wide range of variation here that will take some time to explore. Thanks for your input. |
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 ![]() ![]() ![]() ![]() |
One other change to note: Make sure you're comparing rand_automata WUs to rand_automata WUs. The ParityMachine and ParityModified WUs in the GPU app queue are set to run longer (1024 epochs vs. 128 epochs in the test and cpu channels). This is *not* changing the hyperparameters like we discussed above, its simply working longer on the same task with the same parameters before sending it back for evaluation. This was done for a few reasons: a) Dataset 1+2 WUs (as configured for 128 epochs) finish in about 5-7 minutes on a GPU. That seems to be a waste. b) As anyone who has monitored progress can see, we're struggling to get ParityMachine/ParityModified networks trained and finished. Part of that is because Parity is just hard to learn, taking 2000-3000 epochs (on average!). Currently, we do 128 epochs, wait for 2 hosts to complete that (which could take 4 days), and then schedule it again to continue. This was an important compromise when dealing with multiple machine types.. but since we only have ParityMachine/ParityModified left we might as well bite it off in bigger chunks. We might also change the CPU WUs to use more epochs at once too. We my also send out some EightBitModified WUs as well to help get those over the finish line as well. Note that the credit for these longer WUs has increased proportionally. The rand_automata WUs haven't changed. |
Send message Joined: 9 Jul 20 Posts: 142 Credit: 11,536,204 RAC: 3 ![]() ![]() ![]() ![]() |
If I were to see these kind of numbers on mine I would agree, however the GPU WUs come in reliably at the 1250 sec mark with little variance depending on which overhead task might be running in the background besides the other 5 out of 6 dedicated CPU threads. 80 min is very far off of my 20 min mark, so I really begin to wonder what might be the cause of this large discrepancy between 2 identical cards/card models. Like you pointed out, it might be a driver/OS combination, but I remain clueless. Meanwhile I tried to load my GPU at 100% CUDA compute capability and managed to get there by running F@H units of the latest covid-19 sprint (project 13429) alongside the rand GPU WUs. These WU typically load the 750Ti's compute cap. only at 65%. Now I am still at ~60% power capacity. Each GPU task has a dedicated CPU thread and with a slight penalty of ~15% for the rand WU (1440 sec/epoch time 7.5 sec), and currently I see an estimated ~50% runtime penalty on the F@H tasks. I don't know if I will continue this setup but it remains to be seen if this is worthwhile if the penalty on the rand WU turn out to be really this small. It would still expedite my computational effort at this project by a factor of 24x. And thanks for sharing this news! |
Send message Joined: 12 Jul 20 Posts: 48 Credit: 73,492,193 RAC: 0 ![]() ![]() ![]() ![]() |
One other change to note: Make sure you're comparing rand_automata WUs to rand_automata WUs. Yes, I have completed only the "rand_automata", and I see that is true also for bozz4science on his Win10 machine with the GTX 750 Ti. By the way, I went back to the 446.14 drivers, which are the last ones with CUDA 10.2, but don't see any difference. The work units are completing in about 80 minutes. I am really more interested in Linux though, and this is just a pastime until those are available, so I probably won't push this much further. But maybe someone can find out if there really is a difference between Win7 and Win10. I would not have suspected it. PS - I am supporting the GPU with two free cores of an i7-4771. That should be plenty. But just to check, I am suspending the WCG/OPN work on the other six cores to see if it makes a difference. I will note back here if it does. |
Send message Joined: 1 Jul 20 Posts: 34 Credit: 26,118,410 RAC: 0 ![]() ![]() ![]() ![]() |
My RTX 1660 Ti runs a single task with a load of around 40%. This is on an old i5-4590 (4 core, no HT). I will also note that it also consumes two full cores. So I set my app_config.xml file to reserve and 2 CPU per GPU task. If I let BOINC use more than two cores for other tasks, the GPU utilization drops way down. <app_config> <app> <name>mldstest</name> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>2</cpu_usage> </gpu_versions> </app> <app> <name>mlds-gpu</name> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>2</cpu_usage> </gpu_versions> </app> </app_config> Also, my three other dual GPU, win10 machines continue to fail 100% both CPU and GPU tasks. Reno, NV Team: SETI.USA ![]() |
Send message Joined: 1 Jul 20 Posts: 34 Credit: 26,118,410 RAC: 0 ![]() ![]() ![]() ![]() |
Also, my three other dual GPU, win10 machines continue to fail 100% both CPU and GPU tasks. To be clear, those three machines have about a million credits combined. So at least CPU tasks used to work. But no longer. Reno, NV Team: SETI.USA ![]() |
©2023 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)