Questions and Answers :
Unix/Linux :
GPU support update 11/23
Message board moderation
Author | Message |
---|---|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 ![]() ![]() ![]() ![]() |
All, We've made a few changes to the Linux clients in mldstest and we're getting better results but with some trade-offs. However, it appears we're having much better luck with GPU support under linux now! Here's a short changelog for 9.80 in mldstest:
|
Send message Joined: 9 Jul 20 Posts: 142 Credit: 11,536,204 RAC: 3 ![]() ![]() ![]() ![]() |
The preliminary numbers look already very promising! Appreciate the detailed listings of the technical requirement and potential issues with the current client version. Overall, exciting news! |
![]() Send message Joined: 1 Jul 20 Posts: 32 Credit: 22,436,564 RAC: 0 ![]() ![]() ![]() ![]() |
EUREKA! I have a GTX980Ti running on Ubuntu and everything looks great so far. Previously the wu's failed in the first 5 seconds but seem to run normally now. I will monitor to see if they validate. If so, I will add some other Linux GPU's. My fingers are crossed! EDIT: Completed and validated. WooHoo! |
Send message Joined: 12 Jul 20 Posts: 48 Credit: 73,492,193 RAC: 0 ![]() ![]() ![]() ![]() |
However, it appears we're having much better luck with GPU support under linux now! Yes! My GTX 1650 Super (Ubuntu 20.04.1, 455 driver) is running the rand_automata in 15 to 16 minutes, at about 50 watts. That compares to 3 hours for a GTX 1650 Super on my Win7 machine (25 watts). From what I see, the Win10 machines are still much faster than Win7 for some reason, and probably comparable to Linux. The disk drive space is not an issue. (What is a MB?) |
Send message Joined: 6 Jul 20 Posts: 7 Credit: 2,082,893 RAC: 9 ![]() ![]() ![]() ![]() |
Running fine on my Ubuntu system, sporting a GeForce 1060 3GB. Several tasks validated. Very nice! Thank you! Keep up your great work! - - - - - - - - - - Greetings, Jens |
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 ![]() ![]() ![]() ![]() |
Moved the linux cuda client to release. There will be some transition pains, as the 14K existing WUs on the mlds-gpu queue have disk limits set too low for the linux client. I've updated them all in the DB, but I'm not 100% sure if that will do the trick for already created results. I've already updated our work dispatcher than any newly generated WUs will have the correct limits set, but there may be some time when you get disk_limit_exceeded errors for a bit before they're all flushed out. |
Send message Joined: 6 Jul 20 Posts: 7 Credit: 2,082,893 RAC: 9 ![]() ![]() ![]() ![]() |
I'd like to add that I run driver version 450.xx, not 455.xx on Ubuntu 20. So, your requirements seem a little bit higher than needed. For CUDA: - - - - - - - - - - Greetings, Jens |
Send message Joined: 24 Jul 20 Posts: 30 Credit: 3,485,605 RAC: 0 ![]() ![]() ![]() ![]() |
The downside is the size of the libraries we need to ship. CUDA is huge. Our binary itself is 8.6MB, but the libtorch_cuda library is 1.9GB, and several other libraries bring the total size of the app+library to 3GB. With AppImage, these libraries were compressed down and stored on disk in a squashfs filesystem, bringing the on-disk requirement down to approximately 900MB-1.6GB (depending on the build). Whats worse, these files need to be downloaded to the project directory, and then copied to each running directory (not sure why BOINC doesn't do linking here, maybe we're missing a setting?), so they take up twice as much disk space when in use. More if you have more than one GPU.I think people will usually have enough disk space, though they may need to configure BOINC to use that much. What worries me is the amount of data written. A quick calculation shows that your new application could easily write the full capacity of my "little" SSDs twice a day. That's not acceptable. Even if you could reduce that by a factor of 10 by making the tasks larger I think I still wouldn't do it. Perhaps you could ask at Rosetta@Home, they had the same issue. Their application needs a database that used to be replicated for every task but they found a way to use a single copy in the project directory. I don't know how that works though. |
Send message Joined: 9 Jul 20 Posts: 142 Credit: 11,536,204 RAC: 3 ![]() ![]() ![]() ![]() |
Haven't given it much thought until now. For my ancient rig with only a years old HDD, that's fine, but with a NVME drive or SSD as main storage on which BOINC is installed and run from, it is valid to think about a potentially accelerated hardware depreciation. Let's start by looking at high-end NVMW drives, that can be increasingly found on modern systems, f.ex. Samsung's Evo/Evo Plus drives. They have a lifespan of ~1,200 TB TBW. On a dual-GPU setup running 24/7, with 2 tasks concurrently on each GPU with an average runtime of 3,600 sec. which is to represent all GPU WU types, we would get to ~96 GPU WUs computed per day. With a lower estimate of ΓΈ runtime you would easily get up to ~150 WU/day. If you were to take the 1.9 GB per task * 100 WU = 190 GB 1,200 TB TBW --> 1,200,000 GB / 190 GB = 6,315 days = 17.3 years If you were to take a lower average runtime estimate of 2,500 sec for the sake of this mind experiment, you would come up with 140 WU/days. 1,200,000 GB / 266 GB = 4,511 days =12.3 years With an initial investment of ~100$ for 500 GB of NVME storage, the CUDA Linux client would equate to an 0.0158$ or 0.0222$ respectively in additional deprecation per day if running MLC 24/7. Sure, that doesn't measure the degrading performance, but it's an intuitive monetary measure of the deprecation of the hardware over its expected full lifetime. I guess most components, will not make it to this number in most cases, as mechanical parts, such as pumps, fans, etc. will eventually break before that if under constant 24/7 load, and other components might be upgraded within 5yr intervals. This would assume of course, that you were to only run MLC's GPU client, without any side project, or other applications running along this client. I guess, it would make sense to consider running the CUDA client if you can, but the degradation seems to only become a real issue well after the warranty period is over. Usually these mentioned drives come with a 4-5 yrs warranty. So you could even run up to 7 yrs worth of drive depreciation of CPU projects before you were expected to lose some cells on your NVME drive. Much oversimplified, but just to illustrate, that today's tech should keep up well with these demanding requirement of this client version. I hope I didn't screw up this thought experiment. My only intention is to spark a discussion! |
Send message Joined: 12 Jul 20 Posts: 48 Credit: 73,492,193 RAC: 0 ![]() ![]() ![]() ![]() |
What worries me is the amount of data written. I am running a GTX 1060 under Ubuntu 20.04.1, and checked the writes with "iostat 3600 -m", which gives the writes in megabytes per hour (ignore the first reading, it is the writes since boot). I am getting about 9 GB/hour, or 200 GB/day, which is a little much for my Samsung 850 EVO; I like to keep it to less than 70 GB/day. So I use a write cache. It is built in to Linux, you just set the parameters. Since I have 16 GB main memory, and it is not much used, I can devote half of it (8 GB) to the cache. I set a timeout (the time before the cache is flushed) to one hour, so that an entire MLC work unit can be held in the cache. Very little will be written to the SSD. This works: Swappiness: to reduce the use of swap: sudo sysctl vm.swappiness=0 Set write cache to 8 GB/8.5 GB: for 16 GB main memory sudo sysctl vm.dirty_background_bytes=8000000000 sudo sysctl vm.dirty_bytes=8500000000 sudo sysctl vm.dirty_writeback_centisecs=500 (checks the cache every 5 seconds) sudo sysctl vm.dirty_expire_centisecs=360000 (page flush 60 min) To check the memory used for disk caching: cat /proc/meminfo | head -n 5 This shows that the cache is less than 3 GB full. You could probably use a 4 GB cache, or even less would save the SSD. |
Send message Joined: 1 Jul 20 Posts: 2 Credit: 10,203,385 RAC: 0 ![]() ![]() ![]() ![]() |
Hello Jim1348, great suggestion. Do you know how much faster the gpu task will complete when you write to RAM versus SSD? Thanks. |
Send message Joined: 12 Jul 20 Posts: 48 Credit: 73,492,193 RAC: 0 ![]() ![]() ![]() ![]() |
Do you know how much faster the gpu task will complete when you write to RAM versus SSD? Thanks. Yes. Not at all. It is just to save the SSD. |
Send message Joined: 9 Jul 20 Posts: 142 Credit: 11,536,204 RAC: 3 ![]() ![]() ![]() ![]() |
So I use a write cache. It is built in to Linux, you just set the parameters.Great to know! Thanks Jim |
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 ![]() ![]() ![]() ![]() |
Just a little technical background: there are several unfortunate design decisions in boinc, pytorch, and us that make sense overall but are colliding to make this an actual issue for our project. The whole problem could be solved with symlinks. Ideally, you would download the exe+libraries once into the project directory, and when boinc launches the app, it would create symlinks between the run directory and the project directory. The loader knows about symlinks, and all the shared objects we shipped are modified to look in the current runtime directory for their work. That way, you keep only one copy of the files and just link to them each time. Easy peasy. But symlinks don't work on Windows (or rather, they kind of do, sort of, and certainly didn't commonly work on Windows 20 years ago when BOINC was being developed). And rather than have some platforms that use real symlinks and others that don't, boinc doesn't use posix symlinks at all. In boinc uses a placeholder file with an xml tag that contains the path to the read file, and requires the boinc client itself to open and resolve that file. That works file for single program exes that are opened by the boinc client, and for data files that the individual client app knows this and resolves the filenames themselves. It does not work when other programs, like the loader, which isn't aware of boinc or its custom hand-crafted "links" tries to find the shared libraries associated with a program to load into memory. On disk, these shared libraries need to be stored uncompressed at runtime, but at least during network download they can be compressed. Even if boinc did use symlinks, there's another issue with names. Any file served from the boinc server needs to have a unique name, yet we have multiple clients with libraries that share a name but are different files (libtorch.so.1 for the CPU client is different than libtorch.so.1 for the CUDA client, etc). So that means i need to add something like a version or type to each filename, and then set a parameter in the WU to tell the client to change the name to the canonical one when copying to the runtime directory. It turns out that option only exists IF you also set the option to copy the file to the runtime directory. Meaning there's no way to tell boinc to create even its pseudo-symlink file with a different name if you don't also copy the file. The above are all things that would need to change in the boinc client and I have no control over. In fact, I suspect the larger boinc project will say "well, don't do things this way, statically link your exe or use a vm image!". Of course, we can't statically link (see https://github.com/pytorch/pytorch/issues/21737) or use a virtualbox vm (no GPU support). So we're left with a lot of bad choices. AppImage solves some of the above problems by providing a single binary that included an embedded squashfs image that contains all the libraries and modified the libraries and binaries to use the copies in the embedded filesystem. There's only a single "binary", which is boinc psuedo-symlinked to the main project dir and not copied, and since the client is the one that launches it, it can resolve the link. The downsides are that the appimage tools are a bit unstable on the creation side, its proved to be very difficult to debug, requires the user have "fuse" installed, and creates a temportary mountpoint in /tmp, which means its touching things outside the boinc directory, which understandably makes some people nervous. And most importantly, it just flat out didn't work for the cuda binary, likely because it couldn't resolve all the dependencies. There are some other potential options, none of them good. We could move back to appimage for GPU, which failed us already, but maybe could be forced into submission with more work. We could ditch pytorch and write custom, likely buggy, likely much slower versions of the app (non-starter for me). We could push for boinc to allow symlinks with new names on linux (likely an uphill battle and would take time). We could re-write the app to use a boinc "wrapper", which would lose us some features, but does have an "exec_dir" option which we might be able to use to run...if we could somehow solve the filename problem. Note the windows app always behaved this way, copying its shared libraries around, because there is no appimage equivalent. If you ever wondered why the GPU app took so long, I hope now you begin to see the time and effort that went into it beyond a simple "turn on the flag and recompile". As for SSD wearout, other than a cache (or just the regular page cache) there's another option. While I'm not a fan of BTRFS in general, it is a copy-on-write filesystem, which i believe means that since the files here are copied and read (not modified) it actually wouldn't create a whole new copy of the file on the SSD, just do the (loose equivalent) of a symlink but hidden within the filesystem. So it may help to copy your boinc dir to a btrfs (or other CoW) filesystem partition. Sorry for the wall of text, hope it provided some context. |
Send message Joined: 12 Jul 20 Posts: 48 Credit: 73,492,193 RAC: 0 ![]() ![]() ![]() ![]() |
If you ever wondered why the GPU app took so long, I hope now you begin to see the time and effort that went into it beyond a simple "turn on the flag and recompile". . I have never seen a GPU app developed so fast. You have a second career doing it if you want to. |
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 ![]() ![]() ![]() ![]() |
I have never seen a GPU app developed so fast. You have a second career doing it if you want to. To be fair, most GPU apps in boinc are custom. PyTorch supports GPUs by design, so the actual changes to the actual code are pretty much flipping a few flags in the config and recompiling. It's the packaging that's a pain. |
Send message Joined: 24 Jul 20 Posts: 30 Credit: 3,485,605 RAC: 0 ![]() ![]() ![]() ![]() |
First thanks to all for your suggestions. That gives me something to think about. Maybe during the weekend ... @bozz4science: My calculation is very different, mostly because you seem to think of different SSDs. 1200TB TBW, that must be terabyte size devices. My SSDs usually are 120GB, that is more than sufficient for a dedicated cruncher. The downside is the much lower TBW rating. With your 100 tasks a day at 3GB each (not 1.9) I calculate 300GB of daily writes. At that rate the SSD I have in mind could reach EOL in 200 days. While the monetary value is not very high I don't want to burn it like that. Moreover, replacing the SSD on the computer I'd possibly use is a major effort, I really wish to avoid it. @Jim1348: A write cache is an interesting idea but it's not obvious to me that it actually does reduce writes, not just delay them. I'll need to find more information on how it works in detail. I'd thought of running BOINC from a tmpfs perhaps, but then I'd have to find a way to make the data persist across restarts. @pianoman: I'm not familiar with BTRFS at all. A quick search didn't come up with CoW as a prominent feature. That definitely needs some more reading before I'd consider using it. |
Send message Joined: 12 Jul 20 Posts: 48 Credit: 73,492,193 RAC: 0 ![]() ![]() ![]() ![]() |
@Jim1348: A write cache is an interesting idea but it's not obvious to me that it actually does reduce writes, not just delay them. I'll need to find more information on how it works in detail. Most of my tests have been on Windows, since the caches (PrimoCache provides the most info) show how much is written by the OS, and how much is written to the disk. The point with scientific programs is that they are iterative. That is, they read a location, do a calculation, and write the results back to the same location. With a cache latency of a couple of hours, I could typically see a reduction in writes to the SSD of over 80% or so. If it was four hours, I could do 90%. That depends on the project of course. Here, the entire work unit runs in less than an hour, so if you have a big enough cache so that it does not overflow, the writes will be almost zero. I don't have a good way to measure that in Linux, but the monitoring tool I noted showed that the program was occupying only 3 GB of the cache, so that will necessarily be the case. Tmpfs would no doubt work; it is like a ramdisk I think, but would occupy more space than a cache, since you have to keep the entire program in memory. And you have to make it survive a reboot. I find a cache simpler, but if you can get tmpfs to work, let us know. |
Send message Joined: 9 Jul 20 Posts: 142 Credit: 11,536,204 RAC: 3 ![]() ![]() ![]() ![]() |
Yeah, you're completely right. I used the numbers stated on Samsung's website for 1 TB NVME SSDs, though that might be a whole different story than your standard SATA SSD. Having your numbers in mind, I completely understand your situation and share your concerns. I agree that more than 2x of total cap. in daily writes (120 GB SSD) is definitely too much even at 20$ price point. And while monetary daily depreciation would increase to 0.10$/day is still low compared to the operating or other components' costs, I feel you when you say the EOL could be potentially reached within not even a full year. Needless to say, that the trash produced by rendering the device unusable through those heavy and sustained loads, is much worse and should be avoided. I liked Jim's advice by the way very much. Easy to implement and should do the trick of protecting the SSD against excessive rewriting of the same data |
Send message Joined: 30 Nov 20 Posts: 14 Credit: 7,958,883 RAC: 16 ![]() ![]() ![]() |
My machine https://www.mlcathome.org/mlcathome/show_host_detail.php?hostid=5035 has a GTX 750Ti with 2GB memory....it is not getting any GPU tasks so I am wondering if the requirements have changed? |
©2023 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)