|
61)
Message boards :
News :
[TMIM Notes] June 8 2021
(Message 1213)
Posted 9 Jun 2021 by pianoman [MLC@Home Admin] Post: This Notes for June 8 2021 A monthly summary of news and notes for MLC@Home Summary Updates have come slowly these past few months, since the presentation at the BOINC workshop and the release of our initial paper, as we're personally adjusting (fortunately!) the the beginnings of a post-pandemic life. Work, family life, and everything is changing for many of us, and we're still trying to figure out the new normal. Because of this, going forward these updates to be monthly since they take quite a bit of time to put together and we've been failing to get them out weekly for a while now anyway. And here's hoping all our volunteers all over the world are in an area where they too can start to move beyond the worst of the pandemic. But that doesn't mean the project has been dormant! DS1/DS2/DS3 are all nearing completion, especially DS3 which is sitting at 97%. We've been talking about DS4 for months, and the code is ready for larger testing. Unfortunately, we rolled out a test client a few weeks ago that failed miserably, because of an incompatibility between PyTorch and the native BOINC API. there's a way around this, but it requires more development, and a change to how WUs are specified, and we've been working on it ever since. We should be ready any day now but its been more involved then we thought so we're not prepared to give it a time. But, we do know we need to have it soon as DS3 WUs are running out. Some of the other benefits of the new client are it's statically linked, which vastly simplifies deployment. The extra development time has also given us a chance to make a change to make us more robust to NaNs, which should cut down on the amount of validation errors on the system. Another new issue is the data partition is running out of space on the server.. DS3 is taking over 4TB! Thanks to all of our volunteers! We've moved some things around to make a little space so everything is still working for now. We received some new storage today and will need some downtime to get it installed. Shouldn't take more than a few minutes, so we'll just do it sometime within the next week. So, stay tuned, the next month's going to be intresting for MLC@Home, as we move into DS4 and the next phase of this research. Other News
|
|
62)
Questions and Answers :
Unix/Linux :
Linux for Windows Users
(Message 1205)
Posted 30 May 2021 by pianoman [MLC@Home Admin] Post: Note I've temporarily disabled the ROCm client as I needed to re-write the new version of the client and it'll change the WU format a little bit. Will re-enable it over the next few weeks. |
|
63)
Questions and Answers :
Issue Discussion :
Validate errors
(Message 1204)
Posted 30 May 2021 by pianoman [MLC@Home Admin] Post: Yes. There's some code we've developed to detect NaN and restart the optimizer to go down a new path automatically. It's queued up for the release after this current one we're trying to roll out So not 9.90, but 9.91 (9.89 was what we rolled out to test and had to pull back due to errors two weeks ago due to unforeseen errors). So you have no reason to believe me, considering how slow things have been progressing recently due to work, school, and family demands over the past two months (should be getting better starting next week), but I hope to have a solution to at least the majority of NaNs within a few weeks. |
|
64)
Questions and Answers :
Unix/Linux :
New client testing has started
(Message 1198)
Posted 17 May 2021 by pianoman [MLC@Home Admin] Post: We've released a new version of the linux CPU client (with DS4 support) into the mldstest channel. This is the biggest change in how the client is built and run since the beginning of the project, so we expect a little bit of chaos. We've already run into one issue where the application runs fine standalone, but segfaults when loading the dataset when running under the boinc client. So expect issues on the test channel over the next few days while we iron out the issues. Here are the significant changes: * DS4 (CNN) support * No more appimage, instead we have a statically linked binary. So a single file download * Tested standalone to work on on CentOS 7 (new), ubuntu 21.04, 20.04, 16.04, and debian 10 and 11, on both AMD and Intel. * Compile HDF5, OpenBLAS, and PyTorch directly into the binary. * Update to PyTorch v1.8 And lots of other little fixes/changes to support DS4. This should set us up for many months to come, so please bear with us as we work out the kinks now. GPU, Windows, and ARM clients will follow once the basic linux CPU client is working. |
|
65)
Message boards :
News :
[TWIM Notes] May 1 2021 posted
(Message 1195)
Posted 2 May 2021 by pianoman [MLC@Home Admin] Post: MLC@Home has posted the May 1 2021 edition of its weekly "This Week In MLC@Home" newsletter! A server hiccup, a note about possible issues on Linux with newer distributions with aggressive systemd sandboxing, and hope for a new client rollout this week! Read the update and join the discussion here. |
|
66)
Message boards :
News :
[TWIM Notes] May 1 2021
(Message 1194)
Posted 2 May 2021 by pianoman [MLC@Home Admin] Post: This Week in MLC@Home Notes for May 1 2021 A weekly summary of news and notes for MLC@Home Summary An overdue update this week. First, we had a small server issue this morning, 5/1, and were down for about 10 hours until it was fixed. No data was lost, and we were able to restart with no further issues, although there may be some WUs that were marked invalid due to the unstable state of the system as it was going down, we're looking into that at the moment. It's the first bit of unscheduled downtime in long time. Fortunately we've been very stable since moving to the new server last year. Second, thanks to an astute user, we've noticed a trend in newer Linux distributions that effects the MLC clients (as well as others like Einstein@Home and LHC). Some distributions are using systemd's sandboxing capabilities to limit the BOINC client and any project applications from interacting with the rest of the system for security reasons. Unfortunately, MLC's appimage-based clients use /tmp, which is now restricted under this new policy. We've identified this as an issue with Ubuntu 21.04 and Gentoo, and may become an issue further down the line for other systemd-based distributions. For now, there's a workaround listed in our forums https://www.mlcathome.org/mlcathome/forum_thread.php?id=198 . The next client update will drop appimage support and thus won't be effected by the issue going forward. We've also spent some time working on the ROCm client, and have it working with Radeon VII as well as VEGA graphics card. Unfortunately, the current client requires you to have rocm-3.9.0 installed on your system. Speaking of the next client version, DS4 support is implemented and works, so we hope to roll out the new CPU client with some test WUs this coming week. Features include CNN (DS4) support, static linking, (no more appimage!), and some more minor fixes. Other News
|
|
67)
Questions and Answers :
Unix/Linux :
Fuse/read-only filesystem issue with newer distributions (and new client update)
(Message 1191)
Posted 28 Apr 2021 by pianoman [MLC@Home Admin] Post: Perfect, I just confirmed adding -/tmp works on my machine as well. Time to write up instructions and post a news update. I'll sticky this post. thank you! |
|
68)
Questions and Answers :
Unix/Linux :
clang-ocl: No such file or directory
(Message 1189)
Posted 28 Apr 2021 by pianoman [MLC@Home Admin] Post: Oh, you're preaching to the choir here. Every time I see anything in /opt I cringe. At least AMD is better than they were originally, where they really just wanted you to just use their docker image. I'm not aware if amd is working with the debian folks to do proper packaging (I hope they are), but I've always used their repo. Are you working on a proper portage recipe for rocm, boinc, or both? If so, rock on, happy to help in any way I can. Do you have other projects using rocm natively? or using opencl via rocm? We're using pytorch compiled with native rocm/HIP support, not opencl, I wonder if that makes a difference. I'm wondering if the issue is that we're shipping our own rocm libraries. If you have rocm installed already and in your search path, then the libraries we ship are redundant, and possibly complicating the setup. Would you be willing to try running the client without our libraries? Here's the general process: * Go into the mlc project dir in the boinc home directory * Copy mlds_* and all the files that end in `-pt17rocm39`, and the dataset (*.hdf5) into a separate temp directory. * Remove the -pt17rocm39 from the end of all the files that have it * Create a `dataset.hdf5` symlink to the dataset file in the same dir (`ln -s Parity*.hdf5 dataset.hdf5`) . You should then be able to run the client from that directory via `./mlds_* -m 5` (will run for 5 epochs then exit). That should fail in the same way. From there, delete or rename all the libraries from this directory that you already have installed for your main rocm installation and try running again. I'd love to hear how it fails in new and spectacular ways. I'm still trying to test the systemd issue. |
|
69)
Questions and Answers :
Unix/Linux :
Fuse/read-only filesystem issue with newer distributions (and new client update)
(Message 1187)
Posted 28 Apr 2021 by pianoman [MLC@Home Admin] Post: That... would certainly fit the observed behavior. Thanks for the pointer. I'm aware the systemd is capable of sandboxing, but didn't think they would keep the boinc user from either writing to /tmp or mounting a fuse filesystem (which is what AppImage does: it creates a temporary mountpoint in /tmp, mounts the embedded squashfs image, then runs the binary from the mount point). But it would explain why it works fine when I go into the boinc directory and run it manually. It's always been a little ugly that we have to use appimage. It's a kludge around the fact that pytorch can't (couldn't) be compiled as a static library. I can't wait to move away from that. I'll test this out and work on a workaround to post, and then get on with the next client release to put this all behind us. Note: currently, static compiling works for the next release for CPU only. I haven't tried cuda, and rocm with its hard-coded paths is going to be even worse. |
|
70)
Questions and Answers :
Unix/Linux :
Fuse/read-only filesystem issue with newer distributions (and new client update)
(Message 1184)
Posted 27 Apr 2021 by pianoman [MLC@Home Admin] Post: We've seen a few people report here in the pas about a weird fuse error that keeps the client from starting and the error from the task is something about "can't mount, read only filesystem" or something similar. I don't know what's causing this, but after upgrading to ubuntu 21.04 I'm now getting this issue myself. So now that I'm experiencing it myself, I can at least debug it. But I have no idea what's causing it. What's worse, running the program outside of boinc seems to work. The next release of the client will be statically linked and drop fuse entirely. The other main feature of the client, DS4 support, is almost ready (training the network works, I'm just tuning the runtimes and number of epochs to work on MNIST and Fashion-MNIST) so I'm going to hurry a release of that ASAP. It's a *big* change from what we were doing before, but I think it'll be worth it. That said, also expect some bumps in the road and some more time testing than usual as we make sure the new client works as designed. |
|
71)
Questions and Answers :
Unix/Linux :
clang-ocl: No such file or directory
(Message 1182)
Posted 27 Apr 2021 by pianoman [MLC@Home Admin] Post: Thanks for trying this and the bug report. I share your frustration. I can reproduce this and I'm working on it. The ROCM official docs actually say this should be installed in /opt/rocm-X.Y.Z/bin, with a symlink /opt/rocm pointing to the latest /opt/rocm-X.Y.Z installed on the system. https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html#deploying-rocm . However, there are a ton of issues with this. First, the client as it stands now is compiled against rocm-3.9.x, and I had no idea it was looking in a hard-coded path to run clang-ocl. So it worked on my machine because I happened to have that installed. Now it fails on my machine because I re-installed it and installed the latest version 4.1.0, and /opt/rocm-3.9.0 doesn't exist. I had hoped that by shipping the rocm libraries with the binary the user wouldn't even need to have rocm installed, it would just use the local copy included with the app. Unfortunately, the libraries contain external calls out to hard-coded paths, including the version number in the path as well. This is also bad because boinc is not ROCm aware, and has no way of telling me what version of rocm is installed on the system (like it does with cuda)... so if I compile against a version of rocm, there's no way to know (ahead of time) if the user is using that version of rocm on their system. In reality, if rocm can't find the specific version of clang-ocl in the path it expects, it should fall back to just running whatever it can find in the path... but apparently it doesn't. Or at least it doesn't on the version I compiled the client against. Even static linking (what we're working on for the next release) won't fix this issue. I'm working on it but it'll take a while. As a workaround, if you're willing to, you can install rocm-3.9.0 on your system under opt, and it should work on VEGA (and redeon vii) as-is. We just had one user (see the chat on the discord server) get it running on radeon vii on the discord server by installing 3.9.0... but that is *far* from an ideal solution. |
|
72)
Questions and Answers :
Issue Discussion :
More teams needed in team list
(Message 1179)
Posted 26 Apr 2021 by pianoman [MLC@Home Admin] Post: I ran that script when I created the server and it eventually crashed out for some reason after only importing part of the list. I made a note to check back again at some point, but I don't think I ever did. Come to think of it, it may have been due to some teams using emoji's in their names and descriptions and my DB config complaining about them... but that was 10-ish months ago, my memory is almost certainly fuzzy. I'll have to check that script to see if its re-entrant so I can run it again without creating duplicate teams. |
|
73)
Questions and Answers :
Issue Discussion :
Cannot download NVIDIA GPU tasks
(Message 1168)
Posted 24 Apr 2021 by pianoman [MLC@Home Admin] Post: Hi Erich, What version of CUDA do you have installed on that system? We need CUDA 10.2 or higher installed. |
|
74)
Questions and Answers :
Windows :
GPU running idle most of the time
(Message 1164)
Posted 22 Apr 2021 by pianoman [MLC@Home Admin] Post: Hello and welcome to the project! First, it doesn't look like you're doing anything wrong. I should turn this into a FAQ. I'll cover the validation error in a second (and no, it is not ideal), but first, on GPU utilization: The current batch and network sizes for the WUs in the GPU queue are rather small, especially for higher-end cards. This can lead to some "less than 100%" utilization on the GPU.. it's not that the GPU isn't busy, its that it spends a lot more time moving data into and out of the GPU than it does doing the matrix multiplication (what shows up as utilization). We're looking at ways to better utilize higher end GPUs for DS4 (in development), which is a CNN-network as opposed to an RNN network, so it should take better advantage of GPUs. Even with these inefficiencies, GPU training is still faster than CPU training (although CPU training is no slouch!). This seems to be especially an issue on Windows.. Linux seems to do a better job overall keeping the GPU busy, but not by much. As for the validation error, this is a huge problem for us at the moment. ML training is, in general, a randomly-guided search for an optimal solution. Sometimes that search leads down paths that push the bounds of floating point representation, and can lead to NaNs (invalid numbers). When this happens there's really no way to recover[1], and the result is invalid. Now, why not just give you credit anyway since you did the work? you shouldn't be penalized because you got unlucky.... and we agree. The problem is we can't (yet) tell the difference if you did the computation and it came out invalid (in that case you deserve credit), or you're someone cheating and returning a random result to game the credit system. Even worse, if we detect the condition and return early, and then mark the result as valid, BOINC resource management gets screwed up and thinks you completed the full work much faster and starts mismanaging your resources (you'll get results ending with TIMEOUT because it thinks you have a much faster machine than you do (BOINC accounting is... very powerful and very convoluted). Overall, our validation error rate is under 0.5%, so it hasn't been too much of an issue. Additionally, certain network constructs are more susceptible to this behavior than others, and the work in the GPU queue at the moment is filled with some of those networks. As we're finishing up DS1/DS2, the remaining Parity networks are the hardest ones to search for. That's why they're still not complete yet. They also have a much higher rate of these wayward searches, up to 5%. That's too high and we haven't fixed it yet. We need to so something about that, by just granting credit for NaN networks and accepting that there's a chance people can cheat at least until those networks are cleared out. The CPU queue doesn't have th This is a very long answer, the TL;DR is, you've done everything right, the random nature of ML training makes this hard; and we'll attempt to tweak the validation criteria in the next few days so that your result at least gets credit. Thanks for volunteering and I hope this hasn't scared you off from the project! [1] To make matters even more complicated, if the task is suspended/resumed while its stuck in a NaN state, that re-kicks the optimizer and the WU can (sometimes) recover and start searching valid values again. We may be able to update the client to do this, but it is not implemented yet. |
|
75)
Message boards :
News :
[TWIM Notes] Apr 22 2021 posted
(Message 1161)
Posted 22 Apr 2021 by pianoman [MLC@Home Admin] Post: MLC@Home has posted the Apr 22 2021 edition of its weekly "This Week In MLC@Home" newsletter! A busy few weeks: Paper published! 2021 BOINC Workshop presentation! Read the update and join the discussion here. |
|
76)
Message boards :
News :
[TWIM Notes] Apr 22 2021
(Message 1160)
Posted 22 Apr 2021 by pianoman [MLC@Home Admin] Post: This Week in MLC@Home Notes for Apr 22 2021 A weekly summary of news and notes for MLC@Home Summary It's been a very busy few weeks for MLC@Home! On 4/14/2021, MLC presented at the 2021 BOINC Workshop. The slides are available here. The video of the presentation will be posted to youtube shortly. On 4/21/2021, we participated on in day 2 of the workshop as a member of a panel on doing AI/ML using BOINC. Videos of that should be posted shortly as well. It was clear that there's both a lot of interest in using BOINC for AI/ML, and it is also clear MLC is at the forefront of that interest. More importantly, MLC@Home released today the first paper based on the MLDS dataset computed by our volunteers! MLDS: A Dataset for Weight-Space Analysis of Neural Networks In this paper, we show meaningful clustering in weight space for networks that are trained on the same data. Like any good science, these preliminary findings trigger just as many new questions as it provides answers. All in all, as we continue to work on DS4, there's a lot of big things afoot for the future of MLC. Thanks again for the support you have shown MLC, and we hope to continue to earn your support as we move forward. Other News
|
|
77)
Questions and Answers :
Issue Discussion :
How long before you show up in the stats?
(Message 1155)
Posted 18 Apr 2021 by pianoman [MLC@Home Admin] Post: If you haven't, please make sure you have "Do you consent to exporting your data to BOINC statistics aggregation Web sites?" checked under Project -> Preferences at the top of this page. |
|
78)
Message boards :
Cafe :
How does MLC verify results without running multiple tasks per work unit (redundancy)?
(Message 1151)
Posted 16 Apr 2021 by pianoman [MLC@Home Admin] Post: I'm working on adding a few more criteria that I'll keep vague to stop potential cheating, but in general that's why we don't do multiple results per WU, because they're stochastic and won't be identical at all. In fact, this stochastic nature is what we're trying to capture so it helps. |
|
79)
Message boards :
Cafe :
How does MLC verify results without running multiple tasks per work unit (redundancy)?
(Message 1150)
Posted 16 Apr 2021 by pianoman [MLC@Home Admin] Post: Each network is compared against a test dataset that is held back from being sent to you users. If the network performs well on that test set (and is a valid network, and is not identical to a previously submitted result) then it is considered valid. |
|
80)
Message boards :
News :
[TWIM Notes] Apr 8 2021 posted
(Message 1145)
Posted 9 Apr 2021 by pianoman [MLC@Home Admin] Post: MLC@Home has posted the Apr 8 2021 edition of its weekly "This Week In MLC@Home" newsletter! Information on the 2021 BOINC workshop next week, where we'll present MLC and release our first paper! Read the update and join the discussion here. |
©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)