|
21)
Questions and Answers :
Issue Discussion :
Rogue batch ?
(Message 1375)
Posted 1 Oct 2021 by pianoman [MLC@Home Admin] Post: Looking at this now.. working on it. Not sure what's going on, looks like a connection issue to the database somewhere. Tracking. |
|
22)
Questions and Answers :
Issue Discussion :
Multiple MLDS.exe apps running
(Message 1371)
Posted 29 Sep 2021 by pianoman [MLC@Home Admin] Post: This is new behavior with the new v9.90+ client, right? Not with the old one? v9.90 moved to using a "wrapper"... where there's BOINC-provided wrapper program that runs a a generic, unmodified binary to process the code. I'd say the majority of BOINC projects run the wrapper for their clients, we were a bit of an odd duck because we modified our client to use the BOINC API directly. However, there were issues with this and pytorch (both wanted to use the SIGALRM posix signal), so we moved off of that for the latest client. It's certainly possible that the wrapper is, sometimes, on windows, not cleaning up all its child threads. But since we didn't write the wrapper (literally we're using the binary from https://boinc.berkeley.edu/trac/wiki/WrapperApp), I think this might be worth opening a bug report on the main BOINC github repo. :( |
|
23)
Questions and Answers :
Issue Discussion :
Multiple MLDS.exe apps running
(Message 1360)
Posted 13 Sep 2021 by pianoman [MLC@Home Admin] Post: Are they using up any CPU time? If they are, and the client says there's no WUs currently *running*, then there's an issue. When you say there's "no WUs in progress" .. do you mean none currently running, but there are some that are partially complete but waiting for their turn again? or that there's no WUs even partially complete? There are two reasons why I can think there may be stray executable threads sitting around. One is that the client (especially on windows) sometimes spawns threads which aren't actually use, so they're harmless, not using any compute power, and we just ignore them. The other is if the WU is in progress but the client suspends it for some reason (because it's time to run another project, for example), by default the client will keep the executable loaded in memory but paused.. so it again doesn't use any CPU resources, but does use up memory. This is done so that resuming is much easier. I think there's a BOINC client setting to change that default behavior and force suspended WUs to unload the exe from memory. IF the client says that no mlc WUs are running or in progress, AND there are mlds processes still running and consuming CPU time, then that's a bug that needs to be fixed. |
|
24)
Questions and Answers :
Windows :
Long run times on second GPU
(Message 1351)
Posted 28 Aug 2021 by pianoman [MLC@Home Admin] Post: First, thanks for supporting the project. Second, can you run "nvidia-smi" to verify that the client is actually using the second card (and not double-loading the 1st)? Here's a link on how to find this command on windows https://stackoverflow.com/questions/57100015/how-do-i-run-nvidia-smi-on-windows. Run it while boinc is running. I don't have a dual GPU machine to test. But assuming it's working as intended, then I can say that yes, the way the current GPU WUs are crafted, there is a lot of activity moving the dataset and model between CPU and GPU RAM, so it's not /too/ surprising that if the first slot is x16, and the second slot is x4, the WU might take 2-3x as long. This is why you'll also see people here with very fast graphics cards saying they get low GPU utilization. |
|
25)
Message boards :
News :
Updated CPU client 9.9x release and issues
(Message 1348)
Posted 28 Aug 2021 by pianoman [MLC@Home Admin] Post: This fix will not solve ALL oustanding issues, but it should help with: * Computation errors with no output in the logs that became prevalent in the last 24 hours (the 24h failure rate jumped from 1% to 80% over the past two days) * The memory limit has been set back to 800MB, as originally intended. It turned out that was not the issue. There's still known issues with DLL issues on windows, at least one report of a crash involving a file already existing that shouldn't exist, and one crash on an odroid (arm) system. Please keep reporting these issues and we'll tackle them as we can. I want to re-assure anyone experiencing those that we're not ignoring you at all. Thanks for volunteering your compute time! |
|
26)
Message boards :
News :
Updated CPU client 9.9x release and issues
(Message 1347)
Posted 28 Aug 2021 by pianoman [MLC@Home Admin] Post: Earlier this week, we released the latest v9.90 CPU client after almost 3 weeks of testing. While it initially seemed to be working fine, a number of errors started accumulating over the last 24 hours. We've identified a server configuration issue and believe it is now fixed as of 6AM UTC today. The server was generating invalid WUs for the MLDS queue. We've cancelled all of the problematic WUs and are adding new ones to the main queue. The GPU clients and MLDSTEST queue remained unaffected. v9.90 is an important release for MLDS, as it contains support for CNNs and Dense feed forward network types needed for DS4. Highlights include: - Statically linked binary for Linux (no more AppImage) - DS4 support! (CNN and Dense networks) - Better NaN handling - Update to libTorch 1.9 - Wrapper instead of BOINC native API
|
|
27)
Questions and Answers :
Windows :
Exit status ... dll file (ucrtbase.DLL) not found - MLDS v9.90
(Message 1344)
Posted 27 Aug 2021 by pianoman [MLC@Home Admin] Post: Working on it. I find it interesting this one crashed after 2 hours of runtime. Thanks for posting these. |
|
28)
Questions and Answers :
Issue Discussion :
25% 'Error while computing' after upgrade to v9.90
(Message 1343)
Posted 27 Aug 2021 by pianoman [MLC@Home Admin] Post: Working on it. two issues have come up suddenly.. one is drastically larger memory usage, the other is the issue you see below. I've been updating discord since it was brought to my attention 12 hours ago. The memory issue related to OMP, but I thought the model-* issue was harmless and wouldn't be seen on volunteer machines. I'll be pushing out an update that fixes the latter, still not sure how to fix the former. |
|
29)
Questions and Answers :
Windows :
Exit status -1073741515 (0xC0000135) STATUS_DLL_NOT_FOUND
(Message 1338)
Posted 25 Aug 2021 by pianoman [MLC@Home Admin] Post: This is so frustrating. The error is intermittent even on the same host, and appears to be happening in a bowels of the pytorch library, so it's not (necessarily) anything we've done in the client. And you're sure this system runs reliably with other intense projects? I don't doubt it, but I can't rule out a cache or memory issue endemic to your system either. That said, there are other scattered reports of rare intermittent crashes on other windows systems, but it seems to be on >1% of WUs. Hard to know what the issue is. I really wanted to statically link the new client, but that just wasn't happening on windows. I wish I had a better answer than I don't know, but that's all I have for you for now. Oddly the linux client seems to be a bit more unstable too.. maybe this is a pytorch v1.9 issue? Can you check the boinc client logs and see if the program has suspended/resumed just before crashing? We *did* change some code around that. You can simulate this by running a WU, then suspending it in the gui, and then resuming it. |
|
30)
Questions and Answers :
Issue Discussion :
Memory requirement for CPU WUs
(Message 1334)
Posted 25 Aug 2021 by pianoman [MLC@Home Admin] Post: Ugh. No the memory requirements for CPU shouldn't have gone up, looks like a set the gpu limits by accident. Will fix.within the hour but expect the existing wus to take a few days to flush out of the system. My mistake, my apologies. |
|
31)
Questions and Answers :
Issue Discussion :
Out of work for CPU
(Message 1328)
Posted 24 Aug 2021 by pianoman [MLC@Home Admin] Post: Before anyone complains, yes, we're out of work for the CPU queue at the moment, but there's good reason this time! We're updating to a new app tonight (what's been in testing) that is incompatible with the current WUs in-flight. So tongiht we'll be aborting all the outstanding WUs on the CPU queue, and re-issuing them to be compatible with the new client. It should all be resolved by tomorrow, jst be aware you may get some aborted WUs and that's unavoidable during this transition. Note the GPU clients aren't ready yet for the new WU type (they may or may not take long, the linux side especially changed the way it links the code, so it may take a bit to iron that out w/ cuda and rocm), but for now the GPUs will continue to crunch with the old WUs to finish up the work there. Thanks for your patience and understanding |
|
32)
Questions and Answers :
Windows :
New Windows CPU client in mldstest (9.90)
(Message 1319)
Posted 12 Aug 2021 by pianoman [MLC@Home Admin] Post: A ha, found one issue.. I missed packaging a DLL with the client, so if you don't happen to have msvcp140.dll installed on your system already the new client would fail. I updated the windows test client to 9.91 to fix this. So if you tried the client before and it failed, please try again. |
|
33)
Questions and Answers :
Windows :
New Windows CPU client in mldstest (9.90)
(Message 1318)
Posted 12 Aug 2021 by pianoman [MLC@Home Admin] Post: OK, I've noticed at least some issues with people running the *latest* versions of windows 10. I compiled the client on windows 10 2004, but I installed a VM with Windows 10 21H1, and it errors out. I'm currently debugging, and I'm not happy windows 10 runtimes don't seem to be compatible among windows versions. |
|
34)
Questions and Answers :
Issue Discussion :
WU Runtime for CPU WUs
(Message 1317)
Posted 11 Aug 2021 by pianoman [MLC@Home Admin] Post: Hyperthreading/SMT... maybe threads are fighting over shared FPU resources? might have just as high an impact as the cache issues you brought up. |
|
35)
Questions and Answers :
Issue Discussion :
WU Runtime for CPU WUs
(Message 1315)
Posted 11 Aug 2021 by pianoman [MLC@Home Admin] Post: How much RAM do you have? Each WU takes at least half a gig... more it they're GPU WUs. What you're describing almost sounds like some of the WUs were pushed out to swap. |
|
36)
Message boards :
News :
[TMIM Notes] Aug 6 2021
(Message 1314)
Posted 11 Aug 2021 by pianoman [MLC@Home Admin] Post: The developer is actually Delta, who does the BOINC Radio podcast. DS4 WUs will likely take a little shorter time that DS1/DS2 initially, but we'll be tweaking those (and balancing credit). Runs on my test machine show DS4 WUs using the CNN complete an epoch in 11 seconds, versus DS1's RNNs which take 22s/epoch. Then the question becomes scaling the number of epochs we need to run to get good results, which I think will be more than what we do for DS1, but probably not twice as many. It's a balancing act that the data will drive, so expect a little bit of flux at the beginning. Also, a "DS4 WU" might not be uniform. Those numbers are using a simple CNN on MNIST-like datasets (black-and-white images), but we'd also like to do CIFAR (24-bit color), which will be more complex and drive up runtimes (and credits) On the bright side, we released the new windows CPU client in mldstest last night, and at least one user is stating they're seeing a nice speedup with DS1 WUs. |
|
37)
Questions and Answers :
Windows :
New Windows CPU client in mldstest (9.90)
(Message 1313)
Posted 11 Aug 2021 by pianoman [MLC@Home Admin] Post: All, There's a new windows client available in the MLDS test work queue. This new client brings all the features of the new Linux CPU client (DS4 support, wrapper support, etc..) except it's still not statically linked, meaning we'll probably still have windows DLL mismatch issues. Again, if there's any windows developers out there who can take a crack and getting the client statically linked, the source is available on gitlab and I think we're mostly there. In the mean time, please test the new client in mldstest and let me know how it goes. So far (less than 12 hours in) we're seeing an 86% success rate, and most of those are hosts from one user that are failing for unknown reasons (likely DLL issues). We'll have to track those down. Thanks again for supporting the project, and happy crunching. |
|
38)
Questions and Answers :
Windows :
Exit status -1073741515 (0xC0000135) STATUS_DLL_NOT_FOUND
(Message 1312)
Posted 11 Aug 2021 by pianoman [MLC@Home Admin] Post: Thanks for this. When boinc runs, it creates a new directory and copies all the files that end with "-961" and the main exe to that new directory, and renames the files to not have the "-961" extension. Then it runs the exe from that new directory, with the DLL's available. You can simulate this manually if you like. I'll admit though that I don't understand what these new "api-ns-win-*" libraries are, I assume they're supposed to be part of core windows. None of this would be a problem if I could convince windows to statically link the client, but I have bad news on that front. I spend 2.5 weeks last month working on it, and couldn't make it work (or rather, I got windows to compile it, but it crashed when it tried to run). PyTorch is a big, complicated beast. That said, We do have an updated windows CPU client available in the "mldstest" work queue. If you'd like, you can try running that, although I'm not sure it'll be any different. So far, after approximately 400+ WUs returned, I'm seeing an 86% success rate.. which tells me there's still some missing dlls we're not shipping that are on most, but not all, systems. |
|
39)
Message boards :
News :
[TMIM Notes] Aug 6 2021
(Message 1306)
Posted 6 Aug 2021 by pianoman [MLC@Home Admin] Post: This Month in MLC@Home Notes for Aug 6 2021 A monthly summary of news and notes for MLC@Home Summary Another month of good progress on MLC! First, this past month saw the completion of DS3! You have trained over 1,000,000 neural networks for DS3, which is a huge accomplishment. We're continuing to bundle and evaluate the dataset, so look for a complete public release shortly. We also spend some time on the backend preparing for DS4. We've updated the website to show DS4 progress, but haven't sent any DS4 WUs yet. Instead we spent the bulk of the month trying to get the new client to work under Windows, which hasn't been going well. We spent a good 2.5 weeks trying to get pytorch and the client to compile (and run) statically on windows. Even though it now compiles, the client crashes when running. So last week we switched back to linking dynamically, and want to get an updated windows client out this weekend. The Linux/CPU version of the new client appears to be performing fantastically, so thanks to everyone who ran WUs from the "mldstest" queue tested! DS4 WUs are incompatible with the older client, so we'll only release DS4 WUs as the new (v9.9x) client become available for each platform. This means CPUs first. GPUs will continue to work on finishing up DS1 and DS2. Speaking of DS1/DS2, we're approaching the end of DS1 with only a few more weeks to go. and when we complete those networks we'll switch to DS2 to finish those up as well. So, lots of movement this month behind the scenes, and great progress on the existing datasets. If *any* windows developers would like to help us out getting the new windows CPU client out the door, please contact us directly, we could use the help. Other News
|
|
40)
Questions and Answers :
Issue Discussion :
Out of work for CPU
(Message 1299)
Posted 29 Jul 2021 by pianoman [MLC@Home Admin] Post: pumped out another 8K WUs out last night, and they're all consumed again. I'm going to push out some more, and look into a way to automate this until DS4 is ready (or rather, until the windows CPU client that supports DS4 is ready). |
©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)