|
1)
Message boards :
News :
MLC@Home shutting down for now, and thank you!
(Message 1532)
Posted 2 days ago by pianoman [MLC@Home Admin] Post: MLC@Home Is shutting down After over two years, some bumpy moments, and the tremendous support from our volunteers, I, as MLC admin, am making the decision to shut down MLC@Home as a BOINC project for the time being. Why? We've achieved the goals I set out to accomplish (and more!) with 4 complete datasets comprising dozens of terabytes of data to analyze. Now we need to focus on analyzing the results and writing papers. As a researcher, at some point you have to stop generating data and write; and my family, work, and school commitments have limited the amount of time I can spend generating new experiments. This should be evident as I've been less and less responsive to the community over the past 6 months, for which I apologize. While we can always want more from any endeavor, I think we've accomplished a lot for now, and want to put the project on indefinite hiatus until something new comes along. This is a time to celebrate all that our volunteers have achieved together! This community has been amazing between the forums and Discord. We're shutting down not because of any problem, but because we've achieved the goals we set out to accomplish. For that, I couldn't be more grateful. The only bittersweet aspect to shutting the project down is that I hoped to grow MLC@Home beyond MLDS, to become a platform for democratized machine learning research. I failed to gain traction with other researchers and as such MLDS was the only project on MLC@Home. COVID is partly to blame[1], but there are a number of other factors ranging from how research is funded in a hot field like ML to my own limited time commitments. If other researchers express an interest we can revive the project in the future, but for now I can not justify running the project without a real path to meaningful new work. That's wouldn't be fair to our volunteers. What happens now? First, as promised, the datasets will remain available (DS4 will require some thought and time to release, see below), and the main MLC@Home website (https://www.mlcathome.org) and twitter feed will remain active so I can post updates on any papers and how to access DS4 when available. For now, there are no changes to the BOINC server portions of the website. I'll need to read up on how to properly archive the forums, project pages, and stats so that they can remain available (read only) without becoming a magnet for spam and the (currently hourly...) hacking attempts (sigh...). I will also be winding down the Discord community over the next month or so. For me personally, I will continue my research and work on publishing meaningful results. I'll also continue to support other BOINC projects (I've been contributing to BOINC since the SETI@Home classic days) and support the idea of volunteer computing. At some point, I'll write up my experience as a researcher starting a new project and running it from the beginning to end; and hope that will be a resource for other projects wanting to start out. It's generally been a positive experience, but there are some definite areas for improvement. For you, I encourage you to continue to support other great BOINC projects with your computing time. The official list is here https://boinc.berkeley.edu/projects.php. DS1/2/3 are up for download now, what about DS4? DS4 is large over 12TB in size for just the Dense portion. So ti's going to require even more time to copy, package, analyze, and upload. I intend to do this after my analysis and thesis is complete, which should be in the next 6 months. If you are a researcher and want access to the dataset sooner, please contact me directly and we can work something out. The original idea for DS4 was to compute neural networks for each type of data using dense, LeCun-style CNNs, and AlexNet CNNs. It turns out LeCun networks are so small and easy to compute that I can compute 50,0000 of them them locally on my won workstation in a day or two, so I didn't bother sending those out as BOINC workunits (also because the current client crashes when computing LeNet5 on some platforms, and it was faster to computer it locally than track down the bug). Since its debatable what scientific benefit having AlexNet (another CNN) brings over LeCun networks I'll likely drop those from the dataset. Thanks Even if nothing else happens, MLC@Home has been major success. We produced scientifically interesting and unique datasets, introduced a whole new type of science (machine learning) to the BOINC community, and showed that machine learning research can be conducted by a group volunteers over the internet. There are a few groups and individuals I'd like to specifically thank for making this project such a success. These include, but aren't limited to: the BOINC developers, especially Vitalii Koshura and the other developers on the BOINC Discord server, for helping me develop the project from the very beginning, Marcus (Delta on the BOINC Discord servers) for contributing directly to MLC@Home's server backend processing software, and who, along with JRingo run the BOINC Radio podcast that promoted and supported MLC@Home from very beginning. Mike from the PrimeGrid project for providing some crucial early advice for running a new project. I'm sure I'm forgetting many others, just know that we, as a community have many to thank for the success of this project. I'd like to extend an extra thanks to the early volunteers on the project who helped make the forum a helpful and welcoming place. Thanks also to the CoRaL Labs and my advisor at UMBC for supporting the research and providing funding for the new server after we quickly out-grew our original 2015-era ThinkPad laptop. Finally, thanks to our 4200+ volunteers, who crunched over 12.5 million work units using more than 17000 hosts. I am truly humbled by your contributions and what we've achieved together. None of this would have been possible without you. Thank you for giving a small unknown researcher a chance, and I encourage you to seek out smaller projects in the future, as their success will help determine whether BOINC continues to grow and thrive. I leave you with one last, satisfying website screenshot: Thanks again to everyone, pianoman -- MLC@Home primary researcher and admin: https://www.mlcathome.org/ email: mlcathome2020@gmail.com Twitter: @MLCHome2 |
|
2)
Message boards :
News :
DS3 Dataset of 1 million trained neural networks is available for download!
(Message 1510)
Posted 2 May 2022 by pianoman [MLC@Home Admin] Post: Hello volunteers! Just a quick note that Dataset 3 is finally posted for download at our site https://www.mlcathome.org/mlds.html! Dataset 3 was completed a few months ago, but due to its massive size (2.25TB in all), and us emphasizing our own analysis over packaging the results for download, its taken us until now to make it available. As a reminder, DS3 contains over 1 million trained neural networks (10,000/ea modelling 100 different automata), with a goal of analyzing how networks of the same size and shape encode similar-but-not-exact training data. Expect an updated paper soon! We've always held that if the public is doing work for this project, then the results of that work should be made available back to the public to further science. As of right now, all of DS1, DS2, and DS3 are available to the public under a CC-BY-SA 4.0 license. We will do the same with DS4 when it completes. DS3 is released via torrents due to its size. A few volunteers have already downloaded and seeded these (very large) files, so hopefully new downloads should be a bit quicker than us just serving from our singular server. The torrent files are listed on our website, and we're using the Academic Torrents tracker (see: https://academictorrents.com/browse.php?search=mlds. Thanks again to all our volunteers! DS3 is quite an accomplishment! -- The MLC@Home Admins(s) Homepage: https://www.mlcathome.org/ Discord invite: https://discord.gg/BdE4PGpX2y Twitter: @MLCHome2 |
|
3)
Message boards :
News :
MLC@Home inconsistent work generation for the next few months
(Message 1502)
Posted 17 Apr 2022 by pianoman [MLC@Home Admin] Post: TL;DR: MLC is entering an analysis phase, and new work will be bursty and inconsistent for at least the next few months. Please adjust your BOINC contributions accordingly! Over the past several months we (MLC@Home admins) have turned our attention to the analysis of the results our volunteers have contributed. With the completion of DS1/2/3, and the partial results of DS4, we're really excited to polish up some papers and publish some results. (Along those lines, look for an announcement of availability of all 5 tiers of DS3 datasets later today or tomorrow, just need to set up a torrent for 1.3TB DS3-10000 dataset). In addition, DS4 results are larger than we anticipated and also don't require as much computation time to complete. So when we release DS4 WUs our volunteers churn through them in only a few days time while also filling up the disk space on the server. This is a great problem to have, but also forces us to be judicious about sending out work to make sure we've archived enough old results off the server to handle the influx of new results. The upshot of all this is that we don't have the resources to both do the analysis and prepare/maintain consistent meaningful work units. So rather that just keep pushing out work that'll keep WUs flowing but has less scientific meaning (such as creating more DS3 networks just to create a bigger dataset), we'd rather just announce that MLC@Home WUs work will be inconsistent for at least the next several months. We expect to release batches of DS4 WUs every few weeks, but it won't be the constant work availability you're used to from the project over the past two years. We realize this will cause us to lose some volunteers, but that's why we're trying to be upfront about this now so that everyone can decide if and how to allocate their BOINC contributions accordingly. We hope that you'll consider leaving MLC in your projects list and help us crunch WUs when we have them, but understand if choose not to. A few key things to note:
|
|
4)
Message boards :
News :
Spring 2022 MLC Project Update: DS2 Complete edition!
(Message 1492)
Posted 3 Apr 2022 by pianoman [MLC@Home Admin] Post: It's been a while since we've posted an update, but that doesn't mean the project has been idle! If you've been following on our Discord server you'll know we've continued to make progress, and thanks to our volunteers, today is a day of celebration! Here's a summary of the current project status: Summary
|
|
5)
Message boards :
News :
Maintenance / Downtime 3/27/22
(Message 1488)
Posted 27 Mar 2022 by pianoman [MLC@Home Admin] Post: ..and we're back! Expect a full news update soon, but I'm packaging up the datasets for release, and DS4 is out on the main CPU queue. |
|
6)
Message boards :
News :
Maintenance / Downtime 3/27/22
(Message 1487)
Posted 27 Mar 2022 by pianoman [MLC@Home Admin] Post: MLC's server will have a brief period of downtime today starting at approximately 3:30pm UTC to add more storage and prepare the main queue for DS4 workloads. The downtime shouldn't be more than an hour or two. Thanks again for all your support. -- MLC Admins |
|
7)
Questions and Answers :
Issue Discussion :
Restriction based on GLIBC version?
(Message 1455)
Posted 19 Dec 2021 by pianoman [MLC@Home Admin] Post: Actually, the new CPU client should work fine on CentOS 7, because it's statically linked. GPU clients are still an issue though since they are not. |
|
8)
Message boards :
News :
[TMIM Notes] July 1 2021 --- Celebrating 1 year of MLC@Home!
(Message 1444)
Posted 9 Dec 2021 by pianoman [MLC@Home Admin] Post: It's ticking me off quite a bit. Somehow these users/bots are getting credit and then posting.. that's... a lot of effort to post on a random small science-based website. Either that or they're hacking legit accounts and posting. But the credits are all pretty small. I wonder if there's a way to lock old threads automatically..... |
|
9)
Message boards :
News :
Testing updates to backend services again
(Message 1420)
Posted 14 Nov 2021 by pianoman [MLC@Home Admin] Post: New backend scripts are live now. We've checked the logs and the outgoing WUs and they appear to be configured correctly, avoiding the error from last time. However, please do let us know if you experience any more unusual errors, especially with malformed WUs leading to client errors! We'll be monitoring the forums, discord, and server logs closely over the next few days. During the transition, we had one brief 10 second period of errors that lead to a small handful (about 5?) result validation failures. We caught those immediately, and shut down the server to fix, so hopefully no more made it through. We appear to be working smoothly now. Thanks again for supporting the project! |
|
10)
Message boards :
News :
Testing updates to backend services again
(Message 1419)
Posted 14 Nov 2021 by pianoman [MLC@Home Admin] Post: All, We're going to be testing some new backend server updates again this weekend. Last time this lead to some instability, but we've learned quite a bit from that and have taken steps to make sure it doesn't happen again, with an easy and quick revert path if necessary. There may be some small interruptions, but nothing serious,. There is also nothing you need to do on the client side, this is all on the backend. Wish us luck, and we'll be watching the results like a hawk for any new issues. |
|
11)
Message boards :
News :
Lab network maintenance Nov 9
(Message 1416)
Posted 9 Nov 2021 by pianoman [MLC@Home Admin] Post: Overnight Nov 8 to 9 we'll be performing some brief preventative network maintenance. The server and site will be inaccessible for an hour or two, but shouldn't last longer than that. No action is required by the end user, and we'll be back soon! |
|
12)
Questions and Answers :
Issue Discussion :
GPU taks are freezing up
(Message 1412)
Posted 8 Nov 2021 by pianoman [MLC@Home Admin] Post: We're aware of some issues with Windows 11 performance when using nvidia GPUs. We don't have any windows 11 systems to test against at the moment, and given the known issues with the windows 11 scheduler and Ryzen (MS really screwed it up), and the extremely complex nature of nvidia cuda and how it interacts with the host operating system, I'm inclined to wait until NVidia and AMD get some time to clean up MS's mess first, before we start looking at things we can do on our side, since there's really not much we do other than make calls for pytorch to use the GPU instead of the CPU (this is not hand-tuned cuda). I know it's not a satisfying answer, but given that we haven't even released an updated v0.90 GPU client for windows yet (It's coming, was trying to get the linux version out first, but that's giving me more issues than planned), figuring out windows 11 performance issues are low on the priority list. |
|
13)
Questions and Answers :
Issue Discussion :
Invalid tasks
(Message 1408)
Posted 26 Oct 2021 by pianoman [MLC@Home Admin] Post: Yes, there was an issue (my fault!) on the server side, now fixed (thanks to theOretical). If you're curious, the issue was that as part of the validation process we compute the loss of the trained network against a set of "test" data. Due to the somewhat awkward process of BOINC validation, we need this loss number multiple places. Originally we re-computed this number up to three times as part of the validation process, but this is a CPU intensive process, so as the project grew I added some code to compute the value once and store it in an on-disk cache. Then when we need it later, we just read it from the cache instead of re-computing it.. win-win. We store this cache in a directory under `/tmp`. I was very careful to make sure that if we try to read the value from the cache and it isn't there (can be for many reasons) that it isn't a fatal error, we just recompute it. And since the server is very reliable the fact we're still using `/tmp`, which gets wiped on each reboot, wasn't an issue. It's been working great for many months. Well, yesterday, the server hicupped, needing a reboot. The cache directory got deleted. And while reading an entry from cache wasn't a fatal error, apparently I never tested what happened if adding a new entry to the cache failed. So the validation process started working, computed the value, and tried to add it to the cache and since the cache directory wasn't there, it failed and threw an exception .. which is interpreted by BOINC as a validation failure. This was the case for about 10-ish hours yesterday, and a huge miss on our part. The first fix is to just create the cache directory that got deleted on the reboot, that gets everything going again. We can also move the cache to a more permanent directory (a cron job periodically cleans out old entries, so that's not an issue).. The real issue is that failure to add an entry to the cache should never cause an exception. We're going to do all three of those things to make sure this doesn't happen again. And shame on me for failing to unit test the cache properly. Furthermore, while I get alerts whenever catestrophic problems happen with the server (assuming its not a hardware hang, like yesterday, where it can't send me an email), this failure was buried in the logs and didn't bubble up to that level, and I never would have notices unless I'd been actively checking the logs. I'm looking for a more robust log-monitoring solution that will get me alerts more quickly when problems like this occur. |
|
14)
Questions and Answers :
Windows :
3 times longer runtimes under Win 11?
(Message 1400)
Posted 24 Oct 2021 by pianoman [MLC@Home Admin] Post: Thanks for reporting. We haven't tested on Windows 11, much less cuda on windows 11. We'll add it to the list, but I suspect this is a windows 11/cuda driver maturity issue. Have you seen similar slowdowns with other projects' WUs? |
|
15)
Message boards :
News :
[TMIM Notes] Oct 23 2021 posted
(Message 1398)
Posted 24 Oct 2021 by pianoman [MLC@Home Admin] Post: MLC@Home has posted the Oct 23 2021 edition of its monthly "This Month In MLC@Home" newsletter! A long overdue update including DS2 slowly working through its backlog, backend updates for maintainability that went a little awry, DS4 backend work, and DS3 analysis. Read the update and join the discussion here. |
|
16)
Message boards :
News :
[TMIM Notes] Oct 23 2021
(Message 1397)
Posted 24 Oct 2021 by pianoman [MLC@Home Admin] Post: This Month in MLC@Home Notes for Oct 23 2021 A monthly(-ish) summary of news and notes for MLC@Home Summary It's been a while since the last update! But there's been a lot going on. From DS2 slowly working through its backlog, to backend updates for maintainability that went a little awry, DS4 backend work, and DS3 analysis. First, two weeks ago we had a mishap with the WU generation, and "continuation" WUs were sent with the wrong parameters leading to computation failures. It took us a few days to fix and clear up, but no data was lost. We've been updating and modernizing our backend scripts to consolidate them and make them less fragile (this is a good thing for maintainability!), and one of our updates went awry. Thank you for your patience while we worked it out. We've had a pretty good track record until now, so I hope you'll continue to support us in the future despite this setback. We're looking for new ways to test these further to avoid similar issues in the future. The majority of the work over the past few months has been analyzing DS3 data. We've been updating the existing paper with the full DS3 analysis. It is disk, bandwidth, and memory intensive on our backend, and sadly isn't quite as easy to break up into WUs to distribute over BOINC. In fact, just tar/gzip-ing the entrie DS3 dataset (2.6TB) takes over 24 hours, since it's over 4 million small files. We will be making all of DS3 this available as a torrent soon. I've been posting updates on this on our Discord server if you're interested. Since we've been focesed on DS3 and modernizing our backend/management scripts, DS4 has suffered. I wish I could say that DS4 WUs are flowing but they aren't yet. Everything is in place, we just need to start the tests. Thanks again for your continued support, and know that while these updates have been coming slower, that doesn't mean work isn't being happening behind the scenes! Other News
|
|
17)
Message boards :
News :
Current WU issues, working on a fix
(Message 1390)
Posted 15 Oct 2021 by pianoman [MLC@Home Admin] Post: New, corrected WUs are starting to trickle out again on the CPU queue.. we're being cautios and monitoring them closely before dumping another big batch, but it looks like we're back to normal. |
|
18)
Questions and Answers :
Issue Discussion :
195 (0x000000C3) EXIT_CHILD_FAILED error
(Message 1389)
Posted 15 Oct 2021 by pianoman [MLC@Home Admin] Post: FYI, this is the issue discussed in the news story.. THANK YOU for posting about it. It's an issue on our backend, not yours. We've been trying to consolidate/clean up a hodgepodge of about 20+ scripts that create, validate, and appraise WUs for all three queues and all 4 datasets into a small few. We've been running some updated, consolidated scripts on mldstest for a while and flipped the switch on the main queue, and somehow broke BOTH the mlds queue and the test queue, in that they were both sending out corrupt WUs. It looks like part of the corruption is not properly specifying where the hdf5 dataset file is located, so the client can't find it. We've managed to cancel all outstanding WUs on both queues, and are working on a fix. Please stay tuned over the next 24 hours while we clean up our mess and get back online. I can easily revert back to the old code, but it would be better to track down where the new code went wrong for the overall health and maintenance of the project, so we're trying that first. |
|
19)
Message boards :
News :
Current WU issues, working on a fix
(Message 1388)
Posted 14 Oct 2021 by pianoman [MLC@Home Admin] Post: A short note that we're aware of the issue with WUs coming from the CPU and TEST work queues (the GPU queue appears fine at the moment). This is due to a server-side issue related to some cleanup and upgrades I've been doing behind the scenes that appears to have gone haywire, and since it initially seemed to be working I didn't catch it immediately, leading to compounding the issues. This is unacceptable and I apologize. While this had been tested, this failure mode was unforeseen. You rely on us to keep things running smoothly, and I failed you. Over the next 24 hours we'll be sending out cancellations for the corrupted WUs, and may stop/start the service a few times while we try to clean things up. Please bear with us and thanks for your patience. I stress : no data was lost, and the nature of the failure is to fail-fast on the client, so there is little to no wasted computer cycles. Thanks again, and we'll do better in the future. |
|
20)
Questions and Answers :
Issue Discussion :
Rogue batch ?
(Message 1376)
Posted 1 Oct 2021 by pianoman [MLC@Home Admin] Post: Aha. Out of memory error on mongodb. ran out of the 32G in the server. working on a fix now. The problem was the validator script connects to another database (a mongodb, the main boinc database is mysql, also large and also running on the same server), and it interpreted a connection-to-the-db failed as a validation failed. So, multiple things to fix. |
©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)