Posts by pianoman [MLC@Home Admin]

1) Message boards : News : DS3 Dataset of 1 million trained neural networks is available for download! (Message 1510)
Posted 18 days ago by pianoman [MLC@Home Admin]
Post:
Hello volunteers!

Just a quick note that Dataset 3 is finally posted for download at our site https://www.mlcathome.org/mlds.html! Dataset 3 was completed a few months ago, but due to its massive size (2.25TB in all), and us emphasizing our own analysis over packaging the results for download, its taken us until now to make it available.

As a reminder, DS3 contains over 1 million trained neural networks (10,000/ea modelling 100 different automata), with a goal of analyzing how networks of the same size and shape encode similar-but-not-exact training data. Expect an updated paper soon!

We've always held that if the public is doing work for this project, then the results of that work should be made available back to the public to further science. As of right now, all of DS1, DS2, and DS3 are available to the public under a CC-BY-SA 4.0 license. We will do the same with DS4 when it completes.

DS3 is released via torrents due to its size. A few volunteers have already downloaded and seeded these (very large) files, so hopefully new downloads should be a bit quicker than us just serving from our singular server. The torrent files are listed on our website, and we're using the Academic Torrents tracker (see: https://academictorrents.com/browse.php?search=mlds.

Thanks again to all our volunteers! DS3 is quite an accomplishment!

-- The MLC@Home Admins(s)
Homepage: https://www.mlcathome.org/
Discord invite: https://discord.gg/BdE4PGpX2y
Twitter: @MLCHome2
2) Message boards : News : MLC@Home inconsistent work generation for the next few months (Message 1502)
Posted 17 Apr 2022 by pianoman [MLC@Home Admin]
Post:
TL;DR: MLC is entering an analysis phase, and new work will be bursty and inconsistent for at least the next few months. Please adjust your BOINC contributions accordingly!

Over the past several months we (MLC@Home admins) have turned our attention to the analysis of the results our volunteers have contributed. With the completion of DS1/2/3, and the partial results of DS4, we're really excited to polish up some papers and publish some results. (Along those lines, look for an announcement of availability of all 5 tiers of DS3 datasets later today or tomorrow, just need to set up a torrent for 1.3TB DS3-10000 dataset).

In addition, DS4 results are larger than we anticipated and also don't require as much computation time to complete. So when we release DS4 WUs our volunteers churn through them in only a few days time while also filling up the disk space on the server. This is a great problem to have, but also forces us to be judicious about sending out work to make sure we've archived enough old results off the server to handle the influx of new results.

The upshot of all this is that we don't have the resources to both do the analysis and prepare/maintain consistent meaningful work units. So rather that just keep pushing out work that'll keep WUs flowing but has less scientific meaning (such as creating more DS3 networks just to create a bigger dataset), we'd rather just announce that MLC@Home WUs work will be inconsistent for at least the next several months. We expect to release batches of DS4 WUs every few weeks, but it won't be the constant work availability you're used to from the project over the past two years.

We realize this will cause us to lose some volunteers, but that's why we're trying to be upfront about this now so that everyone can decide if and how to allocate their BOINC contributions accordingly. We hope that you'll consider leaving MLC in your projects list and help us crunch WUs when we have them, but understand if choose not to.

A few key things to note:

  • Are you shutting down? . No, not at this time. Beyond the stated goals for DS4 above, we have some ideas where we would like to go in the future. But the main admin needs to spend time finishing up their thesis, so those plans will be on hold until after that is complete. If those plans don't come to fruition, then we will be up front here and actively shut down the project. We promise we won't just leave abandon it with no notice!
  • What about all the work the volunteers have done? DS1/DS2 datasets are all available for download at https://www.mlcathome.org/mlds.html, and DS3 will be soon (via torrent). As DS4 completes we promise to make those available too at the same place.



We hope this announcement reassures you that we're trying to be good stewards of the trust and resources you provide us as BOINC volunteers. We're really excited by the science and humbled by your support since we started in July 2020, and we hope you understand as we move into the next phases of our work. As things change we'll make more announcements here and on our Discord.

-- The MLC@Home Admins(s)
Homepage: https://www.mlcathome.org/
Discord invite: https://discord.gg/BdE4PGpX2y
Twitter: @MLCHome2

3) Message boards : News : Spring 2022 MLC Project Update: DS2 Complete edition! (Message 1492)
Posted 3 Apr 2022 by pianoman [MLC@Home Admin]
Post:
It's been a while since we've posted an update, but that doesn't mean the project has been idle! If you've been following on our Discord server you'll know we've continued to make progress, and thanks to our volunteers, today is a day of celebration!

Here's a summary of the current project status:

Summary


  • DS2 Computation is complete! As of 1 Apr 2022, we finally crossed 10,000 trained networks threshold for ParityModified, completing our computation for DS2. This has taken a long time, and the complete dataset should help researchers understand how neural networks encode data.
  • All DS1/DS2 tarballs are available for download from https://www.mlcathome.org/mlds.html. This is your work, and now its free for you or anyone else to study and build upon!
  • DS3 tarballs still pending. Computation for DS3 completed last year, but we have not uploaded to full datasets to the website for download yet. We've been focused on analysis, and the sheer size of the dataset can cause headaches making bundling a time-consuming task. We'll post here when they're available.
  • DS4 WUs are out! DS4 WUs are out for our CPU client, and progress has started there. DS4 is much more complicated to manage on the backend because it has multiple training sets that have different requirements, but we're pushing new WUs out as fast as we can.
  • We're pausing GPU WUs: It saddens us, but we have not been successful updating our GPU clients to support DS4 WUs. And as we shift our focus to analyzing the results we do have, we have less and less time to focus on client development beyond the CPU client. When the current GPU queue runs dry, we won't be sending out more GPU work until we have time to re-prioritize porting a GPU client again. Maintaining a GPU client has taken much more time and effort than anticipated, and unless we can get outside help it will remain a low priority for the time being. We truly appreciate our GPU volunteers, but at the moment we don't have any work to send, and encourage you to turn your hardware to support other worthwhile projects that can support your hardware!
  • We're exploring porting the CPU client to Rust. In addition, our reliance on PyTorch has become more of a hindrance to portability than an asset. While the neural network ecosystem in rust is not nearly as robust, the ability for rust to compile a static binary targeting a large number of architectures and operating systems is very appealing to portability. As such, we're looking to port our MLC CPU client to pure rust, with an option to support GPUs from the same code base in the future. If you know Rust and are interested, please contact the MLC Admins.



Please note that there are still DS2 WUs in the work queue, we ask that you please continue to crunch them, as it's always better to have more samples as spares. However, we don't plan to queue up any more DS1/2/3 WUs, and all new WUs added will be DS4 or later. This applies to the GPU queue as well.

We're really excited for DS4 WUs going forward, and it should help show our theory that similar networks cluster in parameter space in both feed forward and CNN-based networks as well as the RNNs used in DS1/2/3. Beyond DS4, we have some ideas but have nothing concrete at the moment. We'll keep you updated as we move forward.

Thanks again to all our volunteers for supporting the project and helping science.

-- The MLC@Home Admins(s)
Homepage: https://www.mlcathome.org/
Discord invite: https://discord.gg/BdE4PGpX2y
Twitter: @MLCHome2

4) Message boards : News : Maintenance / Downtime 3/27/22 (Message 1488)
Posted 27 Mar 2022 by pianoman [MLC@Home Admin]
Post:
..and we're back! Expect a full news update soon, but I'm packaging up the datasets for release, and DS4 is out on the main CPU queue.
5) Message boards : News : Maintenance / Downtime 3/27/22 (Message 1487)
Posted 27 Mar 2022 by pianoman [MLC@Home Admin]
Post:
MLC's server will have a brief period of downtime today starting at approximately 3:30pm UTC to add more storage and prepare the main queue for DS4 workloads. The downtime shouldn't be more than an hour or two.

Thanks again for all your support.
-- MLC Admins
6) Questions and Answers : Issue Discussion : Restriction based on GLIBC version? (Message 1455)
Posted 19 Dec 2021 by pianoman [MLC@Home Admin]
Post:
Actually, the new CPU client should work fine on CentOS 7, because it's statically linked. GPU clients are still an issue though since they are not.
7) Message boards : News : [TMIM Notes] July 1 2021 --- Celebrating 1 year of MLC@Home! (Message 1444)
Posted 9 Dec 2021 by pianoman [MLC@Home Admin]
Post:
It's ticking me off quite a bit. Somehow these users/bots are getting credit and then posting.. that's... a lot of effort to post on a random small science-based website.

Either that or they're hacking legit accounts and posting. But the credits are all pretty small.

I wonder if there's a way to lock old threads automatically.....
8) Message boards : News : Testing updates to backend services again (Message 1420)
Posted 14 Nov 2021 by pianoman [MLC@Home Admin]
Post:
New backend scripts are live now. We've checked the logs and the outgoing WUs and they appear to be configured correctly, avoiding the error from last time. However, please do let us know if you experience any more unusual errors, especially with malformed WUs leading to client errors! We'll be monitoring the forums, discord, and server logs closely over the next few days.

During the transition, we had one brief 10 second period of errors that lead to a small handful (about 5?) result validation failures. We caught those immediately, and shut down the server to fix, so hopefully no more made it through. We appear to be working smoothly now.

Thanks again for supporting the project!
9) Message boards : News : Testing updates to backend services again (Message 1419)
Posted 14 Nov 2021 by pianoman [MLC@Home Admin]
Post:
All,

We're going to be testing some new backend server updates again this weekend. Last time this lead to some instability, but we've learned quite a bit from that and have taken steps to make sure it doesn't happen again, with an easy and quick revert path if necessary. There may be some small interruptions, but nothing serious,. There is also nothing you need to do on the client side, this is all on the backend.

Wish us luck, and we'll be watching the results like a hawk for any new issues.
10) Message boards : News : Lab network maintenance Nov 9 (Message 1416)
Posted 9 Nov 2021 by pianoman [MLC@Home Admin]
Post:
Overnight Nov 8 to 9 we'll be performing some brief preventative network maintenance. The server and site will be inaccessible for an hour or two, but shouldn't last longer than that. No action is required by the end user, and we'll be back soon!
11) Questions and Answers : Issue Discussion : GPU taks are freezing up (Message 1412)
Posted 8 Nov 2021 by pianoman [MLC@Home Admin]
Post:
We're aware of some issues with Windows 11 performance when using nvidia GPUs.

We don't have any windows 11 systems to test against at the moment, and given the known issues with the windows 11 scheduler and Ryzen (MS really screwed it up), and the extremely complex nature of nvidia cuda and how it interacts with the host operating system, I'm inclined to wait until NVidia and AMD get some time to clean up MS's mess first, before we start looking at things we can do on our side, since there's really not much we do other than make calls for pytorch to use the GPU instead of the CPU (this is not hand-tuned cuda).

I know it's not a satisfying answer, but given that we haven't even released an updated v0.90 GPU client for windows yet (It's coming, was trying to get the linux version out first, but that's giving me more issues than planned), figuring out windows 11 performance issues are low on the priority list.
12) Questions and Answers : Issue Discussion : Invalid tasks (Message 1408)
Posted 26 Oct 2021 by pianoman [MLC@Home Admin]
Post:
Yes, there was an issue (my fault!) on the server side, now fixed (thanks to theOretical).

If you're curious, the issue was that as part of the validation process we compute the loss of the trained network against a set of "test" data. Due to the somewhat awkward process of BOINC validation, we need this loss number multiple places. Originally we re-computed this number up to three times as part of the validation process, but this is a CPU intensive process, so as the project grew I added some code to compute the value once and store it in an on-disk cache. Then when we need it later, we just read it from the cache instead of re-computing it.. win-win. We store this cache in a directory under `/tmp`.

I was very careful to make sure that if we try to read the value from the cache and it isn't there (can be for many reasons) that it isn't a fatal error, we just recompute it. And since the server is very reliable the fact we're still using `/tmp`, which gets wiped on each reboot, wasn't an issue. It's been working great for many months.

Well, yesterday, the server hicupped, needing a reboot. The cache directory got deleted. And while reading an entry from cache wasn't a fatal error, apparently I never tested what happened if adding a new entry to the cache failed. So the validation process started working, computed the value, and tried to add it to the cache and since the cache directory wasn't there, it failed and threw an exception .. which is interpreted by BOINC as a validation failure. This was the case for about 10-ish hours yesterday, and a huge miss on our part.

The first fix is to just create the cache directory that got deleted on the reboot, that gets everything going again. We can also move the cache to a more permanent directory (a cron job periodically cleans out old entries, so that's not an issue).. The real issue is that failure to add an entry to the cache should never cause an exception. We're going to do all three of those things to make sure this doesn't happen again. And shame on me for failing to unit test the cache properly.

Furthermore, while I get alerts whenever catestrophic problems happen with the server (assuming its not a hardware hang, like yesterday, where it can't send me an email), this failure was buried in the logs and didn't bubble up to that level, and I never would have notices unless I'd been actively checking the logs. I'm looking for a more robust log-monitoring solution that will get me alerts more quickly when problems like this occur.
13) Questions and Answers : Windows : 3 times longer runtimes under Win 11? (Message 1400)
Posted 24 Oct 2021 by pianoman [MLC@Home Admin]
Post:
Thanks for reporting. We haven't tested on Windows 11, much less cuda on windows 11.

We'll add it to the list, but I suspect this is a windows 11/cuda driver maturity issue.

Have you seen similar slowdowns with other projects' WUs?
14) Message boards : News : [TMIM Notes] Oct 23 2021 posted (Message 1398)
Posted 24 Oct 2021 by pianoman [MLC@Home Admin]
Post:
MLC@Home has posted the Oct 23 2021 edition of its monthly "This Month In MLC@Home" newsletter!
A long overdue update including DS2 slowly working through its backlog, backend updates for maintainability that went a little awry, DS4 backend work, and DS3 analysis.

Read the update and join the discussion here.
15) Message boards : News : [TMIM Notes] Oct 23 2021 (Message 1397)
Posted 24 Oct 2021 by pianoman [MLC@Home Admin]
Post:
This Month in MLC@Home
Notes for Oct 23 2021
A monthly(-ish) summary of news and notes for MLC@Home

Summary
It's been a while since the last update! But there's been a lot going on. From DS2 slowly working through its backlog, to backend updates for maintainability that went a little awry, DS4 backend work, and DS3 analysis.

First, two weeks ago we had a mishap with the WU generation, and "continuation" WUs were sent with the wrong parameters leading to computation failures. It took us a few days to fix and clear up, but no data was lost. We've been updating and modernizing our backend scripts to consolidate them and make them less fragile (this is a good thing for maintainability!), and one of our updates went awry. Thank you for your patience while we worked it out. We've had a pretty good track record until now, so I hope you'll continue to support us in the future despite this setback. We're looking for new ways to test these further to avoid similar issues in the future.

The majority of the work over the past few months has been analyzing DS3 data. We've been updating the existing paper with the full DS3 analysis. It is disk, bandwidth, and memory intensive on our backend, and sadly isn't quite as easy to break up into WUs to distribute over BOINC. In fact, just tar/gzip-ing the entrie DS3 dataset (2.6TB) takes over 24 hours, since it's over 4 million small files. We will be making all of DS3 this available as a torrent soon. I've been posting updates on this on our Discord server if you're interested.

Since we've been focesed on DS3 and modernizing our backend/management scripts, DS4 has suffered. I wish I could say that DS4 WUs are flowing but they aren't yet. Everything is in place, we just need to start the tests.

Thanks again for your continued support, and know that while these updates have been coming slower, that doesn't mean work isn't being happening behind the scenes!

Other News

  • We've also spent some time trying to port the new statically-linked client to CUDA and ROCM, neither of which have worked so far. The Windows CUDA client should be a standard recompile, but the Linux clients did not compile and link as planned and need some more work.
  • We're starting to see SPAM in the forums. To combat this, we've disabled posting in any thread except "Issue Discussion" unless you have at least 100 credits. If you see things that look like spam in the forums, please press the report button to report it as such and we'll take care of is as soon as we can.
  • The ARM64-specific client also isn't ready, because of a strange linker error with the size of the static binary. Honestly, we're not sure how to make it work. If you know about Linux linking with large relocations on ARM64, please get in contact with us. Until then, please run the ARMHF client (32-bit) on 64-bit ARM systems.
  • Many thanks to Delta for his tireless work on modernizing out backend. We already have a new database access for both the BOINC database and our MLDS-specific MongoDB database thanks to his work, and soon we'll be consolidating 21 different scripts into a small handful.
  • Reminder: the MLC client is open source, and has an issues list at gitlab. If you're a programmer or data scientist and want to help, feel free to look over the issues and submit a pull request.



Project status snapshot:
(note these numbers are approximations)






Last month's TMIM Notes: Aug 6 2021

Thanks again to all our volunteers!

-- The MLC@Home Admins(s)
Homepage: https://www.mlcathome.org/
Discord invite: https://discord.gg/BdE4PGpX2y
Twitter: @MLCHome2

16) Message boards : News : Current WU issues, working on a fix (Message 1390)
Posted 15 Oct 2021 by pianoman [MLC@Home Admin]
Post:
New, corrected WUs are starting to trickle out again on the CPU queue.. we're being cautios and monitoring them closely before dumping another big batch, but it looks like we're back to normal.
17) Questions and Answers : Issue Discussion : 195 (0x000000C3) EXIT_CHILD_FAILED error (Message 1389)
Posted 15 Oct 2021 by pianoman [MLC@Home Admin]
Post:
FYI, this is the issue discussed in the news story.. THANK YOU for posting about it.

It's an issue on our backend, not yours. We've been trying to consolidate/clean up a hodgepodge of about 20+ scripts that create, validate, and appraise WUs for all three queues and all 4 datasets into a small few. We've been running some updated, consolidated scripts on mldstest for a while and flipped the switch on the main queue, and somehow broke BOTH the mlds queue and the test queue, in that they were both sending out corrupt WUs. It looks like part of the corruption is not properly specifying where the hdf5 dataset file is located, so the client can't find it.

We've managed to cancel all outstanding WUs on both queues, and are working on a fix. Please stay tuned over the next 24 hours while we clean up our mess and get back online.

I can easily revert back to the old code, but it would be better to track down where the new code went wrong for the overall health and maintenance of the project, so we're trying that first.
18) Message boards : News : Current WU issues, working on a fix (Message 1388)
Posted 14 Oct 2021 by pianoman [MLC@Home Admin]
Post:
A short note that we're aware of the issue with WUs coming from the CPU and TEST work queues (the GPU queue appears fine at the moment). This is due to a server-side issue related to some cleanup and upgrades I've been doing behind the scenes that appears to have gone haywire, and since it initially seemed to be working I didn't catch it immediately, leading to compounding the issues.

This is unacceptable and I apologize. While this had been tested, this failure mode was unforeseen. You rely on us to keep things running smoothly, and I failed you.

Over the next 24 hours we'll be sending out cancellations for the corrupted WUs, and may stop/start the service a few times while we try to clean things up. Please bear with us and thanks for your patience.

I stress : no data was lost, and the nature of the failure is to fail-fast on the client, so there is little to no wasted computer cycles.

Thanks again, and we'll do better in the future.
19) Questions and Answers : Issue Discussion : Rogue batch ? (Message 1376)
Posted 1 Oct 2021 by pianoman [MLC@Home Admin]
Post:
Aha. Out of memory error on mongodb. ran out of the 32G in the server. working on a fix now.

The problem was the validator script connects to another database (a mongodb, the main boinc database is mysql, also large and also running on the same server), and it interpreted a connection-to-the-db failed as a validation failed. So, multiple things to fix.
20) Questions and Answers : Issue Discussion : Rogue batch ? (Message 1375)
Posted 1 Oct 2021 by pianoman [MLC@Home Admin]
Post:
Looking at this now.. working on it. Not sure what's going on, looks like a connection issue to the database somewhere. Tracking.


Next 20

©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)