Message boards :
News :
[TMIM Notes] June 8 2021
Message board moderation
Author | Message |
---|---|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 ![]() ![]() ![]() ![]() |
This Notes for June 8 2021 A monthly summary of news and notes for MLC@Home Summary Updates have come slowly these past few months, since the presentation at the BOINC workshop and the release of our initial paper, as we're personally adjusting (fortunately!) the the beginnings of a post-pandemic life. Work, family life, and everything is changing for many of us, and we're still trying to figure out the new normal. Because of this, going forward these updates to be monthly since they take quite a bit of time to put together and we've been failing to get them out weekly for a while now anyway. And here's hoping all our volunteers all over the world are in an area where they too can start to move beyond the worst of the pandemic. But that doesn't mean the project has been dormant! DS1/DS2/DS3 are all nearing completion, especially DS3 which is sitting at 97%. We've been talking about DS4 for months, and the code is ready for larger testing. Unfortunately, we rolled out a test client a few weeks ago that failed miserably, because of an incompatibility between PyTorch and the native BOINC API. there's a way around this, but it requires more development, and a change to how WUs are specified, and we've been working on it ever since. We should be ready any day now but its been more involved then we thought so we're not prepared to give it a time. But, we do know we need to have it soon as DS3 WUs are running out. Some of the other benefits of the new client are it's statically linked, which vastly simplifies deployment. The extra development time has also given us a chance to make a change to make us more robust to NaNs, which should cut down on the amount of validation errors on the system. Another new issue is the data partition is running out of space on the server.. DS3 is taking over 4TB! Thanks to all of our volunteers! We've moved some things around to make a little space so everything is still working for now. We received some new storage today and will need some downtime to get it installed. Shouldn't take more than a few minutes, so we'll just do it sometime within the next week. So, stay tuned, the next month's going to be intresting for MLC@Home, as we move into DS4 and the next phase of this research. Other News
|
Send message Joined: 11 Jul 20 Posts: 33 Credit: 1,266,237 RAC: 0 ![]() ![]() ![]() ![]() |
We've been talking about DS4 for months, and the code is ready for larger testing. Unfortunately, we rolled out a test client a few weeks ago that failed miserably, because of an incompatibility between PyTorch and the native BOINC API. there's a way around this, but it requires more development, and a change to how WUs are specified, and we've been working on it ever since. Seems, after a long time, something is moving on this way: Implement a BOINC app library |
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 ![]() ![]() ![]() ![]() |
Yes and no. That's more for projects that use well-known applications... like folding projects tend to use a small handful of common executables (some closed source), but do different things with them. So it makes sense instead of each project bundling the executable separately that there be a common set of known executables for projects to just point to. The ML world is a little different (less mature?) at the moment. We have common libraries (pytorch, tensorflow, mxnet, xgboost, etc..) , and everyone wraps their own custom executables around these libraries. So I'm not sure how helpful it'll be for us. The issue with the failed app is that the native BOINC API uses SIGALRM (without documenting that it does) , and so does PyTorch (barely documenting that it does). The app ran fine when run standalone, but once it was launched by the boinc client it crashed once pytorch overwrote the SIGALRM handler and boinc tried to call it. The fix is to re-write the app using the boinc wrapper, which means undoing a lot of the work I put in to use the native API in the first place (which annoys me), and means that WUs created for the old app won't be compatible with those from the new app. And its just turned out to take a while longer than planned. |
Send message Joined: 11 Jul 20 Posts: 33 Credit: 1,266,237 RAC: 0 ![]() ![]() ![]() ![]() |
The issue with the failed app is that the native BOINC API uses SIGALRM (without documenting that it does) , and so does PyTorch (barely documenting that it does). The app ran fine when run standalone, but once it was launched by the boinc client it crashed once pytorch overwrote the SIGALRM handler and boinc tried to call it. The fix is to re-write the app using the boinc wrapper, which means undoing a lot of the work I put in to use the native API in the first place (which annoys me), and means that WUs created for the old app won't be compatible with those from the new app. And its just turned out to take a while longer than planned. VERY interesting explanation!!! Thank you! |
![]() Send message Joined: 30 Aug 20 Posts: 25 Credit: 47,025,926 RAC: 0 ![]() ![]() ![]() ![]() |
Fantastic job! I am staying 'til teal! I am already looking forward to learning more about DS4 and the next project (personally I hope it targets reproducibility). |
©2023 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)