Joined: 30 Jun 20
This Month in MLC@Home
Notes for July 1 2021
A monthly summary of news and notes for MLC@Home
Happy first birthday to MLC@Home! This project went live on July 1, 2020, and caught on pretty quickly in the BOINC community. We've remained focused on our goal, which is breaking open the black box of neural networks to explain why they make the choices they do. This is so important as machine learning permeates more and more of our everyday life; from autonomous cars, to banking decisions, and medical diagnoses. We need research to understand how to keep bias out of these systems.
We are also the first, and to date only, public machine learning focused BOINC project. This means that while we could leverage the BOINC framework for job management, we have to build most of the ML client infrastructure from the ground up. This hasn't always been smooth, but we've accomplished so much in the past year regardless.
In the past year, we have:
Joined: 9 Jul 20
Thanks as always for the thorough updates! Most highly appreciated :)
Glad to finally call myself a long-term supporter of MLC@H and hope to add many years to that! (Another idea for batches based on participation in years such as in MW@H) And what a year it has been!
First of all, congrats on building your first BOINC project and shedding a lot of sweat while developing and maintaining the application code. Congrats as well on your first paper that has been generated by leveraging data from this project. And Kudos to you for having sustained the very engagement and responsiveness with us volunteers that you promised from the very start. I am very impressed with the architecture that you have built here on MLC@H that supports so many OS, and even various apps that allow GPU computing!
I am extremely pleased with the overall progress, in particular ...
- Progress on the larger data sets
- Introduction of a beta app as main test channel
- Badges :)
- Development progress on a graphics application showing the training progress (Thx to tank busters)
- Academic publication using our results
- First results surrounding weight-space clustering of the trained networks that validates the initial theory/assumption
- Future pipeline of additional sub-projects + thoughts on BOINC-wide promotion
- Open-access of results and open-access collaboration with app development via GitLab
2. As there is a wide list of issues on GitLab that I would love to contribute to myself but am ATM rather limited at helping with due to my lack of programming skills, I would be interested in a priorization of that list. What is the most important aspect you are currently working on? Onboarding of DS4? App development? GPU app optimisation? Validation technique?
3. Do you plan to extend the data set sizes even further beyond the current dimension? Is this sensible?
4. How can we help to extend the reach of this project?
5. Discord vs. message boards: Should we use Discord rather than the message boards here for prolonged discussions?
6. What is the current backend server architecture you have and how long do you think it will be sufficient? (assuming a sudden growth of our volunteer base within the next 6 months) You could f.ex. list it on the server status page
7. Who will decide on future research project collaborations? Is there a board/committee that will vote, discuss and organize future collabs within your faculty/university? Will it only be you?
8. Assuming future projects will launch that won't be mlds-based, will you roll out new applications that allow to opt-in/opt-out?
9. Would you consider to have research being conducted on MLC@H that target ML models on real-world data (f.ex. diagnostic ML models for cancer on health related data, accuracy estimation of autopilots/traffic sign detection, etc.) or would you rather like to see/limit future research collabs to be fundamental research into ML?
10. Have you already conceptualized an onboarding plan and T&As for potential future projects? (scientists might be unfamiliar with BOINC related peculiarities and requirements)
11. Do you plan on reaching out to other universities/faculties that conduct research in the field of ML/DL/AI proactively? Or do you plan on advertising MLC@H and wait for research teams to approach you?
Joined: 17 Jul 20
We already use discord
Joined: 30 Jun 20
Tons of great questions as usual. Thanks for sticking around!
I've been working very hard on wrapping up what I need for my thesis, and this is part of it, but I also need to do a lot of writing as well, so I've been splitting my free time a bit more lately. Note, I see this project continuing beyond completing my thesis, so don't worry, we're not going anywhere.
To answer your questions:
1. For DS3 I'm planning to do a torrent, or if researches approach me directly, I can give them one-time access to my download directly from my home systems. DS1/DS2 will likely remain small enough to download directly.
2. At the moment the priority is the updated client and DS4, in the order I mentioned in the other forum post (CPU first, then GPU). As for other features, the highest priority is getting NaN recovery working. That requires someone to change the way the training class is coded in C++ to both swap out the current model and create a new optimizer object.. which isn't possible from the main loop the way I coded it up originally. It's not rocket science, but would take some careful thinking. DS4 support is already in the new client, so all that is left there is to update the WU generation/validation/assimilation scripts to handle it, which shouldn't take long. That's not a complete list, but its a start.
3. I don't think we need more examples of the same things we have. I think DS4 with CNNs will be a big help. For DS5, I'm thinking of maybe varying the shape and size parameters for DS1/2/3/4. Right now, we only vary the weights.. this made the analysis easy, and showed the good results we already have. It would be nice if we could show the same clustering even if we vary the shape of the network (different numbers of hidden nodes, different number of layers, etc..). DS5 may do that, but I haven't decided yet.
4. I've decided I'm going to swing for the fences, and beef up the paper to submit to one of the big ML conferences, AAAI. It's a stretch and hyper competitive, but even if its only a poster, that should get some people's interest. The poster would be more about MLC@Home itself as a research platform. The paper mentions it too.
5. As for Discord, I find I'm personally on discord a lot, and checking this forum a lot less. I've hooked the forum into the RSS feed on discord, but even then I seem to work better on Discord. Still, for conversations that need to exist long term, like longer term discussions, the forum is probably better. But for tech support and actually getting my attention faster, Discord is a better bet.
6. The server computation-wise if fine. Since the upgrade to a 6c/12t processor (thanks Ryzen!) its barely breaking a sweat. Disk space is more of a concern, but I can mitigate that more easily by moving the DS3 archive off of that onto a torrent. Now, network bandwidth hasn't been a huge issue, but I am still running the system off my home network, as my university is just now allowing people back onto campus post pandemic. That means I can move the server onto their network. However, it also means if I need to access it, it'll be 30 minutes away by car instead of under my desk. So, I think we're good for now.
7. At the moment, it's only me. Once it grows beyond me, I'd like to set up a small governance committee, but for now all you have is my word on that. I will say that I will keep to the tenets laid out on our homepage, that the resulting data needs to be made publicaly available as soon as possible.
8. Of course, any new applications will be separate BOINC application queues, so you can opt in/out as you wish.
9. It's difficult, but I'm open to anything. I've learned early on if you want something you create to thrive, you have to be open to other's ideas on how to use it. In general, I feel ML on the BOINC platform is more suited to trying out a lot of small things in parallel, rather than trying to build one bigger network to target a single problem. That said, one can look at, trying a whole bunch of parameters on a network at once to see which ones perform better, etc. Another problem with real-world data is privacy. First, dealing with medical data is a potential whole can of worms in the US, as the data could be personally identifying, so hosting the data on something like MLC@Home could potentially open us up to some liability there. On the other end of the spectrum, there's so many businesses looking to make money on ML, they tend to jealously guard their data. So getting real world data is often a problem. Keeping things at the fundamental ML level tends to avoid all those problems and I think has a broader impact on the field as a whole. But like I said, we'd absolutely listen to any honest researcher with an idea!
10. Heck, I'm still learning BOINC peculiarities, so while I've had some informal discussions with other researchers and mentioned some of the caveats, I don't have anything formal.
11. So the funny thing at the moment is that MLC@Home is completely unfunded. The new server was purchased with some grant money, but that's it. I'm able to work on it because I'm essentially self-funded and work another full-time job (I'm a part-time student). The main issue I've had talking with other researchers is that they need funding to continue as a grad student, and working on MLC won't get them any. I've been looking at potential funding opportunities and collaborations, but one particular one I had high hopes for didn't pan out. So at the moment it's advertise and hope to get noticed. Attending conferences like AAAI might help.
Joined: 9 Jul 20
Wow, that is a lot of great information and took a while to fully work through! But first of all thanks for giving a lot of thought to my questions and your overall outstanding responsiveness. I'll just go through your answers one by one.
1. Makes sense. Torrents will likely be the (only) way to go for the larger data sets.
2. Great priorities. DS4 support is great as this will obviously extend the future data sets and thus possible prospective research questions that can be addressed later on. Hope that you'll find some talented devs that can support you on this quest.
3. Same thoughts here. Varying more parameters than just weights (one at a time or various) would definitely be an interesting setting for future research. I guess that clustering would be apparent earlier on due to larger variation in the output (trained networks).
4. Awesome - Fingers crossed!
5. That was my impression too, and what motivated this question. I'll keep that in mind.
6. I saw that your personal machine computing for MLC got upgraded to a 5950X 16c/32t monster :) Guess the 3600 just got retired for that purpose? :) Evidently, this setting works great but might become unmanageable/unbearable (heat/electricity cost/noise) in the future if/when the project's volunteer base were to scale up soon!
7. Kudos to you. Running a BOINC project is one thing, but a whole other level setting one up all by yourself and custom building the client software.
9. Naive of me to totally forget about privacy issues here. But sure thing, that most cutting edge research with practical applications/use cases is conducted at private corporations and that these IPs are proprietary and protected. My primary interest in MLC always has been in fundamental research with a much broader impact than training/assessment of specific ML models. So many interesting questions yet have to be researched/addressed to satisfy my understanding of a great "comprehension" of AI and ML. Only this will make sure that these technologies are safe to use, benefit the many and will not be malicious.
10. Fair. I am still navigating the easy side of BOINC of us volunteers after more than a year :)
11. Is a donation system sensible/helpful in that case? Gladly would like to donate if that were to help the project thrive and get the support you need to lessen the burden of a one-man show workload you currently buckle.
As always, I do highly appreciate you taking the time to curate and care for the MLC community here as well as on other channels. All the best for you and the next 6 months plan for the project!