Posts by Sagittarius Lupus

1) Questions and Answers : Issue Discussion : 195 (0x000000C3) EXIT_CHILD_FAILED error (Message 1383)
Posted 14 Oct 2021 by Sagittarius Lupus
Post:
Thanks for disclosing this. I was worried for a moment that I had done something to destabilize my worker rig, but I'm glad it's not just me. I'm happy to run test units but when I see the red line I go investigating... it saves so much time to find information like this from project developers. Keep on breaking stuff for science! :D
2) Questions and Answers : Unix/Linux : clang-ocl: No such file or directory (Message 1193)
Posted 30 Apr 2021 by Sagittarius Lupus
Post:
I had a symlink from /opt/rocm-3.9.0/bin/clang-ocl -> /usr/bin/clang-ocl that was doing its part, but my build of ROCm didn't have the USE flags for HIP turned on, so I figured those exceptions I saw straight away were just my system missing entry points for your app and rebuilt ROCm with those things. It turns out the science ebuilds for ROCm 3.9.0 on Gentoo (all the various HIP and native ROCm libraries, including Tensile) are quite broken, but those for 4.0.0 are just barely serviceable, so that is what I actually have installed. The package dev-util/amd-rocm-meta[opencl,hip,science] from the ROCm package overlay I linked in my previous post provides all the necessaries.

After I did that, I was able to run a task in standalone mode and inside the BOINC client. But when I checked what libraries the task is using, I can see with lsof that it's using the ones you bundled except for a few standard system libraries. It might be all that I did was remove a conflict.

At any rate, I completed a task and I've checked out a bunch more to crunch on; they look good.
3) Questions and Answers : Unix/Linux : clang-ocl: No such file or directory (Message 1192)
Posted 28 Apr 2021 by Sagittarius Lupus
Post:
I'm not working on the Portage bits themselves, no -- there are already Gentoo maintainers for the BOINC and various ROCm packages. They work pretty well.

Most of the BOINC GPU projects I'm contributing to (Amicable Numbers, Collatz Conjecture, SRBase, NumberFields, Minecraft@Home) are OpenCL applications on top of ROCm. This is the first project I've actually seen using native ROCm with HIP, so I didn't actually have support for that built in to my system installation at all -- another potential wrench I need to sort out.

I will do as you suggest and try running the application standalone using your instructions. That'll help me retry without sending a bunch of failed tasks back to your server. I'll let you know how it goes.
4) Questions and Answers : Unix/Linux : Fuse/read-only filesystem issue with newer distributions (and new client update) (Message 1190)
Posted 28 Apr 2021 by Sagittarius Lupus
Post:
In case it's helpful, this is my override file at /etc/systemd/system/boinc-client.service.d/override.conf:

[Service]
PrivateTmp=false
ProtectControlGroups=false
ReadWritePaths=-/tmp


With this modification, I don't have any trouble with BOINC tasks writing to /tmp, modifying their own control groups, or interacting with FUSE filesystems (LHC@Home does this with CVMFS).
5) Questions and Answers : Unix/Linux : clang-ocl: No such file or directory (Message 1188)
Posted 28 Apr 2021 by Sagittarius Lupus
Post:
Thanks for responding, and for all that information. I'll consider joining the Discord.

Yeah, so, I'm with you on trying to make the binary available on the host at the path it expects, at least as a workaround. I tried that earlier today, but all of my ROCm packages are also at 4.1.0 now, and I use the system installation for other BOINC projects. It appears to have executed the right binary, but then bailed spectacularly in some new way I haven't yet had time to dig into. Possibly I'm missing some dependencies; possibly 4.1.0 just won't work, but I will have to do some wizardry to get 3.9.0 back onto my system for any testing.

Some editorialization: Software vendors can recommend installation in non-standard paths, but as you've discovered, ignoring the Linux FHS has less-than-helpful side effects like making binary dependencies impossible to find without explicit declarations. I wasn't aware the vendor (AMD) had rolled their own packaging for Ubuntu and CentOS (RIP), but I'm using Gentoo, where our package maintainers are strongly encouraged to adhere to FHS paths. Because the system is designed for stable, repeatable software builds, any given build- or run-time dependency has to be in a predictable location and the kind of messiness implied by installing stuff in /opt/ just isn't allowed. Alas, if only every upstream developer would get that memo.
6) Questions and Answers : Unix/Linux : Fuse/read-only filesystem issue with newer distributions (and new client update) (Message 1185)
Posted 27 Apr 2021 by Sagittarius Lupus
Post:
Hey, there. I can't be certain this is the same issue, but you mentioned a distribution upgrade, and that creates an opportunity for systemd to get involved with the BOINC client if you're running it as a service. I banged my head against the problem of tasks running in the client being unable to access various parts of the host filesystem that definitely were not read-only -- in particular, it couldn't reach into various control groups that the boinc user should have had exclusive access to, and worse, if I ran the tasks outside of BOINC they had no such problem.

My investigation led to this thread over on the LHC@Home forums, which is mostly me talking to myself about the problem, where I eventually managed to solve it: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5121

TL;DR: if you're running BOINC as a service in systemd, the service unit installed by your distribution may enforce certain sandboxing features that mask access to parts of the host file system for processes running inside the service. In my case, I had to set a ProtectControlGroups=false override, but in your case, I suspect it may be sufficient to add the path to the FUSE filesystem you're trying to access to a ReadWritePaths override in your boinc-client.service unit file.

Of course, this means that if any of your volunteers are running BOINC on Linux distros with systemd, they will in general have to make the same sandbox accommodations you do. This only tends to be relevant to particularly advanced BOINC projects that reach into unusual parts of the host operating system. You might also consider reporting the permissions conflict to your distribution as a bug, if they are willing to review the security implications of modifying their packaging of the BOINC client for Ubuntu as a whole.
7) Questions and Answers : Unix/Linux : clang-ocl: No such file or directory (Message 1181)
Posted 27 Apr 2021 by Sagittarius Lupus
Post:
I happened to receive one of the new amdrocm tasks, but it appears to be looking for the clang-ocl binary in the wrong place. It's installed on my system in the standard /usr/bin/clang-ocl location, where it is supposed to be... but the task is looking for it in /opt/rocm-3.9.0/bin/clang-ocl, where there is obviously no reason for it to exist.

See: https://www.mlcathome.org/mlcathome/result.php?resultid=5096417

I've been trying to get one of these to actually complete and pass validation on my Vega card, but I'm assigned these test tasks so rarely I almost never get a chance to troubleshoot them.




©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)