clang-ocl: No such file or directory

Questions and Answers : Unix/Linux : clang-ocl: No such file or directory
Message board moderation

To post messages, you must log in.

AuthorMessage
Sagittarius Lupus
Avatar

Send message
Joined: 4 Apr 21
Posts: 7
Credit: 415,093
RAC: 5
Message 1181 - Posted: 27 Apr 2021, 1:09:35 UTC
Last modified: 27 Apr 2021, 1:17:16 UTC

I happened to receive one of the new amdrocm tasks, but it appears to be looking for the clang-ocl binary in the wrong place. It's installed on my system in the standard /usr/bin/clang-ocl location, where it is supposed to be... but the task is looking for it in /opt/rocm-3.9.0/bin/clang-ocl, where there is obviously no reason for it to exist.

See: https://www.mlcathome.org/mlcathome/result.php?resultid=5096417

I've been trying to get one of these to actually complete and pass validation on my Vega card, but I'm assigned these test tasks so rarely I almost never get a chance to troubleshoot them.
ID: 1181 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 1182 - Posted: 27 Apr 2021, 1:46:36 UTC - in response to Message 1181.  

Thanks for trying this and the bug report. I share your frustration. I can reproduce this and I'm working on it. The ROCM official docs actually say this should be installed in /opt/rocm-X.Y.Z/bin, with a symlink /opt/rocm pointing to the latest /opt/rocm-X.Y.Z installed on the system.

https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html#deploying-rocm .

However, there are a ton of issues with this. First, the client as it stands now is compiled against rocm-3.9.x, and I had no idea it was looking in a hard-coded path to run clang-ocl. So it worked on my machine because I happened to have that installed. Now it fails on my machine because I re-installed it and installed the latest version 4.1.0, and /opt/rocm-3.9.0 doesn't exist.

I had hoped that by shipping the rocm libraries with the binary the user wouldn't even need to have rocm installed, it would just use the local copy included with the app. Unfortunately, the libraries contain external calls out to hard-coded paths, including the version number in the path as well. This is also bad because boinc is not ROCm aware, and has no way of telling me what version of rocm is installed on the system (like it does with cuda)... so if I compile against a version of rocm, there's no way to know (ahead of time) if the user is using that version of rocm on their system. In reality, if rocm can't find the specific version of clang-ocl in the path it expects, it should fall back to just running whatever it can find in the path... but apparently it doesn't. Or at least it doesn't on the version I compiled the client against.

Even static linking (what we're working on for the next release) won't fix this issue. I'm working on it but it'll take a while.

As a workaround, if you're willing to, you can install rocm-3.9.0 on your system under opt, and it should work on VEGA (and redeon vii) as-is. We just had one user (see the chat on the discord server) get it running on radeon vii on the discord server by installing 3.9.0... but that is *far* from an ideal solution.
ID: 1182 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sagittarius Lupus
Avatar

Send message
Joined: 4 Apr 21
Posts: 7
Credit: 415,093
RAC: 5
Message 1188 - Posted: 28 Apr 2021, 2:09:32 UTC

Thanks for responding, and for all that information. I'll consider joining the Discord.

Yeah, so, I'm with you on trying to make the binary available on the host at the path it expects, at least as a workaround. I tried that earlier today, but all of my ROCm packages are also at 4.1.0 now, and I use the system installation for other BOINC projects. It appears to have executed the right binary, but then bailed spectacularly in some new way I haven't yet had time to dig into. Possibly I'm missing some dependencies; possibly 4.1.0 just won't work, but I will have to do some wizardry to get 3.9.0 back onto my system for any testing.

Some editorialization: Software vendors can recommend installation in non-standard paths, but as you've discovered, ignoring the Linux FHS has less-than-helpful side effects like making binary dependencies impossible to find without explicit declarations. I wasn't aware the vendor (AMD) had rolled their own packaging for Ubuntu and CentOS (RIP), but I'm using Gentoo, where our package maintainers are strongly encouraged to adhere to FHS paths. Because the system is designed for stable, repeatable software builds, any given build- or run-time dependency has to be in a predictable location and the kind of messiness implied by installing stuff in /opt/ just isn't allowed. Alas, if only every upstream developer would get that memo.
ID: 1188 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 1189 - Posted: 28 Apr 2021, 2:27:47 UTC - in response to Message 1188.  
Last modified: 28 Apr 2021, 2:29:31 UTC

Oh, you're preaching to the choir here. Every time I see anything in /opt I cringe. At least AMD is better than they were originally, where they really just wanted you to just use their docker image. I'm not aware if amd is working with the debian folks to do proper packaging (I hope they are), but I've always used their repo.

Are you working on a proper portage recipe for rocm, boinc, or both? If so, rock on, happy to help in any way I can.
Do you have other projects using rocm natively? or using opencl via rocm? We're using pytorch compiled with native rocm/HIP support, not opencl, I wonder if that makes a difference.

I'm wondering if the issue is that we're shipping our own rocm libraries. If you have rocm installed already and in your search path, then the libraries we ship are redundant, and possibly complicating the setup.

Would you be willing to try running the client without our libraries? Here's the general process:

* Go into the mlc project dir in the boinc home directory
* Copy mlds_* and all the files that end in `-pt17rocm39`, and the dataset (*.hdf5) into a separate temp directory.
* Remove the -pt17rocm39 from the end of all the files that have it
* Create a `dataset.hdf5` symlink to the dataset file in the same dir (`ln -s Parity*.hdf5 dataset.hdf5`) .

You should then be able to run the client from that directory via `./mlds_* -m 5` (will run for 5 epochs then exit). That should fail in the same way.
From there, delete or rename all the libraries from this directory that you already have installed for your main rocm installation and try running again. I'd love to hear how it fails in new and spectacular ways. I'm still trying to test the systemd issue.
ID: 1189 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sagittarius Lupus
Avatar

Send message
Joined: 4 Apr 21
Posts: 7
Credit: 415,093
RAC: 5
Message 1192 - Posted: 28 Apr 2021, 3:37:20 UTC

I'm not working on the Portage bits themselves, no -- there are already Gentoo maintainers for the BOINC and various ROCm packages. They work pretty well.

Most of the BOINC GPU projects I'm contributing to (Amicable Numbers, Collatz Conjecture, SRBase, NumberFields, Minecraft@Home) are OpenCL applications on top of ROCm. This is the first project I've actually seen using native ROCm with HIP, so I didn't actually have support for that built in to my system installation at all -- another potential wrench I need to sort out.

I will do as you suggest and try running the application standalone using your instructions. That'll help me retry without sending a bunch of failed tasks back to your server. I'll let you know how it goes.
ID: 1192 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sagittarius Lupus
Avatar

Send message
Joined: 4 Apr 21
Posts: 7
Credit: 415,093
RAC: 5
Message 1193 - Posted: 30 Apr 2021, 1:14:28 UTC

I had a symlink from /opt/rocm-3.9.0/bin/clang-ocl -> /usr/bin/clang-ocl that was doing its part, but my build of ROCm didn't have the USE flags for HIP turned on, so I figured those exceptions I saw straight away were just my system missing entry points for your app and rebuilt ROCm with those things. It turns out the science ebuilds for ROCm 3.9.0 on Gentoo (all the various HIP and native ROCm libraries, including Tensile) are quite broken, but those for 4.0.0 are just barely serviceable, so that is what I actually have installed. The package dev-util/amd-rocm-meta[opencl,hip,science] from the ROCm package overlay I linked in my previous post provides all the necessaries.

After I did that, I was able to run a task in standalone mode and inside the BOINC client. But when I checked what libraries the task is using, I can see with lsof that it's using the ones you bundled except for a few standard system libraries. It might be all that I did was remove a conflict.

At any rate, I completed a task and I've checked out a bunch more to crunch on; they look good.
ID: 1193 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : clang-ocl: No such file or directory

©2024 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)