Questions and Answers :
Unix/Linux :
Sorry for the delay
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
Sorry for the delay all, We wanted to have new clients out a week ago, then this past weekend, and its now wednesday and they're still not ready. Linux clients for amd64 and arm32/64 are all ready with the multithreading and zlib fixes in place, and more importantly for me, now automatically build on each commit, which will help catch cross-platform errors early, and automates new releases (previous releases were hand-built each time on each arch, taking a lot of time). The problem is the windows client, which is just taking a while to get ready, not because of the code, but because dealing with CMake, VS2019, MinGW, and random libraries is just not going well. Expect the new client any minute now once I get that sorted. And the extra time this release should make future releases a lot smoother. Thanks for your patience. |
|
Send message Joined: 12 Aug 20 Posts: 21 Credit: 53,001,945 RAC: 0 |
Hi! Were waiting (un)patiently! ;-) Two questions: * Would it be possible to release only the Linux client for a few days while your working with the Win client? * Will the new client take more RAM memory than the current? I just recently quit crunching on the NFS@home project as their client took approx 800Mb per core, and that's a little bit too much for several of my machines. (4 cores and 4Gb of RAM is very commonplace among a bit older generations of machines.) As I can see, the current client is taking slightly less than 700Gb per core and that's just acceptable. :-) Kindest regards, Gunnar |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
I got the windows client compiling again at about 2am last night, but didn't want to turn on the new clients right before bed in case something went wrong. I'll release them in a hour or so once I'm home fro my day job and few the fam dinner. To your specific questions, I could have released the linux clients early, but I continued to make small tweaks even as I was working on the windows one, so its an overall win. As for memory usage, its tough, as its a tradeoff between memory usage and speed/power/disk thrashing. There is one good thing in the new client, which will help with memory usage a bit with newer datasets (supports datasets in int8 instead of float32), which will help with new datasets, but not the old/current ones. So expect new client release and announcement in the next few hours, then be on the lookout for new dataset3 WUs over the next few days. |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
Linux clients updated, windows client having an issue registering it with the server. |
|
Send message Joined: 3 Jul 20 Posts: 13 Credit: 13,421,453 RAC: 0 |
Thanks for all the updates. |
|
Send message Joined: 10 Aug 20 Posts: 13 Credit: 6,703,099 RAC: 3 |
On the Odroid-N2 the arm64 app run very fast without problem. Runtime 3h20m. https://www.mlcathome.org/mlcathome/result.php?resultid=1090117 On the Odroid-HC1 the arm32 app crashes with "dlopen(): error loading libfuse.so.2" error. After "sudo apt install libfuse2" i get this error: <core_client_version>7.16.6</core_client_version> <![CDATA[ <message> process exited with code 127 (0x7f, -129)</message> <stderr_txt> fuse: failed to exec fusermount: No such file or directory open dir error: No such file or directory </stderr_txt> https://www.mlcathome.org/mlcathome/result.php?resultid=1096862 |
|
Send message Joined: 3 Jul 20 Posts: 13 Credit: 13,421,453 RAC: 0 |
I have 2 pretty much identical dual Opteron systems running Ubuntu. Both did fine with 9.20 but all errors with 9.50 https://www.mlcathome.org/mlcathome/results.php?hostid=2032&offset=0&show_names=0&state=6&appid= https://www.mlcathome.org/mlcathome/results.php?hostid=293&offset=20&show_names=0&state=6&appid= |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
It's not just the fuse library, you need the fuse userspace tools installed too. Those are in the `fuse` package. I should put a sticky post up about it. |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
I have 2 pretty much identical dual Opteron systems running Ubuntu. Both did fine with 9.20 but all errors with 9.50 First, thanks for the bug report. That's not good and I can't reproduce. Error 4 is normally illegal instruction, which, given that these are opterons, maybe the way I'm compiling pytorch enables something like AVX which it shouldn't. I'll try to fix it and push a 3.51 update later today. It should only require something like sse2, but use AVX/AVX2/AVX512 if it's available. In short, I'm working on it. |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
You're not the only one having these issues too, so its not just opterons. sigh. |
|
Send message Joined: 3 Jul 20 Posts: 13 Credit: 13,421,453 RAC: 0 |
OK thanks for letting me know. Again thanks for keeping us updated! :-) |
|
Send message Joined: 10 Aug 20 Posts: 13 Credit: 6,703,099 RAC: 3 |
Fuse is installed. lsmod shows fuse is loaded. |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
... and its still not working? On my debian/arm systems, once you sudo apt install fuse, there's a program /bin/fusermount. That's what the error is saying it can't find. Is that file present on your system? |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
As for the opteron problem, the issue appears to be libopenblas is detecting the target as having sse3 when it doesn't. I've been able to reproduce using a local VM with the processor set to a Gen1 opteron. It appears the (older version of) openblas I have bundled with the app has issues detecting a few platform capabilities. PyTorch has other options for BLAS, so I'm going to spin a 9.51 release for x86_64 only that uses either Eigen or MKL for BLAS instead of openblas. While newer versions of openblas may have fixed this issue, I also see they've deprecated older optimizations in later releases too, so it probably wouldn't be worth it. the only downside is it'll take longer to compile and probably add to the binary size. I'm working on it. |
|
Send message Joined: 10 Aug 20 Posts: 13 Credit: 6,703,099 RAC: 3 |
It is not on the system. If i run "sudo apt install fuse" i get the message newest version is installed. |
|
Send message Joined: 12 Aug 20 Posts: 21 Credit: 53,001,945 RAC: 0 |
Hi! Most of my machines seems to be doing fine with the new version, and for example the HP Elitedesk 8300 USDT with the CPU i5-3470s seems to run ~15% more efficient! :-) However, I have a problem with old Intel processors, family 6, model 15, ... (and earlier?): These all seem to produce the SIGILL (signal 4) error. I have four such old cans in my "Boinc-farm", and none of them can run any 9.50 tasks. I'm not really a hardware black-belter, so I don't know exactly what CPU instruction are illegal for those CPUs, but many of my older computers can run the project although they do not have the AVX extensions. A good example could be my two old faithful HPs with Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz [Family 6 Model 23 Stepping 10]. They certainly do not have any AVX, although they do feature the SSE instruction set, all the way up to SSE4.1. Could it be the lack of SSE4 instruction set in the CPU that causes the problem? Those of my cans that caused troubles all featured some Intel mobile CPU of model CPU T7500 @ 2.20GHz [Family 6 Model 15 Stepping 11], and they are known to lack the SSE4 set. Good luck with fixing the problem, and a nice weekend!!! //Gunnar |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
What version of debian? or some other distro/variant like ubuntu? When I go here: https://packages.debian.org/buster/fuse and click on "list of files" for both arm64 and armel, it lists /bin/fusermount as an installed file from that package. Same with ubuntu 18.04 here: https://packages.ubuntu.com/bionic/fuse . So, I'm a bit at a loss on what to try next. |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
Hi! Its definitely an issue with an older version of OpenBLAS misdetecting what's available on some platforms, and it appears to be a bit random which ones are effected.. I saw a hawsell-based system fail with SIGILL in the logs(!). I haven't run into it on my systems, but that's why I always expect errors when rolling out a new app version. Behind the scenes, we now compile libTorch from source on a Ubuntu 14.04 machine (to make sure libc version is old enough). libTorch can either use Intel's MKL, the open source C++ Eigen library, or OpenBLAS for BLAS functions. v9.20 uses the pre-compiled libtorch binaries from pytorch.org, which (presumably) used intel MKL. I prefer to use openblas, mainly because I don't trust Intel to not play fast and loose with optimizations win benchmarks, plus they don't have an incentive to optimize for older systems (they want you buy new ones instead). OpenBLAS also produces smaller binaries and works on other architectures. However, it appears in this case its let me down. One option is to turn on intel MKL just for amd64, which is what I'll probably do. Another option would be to compile a custom, newer version of libopenblas instead of using the one that comes stock with ubuntu 14.04, and hope that solves the problem. But I think I'll save that for another time. |
|
Send message Joined: 10 Aug 20 Posts: 13 Credit: 6,703,099 RAC: 3 |
On the Odroid-HC1 (32bit) is Ubuntu 20.04.1 LTS [5.4.58-211|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9)] installed. The HC1 are headless system with minimal Ubuntu. I will try it with a new installation of Ubuntu MATE 20.04.1 LTS on a Odroid-XU4. |
|
Send message Joined: 10 Aug 20 Posts: 13 Credit: 6,703,099 RAC: 3 |
It think it is a problem with Ubuntu Minimal 20.04.1 LTS. On the Odroid-XU4 with Ubuntu MATE 20.04.1 LTS the 32bit app works without error. Tomorrow i will try it with a fresh installation Ubuntu MATE 20.04.1 LTS on the Odroid-HC1. |
©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)