Sorry for the delay

Questions and Answers : Unix/Linux : Sorry for the delay
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 456
Credit: 14,368,944
RAC: 939
Message 336 - Posted: 20 Aug 2020, 0:23:47 UTC

Sorry for the delay all, We wanted to have new clients out a week ago, then this past weekend, and its now wednesday and they're still not ready. Linux clients for amd64 and arm32/64 are all ready with the multithreading and zlib fixes in place, and more importantly for me, now automatically build on each commit, which will help catch cross-platform errors early, and automates new releases (previous releases were hand-built each time on each arch, taking a lot of time).

The problem is the windows client, which is just taking a while to get ready, not because of the code, but because dealing with CMake, VS2019, MinGW, and random libraries is just not going well. Expect the new client any minute now once I get that sorted.

And the extra time this release should make future releases a lot smoother. Thanks for your patience.
ID: 336 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 12 Aug 20
Posts: 21
Credit: 46,402,862
RAC: 42,511
Message 337 - Posted: 20 Aug 2020, 14:51:00 UTC - in response to Message 336.  

Hi!

Were waiting (un)patiently! ;-)

Two questions:
* Would it be possible to release only the Linux client for a few days while your working with the Win client?
* Will the new client take more RAM memory than the current? I just recently quit crunching on the NFS@home project as their client took approx 800Mb per core, and that's a little bit too much for several of my machines. (4 cores and 4Gb of RAM is very commonplace among a bit older generations of machines.) As I can see, the current client is taking slightly less than 700Gb per core and that's just acceptable. :-)

Kindest regards,
Gunnar
ID: 337 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 456
Credit: 14,368,944
RAC: 939
Message 338 - Posted: 20 Aug 2020, 22:14:53 UTC - in response to Message 337.  

I got the windows client compiling again at about 2am last night, but didn't want to turn on the new clients right before bed in case something went wrong. I'll release them in a hour or so once I'm home fro my day job and few the fam dinner.

To your specific questions, I could have released the linux clients early, but I continued to make small tweaks even as I was working on the windows one, so its an overall win. As for memory usage, its tough, as its a tradeoff between memory usage and speed/power/disk thrashing. There is one good thing in the new client, which will help with memory usage a bit with newer datasets (supports datasets in int8 instead of float32), which will help with new datasets, but not the old/current ones.

So expect new client release and announcement in the next few hours, then be on the lookout for new dataset3 WUs over the next few days.
ID: 338 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 456
Credit: 14,368,944
RAC: 939
Message 339 - Posted: 21 Aug 2020, 0:38:39 UTC - in response to Message 338.  

Linux clients updated, windows client having an issue registering it with the server.
ID: 339 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
PoppaGeek

Send message
Joined: 3 Jul 20
Posts: 13
Credit: 10,301,199
RAC: 0
Message 340 - Posted: 21 Aug 2020, 0:46:30 UTC

Thanks for all the updates.
ID: 340 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
JagDoc

Send message
Joined: 10 Aug 20
Posts: 13
Credit: 6,159,690
RAC: 0
Message 345 - Posted: 21 Aug 2020, 7:37:51 UTC

On the Odroid-N2 the arm64 app run very fast without problem.
Runtime 3h20m.
https://www.mlcathome.org/mlcathome/result.php?resultid=1090117

On the Odroid-HC1 the arm32 app crashes with "dlopen(): error loading libfuse.so.2" error.
After "sudo apt install libfuse2" i get this error:

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 127 (0x7f, -129)</message>
<stderr_txt>
fuse: failed to exec fusermount: No such file or directory
open dir error: No such file or directory

</stderr_txt>

https://www.mlcathome.org/mlcathome/result.php?resultid=1096862
ID: 345 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
PoppaGeek

Send message
Joined: 3 Jul 20
Posts: 13
Credit: 10,301,199
RAC: 0
Message 346 - Posted: 21 Aug 2020, 13:34:41 UTC

ID: 346 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 456
Credit: 14,368,944
RAC: 939
Message 347 - Posted: 21 Aug 2020, 15:26:26 UTC - in response to Message 345.  


After "sudo apt install libfuse2" i get this error:

7.16.6

process exited with code 127 (0x7f, -129)


fuse: failed to exec fusermount: No such file or directory
open dir error: No such file or directory




It's not just the fuse library, you need the fuse userspace tools installed too. Those are in the `fuse` package. I should put a sticky post up about it.
ID: 347 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 456
Credit: 14,368,944
RAC: 939
Message 348 - Posted: 21 Aug 2020, 15:35:05 UTC - in response to Message 346.  
Last modified: 21 Aug 2020, 15:44:14 UTC

I have 2 pretty much identical dual Opteron systems running Ubuntu. Both did fine with 9.20 but all errors with 9.50

https://www.mlcathome.org/mlcathome/results.php?hostid=2032&offset=0&show_names=0&state=6&appid=

https://www.mlcathome.org/mlcathome/results.php?hostid=293&offset=20&show_names=0&state=6&appid=


First, thanks for the bug report. That's not good and I can't reproduce. Error 4 is normally illegal instruction, which, given that these are opterons, maybe the way I'm compiling pytorch enables something like AVX which it shouldn't. I'll try to fix it and push a 3.51 update later today. It should only require something like sse2, but use AVX/AVX2/AVX512 if it's available.

In short, I'm working on it.
ID: 348 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 456
Credit: 14,368,944
RAC: 939
Message 349 - Posted: 21 Aug 2020, 15:48:43 UTC - in response to Message 348.  

You're not the only one having these issues too, so its not just opterons. sigh.
ID: 349 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
PoppaGeek

Send message
Joined: 3 Jul 20
Posts: 13
Credit: 10,301,199
RAC: 0
Message 350 - Posted: 21 Aug 2020, 15:59:29 UTC - in response to Message 349.  

OK thanks for letting me know. Again thanks for keeping us updated! :-)
ID: 350 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
JagDoc

Send message
Joined: 10 Aug 20
Posts: 13
Credit: 6,159,690
RAC: 0
Message 351 - Posted: 21 Aug 2020, 16:19:07 UTC - in response to Message 347.  


After "sudo apt install libfuse2" i get this error:

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 127 (0x7f, -129)</message>
<stderr_txt>
fuse: failed to exec fusermount: No such file or directory
open dir error: No such file or directory

</stderr_txt>


It's not just the fuse library, you need the fuse userspace tools installed too. Those are in the `fuse` package. I should put a sticky post up about it.

Fuse is installed.
lsmod shows fuse is loaded.
ID: 351 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 456
Credit: 14,368,944
RAC: 939
Message 352 - Posted: 21 Aug 2020, 17:11:19 UTC - in response to Message 351.  
Last modified: 21 Aug 2020, 17:11:36 UTC


After "sudo apt install libfuse2" i get this error:

7.16.6

process exited with code 127 (0x7f, -129)


fuse: failed to exec fusermount: No such file or directory
open dir error: No such file or directory




It's not just the fuse library, you need the fuse userspace tools installed too. Those are in the `fuse` package. I should put a sticky post up about it.

Fuse is installed.
lsmod shows fuse is loaded.


... and its still not working? On my debian/arm systems, once you sudo apt install fuse, there's a program /bin/fusermount. That's what the error is saying it can't find. Is that file present on your system?
ID: 352 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 456
Credit: 14,368,944
RAC: 939
Message 353 - Posted: 21 Aug 2020, 17:19:40 UTC

As for the opteron problem, the issue appears to be libopenblas is detecting the target as having sse3 when it doesn't. I've been able to reproduce using a local VM with the processor set to a Gen1 opteron. It appears the (older version of) openblas I have bundled with the app has issues detecting a few platform capabilities. PyTorch has other options for BLAS, so I'm going to spin a 9.51 release for x86_64 only that uses either Eigen or MKL for BLAS instead of openblas. While newer versions of openblas may have fixed this issue, I also see they've deprecated older optimizations in later releases too, so it probably wouldn't be worth it. the only downside is it'll take longer to compile and probably add to the binary size. I'm working on it.
ID: 353 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
JagDoc

Send message
Joined: 10 Aug 20
Posts: 13
Credit: 6,159,690
RAC: 0
Message 355 - Posted: 21 Aug 2020, 17:32:15 UTC - in response to Message 352.  



... and its still not working? On my debian/arm systems, once you sudo apt install fuse, there's a program /bin/fusermount. That's what the error is saying it can't find. Is that file present on your system?

It is not on the system.
If i run "sudo apt install fuse" i get the message newest version is installed.
ID: 355 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 12 Aug 20
Posts: 21
Credit: 46,402,862
RAC: 42,511
Message 356 - Posted: 21 Aug 2020, 17:42:04 UTC - in response to Message 348.  

Hi!

Most of my machines seems to be doing fine with the new version, and for example the HP Elitedesk 8300 USDT with the CPU i5-3470s seems to run ~15% more efficient! :-)

However, I have a problem with old Intel processors, family 6, model 15, ... (and earlier?): These all seem to produce the SIGILL (signal 4) error.
I have four such old cans in my "Boinc-farm", and none of them can run any 9.50 tasks.

I'm not really a hardware black-belter, so I don't know exactly what CPU instruction are illegal for those CPUs,
but many of my older computers can run the project although they do not have the AVX extensions.
A good example could be my two old faithful HPs with Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz [Family 6 Model 23 Stepping 10].
They certainly do not have any AVX, although they do feature the SSE instruction set, all the way up to SSE4.1.

Could it be the lack of SSE4 instruction set in the CPU that causes the problem?
Those of my cans that caused troubles all featured some Intel mobile CPU of model CPU T7500 @ 2.20GHz [Family 6 Model 15 Stepping 11], and they are known to lack the SSE4 set.

Good luck with fixing the problem, and a nice weekend!!!

//Gunnar
ID: 356 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 456
Credit: 14,368,944
RAC: 939
Message 357 - Posted: 21 Aug 2020, 17:48:00 UTC - in response to Message 355.  



... and its still not working? On my debian/arm systems, once you sudo apt install fuse, there's a program /bin/fusermount. That's what the error is saying it can't find. Is that file present on your system?

It is not on the system.
If i run "sudo apt install fuse" i get the message newest version is installed.


What version of debian? or some other distro/variant like ubuntu? When I go here: https://packages.debian.org/buster/fuse and click on "list of files" for both arm64 and armel, it lists /bin/fusermount as an installed file from that package. Same with ubuntu 18.04 here: https://packages.ubuntu.com/bionic/fuse .

So, I'm a bit at a loss on what to try next.
ID: 357 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 456
Credit: 14,368,944
RAC: 939
Message 359 - Posted: 21 Aug 2020, 18:10:01 UTC - in response to Message 356.  
Last modified: 21 Aug 2020, 18:13:23 UTC

Hi!

Most of my machines seems to be doing fine with the new version, and for example the HP Elitedesk 8300 USDT with the CPU i5-3470s seems to run ~15% more efficient! :-)

However, I have a problem with old Intel processors, family 6, model 15, ... (and earlier?): These all seem to produce the SIGILL (signal 4) error.
I have four such old cans in my "Boinc-farm", and none of them can run any 9.50 tasks.

I'm not really a hardware black-belter, so I don't know exactly what CPU instruction are illegal for those CPUs,
but many of my older computers can run the project although they do not have the AVX extensions.
A good example could be my two old faithful HPs with Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz [Family 6 Model 23 Stepping 10].
They certainly do not have any AVX, although they do feature the SSE instruction set, all the way up to SSE4.1.

Could it be the lack of SSE4 instruction set in the CPU that causes the problem?
Those of my cans that caused troubles all featured some Intel mobile CPU of model CPU T7500 @ 2.20GHz [Family 6 Model 15 Stepping 11], and they are known to lack the SSE4 set.

Good luck with fixing the problem, and a nice weekend!!!

//Gunnar


Its definitely an issue with an older version of OpenBLAS misdetecting what's available on some platforms, and it appears to be a bit random which ones are effected.. I saw a hawsell-based system fail with SIGILL in the logs(!). I haven't run into it on my systems, but that's why I always expect errors when rolling out a new app version.

Behind the scenes, we now compile libTorch from source on a Ubuntu 14.04 machine (to make sure libc version is old enough). libTorch can either use Intel's MKL, the open source C++ Eigen library, or OpenBLAS for BLAS functions. v9.20 uses the pre-compiled libtorch binaries from pytorch.org, which (presumably) used intel MKL. I prefer to use openblas, mainly because I don't trust Intel to not play fast and loose with optimizations win benchmarks, plus they don't have an incentive to optimize for older systems (they want you buy new ones instead). OpenBLAS also produces smaller binaries and works on other architectures. However, it appears in this case its let me down.

One option is to turn on intel MKL just for amd64, which is what I'll probably do. Another option would be to compile a custom, newer version of libopenblas instead of using the one that comes stock with ubuntu 14.04, and hope that solves the problem. But I think I'll save that for another time.
ID: 359 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
JagDoc

Send message
Joined: 10 Aug 20
Posts: 13
Credit: 6,159,690
RAC: 0
Message 360 - Posted: 21 Aug 2020, 18:29:34 UTC - in response to Message 357.  



... and its still not working? On my debian/arm systems, once you sudo apt install fuse, there's a program /bin/fusermount. That's what the error is saying it can't find. Is that file present on your system?

It is not on the system.
If i run "sudo apt install fuse" i get the message newest version is installed.


What version of debian? or some other distro/variant like ubuntu? When I go here: https://packages.debian.org/buster/fuse and click on "list of files" for both arm64 and armel, it lists /bin/fusermount as an installed file from that package. Same with ubuntu 18.04 here: https://packages.ubuntu.com/bionic/fuse .

So, I'm a bit at a loss on what to try next.

On the Odroid-HC1 (32bit) is Ubuntu 20.04.1 LTS [5.4.58-211|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9)] installed.
The HC1 are headless system with minimal Ubuntu.

I will try it with a new installation of Ubuntu MATE 20.04.1 LTS on a Odroid-XU4.
ID: 360 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
JagDoc

Send message
Joined: 10 Aug 20
Posts: 13
Credit: 6,159,690
RAC: 0
Message 361 - Posted: 21 Aug 2020, 20:14:53 UTC - in response to Message 360.  


What version of debian? or some other distro/variant like ubuntu? When I go here: https://packages.debian.org/buster/fuse and click on "list of files" for both arm64 and armel, it lists /bin/fusermount as an installed file from that package. Same with ubuntu 18.04 here: https://packages.ubuntu.com/bionic/fuse .

So, I'm a bit at a loss on what to try next.

On the Odroid-HC1 (32bit) is Ubuntu 20.04.1 LTS [5.4.58-211|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9)] installed.
The HC1 are headless system with minimal Ubuntu.

I will try it with a new installation of Ubuntu MATE 20.04.1 LTS on a Odroid-XU4.

It think it is a problem with Ubuntu Minimal 20.04.1 LTS.

On the Odroid-XU4 with Ubuntu MATE 20.04.1 LTS the 32bit app works without error.
Tomorrow i will try it with a fresh installation Ubuntu MATE 20.04.1 LTS on the Odroid-HC1.
ID: 361 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Questions and Answers : Unix/Linux : Sorry for the delay

©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)