Questions and Answers :
Unix/Linux :
Workaround for "signal 4" problems with 9.50 linux client
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
For those having issues with WUs error-ing out with "signal 4", the fix is taking longer than I thought. As a temporary workaround, you can add OPENBLAS_CORETYPE=GENERIC to the environment of the boinc user, and then restart the boinc client. This will fallback to generic BLAS code instead of running the wrong assembly for your CPU type. This is especially an issue on older Opteron systems where openblas assumes they all have sse3 (only later opterons do), but I've seen it on some intel systems too. |
|
Send message Joined: 12 Aug 20 Posts: 21 Credit: 53,001,945 RAC: 0 |
Hi! The few times I have to (re)start the boinc client I would normally use the $ sudo service boinc-client start command, and that doesn't seem to allow to set any environment variables, or does it? I had a look-around in my computers but I couldn't find any .bashrc or bash_profile files in the home folder of boinc. Do you know what command to give or what file to edit in order to set that environment variable? //Gunnar |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
If you're using a systemd-based system, you can look here: https://serverfault.com/questions/413397/how-to-set-environment-variable-in-systemd-service Note that some distros, like ubuntu, sometimes, left "service" compat commands from upstart,but are really systemd under the hood. For system-v init systems, just edit /etc/init.d/boinc-client to set the env variable there. Again, it shouldn't be long before I have a fix, but until then this will help. |
|
Send message Joined: 12 Aug 20 Posts: 21 Credit: 53,001,945 RAC: 0 |
Hi! Thanks for the advises! I tried both ways but I ran totally out of luck there I'm afraid! :-( The old cans I have that are affected by this bug is not any big and important ones, so I guess I have to wait for a newer version instead. I'll keep on crunching until then. Good luck with all your project work, and a nice new week to come!!! //Gunnar |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
Here's the list of CPUs I see in the database that have this issue: +------+---------------------------------------------------------------------------------+ | id | p_model | +------+---------------------------------------------------------------------------------+ | 293 | Six-Core AMD Opteron(tm) Processor 2431 [Family 16 Model 8 Stepping 0] | | 345 | Intel(R) Core(TM)2 Duo CPU T5870 @ 2.00GHz [Family 6 Model 15 Stepping 13] | | 349 | AMD Phenom(tm) II X6 1090T Processor [Family 16 Model 10 Stepping 0] | | 477 | AMD Opteron(tm) Processor 6128 HE [Family 16 Model 9 Stepping 1] | | 573 | AMD A8-3820 APU with Radeon(tm) HD Graphics [Family 18 Model 1 Stepping 0] | | 1359 | AMD Phenom(tm) II X6 1090T Processor [Family 16 Model 10 Stepping 0] | | 1473 | Intel(R) Core(TM)2 CPU 6400 @ 2.13GHz [Family 6 Model 15 Stepping 6] | | 1588 | Intel(R) Pentium(R) 4 CPU 3.00GHz [Family 15 Model 4 Stepping 9] | | 1646 | Six-Core AMD Opteron(tm) Processor 8425 HE [Family 16 Model 8 Stepping 0] | | 1988 | Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz [Family 6 Model 15 Stepping 11] | | 2032 | Six-Core AMD Opteron(tm) Processor 2431 [Family 16 Model 8 Stepping 0] | | 2066 | Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz [Family 6 Model 15 Stepping 11] | | 2169 | AMD Phenom(tm) II X6 1055T Processor [Family 16 Model 10 Stepping 0] | | 2170 | AMD Phenom(tm) II X6 1055T Processor [Family 16 Model 10 Stepping 0] | | 2171 | AMD Phenom(tm) II X6 1055T Processor [Family 16 Model 10 Stepping 0] | | 2172 | AMD Phenom(tm) II X6 1055T Processor [Family 16 Model 10 Stepping 0] | | 2173 | AMD Phenom(tm) II X6 1055T Processor [Family 16 Model 10 Stepping 0] | | 2190 | Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz [Family 6 Model 15 Stepping 11] | | 2214 | Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz [Family 6 Model 15 Stepping 10] | | 2221 | AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ [Family 15 Model 107 Stepping 1] | | 2230 | Intel(R) Core(TM)2 Duo CPU T5870 @ 2.00GHz [Family 6 Model 15 Stepping 13] | | 2232 | Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz [Family 6 Model 15 Stepping 11] | +------+---------------------------------------------------------------------------------+ 22 rows in set (5.057 sec) For complicated reasons, OpenBLAS seems to have major issues properly detecting these older systems, even the latest version (I've opened a bug on their project about it). I've tried lots of things to fix it in testing, but its not straight forward. Expect an updated linux client, hopefully today, to address this. |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
v9.55 client release for linux which should contain a workaround for this issue. Let me know if you continue to have any issues. |
|
Send message Joined: 12 Aug 20 Posts: 21 Credit: 53,001,945 RAC: 0 |
Hi! Still problems for T-type CPUs. Tested the new version on three of my old cans with T5870 and T7500, but they still got the SIGILL signal, (Xubuntu 18.04) and the computer with the old Xubuntu 14.04 OS couldn't find GLIBC_2.23. (Also tried with resetting the project.) Here are some of the errors, and links to tasks and computers: Xubuntu 18.04: host: https://www.mlcathome.org/mlcathome/show_host_detail.php?hostid=2230 task: https://www.mlcathome.org/mlcathome/result.php?resultid=1226047 <message>process got signal 4</message> Xubuntu 18.04: host: https://www.mlcathome.org/mlcathome/show_host_detail.php?hostid=1988 task: https://www.mlcathome.org/mlcathome/result.php?resultid=1225407 <message>process got signal 4</message> Xubuntu 14.04: host: https://www.mlcathome.org/mlcathome/show_host_detail.php?hostid=2066 task: https://www.mlcathome.org/mlcathome/result.php?resultid=1192553 <message> process exited with code 1 (0x1, -255) </message> <stderr_txt> ../../projects/www.mlcathome.org_mlcathome/mlds_9.55_x86_64-pc-linux-gnu: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.23' not found (required by /tmp/.mount_mlds_9ptg2Tv/usr/bin/../lib/libtorch_cpu.so) ../../projects/www.mlcathome.org_mlcathome/mlds_9.55_x86_64-pc-linux-gnu: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.23' not found (required by /tmp/.mount_mlds_9ptg2Tv/usr/bin/../lib/libquadmath.so.0) </stderr_txt> Good luck with the future versions! //Gunnar |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
https://github.com/xianyi/OpenBLAS/issues/2794 We need hardware that exhibits the problem to test on (My attempt to use a virtual machine apparently is apparently showing a different issue). For example, my Core2 system runs fine, but it is penryn-based, and all the ones I see in the database that have this error are merom, a (minor) generation behind. We need someone to run some code under gdb and find the exact instruction its failing on. We can walk you through it. Nevermind. Apparently the issue isn't caused by openblas, but Intel's internal MKL-DNN library (now "oneAPI") misdetecting and issuing sse4.1 instructions on CPUs that don't support it. |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
All: Would someone experiencing this issue be willing to test the following binaries for me? These might fix the sigill instructions on non-sse4.1 systems. You need to be running on real hardware without sse4.1 instructions for this to work. To test:
|
|
Send message Joined: 3 Jul 20 Posts: 13 Credit: 13,421,453 RAC: 0 |
Detected AMD Family 16 processor, switching OpenBLAS to generic Re-exec()-ing to set number of threads correctly... Machine Learning Dataset Generator v9.55 (Linux/x86_64) (libTorch: release/1.6) [2020-08-29 23:32:03 main:399] : INFO : Set logging level to 1 [2020-08-29 23:32:03 main:407] : INFO : Running in BOINC Standalone mode [2020-08-29 23:32:03 main:412] : INFO : Resolving all filenames [2020-08-29 23:32:03 main:420] : INFO : Resolved: dataset.hdf5 => dataset.hdf5 (exists = 1) [2020-08-29 23:32:03 main:420] : INFO : Resolved: model.cfg => model.cfg (exists = 0) [2020-08-29 23:32:03 main:420] : INFO : Resolved: model-final.pt => model-final.pt (exists = 0) [2020-08-29 23:32:03 main:420] : INFO : Resolved: model-input.pt => model-input.pt (exists = 0) [2020-08-29 23:32:03 main:420] : INFO : Resolved: snapshot.pt => snapshot.pt (exists = 0) [2020-08-29 23:32:03 main:434] : INFO : Dataset filename: dataset.hdf5 [2020-08-29 23:32:03 main:436] : INFO : Configuration: [2020-08-29 23:32:03 main:437] : INFO : Validation Loss Threshold: 0.0001 [2020-08-29 23:32:03 main:438] : INFO : Max Epochs: 2 [2020-08-29 23:32:03 main:439] : INFO : Batch Size: 128 [2020-08-29 23:32:03 main:440] : INFO : Patience: 10 [2020-08-29 23:32:03 main:441] : INFO : Hidden Width: 12 [2020-08-29 23:32:03 main:442] : INFO : # Recurrent Layers: 4 [2020-08-29 23:32:03 main:443] : INFO : # Backend Layers: 4 [2020-08-29 23:32:03 main:445] : INFO : Preparing Dataset [2020-08-29 23:32:03 load_hdf5_ds_into_tensor:28] : INFO : Loading Dataset /Xt from dataset.hdf5 into memory [2020-08-29 23:32:04 load_hdf5_ds_into_tensor:28] : INFO : Loading Dataset /Yt from dataset.hdf5 into memory [2020-08-29 23:32:04 load:103] : INFO : Successfully loaded dataset of 2048 examples into memory. [2020-08-29 23:32:04 load_hdf5_ds_into_tensor:28] : INFO : Loading Dataset /Xv from dataset.hdf5 into memory [2020-08-29 23:32:05 load_hdf5_ds_into_tensor:28] : INFO : Loading Dataset /Yv from dataset.hdf5 into memory [2020-08-29 23:32:05 load:103] : INFO : Successfully loaded dataset of 512 examples into memory. [2020-08-29 23:32:05 main:451] : INFO : Creating Model [2020-08-29 23:32:05 main:456] : INFO : Preparing config file [2020-08-29 23:32:05 main:468] : INFO : Creating new config file [2020-08-29 23:32:05 main:499] : INFO : Loading DataLoader into Memory [2020-08-29 23:32:05 main:502] : INFO : Starting Training [2020-08-29 23:34:11 main:514] : INFO : Epoch 1 | loss: 0.0435973 | val_loss: 0.031742 | Time: 126584 ms [2020-08-29 23:36:25 main:514] : INFO : Epoch 2 | loss: 0.0312199 | val_loss: 0.0305334 | Time: 133204 ms [2020-08-29 23:36:25 main:533] : INFO : Saving trained model to model-final.pt, val_loss 0.0305334 [2020-08-29 23:36:25 main:538] : INFO : Saving end state to config to file [2020-08-29 23:36:25 main:543] : INFO : Success, exiting.. |
|
Send message Joined: 12 Aug 20 Posts: 21 Credit: 53,001,945 RAC: 0 |
Hi! Tested this on three of my T7500, T7300, and T5870 machines, with (x)ubuntu 16 and 18, and found that the first application ".mlds-test-no-sse4.appimage" always executes well (see example output below), but the second "mlds-test-no-sse4-mkldnn.appimage" always crashes with: Illegal instruction (core dumped) Nice weekend, and good luck with the work!!! //Gunnar Output from one of the computers: gunnar@gunnar-hp6910p2:~/testMLC$ ./mlds-test-no-sse4.appimage -m 2 Detected Intel Family 6 processor, switching OpenBLAS to generic Re-exec()-ing to set number of threads correctly... Machine Learning Dataset Generator v9.55 (Linux/x86_64) (libTorch: release/1.6) [2020-08-30 14:05:11 main:399] : INFO : Set logging level to 1 [2020-08-30 14:05:11 main:407] : INFO : Running in BOINC Standalone mode [2020-08-30 14:05:11 main:412] : INFO : Resolving all filenames [2020-08-30 14:05:11 main:420] : INFO : Resolved: dataset.hdf5 => dataset.hdf5 (exists = 1) [2020-08-30 14:05:11 main:420] : INFO : Resolved: model.cfg => model.cfg (exists = 0) [2020-08-30 14:05:11 main:420] : INFO : Resolved: model-final.pt => model-final.pt (exists = 0) [2020-08-30 14:05:11 main:420] : INFO : Resolved: model-input.pt => model-input.pt (exists = 0) [2020-08-30 14:05:11 main:420] : INFO : Resolved: snapshot.pt => snapshot.pt (exists = 0) [2020-08-30 14:05:11 main:434] : INFO : Dataset filename: dataset.hdf5 [2020-08-30 14:05:11 main:436] : INFO : Configuration: [2020-08-30 14:05:11 main:437] : INFO : Validation Loss Threshold: 0.0001 [2020-08-30 14:05:11 main:438] : INFO : Max Epochs: 2 [2020-08-30 14:05:11 main:439] : INFO : Batch Size: 128 [2020-08-30 14:05:11 main:440] : INFO : Patience: 10 [2020-08-30 14:05:11 main:441] : INFO : Hidden Width: 12 [2020-08-30 14:05:11 main:442] : INFO : # Recurrent Layers: 4 [2020-08-30 14:05:11 main:443] : INFO : # Backend Layers: 4 [2020-08-30 14:05:11 main:445] : INFO : Preparing Dataset [2020-08-30 14:05:11 load_hdf5_ds_into_tensor:28] : INFO : Loading Dataset /Xt from dataset.hdf5 into memory [2020-08-30 14:05:12 load_hdf5_ds_into_tensor:28] : INFO : Loading Dataset /Yt from dataset.hdf5 into memory [2020-08-30 14:05:13 load:103] : INFO : Successfully loaded dataset of 2048 examples into memory. [2020-08-30 14:05:13 load_hdf5_ds_into_tensor:28] : INFO : Loading Dataset /Xv from dataset.hdf5 into memory [2020-08-30 14:05:13 load_hdf5_ds_into_tensor:28] : INFO : Loading Dataset /Yv from dataset.hdf5 into memory [2020-08-30 14:05:13 load:103] : INFO : Successfully loaded dataset of 512 examples into memory. [2020-08-30 14:05:13 main:451] : INFO : Creating Model [2020-08-30 14:05:13 main:456] : INFO : Preparing config file [2020-08-30 14:05:13 main:468] : INFO : Creating new config file [2020-08-30 14:05:13 main:499] : INFO : Loading DataLoader into Memory [2020-08-30 14:05:13 main:502] : INFO : Starting Training [2020-08-30 14:07:01 main:514] : INFO : Epoch 1 | loss: 0.037587 | val_loss: 0.0315968 | Time: 107342 ms [2020-08-30 14:08:45 main:514] : INFO : Epoch 2 | loss: 0.0312114 | val_loss: 0.030173 | Time: 104527 ms [2020-08-30 14:08:45 main:533] : INFO : Saving trained model to model-final.pt, val_loss 0.030173 [2020-08-30 14:08:45 main:538] : INFO : Saving end state to config to file [2020-08-30 14:08:45 main:543] : INFO : Success, exiting.. gunnar@gunnar-hp6910p2:~/testMLC$ rm -f model.cfg model-final.pt snapshot.pt gunnar@gunnar-hp6910p2:~/testMLC$ ./mlds-test-no-sse4-mkldnn.appimage -m 2 Illegal instruction (core dumped) gunnar@gunnar-hp6910p2:~/testMLC$ |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
Thanks for the thorough tests, I'll spin a new release on monday. Note that I haven't decided if this will still require the newer libc yet or not (these test progs do, but..) I also apologize for besmirching the OpenBLAS project, it turns out they were doing everything right. The ultimate cause turned out to be the Intel MLD-DNN library. Which is precisely the type of shenanigans I want to avoid by not relying on intel-produced libraries. |
|
Send message Joined: 3 Jul 20 Posts: 13 Credit: 13,421,453 RAC: 0 |
Mine was run on a dual opteron 2431 hex cores running ubuntu 16.04 apparently with success |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
Newest x86_64 (9.57) should also work on non-sse4 systems again. |
|
Send message Joined: 12 Aug 20 Posts: 21 Credit: 53,001,945 RAC: 0 |
Hi! YES!!! Now it works on all my machines, even those with the old glibc 2.19! :-) I've put all my old cans back onto MLC again - let's do some crunching! Thank you, and have a really nice week! //Gunnar |
©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)