Posts by entity

1) Questions and Answers : Issue Discussion : WU Runtime for CPU WUs (Message 1316)
Posted 11 Aug 2021 by entity
Post:
The machine has 256GB. What I have noticed is, as I reduce the number of concurrently running work, it looks like they start behaving a little better. At 100 concurrent WUs, the runtimes gradually increase to over 4 hours per unit and there are 1 to 4 of them that really act strange as identified above. Dropping the concurrent work to 75 causes the units to all finish under 2 hours with no strange anomalies. I increased the concurrent work to 85 and the runtimes increased to right around 2 hours (maybe a little more for some) with still no anomalies. No swapping is happening at all. This looks more like a cache issue than external memory. Each socket has 4 memory channels and 8 DIMM slots. All populated equally.
2) Questions and Answers : Issue Discussion : WU Runtime for CPU WUs (Message 1309)
Posted 9 Aug 2021 by entity
Post:
UPDATE: Rebooted the machine to clear any possible cache pollution issue and to sync any possible mismatch between firmware and OS kernel. After reboot almost all have returned to the normal 2 hour execution time. Have 2 that are acting strangely (not showing any CPU time in BoincTasks but progressing at .001% per second) Have been running about 7 hours.

Anyone else running more than 64 simultaneously and seeing WUs acting behaving abnormally? This is app 9.61 running DS1 Parity Machine work. I guess with the new app around the corner, might not be worth trying to diagnose this.
3) Questions and Answers : Issue Discussion : WU Runtime for CPU WUs (Message 1307)
Posted 8 Aug 2021 by entity
Post:
I currently have a few ( about four or five) CPU WUs that have been running for at least 12 hours. Is this normal? Most units finish within about 2 hours. These WUs are progressing and will probably finish and nothing looks abnormal about them. Others on the same machine are finishing in about the normal 2 hours.
4) Questions and Answers : Issue Discussion : Validation errors. (Message 213)
Posted 22 Jul 2020 by entity
Post:
I've been crunching for 3 days and I have 8 invalids. They are spread across all machines (Intel and AMD).
5) Questions and Answers : Issue Discussion : CentOS 8.2 Error (Message 196)
Posted 20 Jul 2020 by entity
Post:
That was it. I have about 10 WUs running now. Will see how they do before installing fuse on the remaining machines.
6) Questions and Answers : Issue Discussion : CentOS 8.2 Error (Message 189)
Posted 20 Jul 2020 by entity
Post:
Just joined the project and downloaded one WU as a test. Unfortunately, it ended immediately with the following error:

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 127 (0x7f, -129)</message>
<stderr_txt>
fuse: failed to exec fusermount: No such file or directory
open dir error: No such file or directory
</stderr_txt>
]]>

Is this an error with the individual workunit or is this a general error with CentOS?
Additional Information:

Jul 20 09:18:25 alpha boinc[2087]: 20-Jul-2020 09:18:25 [MLC@Home] Starting task ParityMachine-1593738798-3305-20_1
Jul 20 09:18:25 alpha kernel: fuse: init (API version 7.31)
Jul 20 09:18:25 alpha kernel: mlds_0.920_x86_[683432]: segfault at 31f ip 00007fa7b4d89a95 sp 00007ffcd8e64da0 error 4 in ld-2.28.so[7fa7b4d74000+29000]
Jul 20 09:18:25 alpha kernel: Code: 00 00 48 8d 3d 26 c6 00 00 e8 17 51 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 53 48 89 fb 48 8d 3d 79 3e 21 00 ff 15 7b 44 21 00 <80> bb 1f 03 00 00 00 75 14 8b 83 18 03 00 00 85 c0 74 18 31 f6 48
Jul 20 09:18:25 alpha systemd[1]: Mounting FUSE Control File System...
Jul 20 09:18:25 alpha systemd[1]: Mounted FUSE Control File System.
Jul 20 09:18:25 alpha abrt-hook-ccpp[683438]: Process 683432 (mlds_0.920_x86_64-pc-linux-gnu) of user 985 killed by SIGSEGV - dumping core




©2023 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)