1)
Questions and Answers :
Issue Discussion :
WU Runtime for CPU WUs
(Message 1316)
Posted 11 Aug 2021 by entity Post: The machine has 256GB. What I have noticed is, as I reduce the number of concurrently running work, it looks like they start behaving a little better. At 100 concurrent WUs, the runtimes gradually increase to over 4 hours per unit and there are 1 to 4 of them that really act strange as identified above. Dropping the concurrent work to 75 causes the units to all finish under 2 hours with no strange anomalies. I increased the concurrent work to 85 and the runtimes increased to right around 2 hours (maybe a little more for some) with still no anomalies. No swapping is happening at all. This looks more like a cache issue than external memory. Each socket has 4 memory channels and 8 DIMM slots. All populated equally. |
2)
Questions and Answers :
Issue Discussion :
WU Runtime for CPU WUs
(Message 1309)
Posted 9 Aug 2021 by entity Post: UPDATE: Rebooted the machine to clear any possible cache pollution issue and to sync any possible mismatch between firmware and OS kernel. After reboot almost all have returned to the normal 2 hour execution time. Have 2 that are acting strangely (not showing any CPU time in BoincTasks but progressing at .001% per second) Have been running about 7 hours. Anyone else running more than 64 simultaneously and seeing WUs acting behaving abnormally? This is app 9.61 running DS1 Parity Machine work. I guess with the new app around the corner, might not be worth trying to diagnose this. |
3)
Questions and Answers :
Issue Discussion :
WU Runtime for CPU WUs
(Message 1307)
Posted 8 Aug 2021 by entity Post: I currently have a few ( about four or five) CPU WUs that have been running for at least 12 hours. Is this normal? Most units finish within about 2 hours. These WUs are progressing and will probably finish and nothing looks abnormal about them. Others on the same machine are finishing in about the normal 2 hours. |
4)
Questions and Answers :
Issue Discussion :
Validation errors.
(Message 213)
Posted 22 Jul 2020 by entity Post: I've been crunching for 3 days and I have 8 invalids. They are spread across all machines (Intel and AMD). |
5)
Questions and Answers :
Issue Discussion :
CentOS 8.2 Error
(Message 196)
Posted 20 Jul 2020 by entity Post: That was it. I have about 10 WUs running now. Will see how they do before installing fuse on the remaining machines. |
6)
Questions and Answers :
Issue Discussion :
CentOS 8.2 Error
(Message 189)
Posted 20 Jul 2020 by entity Post: Just joined the project and downloaded one WU as a test. Unfortunately, it ended immediately with the following error: <core_client_version>7.16.6</core_client_version> <![CDATA[ <message> process exited with code 127 (0x7f, -129)</message> <stderr_txt> fuse: failed to exec fusermount: No such file or directory open dir error: No such file or directory </stderr_txt> ]]> Is this an error with the individual workunit or is this a general error with CentOS? Additional Information: Jul 20 09:18:25 alpha boinc[2087]: 20-Jul-2020 09:18:25 [MLC@Home] Starting task ParityMachine-1593738798-3305-20_1 Jul 20 09:18:25 alpha kernel: fuse: init (API version 7.31) Jul 20 09:18:25 alpha kernel: mlds_0.920_x86_[683432]: segfault at 31f ip 00007fa7b4d89a95 sp 00007ffcd8e64da0 error 4 in ld-2.28.so[7fa7b4d74000+29000] Jul 20 09:18:25 alpha kernel: Code: 00 00 48 8d 3d 26 c6 00 00 e8 17 51 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 53 48 89 fb 48 8d 3d 79 3e 21 00 ff 15 7b 44 21 00 <80> bb 1f 03 00 00 00 75 14 8b 83 18 03 00 00 85 c0 74 18 31 f6 48 Jul 20 09:18:25 alpha systemd[1]: Mounting FUSE Control File System... Jul 20 09:18:25 alpha systemd[1]: Mounted FUSE Control File System. Jul 20 09:18:25 alpha abrt-hook-ccpp[683438]: Process 683432 (mlds_0.920_x86_64-pc-linux-gnu) of user 985 killed by SIGSEGV - dumping core |
©2023 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)