Questions and Answers :
Issue Discussion :
Invalid tasks
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 27 Dec 20 Posts: 1 Credit: 920,882 RAC: 2 |
I get invalid tasks. The one recalculating the tasks gets an invalid as well. |
|
Send message Joined: 3 Aug 21 Posts: 3 Credit: 109,720 RAC: 0 |
Thanks for letting us know, I've fixed some stuff on the server and the issue should be fixed now. Please let us know if it continues. |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
Yes, there was an issue (my fault!) on the server side, now fixed (thanks to theOretical). If you're curious, the issue was that as part of the validation process we compute the loss of the trained network against a set of "test" data. Due to the somewhat awkward process of BOINC validation, we need this loss number multiple places. Originally we re-computed this number up to three times as part of the validation process, but this is a CPU intensive process, so as the project grew I added some code to compute the value once and store it in an on-disk cache. Then when we need it later, we just read it from the cache instead of re-computing it.. win-win. We store this cache in a directory under `/tmp`. I was very careful to make sure that if we try to read the value from the cache and it isn't there (can be for many reasons) that it isn't a fatal error, we just recompute it. And since the server is very reliable the fact we're still using `/tmp`, which gets wiped on each reboot, wasn't an issue. It's been working great for many months. Well, yesterday, the server hicupped, needing a reboot. The cache directory got deleted. And while reading an entry from cache wasn't a fatal error, apparently I never tested what happened if adding a new entry to the cache failed. So the validation process started working, computed the value, and tried to add it to the cache and since the cache directory wasn't there, it failed and threw an exception .. which is interpreted by BOINC as a validation failure. This was the case for about 10-ish hours yesterday, and a huge miss on our part. The first fix is to just create the cache directory that got deleted on the reboot, that gets everything going again. We can also move the cache to a more permanent directory (a cron job periodically cleans out old entries, so that's not an issue).. The real issue is that failure to add an entry to the cache should never cause an exception. We're going to do all three of those things to make sure this doesn't happen again. And shame on me for failing to unit test the cache properly. Furthermore, while I get alerts whenever catestrophic problems happen with the server (assuming its not a hardware hang, like yesterday, where it can't send me an email), this failure was buried in the logs and didn't bubble up to that level, and I never would have notices unless I'd been actively checking the logs. I'm looking for a more robust log-monitoring solution that will get me alerts more quickly when problems like this occur. |
|
Send message Joined: 30 Jan 21 Posts: 4 Credit: 1,526,031 RAC: 0 |
My two Nvidia cards produce only ivalid results today. Yesterday they worked fine. Is there an issue with the work units or is the problem somehow related to my hardware? |
|
Send message Joined: 9 Dec 21 Posts: 3 Credit: 6,618,560 RAC: 0 |
I'm seeing the same problem, so I doubt it's your hardware. My last valid result was on 3 Feb 2022, 2:37:08 UTC; nothing but invalid results since then, despite seeming to be valid (not a nan value in sight). For example, https://www.mlcathome.org/mlcathome/workunit.php?wuid=8143263, where both results seem valid but have been rejected. |
|
Send message Joined: 30 Jan 21 Posts: 4 Credit: 1,526,031 RAC: 0 |
That's interesting. A few days ago I started supporting this project again and I didn't get a single invalid. Today all 7 invalid! Didn't change any settings, the two GPUs are not overclocked. Strannge... |
|
Send message Joined: 21 Dec 21 Posts: 3 Credit: 5,512,604 RAC: 5 |
Hello! all of my WU's seem to be invalid as of today (ongoing) |
|
Send message Joined: 11 Nov 20 Posts: 1 Credit: 5,883,831 RAC: 2 |
Same here, all the tasks (35) that were finished today got the status invalid. |
|
Send message Joined: 6 Dec 20 Posts: 4 Credit: 39,909,570 RAC: 22 |
And me too! All results returned since around 03:00Z this morning |
|
Send message Joined: 9 Dec 21 Posts: 3 Credit: 6,618,560 RAC: 0 |
Whatever the issue was, it seems to have disappeared/been fixed: https://www.mlcathome.org/mlcathome/workunit.php?wuid=8121884. |
|
Send message Joined: 6 Dec 20 Posts: 4 Credit: 39,909,570 RAC: 22 |
Looks like back to normal for me too |
|
Send message Joined: 14 Feb 21 Posts: 2 Credit: 840,320 RAC: 0 |
Now there is another issue. The tasks do not even start any more. <core_client_version>7.16.6</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63)</message> <stderr_txt> DEBUG: Args: ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200 -c --maxepoch 2048 nthreads: 1 gpudev: 0 Re-exec()-ing to set environment correctly terminate called after throwing an instance of 'c10::Error' what(): CUDA error: forward compatibility was attempted on non supported HW Exception raised from current_device at /home/mlcbuild/git/pytorch-build/build-cuda/pytorch-prefix/src/pytorch/c10/cuda/CUDAFunctions.h:40 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f6f708ce99b in ./libc10.so) frame #1: at::cuda::getCurrentDeviceProperties() + 0x167 (0x7f6ef2f9f3b7 in ./libtorch_cuda.so) frame #2: <unknown function> + 0x88018 (0x559489c91018 in ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200) frame #3: __libc_start_main + 0xf3 (0x7f6ef21600b3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: <unknown function> + 0x8675a (0x559489c8f75a in ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200) SIGABRT: abort called Stack trace (12 frames): ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x37df9c)[0x559489f86f9c] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f6f708593c0] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f6ef217f18b] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f6ef215e859] ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x135)[0x55948a0387f5] ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x398846)[0x559489fa1846] ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x398891)[0x559489fa1891] ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x3968c4)[0x559489f9f8c4] ./libtorch_cuda.so(_ZN2at4cuda26getCurrentDevicePropertiesEv+0x1bd)[0x7f6ef2f9f40d] ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x88018)[0x559489c91018] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f6ef21600b3] ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x8675a)[0x559489c8f75a] Exiting... </stderr_txt> ]]> |
|
Send message Joined: 9 Dec 21 Posts: 3 Credit: 6,618,560 RAC: 0 |
I had a similar issue (affected other projects as well). It seems to have coincided with an upgrade to the Nvidia drivers; you can check /var/log/dpkg.log. In my case, simply rebooting fixed the problem. |
©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)