Invalid tasks

Questions and Answers : Issue Discussion : Invalid tasks
Message board moderation

To post messages, you must log in.

AuthorMessage
Meloentje

Send message
Joined: 27 Dec 20
Posts: 1
Credit: 920,882
RAC: 2
Message 1406 - Posted: 26 Oct 2021, 9:16:28 UTC

I get invalid tasks.
The one recalculating the tasks gets an invalid as well.
ID: 1406 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
the0retical

Send message
Joined: 3 Aug 21
Posts: 3
Credit: 109,720
RAC: 0
Message 1407 - Posted: 26 Oct 2021, 10:12:15 UTC - in response to Message 1406.  

Thanks for letting us know, I've fixed some stuff on the server and the issue should be fixed now.

Please let us know if it continues.
ID: 1407 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 1408 - Posted: 26 Oct 2021, 13:50:17 UTC
Last modified: 26 Oct 2021, 13:51:45 UTC

Yes, there was an issue (my fault!) on the server side, now fixed (thanks to theOretical).

If you're curious, the issue was that as part of the validation process we compute the loss of the trained network against a set of "test" data. Due to the somewhat awkward process of BOINC validation, we need this loss number multiple places. Originally we re-computed this number up to three times as part of the validation process, but this is a CPU intensive process, so as the project grew I added some code to compute the value once and store it in an on-disk cache. Then when we need it later, we just read it from the cache instead of re-computing it.. win-win. We store this cache in a directory under `/tmp`.

I was very careful to make sure that if we try to read the value from the cache and it isn't there (can be for many reasons) that it isn't a fatal error, we just recompute it. And since the server is very reliable the fact we're still using `/tmp`, which gets wiped on each reboot, wasn't an issue. It's been working great for many months.

Well, yesterday, the server hicupped, needing a reboot. The cache directory got deleted. And while reading an entry from cache wasn't a fatal error, apparently I never tested what happened if adding a new entry to the cache failed. So the validation process started working, computed the value, and tried to add it to the cache and since the cache directory wasn't there, it failed and threw an exception .. which is interpreted by BOINC as a validation failure. This was the case for about 10-ish hours yesterday, and a huge miss on our part.

The first fix is to just create the cache directory that got deleted on the reboot, that gets everything going again. We can also move the cache to a more permanent directory (a cron job periodically cleans out old entries, so that's not an issue).. The real issue is that failure to add an entry to the cache should never cause an exception. We're going to do all three of those things to make sure this doesn't happen again. And shame on me for failing to unit test the cache properly.

Furthermore, while I get alerts whenever catestrophic problems happen with the server (assuming its not a hardware hang, like yesterday, where it can't send me an email), this failure was buried in the logs and didn't bubble up to that level, and I never would have notices unless I'd been actively checking the logs. I'm looking for a more robust log-monitoring solution that will get me alerts more quickly when problems like this occur.
ID: 1408 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Drago75

Send message
Joined: 30 Jan 21
Posts: 4
Credit: 1,526,031
RAC: 0
Message 1459 - Posted: 3 Feb 2022, 13:53:10 UTC
Last modified: 3 Feb 2022, 13:54:45 UTC

My two Nvidia cards produce only ivalid results today. Yesterday they worked fine. Is there an issue with the work units or is the problem somehow related to my hardware?
ID: 1459 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Olivier Chassé St-Laurent

Send message
Joined: 9 Dec 21
Posts: 3
Credit: 6,618,560
RAC: 0
Message 1461 - Posted: 3 Feb 2022, 14:34:28 UTC - in response to Message 1459.  

I'm seeing the same problem, so I doubt it's your hardware.

My last valid result was on 3 Feb 2022, 2:37:08 UTC; nothing but invalid results since then, despite seeming to be valid (not a nan value in sight).

For example, https://www.mlcathome.org/mlcathome/workunit.php?wuid=8143263, where both results seem valid but have been rejected.
ID: 1461 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Drago75

Send message
Joined: 30 Jan 21
Posts: 4
Credit: 1,526,031
RAC: 0
Message 1462 - Posted: 3 Feb 2022, 14:49:37 UTC - in response to Message 1461.  

That's interesting. A few days ago I started supporting this project again and I didn't get a single invalid. Today all 7 invalid! Didn't change any settings, the two GPUs are not overclocked. Strannge...
ID: 1462 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Number Cruncher

Send message
Joined: 21 Dec 21
Posts: 3
Credit: 5,512,604
RAC: 5
Message 1463 - Posted: 3 Feb 2022, 18:41:26 UTC

Hello! all of my WU's seem to be invalid as of today (ongoing)
ID: 1463 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sirisori

Send message
Joined: 11 Nov 20
Posts: 1
Credit: 5,883,831
RAC: 2
Message 1464 - Posted: 3 Feb 2022, 21:51:11 UTC - in response to Message 1463.  
Last modified: 3 Feb 2022, 21:52:48 UTC

Same here, all the tasks (35) that were finished today got the status invalid.
ID: 1464 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
benhines

Send message
Joined: 6 Dec 20
Posts: 4
Credit: 39,909,570
RAC: 22
Message 1465 - Posted: 3 Feb 2022, 22:23:01 UTC
Last modified: 3 Feb 2022, 22:24:10 UTC

And me too! All results returned since around 03:00Z this morning
ID: 1465 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Olivier Chassé St-Laurent

Send message
Joined: 9 Dec 21
Posts: 3
Credit: 6,618,560
RAC: 0
Message 1466 - Posted: 4 Feb 2022, 14:49:48 UTC

Whatever the issue was, it seems to have disappeared/been fixed: https://www.mlcathome.org/mlcathome/workunit.php?wuid=8121884.
ID: 1466 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
benhines

Send message
Joined: 6 Dec 20
Posts: 4
Credit: 39,909,570
RAC: 22
Message 1468 - Posted: 5 Feb 2022, 9:20:15 UTC

Looks like back to normal for me too
ID: 1468 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Magiceye04

Send message
Joined: 14 Feb 21
Posts: 2
Credit: 840,320
RAC: 0
Message 1470 - Posted: 8 Feb 2022, 8:12:06 UTC

Now there is another issue. The tasks do not even start any more.

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)</message>
<stderr_txt>
DEBUG: Args: ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200 -c --maxepoch 2048
nthreads: 1 gpudev: 0
Re-exec()-ing to set environment correctly
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: forward compatibility was attempted on non supported HW
Exception raised from current_device at /home/mlcbuild/git/pytorch-build/build-cuda/pytorch-prefix/src/pytorch/c10/cuda/CUDAFunctions.h:40 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f6f708ce99b in ./libc10.so)
frame #1: at::cuda::getCurrentDeviceProperties() + 0x167 (0x7f6ef2f9f3b7 in ./libtorch_cuda.so)
frame #2: <unknown function> + 0x88018 (0x559489c91018 in ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200)
frame #3: __libc_start_main + 0xf3 (0x7f6ef21600b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x8675a (0x559489c8f75a in ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200)

SIGABRT: abort called
Stack trace (12 frames):
../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x37df9c)[0x559489f86f9c]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f6f708593c0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f6ef217f18b]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f6ef215e859]
../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x135)[0x55948a0387f5]
../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x398846)[0x559489fa1846]
../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x398891)[0x559489fa1891]
../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x3968c4)[0x559489f9f8c4]
./libtorch_cuda.so(_ZN2at4cuda26getCurrentDevicePropertiesEv+0x1bd)[0x7f6ef2f9f40d]
../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x88018)[0x559489c91018]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f6ef21600b3]
../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x8675a)[0x559489c8f75a]

Exiting...

</stderr_txt>
]]>
ID: 1470 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Olivier Chassé St-Laurent

Send message
Joined: 9 Dec 21
Posts: 3
Credit: 6,618,560
RAC: 0
Message 1472 - Posted: 8 Feb 2022, 13:19:23 UTC - in response to Message 1470.  

I had a similar issue (affected other projects as well).

It seems to have coincided with an upgrade to the Nvidia drivers; you can check /var/log/dpkg.log. In my case, simply rebooting fixed the problem.
ID: 1472 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Questions and Answers : Issue Discussion : Invalid tasks

©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)