Posts by Luigi R.

1) Questions and Answers : Issue Discussion : Setup to run 2 wu's on one GPU (Message 1019)
Posted 5 Jan 2021 by Luigi R.
Post:
I forgot MLC gpu app uses a full cpu thread.

Try this app_config.xml too.
<app_config>
	<app>
		<name>mlds-gpu</name>
		<gpu_versions>
			<gpu_usage>0.50</gpu_usage>
			<cpu_usage>1.00</cpu_usage>
		</gpu_versions>		
	</app>
</app_config>
2) Questions and Answers : Issue Discussion : Setup to run 2 wu's on one GPU (Message 1016)
Posted 5 Jan 2021 by Luigi R.
Post:
https://boinc.berkeley.edu/trac/wiki/ClientAppConfig
<app_config>
	<app>
		<name>mlds-gpu</name>
		<gpu_versions>
			<gpu_usage>0.50</gpu_usage>
			<cpu_usage>0.50</cpu_usage>
		</gpu_versions>		
	</app>
</app_config>
3) Questions and Answers : Windows : 6 invalid wu's (Message 1014)
Posted 2 Jan 2021 by Luigi R.
Post:
As a follow-up here, I have noticed some evidence in the log that NaN's may be causing a small subset of WUs to be marked as invalid. Not explicitly, but because some code to translate the results to json are causing the json parser to fail to parse the file (it expects a number, non NaNs) is causing a failure in another part of the validation process. I'll work on fixing that over the next few days.
Is this line can help?
terminate called after throwing an instance of 'nlohmann::detail::type_error'
  what():  [json.exception.type_error.302] type must be number, but is null
4) Questions and Answers : Issue Discussion : Validate errors (Message 1013)
Posted 1 Jan 2021 by Luigi R.
Post:
Happy new year!

A case that my task turned into NaNs after Epoch 2020?
[2020-12-31 15:42:49	                main:581]	:	INFO	:	Epoch 2020 | loss: 0.0368034 | val_loss: 0.0325741 | Time: 1645.75 ms
[2020-12-31 15:42:51	                main:581]	:	INFO	:	Epoch 2021 | loss: nan | val_loss: nan | Time: 1645.33 ms

I don't think so. :D


So, if I got it, you are saying with your magnificent explanation that I don't destroy randomness when I suspend/resume task.
My manual intervention does not introduce determinism because it's like changing seed accidentally and a linear combination of N different solutions belongs to the same space.

I'm going to decrease interval to 30 seconds in my script to see if I can avoid validation errors much more.
5) Questions and Answers : Issue Discussion : Validate errors (Message 1008)
Posted 31 Dec 2020 by Luigi R.
Post:
@pianoman

This task was suspended by my script at 99.6% (some NaNs occurred), but I resumed it because it was close to 100%.
Task resumed from a checkpoint before the first NaN and no NaNs appeared then. So it was completed with success. How do you explain that?
6) Questions and Answers : Issue Discussion : Validate errors (Message 995)
Posted 29 Dec 2020 by Luigi R.
Post:
So maybe it's a good idea to wait for a clear statement from the program developers. It makes no sense to kill wu's that are useful and would validate.

I have found validate errors without NaNs, but I have not found a valid task with NaNs yet.
My three invalid results contain NaNs as well.
7) Questions and Answers : Issue Discussion : Validate errors (Message 994)
Posted 29 Dec 2020 by Luigi R.
Post:
Sure, you don't need my permission to post a script like that, go right ahead.
Well, I have to ask because it could encourage suspending/aborting tasks. Thanks anyway. :)

So... this is script is made for users (like me) that run gpu tasks, one at a time.
I tested it on different customized stderr, but I have not run into NaN/faulty tasks yet.
Yesterday I got 3 consecutive validate errors, today nothing.

https://pastebin.com/efiEDfRh
8) Questions and Answers : Issue Discussion : Validate errors (Message 990)
Posted 29 Dec 2020 by Luigi R.
Post:
Do you mean that they are cancelled during runtime, not to wast cpu/gpu time? Sounds good ...
There should be a way to do that from your script, Boinc Manager is remote controllable.

Yeah, we could automatically abort them by boinccmd too.

Are you shure that a NaN automatically signals a failed wu?

Not at all, but, if you go to top hosts and check valid tasks vs invalid ones, you observe that.
9) Questions and Answers : Issue Discussion : Validate errors (Message 987)
Posted 29 Dec 2020 by Luigi R.
Post:
Ok pianoman, I wrote a bash script to catch these NaN WUs.
Expected behaviour is to suspend faulty tasks, so I can manually abort them after "human-check".
If you are not against it, I will post a pastebin with my code.
10) Questions and Answers : Issue Discussion : Validate errors (Message 978)
Posted 28 Dec 2020 by Luigi R.
Post:
I noticed that many invalid tasks have a stderr.txt containing lines like this one.

[2020-12-28 18:48:49	                main:581]	:	INFO	:	Epoch 2 | loss: nan | val_loss: nan | Time: 1735.95 ms


If we see "nan" (not a number) at some point, our task will not be valid and we could have aborted way earlier than task completion.

Example: https://www.mlcathome.org/mlcathome/result.php?resultid=3453877




©2023 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)