Posts by bozz4science

41) Message boards : News : [TWIM Notes] Jan 5 2021 -- A 6 Month Retrospective (Message 1033)
Posted 8 Jan 2021 by bozz4science
Post:
Happy new year to you as well!

Great read! Enjoyed the journey so far as a volunteer of this project. Most of all, the technical discussions on the forums about the science behind and the differences between the various experiments. Currently off to exploring Tensorflow/Keras for the first time myself, trying myself at my very first "big" data science projects doing NN-regressions. Model tuning (Hyperparameter) is such a labour-some job, …. Not coding whole applications, but rather just working with the Keras API for R. Still nice, seeing that the learning output of the stderr files of the WUs is oddly familiar! I'll happily bring back the GPUs in the meantime after testing.

Overall, great progress! Good luck with your paper and looking forward to the new runs.
42) Message boards : News : Badges! (Message 1006)
Posted 31 Dec 2020 by bozz4science
Post:
Wow. Big jump from 1M to 500M.

More like a quantum leap. Maybe GPU's will warp space. :)

Perhaps a few more badges in between 1M and 500M. Maybe every 10M or 20M ?
It's definitely gonna be tough and take much longer if you plan on reaching the 500M / 1B badge only with CPUs. You'll need a huge fleet. Eventually, GPUs will get you there much faster. In the end it's long-term committment to the project that will get you one of those badges. Maybe a 250M badge will do the trick to keep the waiting time between 1M and 500M a bit shorter. My guess is that I'll reach the 500M badge by the end of 2027 considering that credit assignment won't considerably change, I'll continue to see a ~6-10% invalid rate of GPU WUs, won't upgrade my current setup and will crunch MLC exclusively (on GPU) :) But surely, in the meantime there will be a GPU upgrade in the next 7 yrs.

500M - 2M = 498M credits remaining
8,320 credits/hr (GPU only / 1660 Super)
8.32k x 24hrs ~ 200k credits/day --> 95% valid --> ~ 190k credits/day
498M / 190k/day = 2621 days or 7.18 yrs to reach 500M
43) Questions and Answers : Issue Discussion : Great credit (Message 1001)
Posted 30 Dec 2020 by bozz4science
Post:
Have those WUs actually been reverted to the original base credit or has the initial overestimated credit been finally assigned? Not much a credit hunter myself, but doesn't seem to have any relevance at all for comparison among users, if millions of credits were assigned for single faulty WU, while others have spent x times more CPU and GPU hours were simply overtaken within a few hours due to some misconfiguration. Defintely skewed the RAC and leaderboard calculations. What do credits actually measure then or represent if they don't accurately gauge one's computing effort anymore? Just my two cents... Thanks for your quick fix though!

For ds4 rollout I'd like to see beta-testing of the new app first.
44) Message boards : Science : INT 8 support?? (Message 1000)
Posted 30 Dec 2020 by bozz4science
Post:
Defintely interesting to think about. Wouldn't the multitude of precisions that the RTX2000/3000 series Tensors cores have to offer, such as the TF32, FP16-, int8 und INT4, potentially be of interest for future app implementations? Don't know how feasible it would be or whether it is at all, but certainly Tensor cores seem to target the very use case that MLC is interested in at the moment.

The CUDA-X AI ecosystem seems to offer a vast library with many modules targeted at specific AI/ML use cases such as DL-train and DL-inference. Or are we already exploiting this performance potential of these precision types with the cudNN library and the other relevant libraries of the Cuda-x ecosystem (cuBLAS, cuFFT)?

And somehow I can't figure out what version of cudNN MLC deployed on my system, but as far as I can tell, it seems to work with version 7 while the latest cudNN version is version 8 that was released on 11/25/20. From what I read on here, it seems to offer further potential to improve performance. NVIDIA cuDNN 8
45) Message boards : Science : Dataset 3, what is it and when is it coming? (Message 999)
Posted 30 Dec 2020 by bozz4science
Post:
Will you be "introducing" the ds4 experiment in a similar manner? Would highly appreciate a few details like you did here for ds3!
46) Message boards : News : [TWIM Notes] Dec 28 2020 (Message 998)
Posted 30 Dec 2020 by bozz4science
Post:
Especially excited about the prospective paper and the launch of the new experiment (ds4) in the near future. Congrats on gaining traction and growing the voluntueer base.

I do really appreciate that you further look into NaN-error handling as this seems to be getting more of an issue by the day.

Happy new year to all of you as well! May 2021 become a better year for everyone! Cheers!
47) Questions and Answers : Issue Discussion : Validate errors (Message 997)
Posted 30 Dec 2020 by bozz4science
Post:
Thanks for providing this resource Luigi. Recently, I have seen a bump in invalid results on my host as well and even though I only have ~ 100 GPU tasks done on my host, I am approaching nearly 10% invalids. That's a bit high IMO to conitnue computing those tasks without intervention.

Isn't there a way to catch those exepctions (NaN errors in a given epoch) and make this as conditional check so whenever this is reached, before training a new epoch, just abort and reassign the task or reinitalize the same task? I don't see any point in letting those tasks compute to the very end. The NaN error is just carried through to the last epoch and results as well in a NaN error message. I can't imagine anything that is valuable from such a task. Surely, it won't help the current research. Is there any way to trim down those invalid rates that could be directly built in the application? (not circumvent alltogether, as the stochastical nature of ML/NN has been discussed here many times)
48) Questions and Answers : Issue Discussion : wu's fail with err. message out of memory (Message 929)
Posted 9 Dec 2020 by bozz4science
Post:
I happend to see that error very often myself, when my host went a bit rogue one night when the ds1+2 GPU WUs were deployed, and my cc config file specified to run 2 WUs on my 750Ti simultaneously, which incidentally triggered an error whenever my host didn't start 2 rand WUs at the same time which proved to stay well below the 2 GB VRAM. Since then I am staying at 1 task only with a somewhat lower compute load. As the ds1+2 GPU WUs only load VRAM at about 52-54% on my 750Ti, the VRAM capacity wasn't exceeded by much, but that still caused the already started rand WUs to immediately crash as soon as those WUs were read into the GPU memory.

By the way, I saw a similar behaviour if overclocking the memory clock on the GPU too aggressively, so that in the middle of the network training, the tasks threw an error.
49) Message boards : News : [TWIM Notes] Nov 30 2020 (Message 915)
Posted 4 Dec 2020 by bozz4science
Post:
Thanks for the update. Even though the snapshot data you include is at best indicative, the recent trend in the compute power development seems promising. Looks like the GPU rollout really starts to speed things up.

Current GigaFLOPS
10/12: 24544.93
10/19: 248734.27
10/27: 30532.46
11/2: 29333.56
11/9: 26267.99
11/17: 33798.72
11/23: 36126.42
11/30: 37671.35

This week's combined compute power is ~53% higher than only 1.5 months ago. While Mumps is still pulling away with a comfortable lead, others are picking up. It seems that Mumps will be the first to reach the 500m badge. Interesting though that this project's top daily credits days all date back to the beginning of the project...https://www.boincstats.com/stats/190/project/detail/bestxdays
I suspected that GPUs would do the trick and quickly take the lead.
50) Questions and Answers : Issue Discussion : All my GPU applications have crushed. (Message 914)
Posted 4 Dec 2020 by bozz4science
Post:
To me it looks like sth different is going on. As you can see the dataset of the faulty tasks were read in successfully and training went on for 100+ epochs before this error code was thrown. I happened to get the same error message always if the card was too far overclocked (mainly memory clock) and/or too many tasks run in parallel on the same card resulting in VRAM overload scenarios that could have also incurred the illegal memory address error message. Try to tackle those if one of those points applies to you. Otherwise, others might help as well.
51) Questions and Answers : Unix/Linux : GPU support update 11/23 (Message 902)
Posted 27 Nov 2020 by bozz4science
Post:
Yeah, you're completely right. I used the numbers stated on Samsung's website for 1 TB NVME SSDs, though that might be a whole different story than your standard SATA SSD.

Having your numbers in mind, I completely understand your situation and share your concerns. I agree that more than 2x of total cap. in daily writes (120 GB SSD) is definitely too much even at 20$ price point. And while monetary daily depreciation would increase to 0.10$/day is still low compared to the operating or other components' costs, I feel you when you say the EOL could be potentially reached within not even a full year. Needless to say, that the trash produced by rendering the device unusable through those heavy and sustained loads, is much worse and should be avoided. I liked Jim's advice by the way very much. Easy to implement and should do the trick of protecting the SSD against excessive rewriting of the same data
52) Questions and Answers : Unix/Linux : GPU support update 11/23 (Message 896)
Posted 25 Nov 2020 by bozz4science
Post:
So I use a write cache. It is built in to Linux, you just set the parameters.
Great to know! Thanks Jim
53) Questions and Answers : Issue Discussion : Parity Modified GPU WUs often fail (Message 891)
Posted 25 Nov 2020 by bozz4science
Post:
My WU stats quickly deteriorated overnight.
For comparison:
- 4 errors with ~700 WU
- 32 error with ~ 830 WU

To my surprise, I now saw many rand WU fail as well. They failed even though they started training successfully, every time a DS 1/2 WU was selected to start running in tandem with an already running rand WU. It was just too much for the card to bear. Upon seeing this, I reverted back to running only 1 WU at a time. CUDA compute load for those WUs reaches now between 85-100% with an average of 92% on my card (65% for rand WU), and I see now no errors anymore with those WU, as VRAM is loaded at 56%-60% only and kept well within its 2 GB capacity. So it wasn't exceeded by much before, but the overload of 5-10% or just 200MB of the VRAM's capacity let those WUs fail.

With average runtimes of 2,800 sec. (2.65 sec/epoch), I see a drop in throughput and credits though, due to lower overall utilisation at a rate of ~2,650 credits/h as opposed to ~4,000 credits/h with 2 rand GPU WU run concurrently at every time.
54) Questions and Answers : Issue Discussion : GPU tasks runtime comparison (Message 890)
Posted 25 Nov 2020 by bozz4science
Post:
I recently realized that my particular 750 Ti according to nvidia-smi has a power limit of only 38.5W instead of the reference card's 60W TDP rating. So, in retro perspective I overestimated the GPU wattage per hour by ~35%. With 2 tasks run concurrently, at 60% ø power load, gives just a little over 23 Wh. That is quite decent, even in comparison to newer cards, that run at higher TDP, but generate lower runtimes per WU.
55) Questions and Answers : Unix/Linux : GPU support update 11/23 (Message 889)
Posted 25 Nov 2020 by bozz4science
Post:
Haven't given it much thought until now. For my ancient rig with only a years old HDD, that's fine, but with a NVME drive or SSD as main storage on which BOINC is installed and run from, it is valid to think about a potentially accelerated hardware depreciation. Let's start by looking at high-end NVMW drives, that can be increasingly found on modern systems, f.ex. Samsung's Evo/Evo Plus drives. They have a lifespan of ~1,200 TB TBW. On a dual-GPU setup running 24/7, with 2 tasks concurrently on each GPU with an average runtime of 3,600 sec. which is to represent all GPU WU types, we would get to ~96 GPU WUs computed per day. With a lower estimate of ø runtime you would easily get up to ~150 WU/day.

If you were to take the 1.9 GB per task * 100 WU = 190 GB
1,200 TB TBW --> 1,200,000 GB / 190 GB = 6,315 days = 17.3 years

If you were to take a lower average runtime estimate of 2,500 sec for the sake of this mind experiment, you would come up with 140 WU/days.
1,200,000 GB / 266 GB = 4,511 days =12.3 years

With an initial investment of ~100$ for 500 GB of NVME storage, the CUDA Linux client would equate to an 0.0158$ or 0.0222$ respectively in additional deprecation per day if running MLC 24/7. Sure, that doesn't measure the degrading performance, but it's an intuitive monetary measure of the deprecation of the hardware over its expected full lifetime. I guess most components, will not make it to this number in most cases, as mechanical parts, such as pumps, fans, etc. will eventually break before that if under constant 24/7 load, and other components might be upgraded within 5yr intervals.

This would assume of course, that you were to only run MLC's GPU client, without any side project, or other applications running along this client. I guess, it would make sense to consider running the CUDA client if you can, but the degradation seems to only become a real issue well after the warranty period is over. Usually these mentioned drives come with a 4-5 yrs warranty. So you could even run up to 7 yrs worth of drive depreciation of CPU projects before you were expected to lose some cells on your NVME drive. Much oversimplified, but just to illustrate, that today's tech should keep up well with these demanding requirement of this client version. I hope I didn't screw up this thought experiment. My only intention is to spark a discussion!
56) Questions and Answers : Unix/Linux : GPU support update 11/23 (Message 877)
Posted 23 Nov 2020 by bozz4science
Post:
The preliminary numbers look already very promising! Appreciate the detailed listings of the technical requirement and potential issues with the current client version. Overall, exciting news!
57) Questions and Answers : Unix/Linux : GPU update (Message 875)
Posted 23 Nov 2020 by bozz4science
Post:
Looking forward to test on my system and verify. Great accomplishment!
58) Message boards : News : Badges! (Message 871)
Posted 20 Nov 2020 by bozz4science
Post:
Did see a 500k badge just pop up now. Anyway, seems to be working for total credit just fine! So it's gonna be a very long way to 500M after I'll hit 1M soon.
59) Questions and Answers : Issue Discussion : Parity Modified GPU WUs often fail (Message 868)
Posted 20 Nov 2020 by bozz4science
Post:
First of all, thanks for your feedback. I did lower the memory clock slightly, and less tasks seem to fail. I did check the task properties of currently running tasks as well, and foremost it seems to really be an issue with VRAM being close to 100% loaded. Surprisingly, I saw a couple instances where 2 Parity/Parity Modified tasks actually did finish together overnight – that's when VRAM hit 98% load (2 GB).

The observed error rate seems to have definitely improved over night. The ratio was 14:2 with the 2 tasks that failed, erroring out almost immediately. What is interesting though, that at least on a low bandwidth/small VRAM size card, runtimes suffer again some more, if 2 Parity/Parity Modified tasks are running concurrently.

Interestingly, I see large discrepancies for the per epoch training time (always 2 WUs of the same type are running concurrently so the ratio between the runtimes should be rather expressive)
- rand automata: 8.9-9.0 sec/epoch or ø 1,730 sec per WU (192 epochs)
- parity: 4.6-4.7 sec/epoch or ø 4,820 sec per WU (1024 epochs)
- parity modified: 4.9-5.1 sec/epoch or ø 5,190 sec per WU (1024 epochs)

I am glad, that my intuition about the different network types seems to be correct. Wider networks (data set 3/rand) are indeed easier to learn than deeper networks (data set 1+2/parity + parity mod.). Depending on how reliable these short-term average estimates turn out to be, the average epoch training time for ds 1+2 sees an uplift of ~40-50% over the ds 3 WUs.

It's gonna be interesting to see where the long-term average numbers end up at.
60) Questions and Answers : Issue Discussion : Parity Modified GPU WUs often fail (Message 862)
Posted 19 Nov 2020 by bozz4science
Post:
While the rand automata GPU tasks have reliably finished and validated for about a week, except a few WUs I had to abort because the initial dataset loading, that is handled exclusively by the CPU, somehow got stuck and the WU thus never started, the story is different for the new dataset 1+2 GPU versions.

Now I see a rather strange behaviour, where the Parity modified WUs start reliably but crash after a very short runtime of 30-70 sec only. For reference the standard average runtime of a rand GPU WU on this GPU (2 WUs simultaneously) was ~1,730 sec. So this seems rather strange to me. Most of the WUs that failed on my host, had also crashed on prior hosts. Unfortunately, I can't make a lot of sense of the error message. They all basically read like this... and the error "-529697949 (0xE06D7363) Unknown error code" doesn't help much in understanding this behaviour.

[2020-11-19 19:51:03	                load:106]	:	INFO	:	Successfully loaded dataset of 512 examples into memory.
[2020-11-19 19:51:03	                main:494]	:	INFO	:	Creating Model
[2020-11-19 19:51:03	                main:507]	:	INFO	:	Preparing config file
[2020-11-19 19:51:03	                main:519]	:	INFO	:	Creating new config file
[2020-11-19 19:51:04	                main:559]	:	INFO	:	Loading DataLoader into Memory
[2020-11-19 19:51:05	                main:562]	:	INFO	:	Starting Training


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x00007FFF3BC73B29

Engaging BOINC Windows Runtime Debugger...

See f.ex.
WU 1302119
WU 1302124
WU 1305925
WU 1306231
WU 1306653

As some of them however got validated on other hosts, I wonder what it might be on my system, that causes them to fail. I didn't change any setting in the meantime, the same project (CPU tasks) are running in the background, GPU is running at same clock speeds, etc.
And some of the Parity Modified WUs validated on my system so far. They took roughly 2.7x as long (4,800 sec) at 2.2x the credit. Given this, the fact that some validate and some on the other hand error out immediately has me stumped. Also, don't know if others see that error as well or if it is just me... Any idea to what might cause this?


Previous 20 · Next 20

©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)