Posts by floyd

1) Questions and Answers : Unix/Linux : GPU support update 11/23 (Message 900)
Posted 26 Nov 2020 by floyd
Post:
First thanks to all for your suggestions. That gives me something to think about. Maybe during the weekend ...

@bozz4science: My calculation is very different, mostly because you seem to think of different SSDs. 1200TB TBW, that must be terabyte size devices. My SSDs usually are 120GB, that is more than sufficient for a dedicated cruncher. The downside is the much lower TBW rating. With your 100 tasks a day at 3GB each (not 1.9) I calculate 300GB of daily writes. At that rate the SSD I have in mind could reach EOL in 200 days. While the monetary value is not very high I don't want to burn it like that. Moreover, replacing the SSD on the computer I'd possibly use is a major effort, I really wish to avoid it.

@Jim1348: A write cache is an interesting idea but it's not obvious to me that it actually does reduce writes, not just delay them. I'll need to find more information on how it works in detail.
I'd thought of running BOINC from a tmpfs perhaps, but then I'd have to find a way to make the data persist across restarts.

@pianoman: I'm not familiar with BTRFS at all. A quick search didn't come up with CoW as a prominent feature. That definitely needs some more reading before I'd consider using it.
2) Questions and Answers : Unix/Linux : GPU support update 11/23 (Message 888)
Posted 25 Nov 2020 by floyd
Post:
The downside is the size of the libraries we need to ship. CUDA is huge. Our binary itself is 8.6MB, but the libtorch_cuda library is 1.9GB, and several other libraries bring the total size of the app+library to 3GB. With AppImage, these libraries were compressed down and stored on disk in a squashfs filesystem, bringing the on-disk requirement down to approximately 900MB-1.6GB (depending on the build). Whats worse, these files need to be downloaded to the project directory, and then copied to each running directory (not sure why BOINC doesn't do linking here, maybe we're missing a setting?), so they take up twice as much disk space when in use. More if you have more than one GPU.
I think people will usually have enough disk space, though they may need to configure BOINC to use that much. What worries me is the amount of data written. A quick calculation shows that your new application could easily write the full capacity of my "little" SSDs twice a day. That's not acceptable. Even if you could reduce that by a factor of 10 by making the tasks larger I think I still wouldn't do it.
Perhaps you could ask at Rosetta@Home, they had the same issue. Their application needs a database that used to be replicated for every task but they found a way to use a single copy in the project directory. I don't know how that works though.
3) Questions and Answers : Unix/Linux : create mount dir error: Read-only file system despite chmod -R 777 (Message 850)
Posted 17 Nov 2020 by floyd
Post:
/tmp would be the place to check. If the message is correct you somehow got it mounted ro. Or the message could be misleading, in that case it could mean you can't write there for other reasons.
4) Questions and Answers : Issue Discussion : All my GPU applications have crushed. (Message 790)
Posted 10 Nov 2020 by floyd
Post:
Have you checked whether you ticked the option to receive test applications
Yes, everything is ticked. That computer actually got a v9.72 task on Sunday and the settings haven't changed since.
It's getting late, I'll check again tomorrow.
5) Questions and Answers : Issue Discussion : All my GPU applications have crushed. (Message 787)
Posted 10 Nov 2020 by floyd
Post:
Linux/CUDA

    • 64-bit CPU with SSE support
    • NVIDIA card w/ compute capability 3.5 or greater
    • nvidia binary driver version 440+
    • GLIBC 2.27+ (Ubuntu 18.04 equivalent)

I wonder why host 4162 doesn't get tasks. It should meet the requirements and there are tasks ready to send, yet nothing.
6) Questions and Answers : Issue Discussion : All my GPU applications have crushed. (Message 773)
Posted 9 Nov 2020 by floyd
Post:
OK, ran into a few more issues but we've pushed out a new mlds test 9.75 cuda app a few minutes ago.
I'm also going to release a new batch of GPU testing WUs.
Looks like few hosts quickly sucked up all available tasks. You may want to temporarily set up additional restrictions, like on the number of tasks per host, to get a broader distribution.
7) Questions and Answers : Issue Discussion : WU RAM? (Message 751)
Posted 1 Nov 2020 by floyd
Post:
I also happened to receive this error and had to unselect the 2 options "run test application" as well as the corresponding test application in my computing settings
I have all that selected and no problem getting work. Which makes me think the error only happens if the scheduler actually decides to send a test task and the outcome of the next try may be different.
8) Questions and Answers : Issue Discussion : WU RAM? (Message 749)
Posted 1 Nov 2020 by floyd
Post:
Machine Learning Dataset Generator (test) needs 390625.00 MB RAM but only 29252.34 MB is available for use.
It seems obvious to me that this is to prevent test tasks from being re-issued after they fail. At the time of writing this there's only 3 available and 11 in progress. Do you see the same requirement for regular tasks?
9) Questions and Answers : Issue Discussion : All my GPU applications have crushed. (Message 743)
Posted 31 Oct 2020 by floyd
Post:
As far as I can see the error is "process got signal 11" that might mean: "A signal 11 error, commonly know as a segmentation fault, means that the program accessed a memory location that was not assigned."
From my point of view it is unlikely that BOINC client might kill the process by this way.
Not intentionally, but as far as I know BOINC client and science application communicate through shared memory. If the application tried to communicate when the memory had already been released for some reason I think a segfault could (or should?) happen. Maybe the client thought the application was already dead, intentionally or not. Just a thought. One could look at the BOINC log around the time when the crash happened, maybe with some additional log flags on. app_msg_* and and heartbeat_debug look interesting.

I could offer a 750 Ti for testing but the system has the 418.x driver installed and I would like to keep it that way. That means CUDA 10.1 and CC 5.0, but I'm not sure if that is worth a try as I got the impression that CUDA 10.2 is required.
10) Questions and Answers : Issue Discussion : Project Preferences for GPU (Message 718)
Posted 27 Oct 2020 by floyd
Post:
Dataman, you mentioned restarting BOINC or rebooting the computer, but all that does not update the client's settings. It needs to contact the project for that. Do a project update if you want to force it, or just wait for a routine contact during normal operation.
11) Questions and Answers : Issue Discussion : Project Preferences for GPU (Message 715)
Posted 27 Oct 2020 by floyd
Post:
For me it works as expected now. Looks like whatever was changed only affected display of that particular setting in the UI, not the actual value.
12) Questions and Answers : Issue Discussion : Project Preferences for GPU (Message 704)
Posted 26 Oct 2020 by floyd
Post:
It's possible that change to disable GPU in prefs by default messed something up
Seems so. For me the scheduler says
<no_cpu>0</no_cpu>
<no_cuda>0</no_cuda>
when I thought I had CPU on and NV off.
13) Questions and Answers : Issue Discussion : Rpi4 now erroring new tasks. (Message 666)
Posted 21 Oct 2020 by floyd
Post:
Started receiving tasks after a few days without and after a restart, and now getting a load of errors.
The old tasks were for architecture arm-unknown-linux-gnueabihf, the new ones are for aarch64-unknown-linux-gnu. Sorry I can't comment on the reason. Maybe some local configuration change. You could check what the BOINC startup messages say on architecture(s).
14) Message boards : Cafe : Motherboard / heatsink advice for AMD Ryzen chips (Message 617)
Posted 6 Oct 2020 by floyd
Post:
How much performance would be left on the table by choosing a 3600X over a 3700X?
Given that they are equal otherwise you can compare them by clock rate and number of cores. That should give the 3700X a 26% advantage at 35% higher price. But I did not suggest the 3600X for other reasons and if you want an 8 core Zen2 the 3700X is without alternative for me.

And is the 3700X in between the two other mentioned 3600X/3900X really less bang for the buck?
Yes, but not by much. The 3600 without X should give you the best bang for the buck. It's just not a really big bang and only you know how important the buck is.

Was indeed eyeing the 3700X and 3600X
I would not suggest the 3600X, nor the 3800X. They are a bit faster than the 3600 and 3700X respectively, but look what that does to the TDP! Power draw goes through the roof, as well as temperatures potentially, for not much gain. Someone will probably jump in and suggest some tweaks to counteract that. Personally I prefer not to experiment but run my processors at stock speed, with acceptable output and rock stable.

I was looking at larger Noctua and Alpenföhn coolers. Think those should be the best aftermarket air coolers that are currently available.
Noctuas get good reviews, enthusiastic sometimes, but I don't have personal experiences with them. They're too expensive for me. EKL Brocken 3 was my first choice cooler for the 3900X but it was clear it wouldn't fit in my case so I changed to a Scythe Mugen 5. Somewhat smaller but still a big thing and with equally good reviews. Turned out it didn't fit either so I had to use a big old case I still had around. Only one 120mm fan each at front and rear so it heats up quickly, negating the effect of the IMO capable cooler. Will have to change that some time.
Speaking of bad decisions, I got a quite expensive gaming mainboard for that system to make sure it could reliably provide power for a 105W TDP processor running full throttle 24/7, only to find out I can't do that because of the temperature. Waste of money. Next time I'd choose a good middle class mainboard, just don't buy cheap. And I'd definitely want heat sinks on the VRMs if not using the stock top blowing cooler.

Hope that in the end running long term with air cooling won't hurt the chip too much as opposed to water cooling.
Don't worry. Air cooling a 65W TDP processor shouldn't be a problem if you don't try to squeeze the last bit of power out of it. Stay away from the 3600X/3800X with high TDP and if you choose a 3900X or (expensive!) 3950X avoid my mistakes. A good case is a good start.
15) Message boards : Cafe : Motherboard / heatsink advice for AMD Ryzen chips (Message 589)
Posted 4 Oct 2020 by floyd
Post:
Did you already choose a processor? My personal choices for a dedicated cruncher are the Ryzen 3600 or 3900X. The 3600 is "only" a 6 core processor but it generates as much output as the older 8 core 1700X. If you want something bigger and can accept a worse price/power ratio go for the 3700X or 3950X. No Threadripper or Ryzen XT.

I guess I will try then probably first running BOINC with only the boxed cooler. If temps were to get uncomfortably high, I would also consider upgrading to a more powerful heatsink.
If it's okay for you to replace the cooler later try the boxed one first but expect the temperature to get uncomfortably high either way. If you improve cooling the processor only runs faster but not significantly cooler. To reduce temperature you need to force it to slow down, by limiting power or clock rate.

Water cooling is a luxury that I can't afford, won't fit in the case I am looking at currently
For air cooling of high power processors it is vital to keep the case as cool as possible which in my experience doesn't work well with small cases. And the most effective CPU coolers are also BIG. Keep that in mind when you choose your case.
16) Message boards : News : Badges! (Message 581)
Posted 4 Oct 2020 by floyd
Post:
I suggest you limit the flood of badges to one per sub-project, i.e. let a higher badge replace the previous one of the same type. Multiple sub-projects with several badges each will give a mess, more so if the badges look the same.
17) Questions and Answers : Issue Discussion : "No tasks sent" (Message 411)
Posted 25 Aug 2020 by floyd
Post:
The only unusual thing is that all tasks are re-sends with short or very short deadlines.
A small but possibly important correction: I also see some old WUs that apparently had never been sent.
18) Questions and Answers : Issue Discussion : "No tasks sent" (Message 408)
Posted 25 Aug 2020 by floyd
Post:
I don't have problems getting work. The only unusual thing is that all tasks are re-sends with short or very short deadlines. Maybe too few hosts are eligible for that.
19) Questions and Answers : Unix/Linux : OS/Distribution support question? (Message 384)
Posted 23 Aug 2020 by floyd
Post:
For the record, the current database breakdown for libc versions looks like:
glibc 2.17: 9
glibc 2.19: 2
glibc 2.24: 2
glibc 2.26: 2
glibc 2.27: 311
glibc 2.28: 87
glibc 2.29: 4
glibc 2.30: 15
glibc 2.31: 222
Does this mean the survey is over before it has started? This is probably better than any survey result and it clearly shows where you can make a cut without much loss. I'd suggest you check for hosts without glibc information though, it could mean they're running old BOINC clients and possibly old glibc.
Anyway, Debian 10, glibc 2.28, amd64 for me. The 9.50 application is running fine on my Ryzens and faster than 9.20.
20) Questions and Answers : Unix/Linux : Linux/armhf and Linux/arm64 support status thread (Message 383)
Posted 23 Aug 2020 by floyd
Post:
I'd already tried to run ldd on the app, as that's how i usually fix this sort of issue, but the way it's packaged means I can't see what it needs
Run it with "--appimage-extract" as the only parameter.


Next 20

©2021 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)