Just one windows machine with errors

Questions and Answers : Windows : Just one windows machine with errors
Message board moderation

To post messages, you must log in.

AuthorMessage
zombie67 [MM]
Avatar

Send message
Joined: 1 Jul 20
Posts: 34
Credit: 26,118,410
RAC: 0
Message 713 - Posted: 27 Oct 2020, 6:17:51 UTC
Last modified: 27 Oct 2020, 6:41:28 UTC

Any idea why this one machine is throwing nothing but errors, with test units, both CPU and GPU?

https://www.mlcathome.org/mlcathome/show_host_detail.php?hostid=666

Edit: To be clear, no errors on any other projects at all.
Reno, NV
Team: SETI.USA
ID: 713 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
swiftmallard
Avatar

Send message
Joined: 23 Sep 20
Posts: 24
Credit: 15,318,198
RAC: 1,992
Message 717 - Posted: 27 Oct 2020, 12:42:38 UTC

The only reason I can see is because it's computer number 666.

Sorry, couldn't resist. I'll see myself out.
ID: 717 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 721 - Posted: 27 Oct 2020, 19:03:01 UTC

Ugh, the dreaded DLL hell problem that appears on some machines and not others. As you point out, this has nothing to do with GPUs vs. CPUs.

So, the error its reporting is related to it either missing, or having the wrong version, of a dependent DLL library. Its nearly impossible to debug because its highly dependent on the host machine.

Would you be willing to download a zip file and run some tests for me? I can post a copy of the client not bundled through boinc, and a third party program called dependency walker http://dependencywalker.com/, and you can try running the client directly on that machine. Presumably it will fail, and then I'd need to you to run dependency walker on the binary. Unfortunately, I'm not sure there's a way to script dependency walker's output.

Other projects get around this issue by statically linking their applications, which I would love to do. However, pytorch (our underlying framework) can't be linked statically, so we're stuck dealing with issues like this. We get so much by using pytorch (like built-in GPU support), but this is one of its drawbacks. Still others ship a their client in a vbox VM with all dependencies in place... but then you have the overhead of of virtualization, and no GPU support. So, for now, we ship with the exe and a bunch of pytorch DLLs, as well as the Visual Studio Runtime (vcruntime) DLLs.

Linux also has this problem, but its much easier to debug and has much more helpful error messages than "error 0xc0000139".
ID: 721 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sergey Kovalchuk

Send message
Joined: 1 Jul 20
Posts: 31
Credit: 123,959
RAC: 0
Message 722 - Posted: 27 Oct 2020, 19:40:56 UTC - in response to Message 721.  

there is one more way to solve the problem of library compatibility - the preinstalled and configured PyTorch package
you need to choose a way to switch between the native and downloaded packages

the BURP project used a switch file - it was necessary to create a file or folder "native" in the project folder
but the "blender" still loaded with the project and took up space.

on another project, I already proposed another selector - on the configured host (with the PyTorch package) create a pseudo coprocessor of the same name and configure one version of the application (without libraries) to use it through the plan-class

<cc_config>
  <options>
    <coproc>
      <type>PyTorch</type>
      <count>1</count>
      <non_gpu/>
    </coproc>
  </options>
</cc_config>


<app_version>
    <app_name>mlds</app_name>
    <avg_ncpus>1.00000</avg_ncpus>
    <plan_class>PyTorch</plan_class>
    <coproc>
        <type>PyTorch</type>
        <count>1.000000</count>
    </coproc>
</app_version>
ID: 722 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 1 Jul 20
Posts: 34
Credit: 26,118,410
RAC: 0
Message 728 - Posted: 28 Oct 2020, 13:45:36 UTC - in response to Message 721.  

Would you be willing to download a zip file and run some tests for me?


Sure! Just send me the directions and where to get the files.
Reno, NV
Team: SETI.USA
ID: 728 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 739 - Posted: 31 Oct 2020, 3:13:03 UTC - in response to Message 728.  

Would you be willing to download a zip file and run some tests for me?


Sure! Just send me the directions and where to get the files.


Sigh, *now* I find the forum thread 30 seconds after I private messaged you. Well, might as well re-post instructions here in case others want to help debug.

If you're having an issue with the cuda client not running because of what looks to be DLL issues, please download the test copy of the client from https://www.mlcathome.org/analysis/mlds-gpu-9.70-test-win64.zip. This includes the client, a README.md, a copy of dependency walker, and a sample dataset. The README.md gives instructions on how to test-run the client, and if it fails, how to run dependency walker to dump some information about the libraries the exe is tryind to load in your environment to a file so we can analyze it on our end.

Thanks again for being willing to test!
ID: 739 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 1 Jul 20
Posts: 34
Credit: 26,118,410
RAC: 0
Message 741 - Posted: 31 Oct 2020, 7:18:48 UTC

File sent. It's 90mb. I hope that doesn't break things on your end.
Reno, NV
Team: SETI.USA
ID: 741 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 771 - Posted: 9 Nov 2020, 4:13:52 UTC

Haven't forgotten about this. I've taken some time off the past week and there's nothing obviously wrong from your log. I'll need to dig a little deeper but I've been focused on getting the linux/cuda app up and running. I'll pivot back to this soon.
ID: 771 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STE\/E

Send message
Joined: 1 Jul 20
Posts: 2
Credit: 2,768,805
RAC: 0
Message 805 - Posted: 11 Nov 2020, 20:24:30 UTC - in response to Message 713.  

Any idea why this one machine is throwing nothing but errors, with test units, both CPU and GPU?

https://www.mlcathome.org/mlcathome/show_host_detail.php?hostid=666

Edit: To be clear, no errors on any other projects at all.


Same Problem here >>> https://www.mlcathome.org/mlcathome/results.php?hostid=23

Only Box I can seem to get any on & they all error out after 20-30 seconds ...
ID: 805 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bill F
Avatar

Send message
Joined: 2 Jul 20
Posts: 7
Credit: 2,052,848
RAC: 8
Message 813 - Posted: 12 Nov 2020, 4:35:00 UTC - in response to Message 713.  

Any idea why this one machine is throwing nothing but errors, with test units, both CPU and GPU?

https://www.mlcathome.org/mlcathome/show_host_detail.php?hostid=666

Edit: To be clear, no errors on any other projects at all.



Ok do you have any other machines this advanced ? It has an I9 CPU with a lot of memory and it is running VBox 6.1.0
You have what looks like a fairly new GPU driver at 456. Is this your top of the hill system ?

I will ask have you loaded the Extended VBox tool kit for 6.10 ? Why you would need I can not speculate. Since you are not getting failures on other projects I suspect that your hardware and clocking is not a problem. In pushing ideas around one more question ... are you running any other VBox work on the system for other projects and do they have task load requirements that are similar ?

Bill F
Dallas
In October of 1969 I took an oath to support and defend the Constitution of the United States against all enemies, foreign and domestic;
There was no expiration date.


ID: 813 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 1 Jul 20
Posts: 34
Credit: 26,118,410
RAC: 0
Message 815 - Posted: 12 Nov 2020, 6:03:38 UTC

The three machines that have nothing but errors are

- Threadripper 3970X, 2x RTX 1660 Ti (Dual boot Win10 and Linux, tasks fail for both)
- Threadripper 3970X, 1x RTX 2080 Ti + 1x GTX 1080 Ti (Dual boot Win10 and Linux ,tasks fail for both)
- i9-9820X, 2x RTX 1660 Ti (Win10)

Latest Nvidia drivers for all three.

Not sure what the vbox questions are about? This project doesn't use vbox. These are native tasks.

The thing that makes me wonder what's going on is that not even the CPU tasks work. They used to work just fine. Putting aside the GPU problems, something else has changed and is causing problems.
Reno, NV
Team: SETI.USA
ID: 815 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sid

Send message
Joined: 22 Aug 20
Posts: 7
Credit: 18,867,632
RAC: 0
Message 817 - Posted: 12 Nov 2020, 14:06:13 UTC - in response to Message 815.  
Last modified: 12 Nov 2020, 14:08:11 UTC

One of my window box is saying for GPU tasks or not:

-1073741515 (0xC0000135) STATUS_DLL_NOT_FOUND

What it might be?
ID: 817 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Questions and Answers : Windows : Just one windows machine with errors

©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)