Some errors

Questions and Answers : Windows : Some errors
Message board moderation

To post messages, you must log in.

AuthorMessage
[VENETO] boboviz

Send message
Joined: 11 Jul 20
Posts: 33
Credit: 1,266,237
RAC: 0
Message 364 - Posted: 22 Aug 2020, 16:44:30 UTC

1127395
1127422

<message>
(unknown error) - exit code -1073741819 (0xc0000005)</message>
<stderr_txt>


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00007FFC45B65088 read attempt to address 0x00000018


Now other wus seems to run correctly.
ID: 364 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 367 - Posted: 22 Aug 2020, 17:52:45 UTC - in response to Message 364.  

Thanks for the report.

It looks like a null pointer dereference, but its in the c10 library (pytorch lib). It's weird that it only happens occasionally though. Overall we don't get a lot of compute errors on this project, but windows errors with with similar stack traces make up the majority of the few we do get. I'll look into it more.

Note it may be fixed in the 9.50 windows app that I need to roll out again.. hopefully later today. That uses a new version of the pytorch libs (v1.6.0 vs v1.5.1)
ID: 367 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hwt

Send message
Joined: 26 Aug 20
Posts: 1
Credit: 12,615
RAC: 0
Message 417 - Posted: 27 Aug 2020, 16:12:46 UTC - in response to Message 367.  

Hi,I meet the same problem for all wus in this pc(https://www.mlcathome.org/mlcathome/result.php?resultid=1214706|https://www.mlcathome.org/mlcathome/result.php?resultid=1192679|https://www.mlcathome.org/mlcathome/result.php?resultid=1200493|https://www.mlcathome.org/mlcathome/result.php?resultid=1200481)

(unknown error) - exit code -1073741819 (0xc0000005)</message>
<stderr_txt>
Unhandled Exception Detected...
- Unhandled Exception Record -
①Reason: Access Violation (0xc0000005) at address 0x00007FFA6B9C0D50 read attempt to address 0x36065860
②Reason: Access Violation (0xc0000005) at address 0x00007FFA5AE127A7 read attempt to address 0x5BE59378
③Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x00007FFA69528B8C
④Reason: Access Violation (0xc0000005) at address 0x00007FFA0CAF2D6A read attempt to address 0xFFFFFFFF


OS:WIN10 2004 20190.1000
APP:Machine Learning Dataset Generator v9.55 windows_x86_64

[/b]
ID: 417 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 492 - Posted: 17 Sep 2020, 18:38:49 UTC - in response to Message 417.  

Windows issues are so hard to debug. Fortunately the windows client error rate is in line with the linux clients, which is good (it was higher for a while).

This one at least came with a helpful stack trace that seems to show it running out of memory. MLDS uses about 700-750MB per WU.. if you have a high number of threads, I can see it hitting memory problems if other things are running on the box, even at 32GB, especially since BOINC i think defaults to using only 40-50% of memory on a system. I'll have to look at your specific WU and host later in the day when I'm home.

Pytorch is not very good at gracefully dealing with out of memory errors. And I'm fairly certain its not an issue with our client mis-using the interface.

The other windows issues I see a lot boil down to missing DLL or missing entry point (which is, essentially, a missing a dll), and I don't know why that's the case.
ID: 492 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Questions and Answers : Windows : Some errors

©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)