195 (0x000000C3) EXIT_CHILD_FAILED error

Questions and Answers : Issue Discussion : 195 (0x000000C3) EXIT_CHILD_FAILED error
Message board moderation

To post messages, you must log in.

AuthorMessage
Hal Bregg

Send message
Joined: 27 Sep 20
Posts: 6
Credit: 1,025,663
RAC: 0
Message 1378 - Posted: 12 Oct 2021, 16:44:59 UTC

I trashed a lot of WUs today with following error

Machine Learning Dataset Generator v9.90 (Linux/x86_64) (libTorch: release/1.9)
HDF5-DIAG: Error detected in HDF5 (1.12.0) thread 0:
  #000: /home/gitlab-runner/builds/-ZexzTs7/0/mlcathome/mlds/extern/hdf5/src/H5F.c line 793 in H5Fopen(): unable to open file
    major: File accessibility
    minor: Unable to open file
  #001: /home/gitlab-runner/builds/-ZexzTs7/0/mlcathome/mlds/extern/hdf5/src/H5VLcallback.c line 3500 in H5VL_file_open(): open failed
    major: Virtual Object Layer
    minor: Can't open object
  #002: /home/gitlab-runner/builds/-ZexzTs7/0/mlcathome/mlds/extern/hdf5/src/H5VLcallback.c line 3465 in H5VL__file_open(): open failed
    major: Virtual Object Layer
    minor: Can't open object
  #003: /home/gitlab-runner/builds/-ZexzTs7/0/mlcathome/mlds/extern/hdf5/src/H5VLnative_file.c line 100 in H5VL__native_file_open(): unable to open file
    major: File accessibility
    minor: Unable to open file
  #004: /home/gitlab-runner/builds/-ZexzTs7/0/mlcathome/mlds/extern/hdf5/src/H5Fint.c line 1707 in H5F_open(): unable to read superblock
    major: File accessibility
    minor: Read failed
  #005: /home/gitlab-runner/builds/-ZexzTs7/0/mlcathome/mlds/extern/hdf5/src/H5Fsuper.c line 412 in H5F__super_read(): file signature not found
    major: File accessibility
    minor: Not an HDF5 file
terminate called after throwing an instance of 'H5::FileIException'
16:16:01 (360255): mlds exited; CPU time 0.307757
16:16:01 (360255): app exit status: 0x86
16:16:01 (360255): called boinc_finish(195)


What is causing the errors?
ID: 1378 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Hal Bregg

Send message
Joined: 27 Sep 20
Posts: 6
Credit: 1,025,663
RAC: 0
Message 1379 - Posted: 12 Oct 2021, 21:01:09 UTC - in response to Message 1378.  

I didn't add more details about above errors.

Machine Learning Dataset Generator (test) v9.96 keep crashing on Ubuntu 20.04.3 LTS [5.4.0-88-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.2)].
ID: 1379 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
the0retical

Send message
Joined: 3 Aug 21
Posts: 3
Credit: 109,720
RAC: 0
Message 1380 - Posted: 12 Oct 2021, 22:13:18 UTC - in response to Message 1379.  

Thanks for posting an error log. The test queue at the moment has some erroneous scripts in the backend because I'm currently volunteering for this project in refactoring the backend.

The latest batch that I sent out caused cascading errors as which usually happens when the scripts fail and then send to another host and then repeats.

For the next month or so you should expect these sorts of errors as I will be testing feature branches on the test queue.

As for the code, once I have implemented the new validation script, we have agreed to open-source the backend. You can follow me and my activity here https://gitlab.com/delta1512
ID: 1380 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
the0retical

Send message
Joined: 3 Aug 21
Posts: 3
Credit: 109,720
RAC: 0
Message 1381 - Posted: 12 Oct 2021, 22:28:36 UTC - in response to Message 1380.  

An update on this.

It appears that John accidentally set the cronjob/service to automatically generate work on all queues instead of just the production queue so now all the work is being sent through the erroneous backend scripts.

The issue should subside later today when he disables the generation but you might still get the cascading effects.
ID: 1381 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sagittarius Lupus
Avatar

Send message
Joined: 4 Apr 21
Posts: 7
Credit: 415,093
RAC: 5
Message 1383 - Posted: 14 Oct 2021, 1:15:50 UTC - in response to Message 1380.  

Thanks for disclosing this. I was worried for a moment that I had done something to destabilize my worker rig, but I'm glad it's not just me. I'm happy to run test units but when I see the red line I go investigating... it saves so much time to find information like this from project developers. Keep on breaking stuff for science! :D
ID: 1383 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Landjunge

Send message
Joined: 19 Sep 20
Posts: 1
Credit: 7,524,856
RAC: 2,920
Message 1385 - Posted: 14 Oct 2021, 9:27:31 UTC

I have the same errors as mentioned above on my raspberry pi4's. First i got it only with 9.96 and i disabled the beta wu's. But now 9.90 are erroring out too. Did this need fixing on the server side?
ID: 1385 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
swiftmallard
Avatar

Send message
Joined: 23 Sep 20
Posts: 24
Credit: 15,318,198
RAC: 1,992
Message 1386 - Posted: 14 Oct 2021, 10:51:56 UTC

I have had 125 of these errors over night
ID: 1386 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
UBT - wbiz

Send message
Joined: 2 May 21
Posts: 9
Credit: 2,016,461
RAC: 2
Message 1387 - Posted: 14 Oct 2021, 15:40:55 UTC - in response to Message 1385.  

I have the same errors as mentioned above on my raspberry pi4's. First i got it only with 9.96 and i disabled the beta wu's. But now 9.90 are erroring out too. Did this need fixing on the server side?


Same on my Pi4's 675 errors over the last couple of days, I've suspended MLC now as its messing up my schedules for other projects.
ID: 1387 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 1389 - Posted: 15 Oct 2021, 4:26:19 UTC - in response to Message 1387.  

FYI, this is the issue discussed in the news story.. THANK YOU for posting about it.

It's an issue on our backend, not yours. We've been trying to consolidate/clean up a hodgepodge of about 20+ scripts that create, validate, and appraise WUs for all three queues and all 4 datasets into a small few. We've been running some updated, consolidated scripts on mldstest for a while and flipped the switch on the main queue, and somehow broke BOTH the mlds queue and the test queue, in that they were both sending out corrupt WUs. It looks like part of the corruption is not properly specifying where the hdf5 dataset file is located, so the client can't find it.

We've managed to cancel all outstanding WUs on both queues, and are working on a fix. Please stay tuned over the next 24 hours while we clean up our mess and get back online.

I can easily revert back to the old code, but it would be better to track down where the new code went wrong for the overall health and maintenance of the project, so we're trying that first.
ID: 1389 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Questions and Answers : Issue Discussion : 195 (0x000000C3) EXIT_CHILD_FAILED error

©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)