Questions and Answers :
Issue Discussion :
195 (0x000000C3) EXIT_CHILD_FAILED error
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 27 Sep 20 Posts: 6 Credit: 1,025,663 RAC: 0 |
I trashed a lot of WUs today with following error Machine Learning Dataset Generator v9.90 (Linux/x86_64) (libTorch: release/1.9)
HDF5-DIAG: Error detected in HDF5 (1.12.0) thread 0:
#000: /home/gitlab-runner/builds/-ZexzTs7/0/mlcathome/mlds/extern/hdf5/src/H5F.c line 793 in H5Fopen(): unable to open file
major: File accessibility
minor: Unable to open file
#001: /home/gitlab-runner/builds/-ZexzTs7/0/mlcathome/mlds/extern/hdf5/src/H5VLcallback.c line 3500 in H5VL_file_open(): open failed
major: Virtual Object Layer
minor: Can't open object
#002: /home/gitlab-runner/builds/-ZexzTs7/0/mlcathome/mlds/extern/hdf5/src/H5VLcallback.c line 3465 in H5VL__file_open(): open failed
major: Virtual Object Layer
minor: Can't open object
#003: /home/gitlab-runner/builds/-ZexzTs7/0/mlcathome/mlds/extern/hdf5/src/H5VLnative_file.c line 100 in H5VL__native_file_open(): unable to open file
major: File accessibility
minor: Unable to open file
#004: /home/gitlab-runner/builds/-ZexzTs7/0/mlcathome/mlds/extern/hdf5/src/H5Fint.c line 1707 in H5F_open(): unable to read superblock
major: File accessibility
minor: Read failed
#005: /home/gitlab-runner/builds/-ZexzTs7/0/mlcathome/mlds/extern/hdf5/src/H5Fsuper.c line 412 in H5F__super_read(): file signature not found
major: File accessibility
minor: Not an HDF5 file
terminate called after throwing an instance of 'H5::FileIException'
16:16:01 (360255): mlds exited; CPU time 0.307757
16:16:01 (360255): app exit status: 0x86
16:16:01 (360255): called boinc_finish(195)
What is causing the errors? |
|
Send message Joined: 27 Sep 20 Posts: 6 Credit: 1,025,663 RAC: 0 |
I didn't add more details about above errors. Machine Learning Dataset Generator (test) v9.96 keep crashing on Ubuntu 20.04.3 LTS [5.4.0-88-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.2)]. |
|
Send message Joined: 3 Aug 21 Posts: 3 Credit: 109,720 RAC: 0 |
Thanks for posting an error log. The test queue at the moment has some erroneous scripts in the backend because I'm currently volunteering for this project in refactoring the backend. The latest batch that I sent out caused cascading errors as which usually happens when the scripts fail and then send to another host and then repeats. For the next month or so you should expect these sorts of errors as I will be testing feature branches on the test queue. As for the code, once I have implemented the new validation script, we have agreed to open-source the backend. You can follow me and my activity here https://gitlab.com/delta1512 |
|
Send message Joined: 3 Aug 21 Posts: 3 Credit: 109,720 RAC: 0 |
An update on this. It appears that John accidentally set the cronjob/service to automatically generate work on all queues instead of just the production queue so now all the work is being sent through the erroneous backend scripts. The issue should subside later today when he disables the generation but you might still get the cascading effects. |
|
Send message Joined: 4 Apr 21 Posts: 7 Credit: 415,093 RAC: 5 |
Thanks for disclosing this. I was worried for a moment that I had done something to destabilize my worker rig, but I'm glad it's not just me. I'm happy to run test units but when I see the red line I go investigating... it saves so much time to find information like this from project developers. Keep on breaking stuff for science! :D |
|
Send message Joined: 19 Sep 20 Posts: 1 Credit: 7,524,856 RAC: 2,920 |
I have the same errors as mentioned above on my raspberry pi4's. First i got it only with 9.96 and i disabled the beta wu's. But now 9.90 are erroring out too. Did this need fixing on the server side? |
|
Send message Joined: 23 Sep 20 Posts: 24 Credit: 15,318,198 RAC: 1,992 |
I have had 125 of these errors over night |
|
Send message Joined: 2 May 21 Posts: 9 Credit: 2,016,461 RAC: 2 |
I have the same errors as mentioned above on my raspberry pi4's. First i got it only with 9.96 and i disabled the beta wu's. But now 9.90 are erroring out too. Did this need fixing on the server side? Same on my Pi4's 675 errors over the last couple of days, I've suspended MLC now as its messing up my schedules for other projects. |
|
Send message Joined: 30 Jun 20 Posts: 462 Credit: 21,406,548 RAC: 0 |
FYI, this is the issue discussed in the news story.. THANK YOU for posting about it. It's an issue on our backend, not yours. We've been trying to consolidate/clean up a hodgepodge of about 20+ scripts that create, validate, and appraise WUs for all three queues and all 4 datasets into a small few. We've been running some updated, consolidated scripts on mldstest for a while and flipped the switch on the main queue, and somehow broke BOTH the mlds queue and the test queue, in that they were both sending out corrupt WUs. It looks like part of the corruption is not properly specifying where the hdf5 dataset file is located, so the client can't find it. We've managed to cancel all outstanding WUs on both queues, and are working on a fix. Please stay tuned over the next 24 hours while we clean up our mess and get back online. I can easily revert back to the old code, but it would be better to track down where the new code went wrong for the overall health and maintenance of the project, so we're trying that first. |
©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)