Task 5096566

Name ParityModified-1607233792-4279-56_3
Workunit 3057528
Created 27 Apr 2021, 1:11:50 UTC
Sent 27 Apr 2021, 1:24:40 UTC
Report deadline 4 May 2021, 1:24:40 UTC
Received 27 Apr 2021, 21:37:33 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 193 (0x000000C1) EXIT_SIGNAL
Computer ID 11158
Run time 1 min 26 sec
CPU time
Validate state Invalid
Credit 0.00
Device peak FLOPS 13,837.92 GFLOPS
Application version Machine Learning Dataset Generator (test) v9.80 (amdrocm)
x86_64-pc-linux-gnu
Peak working set size 1.61 GB
Peak swap size 8.18 GB
Peak disk usage 2.25 GB

Stderr output

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)</message>
<stderr_txt>
DEBUG: Args: ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__rocm -c --maxepoch 128 
nthreads: 1 gpudev: 0
Re-exec()-ing to set environment correctly
14:35:03 (48648): start_timer_thread(): pthread_create(): 22Machine Learning Dataset Generator v9.80 (Linux/x86_64) (libTorch: release/1.7 GPU: Vega 20 [Radeon VII])
[2021-04-27 14:35:03	                main:442]	:	INFO	:	Set logging level to 1
[2021-04-27 14:35:03	                main:448]	:	INFO	:	Running in BOINC Client mode
[2021-04-27 14:35:03	                main:451]	:	INFO	:	Resolving all filenames
[2021-04-27 14:35:03	                main:459]	:	INFO	:	Resolved: dataset.hdf5 => ../../projects/www.mlcathome.org_mlcathome/ParityModified-train-val-dataset.hdf5 (exists = 1)
[2021-04-27 14:35:03	                main:459]	:	INFO	:	Resolved: model.cfg => ../../projects/www.mlcathome.org_mlcathome/ParityModified-1607233792-4279-56_3_r2051608797_1 (exists = 0)
[2021-04-27 14:35:03	                main:459]	:	INFO	:	Resolved: model-final.pt => ../../projects/www.mlcathome.org_mlcathome/ParityModified-1607233792-4279-56_3_r2051608797_0 (exists = 0)
[2021-04-27 14:35:03	                main:459]	:	INFO	:	Resolved: model-input.pt => ../../projects/www.mlcathome.org_mlcathome/ParityModified-1607233792-4279-56 (exists = 1)
[2021-04-27 14:35:03	                main:459]	:	INFO	:	Resolved: snapshot.pt => snapshot.pt (exists = 0)
[2021-04-27 14:35:03	                main:479]	:	INFO	:	Dataset filename: ../../projects/www.mlcathome.org_mlcathome/ParityModified-train-val-dataset.hdf5
[2021-04-27 14:35:03	                main:481]	:	INFO	:	Configuration: 
[2021-04-27 14:35:03	                main:482]	:	INFO	:	    Model type: GRU
[2021-04-27 14:35:03	                main:483]	:	INFO	:	    Validation Loss Threshold: 0.0001
[2021-04-27 14:35:03	                main:484]	:	INFO	:	    Max Epochs: 128
[2021-04-27 14:35:03	                main:485]	:	INFO	:	    Batch Size: 128
[2021-04-27 14:35:03	                main:486]	:	INFO	:	    Learning Rate: 0.01
[2021-04-27 14:35:03	                main:487]	:	INFO	:	    Patience: 10
[2021-04-27 14:35:03	                main:488]	:	INFO	:	    Hidden Width: 12
[2021-04-27 14:35:03	                main:489]	:	INFO	:	    # Recurrent Layers: 4
[2021-04-27 14:35:03	                main:490]	:	INFO	:	    # Backend Layers: 4
[2021-04-27 14:35:03	                main:491]	:	INFO	:	    # Threads: 1
[2021-04-27 14:35:03	                main:493]	:	INFO	:	Preparing Dataset
[2021-04-27 14:35:03	load_hdf5_ds_into_tensor:28]	:	INFO	:	Loading Dataset /Xt from ../../projects/www.mlcathome.org_mlcathome/ParityModified-train-val-dataset.hdf5 into memory
[2021-04-27 14:35:04	load_hdf5_ds_into_tensor:28]	:	INFO	:	Loading Dataset /Yt from ../../projects/www.mlcathome.org_mlcathome/ParityModified-train-val-dataset.hdf5 into memory
[2021-04-27 14:35:06	                load:106]	:	INFO	:	Successfully loaded dataset of 2048 examples into memory.
[2021-04-27 14:35:06	load_hdf5_ds_into_tensor:28]	:	INFO	:	Loading Dataset /Xv from ../../projects/www.mlcathome.org_mlcathome/ParityModified-train-val-dataset.hdf5 into memory
[2021-04-27 14:35:06	load_hdf5_ds_into_tensor:28]	:	INFO	:	Loading Dataset /Yv from ../../projects/www.mlcathome.org_mlcathome/ParityModified-train-val-dataset.hdf5 into memory
[2021-04-27 14:35:06	                load:106]	:	INFO	:	Successfully loaded dataset of 512 examples into memory.
[2021-04-27 14:35:06	                main:501]	:	INFO	:	Creating Model
[2021-04-27 14:35:06	                main:514]	:	INFO	:	Preparing config file
[2021-04-27 14:35:06	                main:526]	:	INFO	:	Creating new config file
[2021-04-27 14:35:06	                main:545]	:	INFO	:	This is a continuation WU, loading previous network
[2021-04-27 14:35:07	                main:566]	:	INFO	:	Loading DataLoader into Memory
[2021-04-27 14:35:07	                main:569]	:	INFO	:	Starting Training
/src/external/hip-on-vdi/rocclr/hip_global.cpp:69: guarantee(false && "Cannot find Symbol")
SIGABRT: abort called
Stack trace (30 frames):
../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__rocm(+0x37f44c)[0x565165b2a44c]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7fea1adf03c0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fe9e9be118b]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fe9e9bc0859]
./libamdhip64.so.3(+0x15f92f)[0x7fe9e8e8f92f]
./libamdhip64.so.3(+0x8b380)[0x7fe9e8dbb380]
./libamdhip64.so.3(+0x8bac7)[0x7fe9e8dbbac7]
./libamdhip64.so.3(+0x5df5d)[0x7fe9e8d8df5d]
./libamdhip64.so.3(+0x1055ec)[0x7fe9e8e355ec]
./libamdhip64.so.3(hipLaunchKernel+0x172)[0x7fe9e8e18242]
./libtorch_hip.so(+0x165ca89)[0x7fe9eb81aa89]
./libtorch_hip.so(_ZN2at6native17gpu_reduce_kernelIffLi4ENS0_14func_wrapper_tIfZNS0_11sum_functorIfffEclERNS_14TensorIteratorEEUlffE_EEdEEvS6_RKT2_T3_PNS0_18AccumulationBufferEl+0xb3c)[0x7fe9eb82d2cc]
./libtorch_hip.so(+0x16594a9)[0x7fe9eb8174a9]
./libtorch_cpu.so(_ZN2at6native7sum_outERNS_6TensorERKS1_N3c108ArrayRefIlEEbNS5_8optionalINS5_10ScalarTypeEEE+0x130)[0x7fea1520ec50]
./libtorch_cpu.so(_ZN2at6native3sumERKNS_6TensorEN3c108ArrayRefIlEEbNS4_8optionalINS4_10ScalarTypeEEE+0x5b)[0x7fea1520f1cb]
./libtorch_hip.so(+0x3fec54)[0x7fe9ea5bcc54]
./libtorch_hip.so(+0x440eef)[0x7fe9ea5feeef]
./libtorch_cpu.so(+0xfff171)[0x7fea15795171]
./libtorch_cpu.so(_ZN2at3sumERKNS_6TensorEN3c108ArrayRefIlEEbNS3_8optionalINS3_10ScalarTypeEEE+0x100)[0x7fea156a9a50]
./libtorch_cpu.so(+0x1c1355e)[0x7fea163a955e]
./libtorch_cpu.so(+0x62217f)[0x7fea14db817f]
./libtorch_cpu.so(+0xfff171)[0x7fea15795171]
./libtorch_cpu.so(_ZNK2at6Tensor3sumEN3c108ArrayRefIlEEbNS1_8optionalINS1_10ScalarTypeEEE+0x100)[0x7fea158e8dc0]
./libtorch_cpu.so(+0x1fa1158)[0x7fea16737158]
./libtorch_cpu.so(_ZN5torch8autograd6Engine17evaluate_functionERSt10shared_ptrINS0_9GraphTaskEEPNS0_4NodeERNS0_11InputBufferERKS2_INS0_10ReadyQueueEE+0x4fc)[0x7fea1673d59c]
./libtorch_cpu.so(_ZN5torch8autograd6Engine11thread_mainERKSt10shared_ptrINS0_9GraphTaskEE+0x4e9)[0x7fea1673f219]
./libtorch_cpu.so(_ZN5torch8autograd6Engine11thread_initEiRKSt10shared_ptrINS0_10ReadyQueueEEb+0x99)[0x7fea16736529]
../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__rocm(+0x3e48cf)[0x565165b8f8cf]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7fea1ade4609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fe9e9cbd293]

Exiting...

</stderr_txt>
]]>


©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)