Task 5130863

Name ParityModified-1607233793-4287-35_2
Workunit 3076243
Created 2 May 2021, 10:09:02 UTC
Sent 2 May 2021, 10:18:54 UTC
Report deadline 9 May 2021, 10:18:54 UTC
Received 3 May 2021, 8:42:25 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED
Computer ID 11158
Run time 10 hours 57 min 44 sec
CPU time
Validate state Invalid
Credit 0.00
Device peak FLOPS 13,837.92 GFLOPS
Application version Machine Learning Dataset Generator (test) v9.80 (amdrocm)
x86_64-pc-linux-gnu
Peak working set size 1.58 GB
Peak swap size 8.10 GB
Peak disk usage 2.25 GB

Stderr output

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
exceeded elapsed time limit 39418.03 (400000.00G/10.15G)</message>
<stderr_txt>
DEBUG: Args: ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__rocm -c --maxepoch 128 
nthreads: 1 gpudev: 0
Re-exec()-ing to set environment correctly
14:44:10 (10214): start_timer_thread(): pthread_create(): 22Machine Learning Dataset Generator v9.80 (Linux/x86_64) (libTorch: release/1.7 GPU: Vega 20 [Radeon VII])
[2021-05-02 14:44:10	                main:442]	:	INFO	:	Set logging level to 1
[2021-05-02 14:44:10	                main:448]	:	INFO	:	Running in BOINC Client mode
[2021-05-02 14:44:10	                main:451]	:	INFO	:	Resolving all filenames
[2021-05-02 14:44:10	                main:459]	:	INFO	:	Resolved: dataset.hdf5 => ../../projects/www.mlcathome.org_mlcathome/ParityModified-train-val-dataset.hdf5 (exists = 1)
[2021-05-02 14:44:10	                main:459]	:	INFO	:	Resolved: model.cfg => ../../projects/www.mlcathome.org_mlcathome/ParityModified-1607233793-4287-35_2_r2144645146_1 (exists = 0)
[2021-05-02 14:44:10	                main:459]	:	INFO	:	Resolved: model-final.pt => ../../projects/www.mlcathome.org_mlcathome/ParityModified-1607233793-4287-35_2_r2144645146_0 (exists = 0)
[2021-05-02 14:44:10	                main:459]	:	INFO	:	Resolved: model-input.pt => ../../projects/www.mlcathome.org_mlcathome/ParityModified-1607233793-4287-35 (exists = 1)
[2021-05-02 14:44:10	                main:459]	:	INFO	:	Resolved: snapshot.pt => snapshot.pt (exists = 0)
[2021-05-02 14:44:10	                main:479]	:	INFO	:	Dataset filename: ../../projects/www.mlcathome.org_mlcathome/ParityModified-train-val-dataset.hdf5
[2021-05-02 14:44:10	                main:481]	:	INFO	:	Configuration: 
[2021-05-02 14:44:10	                main:482]	:	INFO	:	    Model type: GRU
[2021-05-02 14:44:10	                main:483]	:	INFO	:	    Validation Loss Threshold: 0.0001
[2021-05-02 14:44:10	                main:484]	:	INFO	:	    Max Epochs: 128
[2021-05-02 14:44:10	                main:485]	:	INFO	:	    Batch Size: 128
[2021-05-02 14:44:10	                main:486]	:	INFO	:	    Learning Rate: 0.01
[2021-05-02 14:44:10	                main:487]	:	INFO	:	    Patience: 10
[2021-05-02 14:44:10	                main:488]	:	INFO	:	    Hidden Width: 12
[2021-05-02 14:44:10	                main:489]	:	INFO	:	    # Recurrent Layers: 4
[2021-05-02 14:44:10	                main:490]	:	INFO	:	    # Backend Layers: 4
[2021-05-02 14:44:10	                main:491]	:	INFO	:	    # Threads: 1
[2021-05-02 14:44:10	                main:493]	:	INFO	:	Preparing Dataset
[2021-05-02 14:44:10	load_hdf5_ds_into_tensor:28]	:	INFO	:	Loading Dataset /Xt from ../../projects/www.mlcathome.org_mlcathome/ParityModified-train-val-dataset.hdf5 into memory
[2021-05-02 14:44:10	load_hdf5_ds_into_tensor:28]	:	INFO	:	Loading Dataset /Yt from ../../projects/www.mlcathome.org_mlcathome/ParityModified-train-val-dataset.hdf5 into memory
[2021-05-02 14:44:11	                load:106]	:	INFO	:	Successfully loaded dataset of 2048 examples into memory.
[2021-05-02 14:44:11	load_hdf5_ds_into_tensor:28]	:	INFO	:	Loading Dataset /Xv from ../../projects/www.mlcathome.org_mlcathome/ParityModified-train-val-dataset.hdf5 into memory
[2021-05-02 14:44:11	load_hdf5_ds_into_tensor:28]	:	INFO	:	Loading Dataset /Yv from ../../projects/www.mlcathome.org_mlcathome/ParityModified-train-val-dataset.hdf5 into memory
[2021-05-02 14:44:11	                load:106]	:	INFO	:	Successfully loaded dataset of 512 examples into memory.
[2021-05-02 14:44:11	                main:501]	:	INFO	:	Creating Model
[2021-05-02 14:44:11	                main:514]	:	INFO	:	Preparing config file
[2021-05-02 14:44:11	                main:526]	:	INFO	:	Creating new config file
[2021-05-02 14:44:11	                main:545]	:	INFO	:	This is a continuation WU, loading previous network
[2021-05-02 14:44:11	                main:566]	:	INFO	:	Loading DataLoader into Memory
[2021-05-02 14:44:11	                main:569]	:	INFO	:	Starting Training

</stderr_txt>
]]>


©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)