Task 12936807

Name ParityModified-1645998049-13744-3-0_0
Workunit 10070373
Created 11 Mar 2022, 17:01:21 UTC
Sent 11 Mar 2022, 18:39:52 UTC
Report deadline 19 Mar 2022, 18:39:52 UTC
Received 11 Mar 2022, 18:44:36 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 193 (0x000000C1) EXIT_SIGNAL
Computer ID 13797
Run time 4 sec
CPU time 3 sec
Validate state Invalid
Credit 0.00
Device peak FLOPS 5,488.32 GFLOPS
Application version Machine Learning Dataset Generator (GPU) v9.80 (cuda10200)
x86_64-pc-linux-gnu
Peak disk usage 2.99 GB

Stderr output

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)</message>
<stderr_txt>
DEBUG: Args: ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200 -c --maxepoch 2048 
nthreads: 1 gpudev: 0
Re-exec()-ing to set environment correctly
Machine Learning Dataset Generator v9.80 (Linux/x86_64) (libTorch: release/1.7 GPU: NVIDIA GeForce GTX 1660 Ti)
[2022-03-11 19:40:50	                main:442]	:	INFO	:	Set logging level to 1
[2022-03-11 19:40:50	                main:448]	:	INFO	:	Running in BOINC Client mode
[2022-03-11 19:40:50	                main:451]	:	INFO	:	Resolving all filenames
[2022-03-11 19:40:50	                main:459]	:	INFO	:	Resolved: dataset.hdf5 => dataset.hdf5 (exists = 1)
[2022-03-11 19:40:50	                main:459]	:	INFO	:	Resolved: model.cfg => model.cfg (exists = 0)
[2022-03-11 19:40:50	                main:459]	:	INFO	:	Resolved: model-final.pt => model-final.pt (exists = 0)
[2022-03-11 19:40:50	                main:459]	:	INFO	:	Resolved: model-input.pt => model-input.pt (exists = 1)
[2022-03-11 19:40:50	                main:459]	:	INFO	:	Resolved: snapshot.pt => snapshot.pt (exists = 0)
[2022-03-11 19:40:50	                main:479]	:	INFO	:	Dataset filename: dataset.hdf5
[2022-03-11 19:40:50	                main:481]	:	INFO	:	Configuration: 
[2022-03-11 19:40:50	                main:482]	:	INFO	:	    Model type: GRU
[2022-03-11 19:40:50	                main:483]	:	INFO	:	    Validation Loss Threshold: 0.0001
[2022-03-11 19:40:50	                main:484]	:	INFO	:	    Max Epochs: 2048
[2022-03-11 19:40:50	                main:485]	:	INFO	:	    Batch Size: 128
[2022-03-11 19:40:50	                main:486]	:	INFO	:	    Learning Rate: 0.01
[2022-03-11 19:40:50	                main:487]	:	INFO	:	    Patience: 10
[2022-03-11 19:40:50	                main:488]	:	INFO	:	    Hidden Width: 12
[2022-03-11 19:40:50	                main:489]	:	INFO	:	    # Recurrent Layers: 4
[2022-03-11 19:40:50	                main:490]	:	INFO	:	    # Backend Layers: 4
[2022-03-11 19:40:50	                main:491]	:	INFO	:	    # Threads: 1
[2022-03-11 19:40:50	                main:493]	:	INFO	:	Preparing Dataset
[2022-03-11 19:40:50	load_hdf5_ds_into_tensor:28]	:	INFO	:	Loading Dataset /Xt from dataset.hdf5 into memory
[2022-03-11 19:40:50	load_hdf5_ds_into_tensor:28]	:	INFO	:	Loading Dataset /Yt from dataset.hdf5 into memory
[2022-03-11 19:40:52	                load:106]	:	INFO	:	Successfully loaded dataset of 2048 examples into memory.
[2022-03-11 19:40:52	load_hdf5_ds_into_tensor:28]	:	INFO	:	Loading Dataset /Xv from dataset.hdf5 into memory
[2022-03-11 19:40:52	load_hdf5_ds_into_tensor:28]	:	INFO	:	Loading Dataset /Yv from dataset.hdf5 into memory
[2022-03-11 19:40:52	                load:106]	:	INFO	:	Successfully loaded dataset of 512 examples into memory.
[2022-03-11 19:40:52	                main:501]	:	INFO	:	Creating Model
[2022-03-11 19:40:52	                main:514]	:	INFO	:	Preparing config file
[2022-03-11 19:40:52	                main:526]	:	INFO	:	Creating new config file
[2022-03-11 19:40:52	                main:545]	:	INFO	:	This is a continuation WU, loading previous network
[2022-03-11 19:40:53	                main:566]	:	INFO	:	Loading DataLoader into Memory
[2022-03-11 19:40:53	                main:569]	:	INFO	:	Starting Training
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Exception raised from createCublasHandle at /home/mlcbuild/git/pytorch-build/build-cuda/pytorch-prefix/src/pytorch/aten/src/ATen/cuda/CublasHandlePool.cpp:8 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f2351a6499b in ./libc10.so)
frame #1: <unknown function> + 0x84963d (0x7f22d411863d in ./libtorch_cuda.so)
frame #2: at::cuda::getCurrentCUDABlasHandle() + 0xd86 (0x7f22d4119796 in ./libtorch_cuda.so)
frame #3: <unknown function> + 0x833a42 (0x7f22d4102a42 in ./libtorch_cuda.so)
frame #4: at::native::(anonymous namespace)::addmm_out_cuda_impl(at::Tensor&, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar, c10::Scalar) + 0xfca (0x7f22d536dbba in ./libtorch_cuda.so)
frame #5: at::native::mm_cuda(at::Tensor const&, at::Tensor const&) + 0xc5 (0x7f22d536fc65 in ./libtorch_cuda.so)
frame #6: <unknown function> + 0x866150 (0x7f22d4135150 in ./libtorch_cuda.so)
frame #7: <unknown function> + 0x5d3704 (0x7f234be6b704 in ./libtorch_cpu.so)
frame #8: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, at::Tensor const&, at::Tensor const&) const + 0xd0 (0x7f234c6dc100 in ./libtorch_cpu.so)
frame #9: at::mm(at::Tensor const&, at::Tensor const&) + 0x5b (0x7f234c62ab7b in ./libtorch_cpu.so)
frame #10: <unknown function> + 0x22c45d2 (0x7f234db5c5d2 in ./libtorch_cpu.so)
frame #11: <unknown function> + 0x5d3704 (0x7f234be6b704 in ./libtorch_cpu.so)
frame #12: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, at::Tensor const&, at::Tensor const&) const + 0xd0 (0x7f234c6dc100 in ./libtorch_cpu.so)
frame #13: at::Tensor::mm(at::Tensor const&) const + 0x5b (0x7f234c7dd3db in ./libtorch_cpu.so)
frame #14: <unknown function> + 0x89008d (0x7f234c12808d in ./libtorch_cpu.so)
frame #15: at::native::matmul(at::Tensor const&, at::Tensor const&) + 0x4a (0x7f234c128a0a in ./libtorch_cpu.so)
frame #16: <unknown function> + 0xea5480 (0x7f234c73d480 in ./libtorch_cpu.so)
frame #17: <unknown function> + 0x2211d1d (0x7f234daa9d1d in ./libtorch_cpu.so)
frame #18: <unknown function> + 0x5d3704 (0x7f234be6b704 in ./libtorch_cpu.so)
frame #19: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, at::Tensor const&, at::Tensor const&) const + 0xd0 (0x7f234c6dc100 in ./libtorch_cpu.so)
frame #20: at::Tensor::matmul(at::Tensor const&) const + 0x5b (0x7f234c7dd2fb in ./libtorch_cpu.so)
frame #21: torch::nn::LinearImpl::forward(at::Tensor const&) + 0xe1 (0x7f234e58aab1 in ./libtorch_cpu.so)
frame #22: <unknown function> + 0x93014 (0x55a3b3bca014 in ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200)
frame #23: <unknown function> + 0x9a0e6 (0x55a3b3bd10e6 in ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200)
frame #24: <unknown function> + 0x8ad10 (0x55a3b3bc1d10 in ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200)
frame #25: __libc_start_main + 0xeb (0x7f22d330109b in /lib/x86_64-linux-gnu/libc.so.6)
frame #26: <unknown function> + 0x8675a (0x55a3b3bbd75a in ../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200)

SIGABRT: abort called
Stack trace (33 frames):
../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x37df9c)[0x55a3b3eb4f9c]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f23519fc730]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7f22d33147bb]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7f22d32ff535]
../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x135)[0x55a3b3f667f5]
../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x398846)[0x55a3b3ecf846]
../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x398891)[0x55a3b3ecf891]
../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(__cxa_rethrow+0x49)[0x55a3b3ecd919]
./libtorch_cuda.so(_ZN2at4cuda24getCurrentCUDABlasHandleEv+0x13d4)[0x7f22d4119de4]
./libtorch_cuda.so(+0x833a42)[0x7f22d4102a42]
./libtorch_cuda.so(_ZN2at6native84_GLOBAL__N__60_tmpxft_000025a8_00000000_12_LinearAlgebra_compute_75_cpp1_ii_5e5bd7fb19addmm_out_cuda_implERNS_6TensorERKS2_S5_S5_N3c106ScalarES7_+0xfca)[0x7f22d536dbba]
./libtorch_cuda.so(_ZN2at6native7mm_cudaERKNS_6TensorES3_+0xc5)[0x7f22d536fc65]
./libtorch_cuda.so(+0x866150)[0x7f22d4135150]
./libtorch_cpu.so(+0x5d3704)[0x7f234be6b704]
./libtorch_cpu.so(_ZNK3c1010Dispatcher4callIN2at6TensorEJRKS3_S5_EEET_RKNS_19TypedOperatorHandleIFS6_DpT0_EEES9_+0xd0)[0x7f234c6dc100]
./libtorch_cpu.so(_ZN2at2mmERKNS_6TensorES2_+0x5b)[0x7f234c62ab7b]
./libtorch_cpu.so(+0x22c45d2)[0x7f234db5c5d2]
./libtorch_cpu.so(+0x5d3704)[0x7f234be6b704]
./libtorch_cpu.so(_ZNK3c1010Dispatcher4callIN2at6TensorEJRKS3_S5_EEET_RKNS_19TypedOperatorHandleIFS6_DpT0_EEES9_+0xd0)[0x7f234c6dc100]
./libtorch_cpu.so(_ZNK2at6Tensor2mmERKS0_+0x5b)[0x7f234c7dd3db]
./libtorch_cpu.so(+0x89008d)[0x7f234c12808d]
./libtorch_cpu.so(_ZN2at6native6matmulERKNS_6TensorES3_+0x4a)[0x7f234c128a0a]
./libtorch_cpu.so(+0xea5480)[0x7f234c73d480]
./libtorch_cpu.so(+0x2211d1d)[0x7f234daa9d1d]
./libtorch_cpu.so(+0x5d3704)[0x7f234be6b704]
./libtorch_cpu.so(_ZNK3c1010Dispatcher4callIN2at6TensorEJRKS3_S5_EEET_RKNS_19TypedOperatorHandleIFS6_DpT0_EEES9_+0xd0)[0x7f234c6dc100]
./libtorch_cpu.so(_ZNK2at6Tensor6matmulERKS0_+0x5b)[0x7f234c7dd2fb]
./libtorch_cpu.so(_ZN5torch2nn10LinearImpl7forwardERKN2at6TensorE+0xe1)[0x7f234e58aab1]
../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x93014)[0x55a3b3bca014]
../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x9a0e6)[0x55a3b3bd10e6]
../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x8ad10)[0x55a3b3bc1d10]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7f22d330109b]
../../projects/www.mlcathome.org_mlcathome/mlds-gpu_9.80_x86_64-pc-linux-gnu__cuda10200(+0x8675a)[0x55a3b3bbd75a]

Exiting...

</stderr_txt>
]]>


©2022 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)