Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mnist_training user store segfault in mvout #56

Open
kimjungwow opened this issue Jan 10, 2022 · 0 comments
Open

mnist_training user store segfault in mvout #56

kimjungwow opened this issue Jan 10, 2022 · 0 comments
Assignees

Comments

@kimjungwow
Copy link

kimjungwow commented Jan 10, 2022

Describe the bug
Hi! 😀
I tried mnist training in systolic_runner/mnist_training, but I got user store segfault in gemmini_extended_mvout.

  • I executed spike --extension=gemmini pk mnist_train --model_name mnist_conv_w_batchnorm.onnx --train_data_dir mnist/mnist_data/ --num_train_steps 10 -x 1.
  • I downloaded train_data from ORT Training: Training mnist using provided sample? microsoft/onnxruntime#3706 (comment).
  • When I execute with -x 0 option (using only CPU), I could successfully train this model.
  • I got onnx model by using my_create_mnist_w_batchnorm.py which is slightly different from create_mnist_w_batchnorm.py. (The onnx model from the original version script made an error because of the difference between the dimension of train data and the input of the model.) Below is my patch.
--- create_mnist_w_batchnorm.py	2021-12-27 06:59:56.445299810 +0000
+++ my_create_mnist_w_batchnorm.py	2022-01-10 17:25:47.868769954 +0000
@@ -24,6 +24,8 @@
 
 batch_size = -1
 
+Inputshape = helper.make_tensor(name="Inputshape", data_type=onnx.TensorProto.INT64, dims=[4], vals=[batch_size, 1, 28, 28])
+
 W1_dims = [8, 1, 5, 5]
 W2_dims = [16, 8, 5, 5]
 W3_dims = [256, 10]
@@ -50,7 +52,8 @@
 
 node0 = helper.make_node('BatchNormalization', inputs=['T1', 's', 'bias', 'mean', 'var'], outputs=['T1_bn'])
 
-node1 = helper.make_node('Conv', inputs=['X', 'W1', 'B1'], outputs=['T1'], kernel_shape=[5,5], strides=[1,1], pads=[2,2,2,2])
+reshapenode = helper.make_node("Reshape", inputs=["X", "Inputshape"], outputs=["X1"])
+node1 = helper.make_node('Conv', inputs=['X1', 'W1', 'B1'], outputs=['T1'], kernel_shape=[5,5], strides=[1,1], pads=[2,2,2,2])
 node2 = helper.make_node('Relu', inputs=['T1_bn'], outputs=['T2'])
 node3 = helper.make_node('MaxPool', inputs=['T2'], outputs=['T3'], kernel_shape=[2,2], strides=[2,2])
 
@@ -63,13 +66,13 @@
 node8 = helper.make_node('Gemm', inputs=['T7', 'W3', 'B3'], outputs=['predictions'])
 
 graph = helper.make_graph(
-    [node1, node0, node2, node3, node4, node5, node6, node7, node8],
+    [reshapenode, node1, node0, node2, node3, node4, node5, node6, node7, node8],
     'mnist_conv',
     [ helper.make_tensor_value_info('s', TensorProto.FLOAT, ([8])),
      helper.make_tensor_value_info('bias', TensorProto.FLOAT, ([8])),
      helper.make_tensor_value_info('mean', TensorProto.FLOAT, ([8])),
     helper.make_tensor_value_info('var', TensorProto.FLOAT, ([8])),
-     helper.make_tensor_value_info('X', TensorProto.FLOAT, ([batch_size, 1, 28, 28])),
+     helper.make_tensor_value_info('X', TensorProto.FLOAT, ([batch_size, 784])),
      helper.make_tensor_value_info('W1', TensorProto.FLOAT, W1_dims),
      helper.make_tensor_value_info('W2', TensorProto.FLOAT, W2_dims),
      helper.make_tensor_value_info('W3', TensorProto.FLOAT, W3_dims),
@@ -77,9 +80,10 @@
      helper.make_tensor_value_info('B2', TensorProto.FLOAT, B2_dims),
      helper.make_tensor_value_info('B3', TensorProto.FLOAT, B3_dims),
      helper.make_tensor_value_info('shape', TensorProto.INT64, [2]),
+     helper.make_tensor_value_info('Inputshape', TensorProto.INT64, [4]),
     ],
     [helper.make_tensor_value_info('predictions', TensorProto.FLOAT, ([batch_size, 10]))],
-    [s, bias, mean, var, W1, W2, W3, B1, B2, B3, shape]
+    [s, bias, mean, var, W1, W2, W3, B1, B2, B3, shape, Inputshape]
 )
 original_model = helper.make_model(graph, producer_name='onnx-examples')
  • Below is the error message including my own printf for debug. I used FP32.
Gemmini extension configured with:
    dim = 16
bbl loader
Loaded runner program
Setting up logger
Setting up env
Setting up training params
Setting up data
Loading MNIST data from folder mnist/mnist_data/
Preparing data ...
Preparing data: done
#training set size = 60000 
#test set size = 10000 
Creating training runner
Initializing training runner
1970-01-01 00:00:12.989272862 [W:onnxruntime:, graph.cc:1074 Graph] Initializer s appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989337471 [W:onnxruntime:, graph.cc:1074 Graph] Initializer bias appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989387912 [W:onnxruntime:, graph.cc:1074 Graph] Initializer mean appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989436124 [W:onnxruntime:, graph.cc:1074 Graph] Initializer var appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989486858 [W:onnxruntime:, graph.cc:1074 Graph] Initializer W1 appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989540649 [W:onnxruntime:, graph.cc:1074 Graph] Initializer W2 appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989590741 [W:onnxruntime:, graph.cc:1074 Graph] Initializer W3 appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989639219 [W:onnxruntime:, graph.cc:1074 Graph] Initializer B1 appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989687503 [W:onnxruntime:, graph.cc:1074 Graph] Initializer B2 appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989736087 [W:onnxruntime:, graph.cc:1074 Graph] Initializer B3 appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989784748 [W:onnxruntime:, graph.cc:1074 Graph] Initializer shape appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989836510 [W:onnxruntime:, graph.cc:1074 Graph] Initializer Xshape appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:13.730676103 [W:onnxruntime:, graph.cc:84 MergeShapeInfo] Error merging shape info for output. 'loss' source:{} target:{1}. Falling back to lenient merge.
Starting training
>>>> my_debug gemmini_extended_config_st act: 0, scale : 1.000000, stride_C : 784, sizeof_C : 4
>>>> my debug C_dram_addr : 0x263f5040, C_sp_addr : c0000000, rows : 8, cols : 8
z  0000000000000000 ra 0000000000da65ec sp 0000003fffffa830 gp 0000000001c2a4f8
tp 0000000001c9b500 t0 000000000000000a t1 0000000000002000 t2 0000000000000001
s0 0000003fffffaa00 s1 0000000000000002 a0 000000000000005c a1 0000000000000000
a2 000000000000005c a3 0000000000000000 a4 00080008c0000000 a5 00000000263f5040
a6 0000000000000064 a7 0000000000000040 s2 0000000000000310 s3 0000000000000019
s4 0000000000000000 s5 0000000000000002 s6 0000000000000001 s7 0000000020403280
s8 0000000000000001 s9 000000000000001c sA 000000000000001c sB 0000000000000005
t3 0000000000000000 t4 0000000000000008 t5 0000000000000000 t6 0000000000000020
pc 0000000000da6606 va/inst 3f800000263f5c80 sr 8000000200006020
User store segfault @ 0x3f800000263f5c80

System information

  • OS Platform and Distribution : Linux Ubuntu 18.04
  • ONNX Runtime installed from (source or binary):
  • ONNX Runtime version: 0c8c9b4
  • Python version: 3.6.9
  • GCC/Compiler version (if compiling from source): 9.2.0

To Reproduce

EDIT : I also checked that pc (0000000000da6606) in user store segfault message comes from gemmini_extended_mvout() in systolic_include.h:sp_tiled_matmul_os.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants