mnist_training user store segfault in mvout #56

kimjungwow · 2022-01-10T17:41:53Z

Describe the bug
Hi! 😀
I tried mnist training in systolic_runner/mnist_training, but I got user store segfault in gemmini_extended_mvout.

I executed spike --extension=gemmini pk mnist_train --model_name mnist_conv_w_batchnorm.onnx --train_data_dir mnist/mnist_data/ --num_train_steps 10 -x 1.
I downloaded train_data from ORT Training: Training mnist using provided sample? microsoft/onnxruntime#3706 (comment).
When I execute with -x 0 option (using only CPU), I could successfully train this model.
I got onnx model by using my_create_mnist_w_batchnorm.py which is slightly different from create_mnist_w_batchnorm.py. (The onnx model from the original version script made an error because of the difference between the dimension of train data and the input of the model.) Below is my patch.

--- create_mnist_w_batchnorm.py	2021-12-27 06:59:56.445299810 +0000
+++ my_create_mnist_w_batchnorm.py	2022-01-10 17:25:47.868769954 +0000
@@ -24,6 +24,8 @@
 
 batch_size = -1
 
+Inputshape = helper.make_tensor(name="Inputshape", data_type=onnx.TensorProto.INT64, dims=[4], vals=[batch_size, 1, 28, 28])
+
 W1_dims = [8, 1, 5, 5]
 W2_dims = [16, 8, 5, 5]
 W3_dims = [256, 10]
@@ -50,7 +52,8 @@
 
 node0 = helper.make_node('BatchNormalization', inputs=['T1', 's', 'bias', 'mean', 'var'], outputs=['T1_bn'])
 
-node1 = helper.make_node('Conv', inputs=['X', 'W1', 'B1'], outputs=['T1'], kernel_shape=[5,5], strides=[1,1], pads=[2,2,2,2])
+reshapenode = helper.make_node("Reshape", inputs=["X", "Inputshape"], outputs=["X1"])
+node1 = helper.make_node('Conv', inputs=['X1', 'W1', 'B1'], outputs=['T1'], kernel_shape=[5,5], strides=[1,1], pads=[2,2,2,2])
 node2 = helper.make_node('Relu', inputs=['T1_bn'], outputs=['T2'])
 node3 = helper.make_node('MaxPool', inputs=['T2'], outputs=['T3'], kernel_shape=[2,2], strides=[2,2])
 
@@ -63,13 +66,13 @@
 node8 = helper.make_node('Gemm', inputs=['T7', 'W3', 'B3'], outputs=['predictions'])
 
 graph = helper.make_graph(
-    [node1, node0, node2, node3, node4, node5, node6, node7, node8],
+    [reshapenode, node1, node0, node2, node3, node4, node5, node6, node7, node8],
     'mnist_conv',
     [ helper.make_tensor_value_info('s', TensorProto.FLOAT, ([8])),
      helper.make_tensor_value_info('bias', TensorProto.FLOAT, ([8])),
      helper.make_tensor_value_info('mean', TensorProto.FLOAT, ([8])),
     helper.make_tensor_value_info('var', TensorProto.FLOAT, ([8])),
-     helper.make_tensor_value_info('X', TensorProto.FLOAT, ([batch_size, 1, 28, 28])),
+     helper.make_tensor_value_info('X', TensorProto.FLOAT, ([batch_size, 784])),
      helper.make_tensor_value_info('W1', TensorProto.FLOAT, W1_dims),
      helper.make_tensor_value_info('W2', TensorProto.FLOAT, W2_dims),
      helper.make_tensor_value_info('W3', TensorProto.FLOAT, W3_dims),
@@ -77,9 +80,10 @@
      helper.make_tensor_value_info('B2', TensorProto.FLOAT, B2_dims),
      helper.make_tensor_value_info('B3', TensorProto.FLOAT, B3_dims),
      helper.make_tensor_value_info('shape', TensorProto.INT64, [2]),
+     helper.make_tensor_value_info('Inputshape', TensorProto.INT64, [4]),
     ],
     [helper.make_tensor_value_info('predictions', TensorProto.FLOAT, ([batch_size, 10]))],
-    [s, bias, mean, var, W1, W2, W3, B1, B2, B3, shape]
+    [s, bias, mean, var, W1, W2, W3, B1, B2, B3, shape, Inputshape]
 )
 original_model = helper.make_model(graph, producer_name='onnx-examples')

Below is the error message including my own printf for debug. I used FP32.

Gemmini extension configured with:
    dim = 16
bbl loader
Loaded runner program
Setting up logger
Setting up env
Setting up training params
Setting up data
Loading MNIST data from folder mnist/mnist_data/
Preparing data ...
Preparing data: done
#training set size = 60000 
#test set size = 10000 
Creating training runner
Initializing training runner
1970-01-01 00:00:12.989272862 [W:onnxruntime:, graph.cc:1074 Graph] Initializer s appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989337471 [W:onnxruntime:, graph.cc:1074 Graph] Initializer bias appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989387912 [W:onnxruntime:, graph.cc:1074 Graph] Initializer mean appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989436124 [W:onnxruntime:, graph.cc:1074 Graph] Initializer var appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989486858 [W:onnxruntime:, graph.cc:1074 Graph] Initializer W1 appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989540649 [W:onnxruntime:, graph.cc:1074 Graph] Initializer W2 appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989590741 [W:onnxruntime:, graph.cc:1074 Graph] Initializer W3 appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989639219 [W:onnxruntime:, graph.cc:1074 Graph] Initializer B1 appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989687503 [W:onnxruntime:, graph.cc:1074 Graph] Initializer B2 appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989736087 [W:onnxruntime:, graph.cc:1074 Graph] Initializer B3 appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989784748 [W:onnxruntime:, graph.cc:1074 Graph] Initializer shape appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:12.989836510 [W:onnxruntime:, graph.cc:1074 Graph] Initializer Xshape appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
1970-01-01 00:00:13.730676103 [W:onnxruntime:, graph.cc:84 MergeShapeInfo] Error merging shape info for output. 'loss' source:{} target:{1}. Falling back to lenient merge.
Starting training
>>>> my_debug gemmini_extended_config_st act: 0, scale : 1.000000, stride_C : 784, sizeof_C : 4
>>>> my debug C_dram_addr : 0x263f5040, C_sp_addr : c0000000, rows : 8, cols : 8
z  0000000000000000 ra 0000000000da65ec sp 0000003fffffa830 gp 0000000001c2a4f8
tp 0000000001c9b500 t0 000000000000000a t1 0000000000002000 t2 0000000000000001
s0 0000003fffffaa00 s1 0000000000000002 a0 000000000000005c a1 0000000000000000
a2 000000000000005c a3 0000000000000000 a4 00080008c0000000 a5 00000000263f5040
a6 0000000000000064 a7 0000000000000040 s2 0000000000000310 s3 0000000000000019
s4 0000000000000000 s5 0000000000000002 s6 0000000000000001 s7 0000000020403280
s8 0000000000000001 s9 000000000000001c sA 000000000000001c sB 0000000000000005
t3 0000000000000000 t4 0000000000000008 t5 0000000000000000 t6 0000000000000020
pc 0000000000da6606 va/inst 3f800000263f5c80 sr 8000000200006020
User store segfault @ 0x3f800000263f5c80

System information

OS Platform and Distribution : Linux Ubuntu 18.04
ONNX Runtime installed from (source or binary):
ONNX Runtime version: 0c8c9b4
Python version: 3.6.9
GCC/Compiler version (if compiling from source): 9.2.0

To Reproduce

Following https://github.com/ucb-bar/onnxruntime-riscv/blob/2021-12-23/systolic_runner/docs/BUILD.md.
Rebuild spike and pk separately.
Execute ./build.sh --enable_training in mnist_training directory.
Get an onnx model from https://drive.google.com/file/d/1eoXn0-xC7nQYfnAOHKcZJf0TTXb_0YaG/view?usp=sharing or executing create_mnist_w_batchnorm.py after applying above patch.
Execute spike --extension=gemmini pk mnist_train --model_name mnist_conv_w_batchnorm.onnx --train_data_dir mnist/mnist_data/ --num_train_steps 10 -x 1.

EDIT : I also checked that pc (0000000000da6606) in user store segfault message comes from gemmini_extended_mvout() in systolic_include.h:sp_tiled_matmul_os.

The text was updated successfully, but these errors were encountered:

pranav-prakash assigned simonguozirui Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mnist_training user store segfault in mvout #56

mnist_training user store segfault in mvout #56

kimjungwow commented Jan 10, 2022 •

edited

Loading

mnist_training user store segfault in mvout #56

mnist_training user store segfault in mvout #56

Comments

kimjungwow commented Jan 10, 2022 • edited Loading

kimjungwow commented Jan 10, 2022 •

edited

Loading