You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to pretrain a bert from google's pretrained checkpoint from Colab TPU. Until yesterday everything is fine. However, I came across this 'crossshardoptimizer' error for all day today. I am wondering if this caused by any code base change or version migration.
/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/error_handling.py in raise_errors(self, timeout_sec)
134 else:
135 logging.warn('Reraising captured error')
--> 136 six.reraise(typ, value, traceback)
137
138 for k, (typ, value, traceback) in kept_errors:
/usr/local/lib/python3.6/dist-packages/six.py in reraise(tp, value, tb)
701 if value.traceback is not tb:
702 raise value.with_traceback(tb)
--> 703 raise value
704 finally:
705 value = None
/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _model_fn(features, labels, mode, config, params)
3157 if mode == model_fn_lib.ModeKeys.TRAIN:
3158 compile_op, loss, host_call, scaffold_fn, training_hooks = (
-> 3159 _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn))
3160 if ctx.embedding_config:
3161 g = ops.get_default_graph()
/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn)
3602 num_shards=ctx.num_replicas,
3603 outputs_from_all_shards=False,
-> 3604 device_assignment=ctx.device_assignment)
3605
3606 loss = loss[0]
/tensorflow-1.15.2/python3.6/tensorflow_core/python/tpu/tpu.py in split_compile_and_shard(computation, inputs, num_shards, input_shard_axes, outputs_from_all_shards, output_shard_axes, infeed_queue, device_assignment, name)
1275 infeed_queue=infeed_queue,
1276 device_assignment=device_assignment,
-> 1277 name=name)
1278
1279 # There must be at least one shard since num_shards > 0.
/tensorflow-1.15.2/python3.6/tensorflow_core/python/tpu/training_loop.py in body_wrapper(inputs)
119 else:
120 dequeue_ops = []
--> 121 outputs = body((inputs + dequeue_ops))
122
123 # If the computation only returned one value, make it a tuple.
/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in (i, loss)
3586 outputs = training_loop.while_loop(
3587 lambda i, loss: i < iterations_per_loop_var,
-> 3588 lambda i, loss: [i + 1, single_tpu_train_step(i)],
3589 inputs=[0, _INITIAL_LOSS])
3590 return outputs[1:]
I am trying to pretrain a bert from google's pretrained checkpoint from Colab TPU. Until yesterday everything is fine. However, I came across this 'crossshardoptimizer' error for all day today. I am wondering if this caused by any code base change or version migration.
tf version: 1.15.2
python: 3.6
bert-tensorflow: 1.0.3
Any insights and discussions are appreciated. Thanks.
The text was updated successfully, but these errors were encountered: