module 'tensorflow_estimator.python.estimator.api._v1.estimator.tpu' has no attribute 'CrossShardOptimizer' #1135

liuyibox · 2020-08-07T20:04:19Z

I am trying to pretrain a bert from google's pretrained checkpoint from Colab TPU. Until yesterday everything is fine. However, I came across this 'crossshardoptimizer' error for all day today. I am wondering if this caused by any code base change or version migration.

tf version: 1.15.2
python: 3.6
bert-tensorflow: 1.0.3

INFO:tensorflow:*** Input Files (MSL-128) ***
INFO:tensorflow: gs://vbert/input/vmware-docs-2020-reddit_non-wwm_msl-128_vocab-vmware-unused.tfrecord
INFO:tensorflow:*** Input Files (MSL-512) ***
INFO:tensorflow: gs://vbert/input/vmware-docs-2020-reddit_non-wwm_msl-512_vocab-vmware-unused.tfrecord
WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder..model_fn at 0x7f5054197bf8>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Using config: {'_model_dir': 'gs://vbert/liuyi-vbert-docs-reddit/base/vocab-vmware-unused', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 10000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
job {
name: "worker"
tasks {
key: 0
value: "10.47.24.194:8470"
}
}
}
isolate_session_state: true
, '_keep_checkpoint_max': 10000, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f505413deb8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.47.24.194:8470', '_evaluation_master': 'grpc://10.47.24.194:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=10000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': <tensorflow.python.distribute.cluster_resolver.tpu_cluster_resolver.TPUClusterResolver object at 0x7f505413dc50>}
INFO:tensorflow:_TPUContext: eval_on_tpu True
INFO:tensorflow:***** Running training *****
INFO:tensorflow: Batch size = 32
INFO:tensorflow:Querying Tensorflow master (grpc://10.47.24.194:8470) for TPU system metadata.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 18293633603678532293)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 16754746863277155707)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 12168993875110325416)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 5785133627713800739)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 531464872121750804)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 13610383926908237188)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 3588204162670013970)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 5523440629424163654)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 9311023021754933234)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 8589934592, 17907827073552055203)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 5163179840106115260)
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Entity <function input_fn_builder..input_fn.. at 0x7f50541971e0> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: module 'gast' has no attribute 'Str'
WARNING: Entity <function input_fn_builder..input_fn.. at 0x7f50541971e0> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: module 'gast' has no attribute 'Str'
INFO:tensorflow:Found small feature: next_sentence_labels [4, 1]
INFO:tensorflow:Found small feature: next_sentence_labels [4, 1]
INFO:tensorflow:Found small feature: next_sentence_labels [4, 1]
INFO:tensorflow:Found small feature: next_sentence_labels [4, 1]
INFO:tensorflow:Found small feature: next_sentence_labels [4, 1]
INFO:tensorflow:Found small feature: next_sentence_labels [4, 1]
INFO:tensorflow:Found small feature: next_sentence_labels [4, 1]
INFO:tensorflow:Found small feature: next_sentence_labels [4, 1]
INFO:tensorflow:*** Features ***
INFO:tensorflow: name = input_ids, shape = (4, 128)
INFO:tensorflow: name = input_mask, shape = (4, 128)
INFO:tensorflow: name = masked_lm_ids, shape = (4, 20)
INFO:tensorflow: name = masked_lm_positions, shape = (4, 20)
INFO:tensorflow: name = masked_lm_weights, shape = (4, 20)
INFO:tensorflow: name = next_sentence_labels, shape = (4, 1)
INFO:tensorflow: name = segment_ids, shape = (4, 128)
INFO:tensorflow:**** Trainable Variables ****
ERROR:tensorflow:Error recorded from training_loop: module 'tensorflow_estimator.python.estimator.api._v1.estimator.tpu' has no attribute 'CrossShardOptimizer'
INFO:tensorflow:training_loop marked as finished
WARNING:tensorflow:Reraising captured error

AttributeError Traceback (most recent call last)

in ()
3 start_time = datetime.now()
4 FLAGS.training_start_time = start_time
----> 5 main()
6 print("Pretraining took", datetime.now() - start_time)

25 frames

in main()
93 max_predictions_per_seq=FLAGS.max_predictions_per_seq,
94 is_training=True)
---> 95 estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps, saving_listeners=[listener])
96
97 FLAGS.loop_times = loop_times

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
3033 finally:
3034 rendezvous.record_done('training_loop')
-> 3035 rendezvous.raise_errors()
3036
3037 def evaluate(self,

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/error_handling.py in raise_errors(self, timeout_sec)
134 else:
135 logging.warn('Reraising captured error')
--> 136 six.reraise(typ, value, traceback)
137
138 for k, (typ, value, traceback) in kept_errors:

/usr/local/lib/python3.6/dist-packages/six.py in reraise(tp, value, tb)
701 if value.traceback is not tb:
702 raise value.with_traceback(tb)
--> 703 raise value
704 finally:
705 value = None

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
3028 steps=steps,
3029 max_steps=max_steps,
-> 3030 saving_listeners=saving_listeners)
3031 except Exception: # pylint: disable=broad-except
3032 rendezvous.record_error('training_loop', sys.exc_info())

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
368
369 saving_listeners = _check_listeners_type(saving_listeners)
--> 370 loss = self._train_model(input_fn, hooks, saving_listeners)
371 logging.info('Loss for final step: %s.', loss)
372 return self

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py in _train_model(self, input_fn, hooks, saving_listeners)
1159 return self._train_model_distributed(input_fn, hooks, saving_listeners)
1160 else:
-> 1161 return self._train_model_default(input_fn, hooks, saving_listeners)
1162
1163 def _train_model_default(self, input_fn, hooks, saving_listeners):

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py in _train_model_default(self, input_fn, hooks, saving_listeners)
1189 worker_hooks.extend(input_hooks)
1190 estimator_spec = self._call_model_fn(
-> 1191 features, labels, ModeKeys.TRAIN, self.config)
1192 global_step_tensor = training_util.get_global_step(g)
1193 return self._train_with_estimator_spec(estimator_spec, worker_hooks,

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _call_model_fn(self, features, labels, mode, config)
2855 else:
2856 return super(TPUEstimator, self)._call_model_fn(features, labels, mode,
-> 2857 config)
2858 else:
2859 if mode == _INFERENCE_ON_TPU_MODE:

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py in _call_model_fn(self, features, labels, mode, config)
1147
1148 logging.info('Calling model_fn.')
-> 1149 model_fn_results = self._model_fn(features=features, **kwargs)
1150 logging.info('Done calling model_fn.')
1151

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _model_fn(features, labels, mode, config, params)
3157 if mode == model_fn_lib.ModeKeys.TRAIN:
3158 compile_op, loss, host_call, scaffold_fn, training_hooks = (
-> 3159 _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn))
3160 if ctx.embedding_config:
3161 g = ops.get_default_graph()

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn)
3602 num_shards=ctx.num_replicas,
3603 outputs_from_all_shards=False,
-> 3604 device_assignment=ctx.device_assignment)
3605
3606 loss = loss[0]

/tensorflow-1.15.2/python3.6/tensorflow_core/python/tpu/tpu.py in split_compile_and_shard(computation, inputs, num_shards, input_shard_axes, outputs_from_all_shards, output_shard_axes, infeed_queue, device_assignment, name)
1275 infeed_queue=infeed_queue,
1276 device_assignment=device_assignment,
-> 1277 name=name)
1278
1279 # There must be at least one shard since num_shards > 0.

/tensorflow-1.15.2/python3.6/tensorflow_core/python/tpu/tpu.py in split_compile_and_replicate(failed resolving arguments)
990 vscope.set_custom_getter(custom_getter)
991
--> 992 outputs = computation(*computation_inputs)
993
994 vscope.set_use_resource(saved_use_resource)

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in multi_tpu_train_steps_on_single_shard(replica_id)
3587 lambda i, loss: i < iterations_per_loop_var,
3588 lambda i, loss: [i + 1, single_tpu_train_step(i)],
-> 3589 inputs=[0, _INITIAL_LOSS])
3590 return outputs[1:]
3591

/tensorflow-1.15.2/python3.6/tensorflow_core/python/tpu/training_loop.py in while_loop(failed resolving arguments)
176 inputs = [array_ops.constant(0)]
177 return control_flow_ops.while_loop(
--> 178 condition_wrapper, body_wrapper, inputs, name="", parallel_iterations=1)
179
180

/tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/control_flow_ops.py in while_loop(cond, body, loop_vars, shape_invariants, parallel_iterations, back_prop, swap_memory, name, maximum_iterations, return_same_structure)
2751 ops.add_to_collection(ops.GraphKeys.WHILE_CONTEXT, loop_context)
2752 result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants,
-> 2753 return_same_structure)
2754 if maximum_iterations is not None:
2755 return result[1]

/tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/control_flow_ops.py in BuildLoop(self, pred, body, loop_vars, shape_invariants, return_same_structure)
2243 with ops.get_default_graph()._mutation_lock(): # pylint: disable=protected-access
2244 original_body_result, exit_vars = self._BuildLoop(
-> 2245 pred, body, original_loop_vars, loop_vars, shape_invariants)
2246 finally:
2247 self.Exit()

/tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/control_flow_ops.py in _BuildLoop(self, pred, body, original_loop_vars, loop_vars, shape_invariants)
2168 expand_composites=True)
2169 pre_summaries = ops.get_collection(ops.GraphKeys._SUMMARY_COLLECTION) # pylint: disable=protected-access
-> 2170 body_result = body(*packed_vars_for_body)
2171 post_summaries = ops.get_collection(ops.GraphKeys._SUMMARY_COLLECTION) # pylint: disable=protected-access
2172 if not nest.is_sequence_or_composite(body_result):

/tensorflow-1.15.2/python3.6/tensorflow_core/python/tpu/training_loop.py in body_wrapper(inputs)
119 else:
120 dequeue_ops = []
--> 121 outputs = body((inputs + dequeue_ops))
122
123 # If the computation only returned one value, make it a tuple.

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in (i, loss)
3586 outputs = training_loop.while_loop(
3587 lambda i, loss: i < iterations_per_loop_var,
-> 3588 lambda i, loss: [i + 1, single_tpu_train_step(i)],
3589 inputs=[0, _INITIAL_LOSS])
3590 return outputs[1:]

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train_step(step)
1713
1714 estimator_spec = self._verify_estimator_spec(
-> 1715 self._call_model_fn(features, labels))
1716 loss, train_op = estimator_spec.loss, estimator_spec.train_op
1717

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _call_model_fn(self, features, labels, is_export_mode)
1992 _add_item_to_params(params, _CTX_KEY, user_context)
1993
-> 1994 estimator_spec = self._model_fn(features=features, **kwargs)
1995 if (running_on_cpu and
1996 isinstance(estimator_spec, model_fn_lib._TPUEstimatorSpec)): # pylint: disable=protected-access

in model_fn(features, labels, mode, params)
67 if mode == tf.estimator.ModeKeys.TRAIN:
68 train_op = optimization.create_optimizer(
---> 69 total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
70
71 output_spec = tf.contrib.tpu.TPUEstimatorSpec(

/usr/local/lib/python3.6/dist-packages/bert/optimization.py in create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu)
66
67 if use_tpu:
---> 68 optimizer = tf.estimator.tpu.CrossShardOptimizer(optimizer)
69
70 tvars = tf.trainable_variables()

/tensorflow-1.15.2/python3.6/tensorflow_core/python/util/module_wrapper.py in getattr(self, name)
191 def getattr(self, name):
192 try:
--> 193 attr = getattr(self._tfmw_wrapped_module, name)
194 except AttributeError:
195 if not self._tfmw_public_apis:

AttributeError: module 'tensorflow_estimator.python.estimator.api._v1.estimator.tpu' has no attribute 'CrossShardOptimizer'

Any insights and discussions are appreciated. Thanks.

The text was updated successfully, but these errors were encountered:

liuyibox · 2020-08-10T03:26:25Z

This is due to bert version update, and is resolved by using "pip install bert-tensorflow==1.0.1" as mentioned here

liuyibox closed this as completed Aug 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

module 'tensorflow_estimator.python.estimator.api._v1.estimator.tpu' has no attribute 'CrossShardOptimizer' #1135

module 'tensorflow_estimator.python.estimator.api._v1.estimator.tpu' has no attribute 'CrossShardOptimizer' #1135

liuyibox commented Aug 7, 2020

liuyibox commented Aug 10, 2020

module 'tensorflow_estimator.python.estimator.api._v1.estimator.tpu' has no attribute 'CrossShardOptimizer' #1135

module 'tensorflow_estimator.python.estimator.api._v1.estimator.tpu' has no attribute 'CrossShardOptimizer' #1135

Comments

liuyibox commented Aug 7, 2020

liuyibox commented Aug 10, 2020