Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

module 'tensorflow_estimator.python.estimator.api._v1.estimator.tpu' has no attribute 'CrossShardOptimizer' #1135

Closed
liuyibox opened this issue Aug 7, 2020 · 1 comment

Comments

@liuyibox
Copy link

liuyibox commented Aug 7, 2020

I am trying to pretrain a bert from google's pretrained checkpoint from Colab TPU. Until yesterday everything is fine. However, I came across this 'crossshardoptimizer' error for all day today. I am wondering if this caused by any code base change or version migration.

tf version: 1.15.2
python: 3.6
bert-tensorflow: 1.0.3

INFO:tensorflow:*** Input Files (MSL-128) ***
INFO:tensorflow: gs://vbert/input/vmware-docs-2020-reddit_non-wwm_msl-128_vocab-vmware-unused.tfrecord
INFO:tensorflow:*** Input Files (MSL-512) ***
INFO:tensorflow: gs://vbert/input/vmware-docs-2020-reddit_non-wwm_msl-512_vocab-vmware-unused.tfrecord
WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder..model_fn at 0x7f5054197bf8>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Using config: {'_model_dir': 'gs://vbert/liuyi-vbert-docs-reddit/base/vocab-vmware-unused', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 10000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
job {
name: "worker"
tasks {
key: 0
value: "10.47.24.194:8470"
}
}
}
isolate_session_state: true
, '_keep_checkpoint_max': 10000, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f505413deb8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.47.24.194:8470', '_evaluation_master': 'grpc://10.47.24.194:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=10000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': <tensorflow.python.distribute.cluster_resolver.tpu_cluster_resolver.TPUClusterResolver object at 0x7f505413dc50>}
INFO:tensorflow:_TPUContext: eval_on_tpu True
INFO:tensorflow:***** Running training *****
INFO:tensorflow: Batch size = 32
INFO:tensorflow:Querying Tensorflow master (grpc://10.47.24.194:8470) for TPU system metadata.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 18293633603678532293)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 16754746863277155707)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 12168993875110325416)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 5785133627713800739)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 531464872121750804)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 13610383926908237188)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 3588204162670013970)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 5523440629424163654)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 9311023021754933234)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 8589934592, 17907827073552055203)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 5163179840106115260)
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Entity <function input_fn_builder..input_fn.. at 0x7f50541971e0> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: module 'gast' has no attribute 'Str'
WARNING: Entity <function input_fn_builder..input_fn.. at 0x7f50541971e0> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: module 'gast' has no attribute 'Str'
INFO:tensorflow:Found small feature: next_sentence_labels [4, 1]
INFO:tensorflow:Found small feature: next_sentence_labels [4, 1]
INFO:tensorflow:Found small feature: next_sentence_labels [4, 1]
INFO:tensorflow:Found small feature: next_sentence_labels [4, 1]
INFO:tensorflow:Found small feature: next_sentence_labels [4, 1]
INFO:tensorflow:Found small feature: next_sentence_labels [4, 1]
INFO:tensorflow:Found small feature: next_sentence_labels [4, 1]
INFO:tensorflow:Found small feature: next_sentence_labels [4, 1]
INFO:tensorflow:*** Features ***
INFO:tensorflow: name = input_ids, shape = (4, 128)
INFO:tensorflow: name = input_mask, shape = (4, 128)
INFO:tensorflow: name = masked_lm_ids, shape = (4, 20)
INFO:tensorflow: name = masked_lm_positions, shape = (4, 20)
INFO:tensorflow: name = masked_lm_weights, shape = (4, 20)
INFO:tensorflow: name = next_sentence_labels, shape = (4, 1)
INFO:tensorflow: name = segment_ids, shape = (4, 128)
INFO:tensorflow:**** Trainable Variables ****
ERROR:tensorflow:Error recorded from training_loop: module 'tensorflow_estimator.python.estimator.api._v1.estimator.tpu' has no attribute 'CrossShardOptimizer'
INFO:tensorflow:training_loop marked as finished
WARNING:tensorflow:Reraising captured error


AttributeError Traceback (most recent call last)

in ()
3 start_time = datetime.now()
4 FLAGS.training_start_time = start_time
----> 5 main()
6 print("Pretraining took", datetime.now() - start_time)

25 frames

in main()
93 max_predictions_per_seq=FLAGS.max_predictions_per_seq,
94 is_training=True)
---> 95 estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps, saving_listeners=[listener])
96
97 FLAGS.loop_times = loop_times

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
3033 finally:
3034 rendezvous.record_done('training_loop')
-> 3035 rendezvous.raise_errors()
3036
3037 def evaluate(self,

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/error_handling.py in raise_errors(self, timeout_sec)
134 else:
135 logging.warn('Reraising captured error')
--> 136 six.reraise(typ, value, traceback)
137
138 for k, (typ, value, traceback) in kept_errors:

/usr/local/lib/python3.6/dist-packages/six.py in reraise(tp, value, tb)
701 if value.traceback is not tb:
702 raise value.with_traceback(tb)
--> 703 raise value
704 finally:
705 value = None

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
3028 steps=steps,
3029 max_steps=max_steps,
-> 3030 saving_listeners=saving_listeners)
3031 except Exception: # pylint: disable=broad-except
3032 rendezvous.record_error('training_loop', sys.exc_info())

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
368
369 saving_listeners = _check_listeners_type(saving_listeners)
--> 370 loss = self._train_model(input_fn, hooks, saving_listeners)
371 logging.info('Loss for final step: %s.', loss)
372 return self

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py in _train_model(self, input_fn, hooks, saving_listeners)
1159 return self._train_model_distributed(input_fn, hooks, saving_listeners)
1160 else:
-> 1161 return self._train_model_default(input_fn, hooks, saving_listeners)
1162
1163 def _train_model_default(self, input_fn, hooks, saving_listeners):

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py in _train_model_default(self, input_fn, hooks, saving_listeners)
1189 worker_hooks.extend(input_hooks)
1190 estimator_spec = self._call_model_fn(
-> 1191 features, labels, ModeKeys.TRAIN, self.config)
1192 global_step_tensor = training_util.get_global_step(g)
1193 return self._train_with_estimator_spec(estimator_spec, worker_hooks,

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _call_model_fn(self, features, labels, mode, config)
2855 else:
2856 return super(TPUEstimator, self)._call_model_fn(features, labels, mode,
-> 2857 config)
2858 else:
2859 if mode == _INFERENCE_ON_TPU_MODE:

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py in _call_model_fn(self, features, labels, mode, config)
1147
1148 logging.info('Calling model_fn.')
-> 1149 model_fn_results = self._model_fn(features=features, **kwargs)
1150 logging.info('Done calling model_fn.')
1151

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _model_fn(features, labels, mode, config, params)
3157 if mode == model_fn_lib.ModeKeys.TRAIN:
3158 compile_op, loss, host_call, scaffold_fn, training_hooks = (
-> 3159 _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn))
3160 if ctx.embedding_config:
3161 g = ops.get_default_graph()

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn)
3602 num_shards=ctx.num_replicas,
3603 outputs_from_all_shards=False,
-> 3604 device_assignment=ctx.device_assignment)
3605
3606 loss = loss[0]

/tensorflow-1.15.2/python3.6/tensorflow_core/python/tpu/tpu.py in split_compile_and_shard(computation, inputs, num_shards, input_shard_axes, outputs_from_all_shards, output_shard_axes, infeed_queue, device_assignment, name)
1275 infeed_queue=infeed_queue,
1276 device_assignment=device_assignment,
-> 1277 name=name)
1278
1279 # There must be at least one shard since num_shards > 0.

/tensorflow-1.15.2/python3.6/tensorflow_core/python/tpu/tpu.py in split_compile_and_replicate(failed resolving arguments)
990 vscope.set_custom_getter(custom_getter)
991
--> 992 outputs = computation(*computation_inputs)
993
994 vscope.set_use_resource(saved_use_resource)

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in multi_tpu_train_steps_on_single_shard(replica_id)
3587 lambda i, loss: i < iterations_per_loop_var,
3588 lambda i, loss: [i + 1, single_tpu_train_step(i)],
-> 3589 inputs=[0, _INITIAL_LOSS])
3590 return outputs[1:]
3591

/tensorflow-1.15.2/python3.6/tensorflow_core/python/tpu/training_loop.py in while_loop(failed resolving arguments)
176 inputs = [array_ops.constant(0)]
177 return control_flow_ops.while_loop(
--> 178 condition_wrapper, body_wrapper, inputs, name="", parallel_iterations=1)
179
180

/tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/control_flow_ops.py in while_loop(cond, body, loop_vars, shape_invariants, parallel_iterations, back_prop, swap_memory, name, maximum_iterations, return_same_structure)
2751 ops.add_to_collection(ops.GraphKeys.WHILE_CONTEXT, loop_context)
2752 result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants,
-> 2753 return_same_structure)
2754 if maximum_iterations is not None:
2755 return result[1]

/tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/control_flow_ops.py in BuildLoop(self, pred, body, loop_vars, shape_invariants, return_same_structure)
2243 with ops.get_default_graph()._mutation_lock(): # pylint: disable=protected-access
2244 original_body_result, exit_vars = self._BuildLoop(
-> 2245 pred, body, original_loop_vars, loop_vars, shape_invariants)
2246 finally:
2247 self.Exit()

/tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/control_flow_ops.py in _BuildLoop(self, pred, body, original_loop_vars, loop_vars, shape_invariants)
2168 expand_composites=True)
2169 pre_summaries = ops.get_collection(ops.GraphKeys._SUMMARY_COLLECTION) # pylint: disable=protected-access
-> 2170 body_result = body(*packed_vars_for_body)
2171 post_summaries = ops.get_collection(ops.GraphKeys._SUMMARY_COLLECTION) # pylint: disable=protected-access
2172 if not nest.is_sequence_or_composite(body_result):

/tensorflow-1.15.2/python3.6/tensorflow_core/python/tpu/training_loop.py in body_wrapper(inputs)
119 else:
120 dequeue_ops = []
--> 121 outputs = body(
(inputs + dequeue_ops))
122
123 # If the computation only returned one value, make it a tuple.

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in (i, loss)
3586 outputs = training_loop.while_loop(
3587 lambda i, loss: i < iterations_per_loop_var,
-> 3588 lambda i, loss: [i + 1, single_tpu_train_step(i)],
3589 inputs=[0, _INITIAL_LOSS])
3590 return outputs[1:]

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train_step(step)
1713
1714 estimator_spec = self._verify_estimator_spec(
-> 1715 self._call_model_fn(features, labels))
1716 loss, train_op = estimator_spec.loss, estimator_spec.train_op
1717

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _call_model_fn(self, features, labels, is_export_mode)
1992 _add_item_to_params(params, _CTX_KEY, user_context)
1993
-> 1994 estimator_spec = self._model_fn(features=features, **kwargs)
1995 if (running_on_cpu and
1996 isinstance(estimator_spec, model_fn_lib._TPUEstimatorSpec)): # pylint: disable=protected-access

in model_fn(features, labels, mode, params)
67 if mode == tf.estimator.ModeKeys.TRAIN:
68 train_op = optimization.create_optimizer(
---> 69 total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
70
71 output_spec = tf.contrib.tpu.TPUEstimatorSpec(

/usr/local/lib/python3.6/dist-packages/bert/optimization.py in create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu)
66
67 if use_tpu:
---> 68 optimizer = tf.estimator.tpu.CrossShardOptimizer(optimizer)
69
70 tvars = tf.trainable_variables()

/tensorflow-1.15.2/python3.6/tensorflow_core/python/util/module_wrapper.py in getattr(self, name)
191 def getattr(self, name):
192 try:
--> 193 attr = getattr(self._tfmw_wrapped_module, name)
194 except AttributeError:
195 if not self._tfmw_public_apis:

AttributeError: module 'tensorflow_estimator.python.estimator.api._v1.estimator.tpu' has no attribute 'CrossShardOptimizer'

Any insights and discussions are appreciated. Thanks.

@liuyibox
Copy link
Author

This is due to bert version update, and is resolved by using "pip install bert-tensorflow==1.0.1" as mentioned here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant