Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spanner] : STATUS_UNAVAILABLE isn't retried in some cases in Result->rows() #5880

Closed
oprudkyi opened this issue Feb 16, 2023 · 15 comments · Fixed by #6149
Closed

[spanner] : STATUS_UNAVAILABLE isn't retried in some cases in Result->rows() #5880

oprudkyi opened this issue Feb 16, 2023 · 15 comments · Fixed by #6149
Assignees
Labels
api: spanner Issues related to the Spanner API. priority: p2 Moderately-important priority. Fix may not be included in next release.

Comments

@oprudkyi
Copy link

Environment details

  • OS:linux/gke/gke-1254-gke2100-cos-101-17162-40-34-v221207-c-cgpv1-pre
  • PHP version: 8.1.12
  • Package name and version: v1.54.2, v1.54.0

Steps to reproduce

it happens very rarely , once per day or week, on relatively loaded clusters and with simple random queries and I can't reproduce it

 PHP Notice: Google\Cloud\Core\Exception\ServiceException: {
    "message": "Broken pipe",
    "code": 14,
    "status": "UNAVAILABLE",
    "details": []
} in /var/www/vendor/google/cloud-core/src/GrpcRequestWrapper.php:257
Stack trace:
#0 /var/www/vendor/google/cloud-core/src/GrpcRequestWrapper.php(194): Google\Cloud\Core\GrpcRequestWrapper->convertToGoogleException(Object(Google\ApiCore\ApiException))
#1 [internal function]: Google\Cloud\Core\GrpcRequestWrapper->handleStream(Object(Google\ApiCore\ServerStream))
#2 /var/www/vendor/google/cloud-spanner/src/Result.php(228): Generator->next()
#3 [internal function]: Google\Cloud\Spanner\Result->rows()

according to https://cloud.google.com/spanner/docs/error-codes the call should be retried

UNAVAILABLE | The server is currently unavailable | Retry using exponential backoff. Note that it is not always safe to retry non-idempotent operations.

but according to source

if ($shouldRetry && $ex->getCode() === Grpc\STATUS_UNAVAILABLE) {
there are cases when call won't be retried

                $hasResumeToken = $this->isSetAndTrue($result, 'resumeToken');
                if ($hasResumeToken || count($bufferedResults) >= self::BUFFER_RESULT_LIMIT) {
                ...
                    $shouldRetry = $hasResumeToken;
                ...    
                $generator->next();
                $valid = $generator->valid();
            } catch (ServiceException $ex) {
                if ($shouldRetry && $ex->getCode() === Grpc\STATUS_UNAVAILABLE) {
                    // Attempt to resume using our last stored resume token. If we
                    // successfully resume, flush the buffer.
                    $generator = $backoff->execute($call, [$this->resumeToken]);
                    $bufferedResults = [];

                    continue;
                }

i.e. it retried only when there is set $resumeToken , but it never set for queries with small results

array (
  'metadata' => 
  array (
    'rowType' => 
    array (
      'fields' => 
      array (
        0 => 
        array (
          'name' => '',
          'type' => 
          array (
            'code' => 2,
            'typeAnnotation' => 0,
          ),
        ),
      ),
    ),
    'transaction' => 
    array (
      'id' => '',
    ),
  ),
  'values' => 
  array (
    0 => '1',
  ),
  'chunkedValue' => false,
  'resumeToken' => '',
)   

sample code I used

        $project = '';
        $instance = '';
        $database = '';
        $spanner = new SpannerClient([ 'projectId' => $project ]);
        $connection = $spanner->connect($instance, $database);
        $generator = $connection
            ->execute('SELECT 1')
            ->rows();

        iterator_to_array($generator);
@yash30201 yash30201 added the api: spanner Issues related to the Spanner API. label Mar 6, 2023
@yash30201 yash30201 added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Mar 7, 2023
@vishwarajanand
Copy link
Contributor

@oprudkyi thanks for reporting this issue. There are reasons due to which all UNAVAILABLE errors are not recommended for retry such as non-idempotent calls.

Can you help me understand if this is a rarely occurring issue, can a retry on user code be helpful?

@vishwarajanand
Copy link
Contributor

Meanwhile, I am trying to understand what should be the behavior when there's no resume token.

@oprudkyi
Copy link
Author

Hi @vishwarajanand

There are reasons due to which all UNAVAILABLE errors are not recommended for retry such as non-idempotent calls.

as for me it looks like idempotent - it calls SELECT via https://cloud.google.com/spanner/docs/reference/rpc/google.spanner.v1#google.spanner.v1.Spanner.ExecuteSql

Meanwhile, I am trying to understand what should be the behavior when there's no resume token.

return original call ? i.e.

 $generator = $backoff->execute($call);

@vishwarajanand
Copy link
Contributor

@oprudkyi I couldn't get the reason(s) why a resume token isn't there for certain rare cases like you pointed.
But assuming that just happens androws(..) returns a generator, a retry without resume token will give unwanted duplicate results.
I recommend a user level retry strategy to handle such cases.

@oprudkyi
Copy link
Author

oprudkyi commented Mar 13, 2023

@vishwarajanand it's not rare, we are getting 50+ errors daily on prod.
and I don't understand why it can't be fixed there

  • SELECT returns 'UNAVAILABLE' error
  • SELECT is idempotent
  • sdk retries call (SELECT)

@oprudkyi
Copy link
Author

just for info, the next ugly workaround helped. The first retry passes without error

    /**
     * @inheritDoc
     * exponential back-off on unavailable errors
     * https://github.com/googleapis/google-cloud-php/issues/5880
     */
    public function select($query, $bindings = [], $useReadPdo = true): array
    {
        $maxAttempts = 8;
        $attempts = 0;
        do {
            $attempts++;
            try {
                return parent::select($query, $bindings, $useReadPdo);
            } catch (QueryException $e) {
                $errorMessage = $e->getMessage();
                if (
                    $attempts <= $maxAttempts
                    && strpos($errorMessage, "UNAVAILABLE") !== false
                    && strpos($errorMessage, "Broken pipe") !== false
                ) {
                    // Broken pipe error, use exponential back-off, from 0.2sec to 25 sec
                    usleep(100000 * pow(2, $attempts));
                } else {
                    throw $e;
                }
            }
        } while (true);
    }

where parent::select is from https://github.com/colopl/laravel-spanner/blob/fbd7a2cab4b0b5e4d11857d356b4518177aed4e3/src/Connection.php#L258

    /**
     * @inheritDoc
     */
    public function select($query, $bindings = [], $useReadPdo = true): array
    {
        return $this->run($query, $bindings, function ($query, $bindings) {
            if ($this->pretending()) {
                return [];
            }

            $generator = $this->getDatabaseContext()
                ->execute($query, ['parameters' => $this->prepareBindings($bindings)])
                ->rows();

            return iterator_to_array($generator);
        });
    }

@taka-oyama but I am still unsure if this this workaround needed in laravel-spanner

@taka-oyama
Copy link
Contributor

I just checked the logs for our service and saw similar errors at the rate of 1~2 an hour.

    "message": "Connection reset by peer",
    "code": 14,
    "status": "UNAVAILABLE",
    "details": []

Can we do something like

Within transaction -> Retry the whole transaction through below.

public function runTransaction(callable $operation, array $options = [])

Outside the transaction -> Retry individual queries through below.

public function rows($format = self::RETURN_ASSOCIATIVE)

@taka-oyama
Copy link
Contributor

Nodejs version seems to have fixed this back in 2020.
googleapis/nodejs-spanner#795

Dotnet has fixed a similar issue.
googleapis/google-cloud-dotnet#5977

@oprudkyi
Copy link
Author

Hi @taka-oyama
Thank you

Within transaction -> Retry the whole transaction through below.

For us there weren't any exceptions inside explicit transactions.
just within transaction-less selects.
I suspect transactions somehow are retried already

@taka-oyama
Copy link
Contributor

taka-oyama commented Apr 14, 2023

Looking at node, dotnet, and java, @olavloite seems to know a lot about how it works and has made PRs for many similar issues.

@olavloite, would it be possible to get some insight on how this is suppose to be handled?

@olavloite
Copy link
Collaborator

(Caution, not a PHP expert here, so I might be reading something wrong in this code)

UNAVAILABLE errors that happen for a SELECT statement can generally be retried safely. The logic should be:

  1. If the initial call to ExecuteStreamingSql fails with an UNAVAILABLE error, the initial call to ExecuteStreamingSql can be retried without a resume token.
  2. The method that reads the stream that is returned by ExecuteStreamingSql should buffer all rows it receives until it sees a resume token, or until its buffer is full. Normally the first will happen. It then emits the rows it has seen so far. This is also the logic that is implemented in the PHP client library.
  3. If the stream fails halfway, the ExecuteStreamingSql call should be retried with the last seen resume token. If there is no resume token and the stream has emitted rows, then retrying is not safe. This is a very rare case. If the stream fails halfway and there is no resume token, but the method has also not emitted any rows, then it's also safe to just retry the initial ExecuteStreamingSql call without a resume token.

From what I can see the only thing that needs to be modified here is this line:

$shouldRetry = false;

It is safe to retry the initial call as long as the method has not returned any rows to the caller.

@taka-oyama
Copy link
Contributor

Wow! Thank you so much for the quick response! 👏🥹

@taka-oyama
Copy link
Contributor

@vishwarajanand
Now that we understand how dotnet and js libraries handle this, would it be possible to get a fix for this?

@vishwarajanand
Copy link
Contributor

I raised a fix to retry when no results have yielded, with or without the resume token.

@taka-oyama
Copy link
Contributor

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the Spanner API. priority: p2 Moderately-important priority. Fix may not be included in next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants