Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vreplication: gracefully handle large transactions #16317

Closed
derekperkins opened this issue Jul 2, 2024 · 2 comments
Closed

vreplication: gracefully handle large transactions #16317

derekperkins opened this issue Jul 2, 2024 · 2 comments
Assignees
Labels
Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)

Comments

@derekperkins
Copy link
Member

derekperkins commented Jul 2, 2024

Overview of the Issue

We had a tablet start OOMing repeatedly on Friday that took a little bit to track down. We isolated it to a materialize stream (1 source tablet -> 16 target tablets), after running a large transaction that by itself updated 63M rows. The tablet spiked to 60 GB of RAM, when it normally runs steady state around 100 MB.

image

We were able to temporarily move the tablet to a larger server to get that processed, then move it back to its original place in our cluster. What made this particularly hard to diagnose is the OOM didn't correlate to any specific logs.

Obviously huge transactions aren't great and are the source of issues, I've just been wondering if there's a way to more gracefully handle a scenario like this, that just kept repeatedly flapping the db primary. This was clearly amplified by needing to process the same tx 16 times. There may not be a way to reduce the memory overhead for a single stream, but maybe if we could detect a certain threshold, then somehow process it serially? That sounds like a very difficult coordination problem, maybe the throttler could be leveraged somehow?

Feel free to close this issue if there isn't anything actionable.

cc @mattlord

Reproduction Steps

{
  "keywords": [
    {
      "name": "keywords__keywords__copy",
      "source": {
        "keyspace": "keywordsu",
        "shards": [
          "0"
        ]
      },
      "target": {
        "keyspace": "keywords",
        "shards": [
          "-10",
          "10-20",
          "20-30",
          "30-40",
          "40-50",
          "50-60",
          "60-70",
          "70-80",
          "80-90",
          "90-a0",
          "a0-b0",
          "b0-c0",
          "c0-d0",
          "d0-e0",
          "e0-f0",
          "f0-"
        ]
      },
      "shard_streams": {
        "-10/uscentral1-0126977400": {
          "streams": [
            {
              "id": 55,
              "shard": "-10",
              "tablet": {
                "cell": "uscentral1",
                "uid": 126977400
              },
              "binlog_source": {
                "keyspace": "keywordsu",
                "shard": "0",
                "filter": {
                  "rules": [
                    {
                      "match": "keywords__keywords__copy",
                      "filter": "select keyword_id, phrase_id, phrase, locale_id, device_code, created_at from keywords where in_keyrange(keyword_id, 'keywords.hash', '-10')"
                    }
                  ]
                }
              },
              "position": "0c4b8ef3-253a-11ec-98d9-3235e408d1e5:1-609569,14e91046-1cdc-11ed-a65f-d60bf275c515:1-1589772,28d7debe-6d30-11ec-9696-826bd9c444b9:1-32830237,4dda5b4e-0ea6-11ed-a83b-c2d1e4afc67b:1-111205389,684e23fe-0eb4-11ed-a604-b6763a422c36:1-3484661,a351f01d-fb7d-11ed-a660-bee2b573f3bf:1-39363505,a5ef9c7b-0eac-11ed-8bbe-46781259a843:1-75497684,a8de8790-6c23-11ed-af0a-62dad3b25097:1-71206353,bd907242-5ea4-11ee-b501-aa7ed7190bc9:1-2490747,cd9e16c9-2539-11ec-9499-2a7ac907fc94:1-282508005,e99beb6f-5ea1-11ee-9732-76800cdb5542:1-1358654362",
              "state": "Running",
              "db_name": "keywords",
              "transaction_timestamp": {
                "seconds": 1719946904
              },
              "time_updated": {
                "seconds": 1719950021
              },
              "tags": [
                ""
              ],
              "rows_copied": 12767817,
              "throttler_status": {
                "time_throttled": {}
              },
              "cells": [
                ""
              ]
            }
          ],
          "is_primary_serving": true
        }
      },
      "workflow_type": "Materialize",
      "workflow_sub_type": "None"
    },
...
  ]
}

Binary Version

v20.0.0

Operating System and Environment details

GKE v1.29

Log Fragments

No response

@derekperkins derekperkins added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VReplication labels Jul 2, 2024
@mattlord
Copy link
Contributor

mattlord commented Jul 2, 2024

Thanks, @derekperkins ! We've encountered similar difficulties in PlanetScale and it's something I'd like to explore in the coming weeks and months.

@derekperkins
Copy link
Member Author

Closing, as it should be resolved by #16328

@github-project-automation github-project-automation bot moved this from In progress to Done in VReplication Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
Status: Done
Development

No branches or pull requests

2 participants