Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(plugin-server): Reduce concurrency of functional test suite. #18833

Closed
wants to merge 8 commits into from

Conversation

tkaemming
Copy link
Contributor

@tkaemming tkaemming commented Nov 22, 2023

Ignore this and don't merge (at least for now) -- trying to replicate test failures seen locally under similar conditions when forcing serial execution, since they're the same tests that I've seen failing intermittently on #18764 and others.

@tkaemming tkaemming changed the title Reduce concurrency of plugin-server functional test suite. test(plugin-server): Reduce concurrency of functional test suite. Nov 22, 2023
@tkaemming
Copy link
Contributor Author

tkaemming commented Nov 23, 2023

Finally, an explanation for failures in plugin method tests: teardown is called on stateful plugin reload if they are updated (source) which is failing on a message that never arrives:

(env) ted@revuelto plugin-server % CLICKHOUSE_DATABASE='default' DATABASE_URL=postgres://posthog:posthog@localhost:5432/test_posthog pnpm functional_tests --verbose -t 'teardown is called on stateful plugin reload if they are updated'

> @posthog/[email protected] functional_tests /Users/ted/Repositories/PostHog/posthog/plugin-server
> jest --config ./jest.config.functional.js "--verbose" "-t" "teardown is called on stateful plugin reload if they are updated"

 FAIL  functional_tests/plugins.test.ts (14.636 s)
  ✕ plugin method tests: teardown is called on stateful plugin reload if they are updated (13109 ms)
  ○ skipped plugin method tests: event captured, processed, ingested
  ○ skipped plugin method tests: creates error on unhandled throw
  ○ skipped plugin method tests: creates error on unhandled rejection
  ○ skipped plugin method tests: creates error on unhandled promise errors
  ○ skipped plugin method tests: can update distinct_id via processEvent
  ○ skipped plugin method tests: can drop events via processEvent
  ○ skipped plugin method tests: correct $autocapture properties included in onEvent calls
  ○ skipped plugin jobs: can call runNow from onEvent
  ○ skipped plugin jobs: can call runNow from processEvent
  ○ skipped plugin jobs: runEveryMinute is executed
  ○ skipped plugins can use attachements
  ○ skipped plugins can use config variables
  ○ skipped plugin configs are still loaded if is_global = false and the team does not own the plugin
  ○ skipped liveness check endpoint works

  ● plugin method tests: teardown is called on stateful plugin reload if they are updated

    expect(received).toHaveLength(expected)

    Expected length: 1
    Received length: 0
    Received array:  []

      263 |     await waitForExpect(async () => {
      264 |         const logs = await fetchPluginConsoleLogEntries(pluginConfig.id)
    > 265 |         expect(logs.filter((log) => log.message.method === 'teardownPlugin')).toHaveLength(1)
          |                                                                               ^
      266 |     })
      267 | })
      268 |

      at toHaveLength (functional_tests/plugins.test.ts:265:79)
      at waitForExpect (functional_tests/expectations.ts:9:20)
      at functional_tests/plugins.test.ts:263:5

Test Suites: 1 failed, 9 skipped, 1 of 10 total
Tests:       1 failed, 68 skipped, 69 total
Snapshots:   0 total
Time:        14.818 s
Ran all test suites with tests matching "teardown is called on stateful plugin reload if they are updated".
Checking clickhouse-ingestion events_plugin_ingestion 0 38 38
Checking clickhouse-plugin-server-async-webhooks clickhouse_events_json 0 38 38
Checking clickhouse-plugin-server-async-onevent clickhouse_events_json 0 38 38
Checking partition_statistics events_plugin_ingestion 0 38 38
Checking group1 clickhouse_events_json 0 38 38
Checking group1 clickhouse_person 0 38 38
Checking group1 plugin_log_entries 0 1060 1060
Checking group1 clickhouse_app_metrics 0 37 37
Checking group1 clickhouse_person_distinct_id 0 38 38
 ELIFECYCLE  Command failed with exit code 1.

I added a bit more detail to the Kafka producer:

diff --git a/plugin-server/src/kafka/producer.ts b/plugin-server/src/kafka/producer.ts
index 7029a26c79..66416f450b 100644
--- a/plugin-server/src/kafka/producer.ts
+++ b/plugin-server/src/kafka/producer.ts
@@ -80,7 +80,7 @@ export const produce = async ({
     headers?: MessageHeader[]
     waitForAck?: boolean
 }): Promise<number | null | undefined> => {
-    status.debug('📤', 'Producing message', { topic: topic })
+    status.debug('📤', 'Producing message', { topic: topic, value: value.toString() })
     const produceSpan = getSpan()?.startChild({ op: 'kafka_produce' })
     return await new Promise((resolve, reject) => {
         if (waitForAck) {

and noticed that even when it was failing, it still was producing the plugin-defined teardown message:

[15:41:18.337] DEBUG (70645): [MAIN] 👉 LOG in plugin test plugin ID 39 (team ID 39 - organization ID 018bf968-c686-0000-36a5-c4aeab8feaae):
    method: "teardownPlugin"
[15:41:18.337] DEBUG (70645): [MAIN] 📤 Producing message
    topic: "plugin_log_entries"
    value: "{\"source\":\"CONSOLE\",\"type\":\"LOG\",\"id\":\"018bf968-d281-0000-0171-5313d671629b\",\"team_id\":39,\"plugin_id\":39,\"plugin_config_id\":39,\"timestamp\":\"2023-11-22 23:41:18.337\",\"message\":\"{\\"method\\":\\"teardownPlugin\\"}\",\"instance_id\":\"018bf967-b7a6-0000-37fd-d708628dd2c3\"}"

I thought this was strange because I hadn't seen the log message at all in the ClickHouse table on prior test runs that failed (most of them) when running:

ted@revuelto plugin-server % docker-compose -f ../docker-compose.dev.yml exec -it clickhouse clickhouse-client -q 'select * from plugin_log_entries order by timestamp asc'

This seemed suspicious, there were no errors in the ClickHouse server logs to indicate any problems with Kafka. I started to suspect a row key conflict:

SHOW CREATE TABLE plugin_log_entries
FORMAT Vertical

Query id: aebb380d-9147-4c6b-8157-6cbe94304643

Row 1:
──────
statement: CREATE TABLE default.plugin_log_entries
(
    `id` UUID,
    `team_id` Int64,
    `plugin_id` Int64,
    `plugin_config_id` Int64,
    `timestamp` DateTime64(6, 'UTC'),
    `source` String,
    `type` String,
    `message` String,
    `instance_id` UUID,
    `_timestamp` DateTime,
    `_offset` UInt64
)
ENGINE = ReplicatedReplacingMergeTree('/clickhouse/tables/noshard/posthog.plugin_log_entries', '{replica}-{shard}', _timestamp)
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (team_id, plugin_id, plugin_config_id, timestamp)
TTL toDate(timestamp) + toIntervalWeek(1)
SETTINGS index_granularity = 512

1 row in set. Elapsed: 0.003 sec. 

Sure enough, there was a gap in the Kafka offsets in the table while unloading:

SELECT *
FROM plugin_log_entries
ORDER BY _offset ASC
FORMAT Markdown
id team_id plugin_id plugin_config_id timestamp source type message instance_id _timestamp _offset
018bf968-ce54-0000-f70a-c5e63e1451fc 39 39 39 2023-11-22 23:41:17.268000 CONSOLE LOG {"method":"processEvent"} 018bf967-b7a6-0000-37fd-d708628dd2c3 2023-11-22 23:41:17 1056
018bf968-d280-0000-c05b-72a36ced2b91 39 39 39 2023-11-22 23:41:18.336000 SYSTEM DEBUG Plugin loaded (instance ID 018bf967-b7a6-0000-37fd-d708628dd2c3). 018bf967-b7a6-0000-37fd-d708628dd2c3 2023-11-22 23:41:18 1057
018bf968-d281-0001-7edb-2db2f404b499 39 39 39 2023-11-22 23:41:18.337000 SYSTEM DEBUG Plugin unloaded (instance ID 018bf967-b7a6-0000-37fd-d708628dd2c3). 018bf967-b7a6-0000-37fd-d708628dd2c3 2023-11-22 23:41:18 1059
018bf969-0954-0000-a947-8b1c95780a12 1 1 1 2023-11-22 23:41:32.372000 SYSTEM DEBUG Plugin unloaded (instance ID 018bf967-b7a6-0000-37fd-d708628dd2c3). 018bf967-b7a6-0000-37fd-d708628dd2c3 2023-11-22 23:41:32 1060
018bf969-0954-0001-871f-85e662937e16 2 2 2 2023-11-22 23:41:32.372000 SYSTEM DEBUG Plugin unloaded (instance ID 018bf967-b7a6-0000-37fd-d708628dd2c3). 018bf967-b7a6-0000-37fd-d708628dd2c3 2023-11-22 23:41:32 1061

And of course, here's our missing message at offset 1058 -- it was combined with the message at offset 1059 which shares the same row key:

ted@revuelto plugin-server % docker-compose -f ../docker-compose.dev.yml exec -it kafka kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic plugin_log_entries --partition 0 --offset 1058 --max-messages 1 | jq
{
  "source": "CONSOLE",
  "type": "LOG",
  "id": "018bf968-d281-0000-0171-5313d671629b",
  "team_id": 39,
  "plugin_id": 39,
  "plugin_config_id": 39,
  "timestamp": "2023-11-22 23:41:18.337",
  "message": "{\"method\":\"teardownPlugin\"}",
  "instance_id": "018bf967-b7a6-0000-37fd-d708628dd2c3"
}
Processed a total of 1 messages

@posthog-bot
Copy link
Contributor

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week.

@posthog-bot posthog-bot removed the stale label Dec 6, 2023
@posthog-bot
Copy link
Contributor

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week.

@tkaemming
Copy link
Contributor Author

Still some issues with export functional tests but seems like those can be addressed on their own, don't really need this meta-PR for identifying the flakes any longer.

@tkaemming tkaemming closed this Dec 13, 2023
@tkaemming tkaemming deleted the debugging-plugin-server-functional-tests branch January 1, 2025 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants