Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

postgres/prometheus couch2pg errors: 'converting NULL to int64 is unsupported' and more #70

Closed
mrjones-plip opened this issue Jun 24, 2023 · 3 comments · Fixed by #79
Closed
Labels
released Type: Bug Fix something that isn't working as intended

Comments

@mrjones-plip
Copy link
Contributor

When running the couch2pg Data setup, I'm getting errors and the data appears to not be imported.

Here's the output from docker logs cht-monitoring-postgres-exporter-1. The real cht instance name has been replaced with cht_postgres_name_here:

ts=2023-06-24T15:31:03.368Z caller=server.go:74 level=info msg="Established new database connection" fingerprint=172.17.0.1:5432
ts=2023-06-24T15:31:03.633Z caller=collector.go:190 level=error target=postgresql://172.17.0.1:5432/cht_postgres_name_here msg="collector failed" name=postmaster duration_seconds=0.265314435 err="sql: Scan error on column index 0, name \"pg_postmaster_start_time\": converting driver.Value type time.Time (\"2023-05-17 10:11:24.021895 +0000 UTC\") to a float64: invalid syntax"
ts=2023-06-24T15:31:04.343Z caller=postgres_exporter.go:622 level=info msg="Semantic version changed" server=172.17.0.1:5432 from=0.0.0 to=9.6.13
ts=2023-06-24T15:31:04.476Z caller=collector.go:190 level=error target=postgresql://172.17.0.1:5432/cht_postgres_name_here msg="collector failed" name=statio_user_tables duration_seconds=1.107925937 err="sql: Scan error on column index 5, name \"idx_blks_read\": converting NULL to int64 is unsupported"
ts=2023-06-24T15:31:04.741Z caller=collector.go:190 level=error target=postgresql://172.17.0.1:5432/cht_postgres_name_here msg="collector failed" name=process_idle duration_seconds=1.37263144 err="sql: Scan error on column index 3, name \"seconds\": unsupported Scan, storing driver.Value type []uint8 into type *[]int64"
ts=2023-06-24T15:31:05.276Z caller=collector.go:190 level=error target=postgresql://172.17.0.1:5432/cht_postgres_name_here msg="collector failed" name=stat_user_tables duration_seconds=1.9078570670000001 err="sql: Scan error on column index 5, name \"idx_scan\": converting NULL to int64 is unsupported"
ts=2023-06-24T15:31:05.541Z caller=collector.go:190 level=error target=postgresql://172.17.0.1:5432/cht_postgres_name_here msg="collector failed" name=replication duration_seconds=2.172937022 err="sql: Scan error on column index 0, name \"pg_postmaster_start_time\": converting driver.Value type time.Time (\"2023-05-17 10:11:24.021895 +0000 UTC\") to a float64: invalid syntax"
@mrjones-plip mrjones-plip added the Type: Bug Fix something that isn't working as intended label Jun 24, 2023
@mrjones-plip
Copy link
Contributor Author

mrjones-plip commented Jun 29, 2023

OK!!! After a deep dive into this I have a path forward, but not specific answer just yet.

tl;dr

we should figure out how to turn off ALL metrics aside from what is in the queries yaml. We tested both --disable-default-metrics and --disable-settings-metrics and it didn't work.

discoveries & workarounds

The biggie is two points:

  1. Errors above don't actually stop the postgres scrape from running! If you expose port 9090 for the prometheus service and check Targets (http://localhost:9090/targets), you actually see this error context deadline exceeded for the postgres target::

    Get "http://postgres-exporter:9187/probe?auth_module=postgress_server_here%3A5434%2Fdatabase_here&target=postgresql%3A%2F%2Fpostgress_server_here%3A5434%2Fdatabase_here": context deadline exceeded

  2. You can work around deadline exceeded by updating the scrape config file and add on line 3 scrape_timeout: 1m. Scrape will now succeed after you down/up the services.

scrape_timeout is just a work around because when you DO get it to work you find out that:

  • it sucks down a 4k+ lines of metrics
  • is over 350kb of data
  • takes 10s of seconds to run (average of about 13sec with n=~10)

This is because there's dozens of metrics it fetches for each database and we have 20+ databases in RDBMs. Multiply that out and you get the above huge stats.

helpful points

  • here is a copy of the output of /metrics on the exporter (Medic google drive)
  • If you want to test what the /metrics endpoint for postgres is going to return, but don't want to set query time to 1s, you can:
    1. enable postgres metrics, per the docs
    2. figure out the IP of your postgres service (it changes every down/up of docker). It's likely called cht-watchdog-postgres-exporter-1: docker inspect $(docker ps -q ) --format='{{ printf "%-50s" .Name}} {{range .NetworkSettings.Networks}}{{.IPAddress}} {{end}}'
    3. run this curl command using the IP you got from prior step, ensuring that both postgres_server and cht are updated to be correct from your auth_modules: curl -v http://192.168.192.3:9187/probe\?auth_module\=postgres_server%3A5432%2Fcht\&target\=postgresql%3A%2F%2Fpostgres_server%3A5432%2Fcht > postgres.metrics.txt
  • If you want to get ride of the original converting NULL to int64 errors, you can update the compose file to use master per this issue, which presumably will make it's way to latest soon. Do that by editing your .env file (copied from the example) to have POSTGRES_EXPORTER_VERSION=master and restart
  • per the fix in the last postgress issue, you can make changes in the docker file to have more flags for when the exporter is called.
  • It's possible that maybe setting all the [no-]collector.* flags indivudually will work instead of the super disable-*-metrics ones as we tried above?

@mrjones-plip
Copy link
Contributor Author

@m5r - it would be great if you could solve this next week while @jkuester and I are out!

cc @tatilepizs who can assist.

@medic-ci
Copy link
Collaborator

🎉 This issue has been resolved in version 1.8.2 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
released Type: Bug Fix something that isn't working as intended
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants