Skip to content
This repository has been archived by the owner on Nov 4, 2021. It is now read-only.

Add built-in GeoIP capabilities #247

Merged
merged 27 commits into from
Mar 26, 2021
Merged

Add built-in GeoIP capabilities #247

merged 27 commits into from
Mar 26, 2021

Conversation

Twixes
Copy link
Member

@Twixes Twixes commented Mar 12, 2021

Changes

This will allow us to efficiently run GeoIP for all users, by making it a built-in capability of the plugin server.

It works by downloading an MMDB binary blob from our new microservice (currently at https://posthog-mmdb.herokuapp.com - maybe we could put the DB in this open source repo, but I'm a bit wary of licensing, it may not be that broad for redistribution).

Then a plugin (like our new GeoIP built specifically for this: https://github.com/PostHog/posthog-plugin-geoip) can access this and efficiently get location data.

I tried putting all of this into the above new plugin, but considering the size of the database, complications in distributed download of the database, and complexity of dependencies, it's MUCH simpler to do this inside the plugin server.

Checklist

  • Updated Settings section in README.md, if settings are affected
  • Jest tests

@Twixes Twixes marked this pull request as draft March 12, 2021 15:56
@Twixes Twixes added the bump minor Bump minor version when this PR gets merged label Mar 12, 2021
@Twixes Twixes marked this pull request as ready for review March 19, 2021 15:43
@Twixes Twixes requested a review from mariusandra March 20, 2021 09:38
Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, it's definitely worth doing this downloading directly in the plugin server... and not inside a plugin. Some feedback though:

  1. I see it's a 60MB .mmdb file that we're downloading. Can't we download a lean .gz of a few megabytes instead? We can't imagine every user/server to be behind a fast connection.
  2. Storing 60MB in Redis is way too much. Heroku's Free Redis maxes out at 25MB as we know. Even the first paid plan at $15/mo is 50MB. You'll have to pay $30/mo to get a redis that will fit this database. I don't think that's really an option.
  3. To be in line with all the other libraries that plugins can use (posthog, google and fetch mainly for now), this geoip should not be passed in meta, but as a global.
  4. What if the previously downloaded (and 7 day cached) .mmdb file expires... and somehow it's not possible to download it during the next restart of the plugin server? For example in an airgapped system. We should not just delete the previously downloaded file from the cache, but keep using it and just refresh it when it's stale.
  5. If you have a plugin server running for months without restarts (could happen), it would be nice to periodically refresh the maxmind database as well (e.g. check every day the md5sum of the downloaded file with what's in the update server... and update if needed).
  6. I think we shouldn't wait for a 60MB file to be downloaded when booting the server, but do it async. If a plugin the accesses this database, we could then await until the database is loaded?
  7. Why the geoip2-node package and not the mmdb-lib pacakge the other plugin used?

@Twixes
Copy link
Member Author

Twixes commented Mar 23, 2021

  1. Good point on compression. I brotli'd the DB (https://github.com/PostHog/posthog-mmdb, long compression time but super quick decompress) and that got the file to 22 MB, which is however still more than a few.
  2. Redis is a good fit since this is pretty much ephemeral, but alternatively we could store this as a global plugin attachment (only this would require a DB migration to allow for non-particular-plugin entries).
  3. I strongly oppose magically injected globals, because an object passed in with meta is much more explicit and globals simply do not work with TypeScript without awkward workarounds.
  4. Good point, can implement something non-expiring.
  5. 👍 with periodic refresh.
  6. TBH I think waiting on plugin server launch makes this less prone to posing problems than causing event processing to await (30 s timeouts, right?). At least until we figure out a better solution to handling potentially long-running plugins.
  7. The MaxMind abstraction has slightly nicer types plus it's listed on the MaxMind website as officially supported, which gives me just a bit more confidence.

@mariusandra
Copy link
Collaborator

  1. I think plugin attachments are a better solution here.
  2. I understand your concern with magic globals and typing, but we shouldn't adopt a system where some libraries are shared one way and others another way. We should be consistent and changing this is surely outside the scope of this PR.
  3. OK!
  4. I think in the end I used that one because it was the only one I could effectively compress into one large ".js" file. Fine to switch.

New points:

  1. Could it make sense to develop this in a way that the existing maxmind plugin (that lets you upload a custom database) can drop its reliance on compiling in the mmdb-lib code and instead also use the exported geoip code? The only thing required is to export something that can be used to make a new reader with a custom .mmdb file...

@Twixes
Copy link
Member Author

Twixes commented Mar 23, 2021

As for 8. that would be very simple but I wouldn't be very comfortable with it, because the plugin server has to store the DB in memory (openBuffer). That's okay for one DB per instance, but would not scale on Cloud if multiple users wanted to use their own DBs. So at the moment I think we should just get the functionality in with only the DB provided by us, and then consider extending it later.

@Twixes
Copy link
Member Author

Twixes commented Mar 23, 2021

Re: globals vs meta, OK with switching to a global here for consistency currently, but I think a followup PR should make all the extension also available in meta, Zapier z-object style. Would you be OK with that?

@mariusandra
Copy link
Collaborator

Re 8, I don't mean this as a default option, but more that we could rebrand the existing "posthog maxmind plugin" to a "posthog custom maxmind database plugin" and have it share some code. Nothing will change in memory usage to people already using that.

@mariusandra
Copy link
Collaborator

Re globals, I think this is a discussion for #242

I'm not immediately opposed to this, but I'm not convinced either. It's another breaking change for unclear benefits (yes, yes, better typing) that must be managed... and since I'm still not completely convinced that module.exports = plugin as Plugin is the way to go, I don't think this adds that much.

I'd still like a plugin to just look like this:

function runEveryMinute () {
  posthog.capture('ping')
}

instead of:

const plugin: Plugin = {
  runEveryMinute (meta) {
    meta.extensions.posthog.capture('ping')
  }
}
module.exports = plugin

@Twixes Twixes requested a review from mariusandra March 24, 2021 10:51
@Twixes
Copy link
Member Author

Twixes commented Mar 24, 2021

I addressed all the feedback (aside from globals/meta until there are other PRs around that), added some cool mechanisms.

For instance periodic background update:

update

Graceful handling of airgapped instances (old MMDB is used or, if there's none, the server runs normally with just GeoIP disabled until download can be performed by the background update job):

airgap-handling

@Twixes
Copy link
Member Author

Twixes commented Mar 24, 2021

Also, https://github.com/PostHog/posthog-mmdb now automatically updates the DB from MaxMind servers every Monday, Wednesday, and Friday at 2:00 AM UTC thanks to GH Actions.

Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The only thing I'm a bit sketchy about is that prepareMmdb can call distributableFetchAndInsertFreshMmdb, which can in turn call prepareMmdb. It doesn't really feel like there will be a deadlock though, so approved, but feel free to doublecheck.

image

@mariusandra
Copy link
Collaborator

and of course tests and merge conflicts...

@Twixes Twixes merged commit 1b3d208 into master Mar 26, 2021
@Twixes Twixes deleted the geoip branch March 26, 2021 02:22
fuziontech pushed a commit to PostHog/posthog that referenced this pull request Oct 12, 2021
…r#247)

* Add @maxmind/geoip2-node as dependency

* Add Server.geoIp with downloaded and cached MMDB

* Inject GeoIP extension into plugin meta

* Upgrade plugin-scaffold

* Disable MMDB fetching in tests

* Update README.md

* Fix "plugin meta has what it should have"

* Fix imports

* Use mmdb.posthog.net

* Rework MMDB approach for plugin attachments and Brotli

* Run prettier

* Improve concurrency handling and restructure files

* Add periodic in-flight MMDB staleness check with no-interrupt update

* Polish up staleness check

* Fix schedule

* Gracefully handle airgapped systems

* Update plugins.test.ts

* Update README.md

* Roll back unrelated config updates

* Add tests

* Fix plugin attachments in tests
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bump minor Bump minor version when this PR gets merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants