Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serve-expired does not adhere to secure-by-default principle #1175

Closed
pimlie opened this issue Nov 7, 2024 · 19 comments
Closed

serve-expired does not adhere to secure-by-default principle #1175

pimlie opened this issue Nov 7, 2024 · 19 comments
Assignees

Comments

@pimlie
Copy link

pimlie commented Nov 7, 2024

Describe the bug

If you just set serve-expired: yes in your config, then unbound could serve stale dns entries for ever as the default value for serve-expired-ttl is 0. This is not a secure configuration.

To reproduce
n.a.

Expected behavior

unbound should also follow suggestions from RFC-8767 and not just recommendation and therefore set a non-zero default value for serve-expired-ttl. The decision to serve expired data forever should be an explicit decision by the user as they put themselves at risk by doing so. See below for why.

System:
n.a.

Additional information

1. Security issue

As background and to explain why this should probably be considered a security issue, I was bit by this last week when I was performing an online payment when my browser stopped the request as I was about to be redirected to an unsecure page/domain.
After investigating I noticed that the SSL certificate of the page I was on did not belong to the PSP that was handling the payment but some random other third party. After contacting the PSP they mentioned that they had reason to believe this was caused by DNS caching. Although I didn't believe them at first, after diving into the unbound config more I'm quite sure they were right (sry PSP).
Looking at the unbound logs, I had last visited that PSP checkout page something like a month ago and I guess that when I first requested the PSP's checkout page last week that unbound served the cached DNS response from last month. But it seems that in between last month and last week, the PSP scaled down/changed their capacity causing the resources they used last month to not be used anymore. And their cloud provider had allocated those resources to another third party.

TLDR: this is a security issue as due to this default unbound configuration I almost submitted personal / payment information to some random third party as unbound served me a month old DNS record. Luckily my browser stopped me from doing that.

2. RFC

Also, I wonder whether unbound does really adhere to RFC-8767. Unbound says about enabling serve-expired that it attempts to serve old responses ... without waiting for the actual resolution to finish. But section 7 of the RFC says that a good-faith effort has been recently made to refresh the stale data before it is delivered to any client.
I might misunderstand the RFC here, but it seems the RFC says an attempt to refresh has to be made while unbound immediately sends stale data without trying to refresh? I guess that unbound could better adhere to the RFC by also setting a non-zero value for serve-expired-client-timeout by default.

3. Mitigation

Instead or in addition to setting serve-expired-ttl, maybe a setting like serve-expired-ttl-factor could be added. This might be outside the scope of RFC-8767, but the party who knows best how long DNS data should be cached is the owner of the DNS recordset who already sets a TTL themselves. And setting a fixed duration for serve-expired-ttl seems wrong as it doesn't take the intentions of the original DNS owner into account. A setting like serve-expired-ttl-factor: 10 could be added to work next to serve-expired-ttl, where serve-expired-ttl is the maximum value and serve-expired-ttl-factor means that unbound only serves stale cache data for the configured factor value times the TTL that the owner of the DNS record configured.

F.e. setting a config like:

serve-expired-ttl: 86400
serve-expired-ttl-factor: 10

could mean that when requesting a DNS record with a configured TTL of 60 seconds would then only be cached for 10 * 60 = 600 seconds while a DNS record with a TTL of 10.000 would be cached for 86400 seconds (and thus not 10 * 10.000 = 100.000)

@gthess
Copy link
Member

gthess commented Nov 8, 2024

Hi Pim,
Thanks for bringing this up!

serve-expired itself is not a default setting. It was even there before the RFC and Unbound would reply straight from cache before trying to update the record. This was/is desired in certain environments where the ecosystem is somewhat controlled.

When implementing RFC 8767 we purposely left the serve-expired-client-timeout to 0 to not break the default behavior of the serve-expired logic. Now that people know of the existence of the new configuration options for quite some time I believe it is the right time to use the recommended 1800 value by default.

Serving expired answers is not ideal because the upstream explicitly communicates the TTL of the original record. Serving expired answers is mostly for controlled environments or big installations where the expired answers are used to push the qps up.

For regular resolving I would advise to turn it off and instead use prefetch to try and keep fresh records in the cache.

@pimlie
Copy link
Author

pimlie commented Nov 8, 2024

Hey Yorgos, yeah this was definitely my own mistake for just using 'some' docker container and trusting that container was using a sane / secure configuration without taking a good look at it's configuration myself. So the only one to blame for that is me 😿
That said, it would probably also helped if NLnet Labs would publish an official docker container with recommended config. Would be awesome if y'all could consider that ;)

One other thought I had about serve-expired-client-timeout, it would maybe be nice if unbound could track & report the average resolve time for upstream requests. Cause just setting serve-expired-client-timeout to some value feels quite arbitrary. Even the recommended value of 1.8s in the RFC seems somewhat arbitrary, for a proper functioning cache that 1.8s seems much more like an upper limit then a meaningful value for serve-expired-client-timeout as in a normal functioning environment I would expect the average resolve time to be much lower then 1.8s.
So if unbound could report like that upstream DNS requests would on average resolve in M time with N sigma then serve-expired-client-timeout could probably be tweaked to an meaningful value for each environment it runs in 🙏

@AlexanderBand
Copy link
Member

Hey Yorgos, yeah this was definitely my own mistake for just using 'some' docker container and trusting that container was using a sane / secure configuration

On the Docker topic, I hear good stories about the one maintained by @madnuttah.
https://github.com/madnuttah/unbound-docker

@pimlie
Copy link
Author

pimlie commented Nov 8, 2024

On the Docker topic, I hear good stories about the one maintained by @madnuttah. https://github.com/madnuttah/unbound-docker

An (semi) official recommendation would be welcome too, will look at that docker image. Thanks!

I'm currently using https://hub.docker.com/r/klutchell/unbound (see also mentioned issue above) because his build/config seemed ok as well and that image has 10M+ pulls on docker hub as opposed to 50K+ for the image by @madnuttah. I remember looking at https://hub.docker.com/r/mvance/unbound too, but iirc correctly then klutchell's image is very small / distribution less while mvance's isn't.

@gthess
Copy link
Member

gthess commented Nov 8, 2024

One other thought I had about serve-expired-client-timeout, it would maybe be nice if unbound could track & report the average resolve time for upstream requests. Cause just setting serve-expired-client-timeout to some value feels quite arbitrary. Even the recommended value of 1.8s in the RFC seems somewhat arbitrary, for a proper functioning cache that 1.8s seems much more like an upper limit then a meaningful value for serve-expired-client-timeout as in a normal functioning environment I would expect the average resolve time to be much lower then 1.8s. So if unbound could report like that upstream DNS requests would on average resolve in M time with N sigma then serve-expired-client-timeout could probably be tweaked to an meaningful value for each environment it runs in 🙏

There are such numbers! Look for recursion.time and histogram in the unbound-control manpage or online for the latest version.

But I think you are holding it wrong :)
The value of 1.8s was selected to be less than the usual amount of time a client will usually wait for a reply before retrying (2 seconds). Not for the average recursion time. So by the time they are almost ready to give up, Unbound will serve them an expired entry instead. This is ofcourse data based on observation and indeed in controlled environments where you expect an upstream answer in X seconds (think datacenters), you can configure serve-expired-client-timeout to match your expectation, because anything slower means trouble somewhere.

@Dynamic5912
Copy link

If you're going into the config to set serve-expired to yes - it doesn't take much to change the serve-expired-ttl at the same time whilst in the config?

I don't see this as a "real" issue, as such.

@pimlie
Copy link
Author

pimlie commented Nov 9, 2024

@AlexanderBand FWIW, madnuttah also sets serve-expired: yes without setting serve-expired-ttl in it's example config which madnuttah says he uses himself. The default config in the docker image does indeed seem to use serve-expired: no.

@gthess Thanks for the tip about unbound-control, will take a look :)

@Dynamic5912 That's not the point, you make an assumption that people already know that they have to configure serve-expired-ttl because otherwise they put themselves at risk. If you read this thread then you'll see that at least 2 docker maintainers and their thousands of users were more then likely not aware of that. Also please read up on the secure-by-design/default principle published by CISA and partners :)

@Dynamic5912
Copy link

If people are configuring Unbound they should have the knowledge and know-how to understand what they're configuring and enabling.

@madnuttah
Copy link

FWIW, madnuttah also sets serve-expired: yes without setting serve-expired-ttl in it's example config which madnuttah says he uses himself. The default config in the docker image does indeed seem to use serve-expired: no.

Thank you for your heads-up! I'll fix that asap. As you've mentioned, the default config doesn't serve expired entries.

madnuttah added a commit to madnuttah/unbound-docker that referenced this issue Nov 9, 2024
Signed-off-by: ϺΛDИVTTΛH <[email protected]>
@Dynamic5912
Copy link

Hi Pim,

Thanks for bringing this up!

serve-expired itself is not a default setting. It was even there before the RFC and Unbound would reply straight from cache before trying to update the record. This was/is desired in certain environments where the ecosystem is somewhat controlled.

When implementing RFC 8767 we purposely left the serve-expired-client-timeout to 0 to not break the default behavior of the serve-expired logic. Now that people know of the existence of the new configuration options for quite some time I believe it is the right time to use the recommended 1800 value by default.

Serving expired answers is not ideal because the upstream explicitly communicates the TTL of the original record. Serving expired answers is mostly for controlled environments or big installations where the expired answers are used to push the qps up.

For regular resolving I would advise to turn it off and instead use prefetch to try and keep fresh records in the cache.

Wouldn't having the cache TTL as 0 and using Prefetch negate any issues though?

Prefetch would (should) update those records in the cache as they are technically expired?

@gthess
Copy link
Member

gthess commented Nov 11, 2024

Prefetch would (should) update those records in the cache as they are technically expired?

I am not sure I follow but prefetch and serve-expire trigger the same fetching logic. Prefetch does it when a currently non-expired reply is in the last 10% of the original TTL, serve-expired does it when the record is expired. Both try to update the cache. So if you only have prefetch enabled, then an expired record will not be used and instead resolved normally.

@Dynamic5912
Copy link

Is serve-expired-ttl still staying with a default value of "0" with this new change?

@gthess
Copy link
Member

gthess commented Nov 26, 2024

Good question.
Any value you are going to choose there is already violating the lifetime of a record so there is technically no difference between infinity and some value. But since the RFC suggests "1 to 3 days" maybe now it's a good time to update that value, only for being close to the RFC suggestion. I'll see if we miss something expectation-wise if we change the default, otherwise the change will land on the same version as the serve-expired-client-timeout default change.

@Dynamic5912
Copy link

Good question.

Any value you are going to choose there is already violating the lifetime of a record so there is technically no difference between infinity and some value. But since the RFC suggests "1 to 3 days" maybe now it's a good time to update that value, only for being close to the RFC suggestion. I'll see if we miss something expectation-wise if we change the default, otherwise the change will land on the same version as the serve-expired-client-timeout default change.

Most people set a time of 1 day if you look at various forums/posts/blogs about setting up Unbound.

So I think 1 day would be a good baseline default setting.

@pimlie
Copy link
Author

pimlie commented Nov 26, 2024

So I think 1 day would be a good baseline default setting.

Agreed, secure-by-default also means to stay on the safe side (of recommendations) which in this case is the lower value. That said, I still think that adding both an upper bound as taking the actual TTL value of the DNS record into account could be worthwhile (ie the above suggested serve-expired-ttl-factor setting).

@pimlie
Copy link
Author

pimlie commented Nov 26, 2024

Also as mentioned above, the serve-expired-ttl-factor setting could be a way to tune the system according to the preferences by the domain holder. In my case with the PSP, they use a TTL of ~60s because they use a fully automatic scaling cloud solution and only block previously used resources for a couple of hours before releasing them. So even setting serve-expired-ttl to 1 day could result in an outdated DNS record being served.

I just don't think that many people are already really aware of RFC-8767 and it's implications for scalable cloud solutions. But those same people are probably aware of general potential DNS issues, so they likely already have set a reasonable TTL (but again without realizing that RFC-8767 recommends a static value to cache for 1-3 days). Which means that any scalable cloud hosted application that wants to provide ultimate safety to their clients (which is basically a GDPR requirement) either needs to buy their own ip ranges or they need to block cloud resources for 3 days after scaling down (which seems unreasonable).

To be fair, this also seems like a short coming / oversight in RFC-8767. So maybe this should be discussed with the authors of RFC-8767? Wdyt @gthess & @AlexanderBand ?

@gthess
Copy link
Member

gthess commented Dec 3, 2024

I understand you concern but to be fair, using serve-expired is altogether unreasonable. The upstream sends a TTL with the answer and clearly clarifies when the record should not be used anymore. The upstream cannot account for serve-stale being used because there lies madness. And if a resolver party points you to stale data that is the resolver party's fault/optimization.

Both the static value and a dynamic one can lead to failures as any assumptions on the original TTL are probably wrong because the only thing that TTL conveys is the expiry time. For your factor proposal I am seeing it from a different angle. I would use a factor of the original TTL like 80% for example. Then it would mean that a cached record can be served for 180% of the original TTL, the last 80% being expired. I can't find a good value to suggest for either your proposal or mine because nothing makes sense to me :)

The serve stale RFC made an effort to standardize the operational practice of serving expired records; these installations have the traffic to keep expired records fairly up-to-date.

I am not sure if you are involved with the IETF, but if not it is easy to start a discussion, the IETF is a pretty open space. You can contact the authors directly or create a new topic on the dnsop mailing list.

gthess added a commit that referenced this issue Dec 3, 2024
@Dynamic5912
Copy link

Many TTL's have expirations of 1 hour or less, some are even a matter of minutes or seconds.

I highly doubt DNS records are changing that frequently and it just causes unnecessary traffic on connections.

Serve-Expired is fine when used within the RFC guidelines, and an expiration of 1 day leaves little room for error/compromise compared to an indefinite "0" as currently used by default.

@pimlie
Copy link
Author

pimlie commented Dec 3, 2024

I understand you concern but to be fair, using serve-expired is altogether unreasonable

Not sure I read this correctly, but also given what you said before do you/NLnet (strongly?) recommend to disable serve-expired unless in very specific scenario's like f.e. when you run into performance issues and/or flaky upstream behaviour? If so, maybe a clear notification/warning about this could be added to the docs.

Both the static value and a dynamic one can lead to failures as any assumptions on the original TTL are probably wrong because the only thing that TTL conveys is the expiry time. For your factor proposal I am seeing it from a different angle. I would use a factor of the original TTL like 80% for example. Then it would mean that a cached record can be served for 180% of the original TTL, the last 80% being expired. I can't find a good value to suggest for either your proposal or mine because nothing makes sense to me :)

Although we might word things differently, it seems we've been almost in perfect sync about the issues. F.e. using 80% instead of a factor of 1.8 seems more of an implementation detail to me 😉 The main reason I was thinking in factors was because my starting point was a 1 minute TTL, and I think that caching a 1 minute TTL for 5 minutes (factor 5) should probably be ok (but not for a TTL of 1 day)..

I am not sure if you are involved with the IETF, but if not it is easy to start a discussion, the IETF is a pretty open space. You can contact the authors directly or create a new topic on the dnsop mailing list.

No, I'm not involved with IETF. So if you think some cache timer that takes the original TTL into account is a good idea, then I have no issue if you / NLnet take ownership as y'all are probably much more into the nitty-gritty details that comes with dns RFC's then I am 🙏

jedisct1 added a commit to jedisct1/unbound that referenced this issue Dec 26, 2024
* nlnet/master: (26 commits)
  - For NLnetLabs#1175, update serve-expired tests.
  - Fix NLnetLabs#1175: serve-expired does not adhere to secure-by-default   principle. The default value of serve-expired-client-timeout   is set to 1800 as suggested by RFC8767.
  - Fix comparison to help static analyzer.
  Changelog entry for NLnetLabs#1169: - Merge NLnetLabs#1169 from Sergey Kacheev, fix: lock-free counters for   auth_zone up/down queries.
  fix: lock-free counters for auth_zone up/down queries
  - Fix for NLnetLabs#1183: release nsec3 hashes per test file.
  - Fix NLnetLabs#1183: the data being used is released in method   nsec3_hash_test_entry.
  - Complete fix for max-global-quota to 200.
  - More descriptive text for 'harden-algo-downgrade'.
  - Increase the default of max-global-quota to 200 from 128 after   operational feedback. Still keeping the possible amplification   factor (CAMP related issues) in the hundreds.
  Changelog entry for: - Fix SETEX check during Redis (re)initialization.
  - Fix SETEX check during Redis (re)initialization.
  - Fix to log redis timeout error string on failure.
  - Fix for the serve expired DNSSEC information fix, it would not allow   current delegation information be updated in cache. The fix allows   current delegation and validation recursion information to be   updated, but as a consequence no longer has certain expired   information around for later dnssec valid expired responses.
  Changelog note for NLnetLabs#1167 - Merge NLnetLabs#1167: Makefile.in: fix occasional parallel build failures   around bison rule.
  Makefile.in: fix occasional parallel build failures around bison rule (NLnetLabs#1167)
  - Fix redis that during a reload it does not fail if the redis   server does not connect or does not respond. It still logs the   errors and if the server is up checks expiration features.
  - Fix redis that during a reload it does not fail if the redis   server does not connect or does not respond. It still logs the   errors and if the server is up checks expiration features.
  Changelog entry for NLnetLabs#1157: - Merge NLnetLabs#1157 from Liang Zhu, Fix heap corruption when calling   ub_ctx_delete in Windows.
  Fix heap corruption when calling ub_ctx_delete in Windows (NLnetLabs#1157)
  ...
jedisct1 added a commit to jedisct1/unbound that referenced this issue Dec 26, 2024
* nlnet/master:
  - For NLnetLabs#1207: [FR] Support for RESINFO RRType 261 (RFC9606), add   LDNS_RR_TYPE_RESINFO similar to LDNS_RR_TYPE_TXT.
  Changelog entry for NLnetLabs#1204: - Merge NLnetLabs#1204: ci: set persist-credentials: false for actions/checkout   per zizmor suggestion.
  set persist-credentials: false per zizmor suggestion
  - Fix typo in log_servfail.tdir test.
  Changelog entry for NLnetLabs#1187: - Merge NLnetLabs#1187: Create the SSL_CTX for QUIC before chroot and privilege   drop.
  Create the SSL_CTX for QUIC before chroot and privilege drop (NLnetLabs#1187)
  - Safeguard alias loop while looking in the cache for expired answers.
  - Merge NLnetLabs#1198: Fix log-servfail with serve expired and no useful cache   contents.
  - For NLnetLabs#1175, the default value of serve-expired-ttl is set to 86400   (1 day) as suggested by RFC8767.
  Changelog entry for NLnetLabs#1189, NLnetLabs#1197: - Merge NLnetLabs#1189: Fix the dname_str method to cause conversion errors   when the domain name length is 255. - Merge NLnetLabs#1197: dname_str() fixes.
  - For NLnetLabs#1193, introduce log-servfail.tdir and cleanup the log-servfail   setting from other tests.
  - Fix NLnetLabs#1193: log-servfail fails to log host SERVFAIL responses in   Unbound 1.19.2 on Ubuntu 24.04.1 LTS, by not considering cached   failures when trying to reply with expired data.
  - For NLnetLabs#1189, homogenize the input buffer size for dname_str().
  - For NLnetLabs#1189, add unit tests for dname_str() and debug check the input   buffer size.
  Fix the dname_str method to cause conversion errors when the domain name length is 255
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants