Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serve-expired does not adhere to secure-by-default principle #1175

Open
pimlie opened this issue Nov 7, 2024 · 11 comments
Open

serve-expired does not adhere to secure-by-default principle #1175

pimlie opened this issue Nov 7, 2024 · 11 comments
Assignees

Comments

@pimlie
Copy link

pimlie commented Nov 7, 2024

Describe the bug

If you just set serve-expired: yes in your config, then unbound could serve stale dns entries for ever as the default value for serve-expired-ttl is 0. This is not a secure configuration.

To reproduce
n.a.

Expected behavior

unbound should also follow suggestions from RFC-8767 and not just recommendation and therefore set a non-zero default value for serve-expired-ttl. The decision to serve expired data forever should be an explicit decision by the user as they put themselves at risk by doing so. See below for why.

System:
n.a.

Additional information

1. Security issue

As background and to explain why this should probably be considered a security issue, I was bit by this last week when I was performing an online payment when my browser stopped the request as I was about to be redirected to an unsecure page/domain.
After investigating I noticed that the SSL certificate of the page I was on did not belong to the PSP that was handling the payment but some random other third party. After contacting the PSP they mentioned that they had reason to believe this was caused by DNS caching. Although I didn't believe them at first, after diving into the unbound config more I'm quite sure they were right (sry PSP).
Looking at the unbound logs, I had last visited that PSP checkout page something like a month ago and I guess that when I first requested the PSP's checkout page last week that unbound served the cached DNS response from last month. But it seems that in between last month and last week, the PSP scaled down/changed their capacity causing the resources they used last month to not be used anymore. And their cloud provider had allocated those resources to another third party.

TLDR: this is a security issue as due to this default unbound configuration I almost submitted personal / payment information to some random third party as unbound served me a month old DNS record. Luckily my browser stopped me from doing that.

2. RFC

Also, I wonder whether unbound does really adhere to RFC-8767. Unbound says about enabling serve-expired that it attempts to serve old responses ... without waiting for the actual resolution to finish. But section 7 of the RFC says that a good-faith effort has been recently made to refresh the stale data before it is delivered to any client.
I might misunderstand the RFC here, but it seems the RFC says an attempt to refresh has to be made while unbound immediately sends stale data without trying to refresh? I guess that unbound could better adhere to the RFC by also setting a non-zero value for serve-expired-client-timeout by default.

3. Mitigation

Instead or in addition to setting serve-expired-ttl, maybe a setting like serve-expired-ttl-factor could be added. This might be outside the scope of RFC-8767, but the party who knows best how long DNS data should be cached is the owner of the DNS recordset who already sets a TTL themselves. And setting a fixed duration for serve-expired-ttl seems wrong as it doesn't take the intentions of the original DNS owner into account. A setting like serve-expired-ttl-factor: 10 could be added to work next to serve-expired-ttl, where serve-expired-ttl is the maximum value and serve-expired-ttl-factor means that unbound only serves stale cache data for the configured factor value times the TTL that the owner of the DNS record configured.

F.e. setting a config like:

serve-expired-ttl: 86400
serve-expired-ttl-factor: 10

could mean that when requesting a DNS record with a configured TTL of 60 seconds would then only be cached for 10 * 60 = 600 seconds while a DNS record with a TTL of 10.000 would be cached for 86400 seconds (and thus not 10 * 10.000 = 100.000)

@gthess
Copy link
Member

gthess commented Nov 8, 2024

Hi Pim,
Thanks for bringing this up!

serve-expired itself is not a default setting. It was even there before the RFC and Unbound would reply straight from cache before trying to update the record. This was/is desired in certain environments where the ecosystem is somewhat controlled.

When implementing RFC 8767 we purposely left the serve-expired-client-timeout to 0 to not break the default behavior of the serve-expired logic. Now that people know of the existence of the new configuration options for quite some time I believe it is the right time to use the recommended 1800 value by default.

Serving expired answers is not ideal because the upstream explicitly communicates the TTL of the original record. Serving expired answers is mostly for controlled environments or big installations where the expired answers are used to push the qps up.

For regular resolving I would advise to turn it off and instead use prefetch to try and keep fresh records in the cache.

@pimlie
Copy link
Author

pimlie commented Nov 8, 2024

Hey Yorgos, yeah this was definitely my own mistake for just using 'some' docker container and trusting that container was using a sane / secure configuration without taking a good look at it's configuration myself. So the only one to blame for that is me 😿
That said, it would probably also helped if NLnet Labs would publish an official docker container with recommended config. Would be awesome if y'all could consider that ;)

One other thought I had about serve-expired-client-timeout, it would maybe be nice if unbound could track & report the average resolve time for upstream requests. Cause just setting serve-expired-client-timeout to some value feels quite arbitrary. Even the recommended value of 1.8s in the RFC seems somewhat arbitrary, for a proper functioning cache that 1.8s seems much more like an upper limit then a meaningful value for serve-expired-client-timeout as in a normal functioning environment I would expect the average resolve time to be much lower then 1.8s.
So if unbound could report like that upstream DNS requests would on average resolve in M time with N sigma then serve-expired-client-timeout could probably be tweaked to an meaningful value for each environment it runs in 🙏

@AlexanderBand
Copy link
Member

Hey Yorgos, yeah this was definitely my own mistake for just using 'some' docker container and trusting that container was using a sane / secure configuration

On the Docker topic, I hear good stories about the one maintained by @madnuttah.
https://github.com/madnuttah/unbound-docker

@pimlie
Copy link
Author

pimlie commented Nov 8, 2024

On the Docker topic, I hear good stories about the one maintained by @madnuttah. https://github.com/madnuttah/unbound-docker

An (semi) official recommendation would be welcome too, will look at that docker image. Thanks!

I'm currently using https://hub.docker.com/r/klutchell/unbound (see also mentioned issue above) because his build/config seemed ok as well and that image has 10M+ pulls on docker hub as opposed to 50K+ for the image by @madnuttah. I remember looking at https://hub.docker.com/r/mvance/unbound too, but iirc correctly then klutchell's image is very small / distribution less while mvance's isn't.

@gthess
Copy link
Member

gthess commented Nov 8, 2024

One other thought I had about serve-expired-client-timeout, it would maybe be nice if unbound could track & report the average resolve time for upstream requests. Cause just setting serve-expired-client-timeout to some value feels quite arbitrary. Even the recommended value of 1.8s in the RFC seems somewhat arbitrary, for a proper functioning cache that 1.8s seems much more like an upper limit then a meaningful value for serve-expired-client-timeout as in a normal functioning environment I would expect the average resolve time to be much lower then 1.8s. So if unbound could report like that upstream DNS requests would on average resolve in M time with N sigma then serve-expired-client-timeout could probably be tweaked to an meaningful value for each environment it runs in 🙏

There are such numbers! Look for recursion.time and histogram in the unbound-control manpage or online for the latest version.

But I think you are holding it wrong :)
The value of 1.8s was selected to be less than the usual amount of time a client will usually wait for a reply before retrying (2 seconds). Not for the average recursion time. So by the time they are almost ready to give up, Unbound will serve them an expired entry instead. This is ofcourse data based on observation and indeed in controlled environments where you expect an upstream answer in X seconds (think datacenters), you can configure serve-expired-client-timeout to match your expectation, because anything slower means trouble somewhere.

@Dynamic5912
Copy link

If you're going into the config to set serve-expired to yes - it doesn't take much to change the serve-expired-ttl at the same time whilst in the config?

I don't see this as a "real" issue, as such.

@pimlie
Copy link
Author

pimlie commented Nov 9, 2024

@AlexanderBand FWIW, madnuttah also sets serve-expired: yes without setting serve-expired-ttl in it's example config which madnuttah says he uses himself. The default config in the docker image does indeed seem to use serve-expired: no.

@gthess Thanks for the tip about unbound-control, will take a look :)

@Dynamic5912 That's not the point, you make an assumption that people already know that they have to configure serve-expired-ttl because otherwise they put themselves at risk. If you read this thread then you'll see that at least 2 docker maintainers and their thousands of users were more then likely not aware of that. Also please read up on the secure-by-design/default principle published by CISA and partners :)

@Dynamic5912
Copy link

If people are configuring Unbound they should have the knowledge and know-how to understand what they're configuring and enabling.

@madnuttah
Copy link

FWIW, madnuttah also sets serve-expired: yes without setting serve-expired-ttl in it's example config which madnuttah says he uses himself. The default config in the docker image does indeed seem to use serve-expired: no.

Thank you for your heads-up! I'll fix that asap. As you've mentioned, the default config doesn't serve expired entries.

madnuttah added a commit to madnuttah/unbound-docker that referenced this issue Nov 9, 2024
Signed-off-by: ϺΛDИVTTΛH <[email protected]>
@Dynamic5912
Copy link

Hi Pim,

Thanks for bringing this up!

serve-expired itself is not a default setting. It was even there before the RFC and Unbound would reply straight from cache before trying to update the record. This was/is desired in certain environments where the ecosystem is somewhat controlled.

When implementing RFC 8767 we purposely left the serve-expired-client-timeout to 0 to not break the default behavior of the serve-expired logic. Now that people know of the existence of the new configuration options for quite some time I believe it is the right time to use the recommended 1800 value by default.

Serving expired answers is not ideal because the upstream explicitly communicates the TTL of the original record. Serving expired answers is mostly for controlled environments or big installations where the expired answers are used to push the qps up.

For regular resolving I would advise to turn it off and instead use prefetch to try and keep fresh records in the cache.

Wouldn't having the cache TTL as 0 and using Prefetch negate any issues though?

Prefetch would (should) update those records in the cache as they are technically expired?

@gthess
Copy link
Member

gthess commented Nov 11, 2024

Prefetch would (should) update those records in the cache as they are technically expired?

I am not sure I follow but prefetch and serve-expire trigger the same fetching logic. Prefetch does it when a currently non-expired reply is in the last 10% of the original TTL, serve-expired does it when the record is expired. Both try to update the cache. So if you only have prefetch enabled, then an expired record will not be used and instead resolved normally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants