When we did our recent performance tests on one of our nginx clusters I noticed something odd: the CPU was choking at a request rate that was too little for a system like that. It’s a static proxy server running vanilla nginx, and the downstream servers were doing okay in terms of latency. CPU on this system shouldn’t choke before saturating those downstream systems, but it did. perf
reports on the process showed most of the samples occupied by symbols related to rewrite
and ngx_http_regex_exec
. While we have a lot of location
rules in this codebase, and many of them are regular expression style matchers, it seemed like way too much time was occupied by these routines. What’s worse is that this happens even when we try running the benchmarks with known wrong URLs or triggering the block/rate-limiting configurations, which should bypass most of the location
matching anyway. At one point I noticed Nginx (depending on compilation flags) has support for a PCRE JIT enigne configuration that promises improvement in regular expression matching. Turning this on
did improve the situation quite a bit, but it wasn’t anything spectacular. The regex symbols still formed a large part in perf
for every cut of URL type. Debugging this further pointed towards a combination of the following issues caused the problem:
- We have lots of
rewrite
rules within aserver
block outside oflocation
blocks. - The bypass routines in cases of known errors and/or rate limiting used relative paths in the config (internal redirects of sorts)
Our config had a bit of this shape:
A long time ago on the project a decision was made to add a file that contained a few URL matching rules that had to run in-order for every URL, before any location
matching runs, sort of a pre-processor to “normalize” URLs.
This is okay as long as it doesn’t get abused. But slowly over time there were additions that should’ve simply been location
blocks instead. For the uninitiated, location
matching tends to be more efficient in matching URLs as nginx builds out a tree of these at startup rather than going at it serially one by one, which is what our problem technically became. This set of rules started growing as it became a kitchen-sink of sorts for every “wrong” URL—at the time of debugging the number of such rules were in the hundreds. These rules get executed multiple times if there are internal redirects; rewrite
rules themselves can be internal redirects if they don’t use one of the bypassing flags
like break
, redirect
, permanent
, which further exacerbates the probem. This was true in our case since the intent was to run these in-order, which means last
and break
can’t be used by definition.
Secondly, we used relative paths for all the error_page
configurations mostly as a carry-over from nginx configurations that are documented pretty much everywhere1 . So when an error status is triggered nginx will redo the matching from the beginning. In isolation this is not a problem, and I can understand why the default documentation snippets use this pattern. In our case these two problems in combination create a cascading effect: when testing out our rate limiting and error handling checks, which should’ve bypassed the relatively-costly location
matching, every rewrite rule got run twice, which made the performance pathologically worse!
So here’s a PSA of sorts:
- Try not to have
rewrite
rules outside of alocation
block - Prefer named routes for “jump”s or bypass internal redirects instead of normal URLs/paths. This would’ve avoided the second execution of the
rewrite
rules. Something like the below snippet:
Example
Just to demonstrate this with an example, I’m going to use this configuration in nginx, which is deliberately close to what we had structurally:
This sets up three main user-facing routes: /
, /main
, /notauthorized
, /nonauthorized
. /
redirects to /main
internally, although the user won’t see any 3xx, while /notauthorized
returns a 403 response. The latter too (return
, as used in this case) is implemented as an internal redirect within nginx, so the routing and rule execution behaviour is going to be similar between the 404 case and the 403 case. For the uninitiated, try_files
(as used here) in nginx checks the paths given to it within the root
path, or else return the status code mentioned at the end. error_page
allow for configuring extra routes when nginx has to respond to a particular status code. Effectively, this too is an internal redirect before and after: when a location
block has the redirect
rule, and when the redirect
rule itself has a path as the target location.
I’ll try the following four routes:
curl localhost/
curl localhost/main
curl localhost/notauthorized
curl localhost/nonexistent
The /
and /main
runs are just to demonstrate the extra rewrite
between them. rewrite_log on;
does what it says on the tin, and here’s a filtered snippet from the logs:
GET /
"/unknown1/(.*)" does not match "/"
"/unknown2/(.*)" does not match "/"
"/unknown3/(.*)" does not match "/"
"/unknown4/(.*)" does not match "/"
"^/.*$" matches "/"
rewritten data: "/main", args: ""
GET /main
"/unknown1/(.*)" does not match "/main"
"/unknown2/(.*)" does not match "/main"
"/unknown3/(.*)" does not match "/main"
"/unknown4/(.*)" does not match "/main"
GET /notauthorized
"/unknown1/(.*)" does not match "/notauthorized"
"/unknown2/(.*)" does not match "/notauthorized"
"/unknown3/(.*)" does not match "/notauthorized"
"/unknown4/(.*)" does not match "/notauthorized"
"/unknown1/(.*)" does not match "/403.html"
"/unknown2/(.*)" does not match "/403.html"
"/unknown3/(.*)" does not match "/403.html"
"/unknown4/(.*)" does not match "/403.html"
GET /nonexistent
"/unknown1/(.*)" does not match "/nonexistent"
"/unknown2/(.*)" does not match "/nonexistent"
"/unknown3/(.*)" does not match "/nonexistent"
"/unknown4/(.*)" does not match "/nonexistent"
"/unknown1/(.*)" does not match "/404.html"
"/unknown2/(.*)" does not match "/404.html"
"/unknown3/(.*)" does not match "/404.html"
"/unknown4/(.*)" does not match "/404.html"
Both the /
route and /main
work as expected: the naked rewrite
rules run once, but in the cases of the other two these get executed twice. With the current config it’s a bit hard to demonstrate, but the pathological case happens even when those 403, 404 cases happen naturally: an undefined location etc. Using named routes this is the rewritten (no pun intended) config:
GET /
"/unknown1/(.*)" does not match "/"
"/unknown2/(.*)" does not match "/"
"/unknown3/(.*)" does not match "/"
"/unknown4/(.*)" does not match "/"
"^/.*$" matches "/"
rewritten data: "/main", args: ""
GET /main
"/unknown1/(.*)" does not match "/main"
"/unknown2/(.*)" does not match "/main"
"/unknown3/(.*)" does not match "/main"
"/unknown4/(.*)" does not match "/main"
GET /notauthorized
"/unknown1/(.*)" does not match "/notauthorized"
"/unknown2/(.*)" does not match "/notauthorized"
"/unknown3/(.*)" does not match "/notauthorized"
"/unknown4/(.*)" does not match "/notauthorized"
GET /nonexistent
"/unknown1/(.*)" does not match "/nonexistent"
"/unknown2/(.*)" does not match "/nonexistent"
"/unknown3/(.*)" does not match "/nonexistent"
"/unknown4/(.*)" does not match "/nonexistent"
As expected, only one set of rewrite
rule runs. That said, the actual fix would be to refactor the rewrites into location
blocks to improve the matching performance a little further.