[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Tuesday, September 06, 2011

The Limits of Network Load Testing – Ephemeral Ports

I've been trying to run load tests against a node.js web app that I'm working on. I've hacked up my app to log (every 5 seconds) the number of requests per second that it handles and the amount of memory that it uses. I saw some strange numbers. I should also mention that I was using ab to throw load at the server. Perhaps not the best load tester around, but it served my purpose for this particular test.

My app started out well, handling about 1500 requests per second, but then all of a sudden it would stop serving requests. With repeated runs, the point where it would stop was always around 16K connections... in fact, it always hovered around 16384, mostly staying below, but sometimes going above it a bit (eg: 16400).

Ephemeral Ports

To those of you familiar with TCP networking, you'll recognise that value as the IANA specified number of Ephemeral Ports (49152 to 65535).

I started to log the state of all TCP connections every second using the following shell script (split onto multiple lines for readability):

while true; do
   echo -n "$( date +%s ) ";
   netstat | \
      awk '$4~/\.http-alt/ { conn[$6]++ }
           END {
              for(c in conn) {
                 printf("%s: %d\t", c, conn[c]);
              } 
              printf("\n");
           }
      ';
   sleep 1;
done
This is the output it returned:
1315341300 TIME_WAIT: 209	ESTABLISHED: 100	
1315341301 FIN_WAIT_1: 4	FIN_WAIT_2: 4	TIME_WAIT: 1892	ESTABLISHED: 92	
1315341302 FIN_WAIT_1: 1	FIN_WAIT_2: 2	TIME_WAIT: 3725	ESTABLISHED: 97	
1315341303 FIN_WAIT_2: 2	TIME_WAIT: 5426	ESTABLISHED: 97	
1315341304 FIN_WAIT_1: 1	FIN_WAIT_2: 5	TIME_WAIT: 7017	ESTABLISHED: 94	
1315341305 TIME_WAIT: 8722	ESTABLISHED: 100	
1315341306 FIN_WAIT_1: 2	TIME_WAIT: 10459	ESTABLISHED: 98	
1315341308 FIN_WAIT_1: 3	FIN_WAIT_2: 3	TIME_WAIT: 12246	ESTABLISHED: 94	
1315341309 FIN_WAIT_1: 7	TIME_WAIT: 14031	ESTABLISHED: 93	
1315341310 FIN_WAIT_1: 3	TIME_WAIT: 15937	ESTABLISHED: 97	
1315341311 TIME_WAIT: 16363	
1315341312 TIME_WAIT: 16363	
1315341314 TIME_WAIT: 16363	
1315341315 TIME_WAIT: 16363	
1315341316 TIME_WAIT: 16363	
1315341317 TIME_WAIT: 16363	
1315341318 TIME_WAIT: 16363	
1315341319 TIME_WAIT: 16363	
1315341321 TIME_WAIT: 16363	
1315341322 TIME_WAIT: 16363	
1315341323 TIME_WAIT: 16363	
1315341324 TIME_WAIT: 16363	
1315341325 TIME_WAIT: 16363	
1315341326 TIME_WAIT: 16363	
1315341328 TIME_WAIT: 16363	
1315341329 TIME_WAIT: 16363	
1315341330 TIME_WAIT: 16363	
1315341331 TIME_WAIT: 14321	
1315341332 TIME_WAIT: 12641	
1315341333 TIME_WAIT: 11024	
1315341334 TIME_WAIT: 9621	
1315341336 TIME_WAIT: 7516	
1315341337 TIME_WAIT: 5920	
1315341338 TIME_WAIT: 4227	
1315341339 TIME_WAIT: 2693	
1315341340 TIME_WAIT: 1108	
1315341341 TIME_WAIT: 23
There are a few things to note here.
  1. I ran ab with concurrency set to 100, so the number of connections in the ESTABLISHED, and FIN_WAIT_{1,2} states should be about 100 for the most part
  2. I was polling netstat once per second + the amount of time it took netstat to return, so you might notice missing seconds in between
  3. Requests were being handled at over 1000 per second, so many of the connections would have gone from the SYN_SENT state to TIME_WAIT in less than 1 second.

It's pretty clear here that everything just stops once the number of connections reaches 16363, and my app's logs confirm that. ab ends up timing out somewhere near the end of this report (which should probably be the subject of a different blog post).

Now if we look closely, we notice that it takes 30 seconds from the first result we see to when the number of connections in the TIME_WAIT state start dropping below 16363

TIME_WAIT

A few points should be noted about the TIME_WAIT state.
  1. Only the endpoint that sends the first FIN will end up in a TIME_WAIT state. I'll tell you later why this is important
  2. A connection remains in the TIME_WAIT state for twice the connection's MSL (See RFC793)

So we now have a whole load of TCP connections in the TIME_WAIT state and the client (the load tester) has run out of ephemeral ports to keep running the test. This is independent of the load tester used, and how it handles its sockets (poll, select, multiple threads, multiple processes, whatever). This is a limit imposed by the Operating System in order to comply with the TCP spec, and in general make sure the internet doesn't break.

Now there are various ways to get around this limit. The simplest is to just use multiple boxes to run your performance test all simultaneously hitting your server. The more boxes you use, the more likely it is that your server will break down before your client hits the port limit. But what if you don't have access to so many boxes (in a world without EC2 for example), or it's too expensive, or you're not utilising the other resources (CPU/RAM/Network Interface) on your client boxes and would like to get a little more out of them?

You could program your load tester to use SO_REUSEADDR, which would allow it to reuse sockets that are in the TIME_WAIT state, but that could backfire if there's still data on the network.

MSL

Now remember that TIME_WAIT is twice the MSL. RFC793 recommends setting the MSL to 2 minutes, which would leave TIME_WAIT at 4 minutes. Luckily no current implementation uses this value. Linux has its TIME_WAIT length hardcoded to 60 seconds (implying a 30 second MSL), while BSD has set to 15 seconds (sysctl net.inet.tcp.msl) and tunable at run time.

To verify, I reran the test after setting the MSL to 5 seconds:

sudo sysctl -w net.inet.tcp.msl=5000

And these were the results:

1315344661 TIME_WAIT: 1451	ESTABLISHED: 100	
1315344662 FIN_WAIT_2: 5	TIME_WAIT: 3272	ESTABLISHED: 95	
1315344664 FIN_WAIT_1: 1	TIME_WAIT: 5010	ESTABLISHED: 99	
1315344665 TIME_WAIT: 6574	ESTABLISHED: 100	
1315344666 FIN_WAIT_1: 11	FIN_WAIT_2: 2	TIME_WAIT: 7908	ESTABLISHED: 87	
1315344667 TIME_WAIT: 9689	ESTABLISHED: 100	
1315344668 TIME_WAIT: 11155	ESTABLISHED: 100	
1315344669 FIN_WAIT_1: 3	TIME_WAIT: 12522	ESTABLISHED: 97	
1315344671 FIN_WAIT_2: 2	TIME_WAIT: 13655	ESTABLISHED: 98	
1315344672 FIN_WAIT_1: 4	FIN_WAIT_2: 12	TIME_WAIT: 13847	ESTABLISHED: 84	
1315344673 FIN_WAIT_1: 2	TIME_WAIT: 13218	ESTABLISHED: 98	
1315344674 FIN_WAIT_1: 2	TIME_WAIT: 13723	ESTABLISHED: 97	
1315344675 FIN_WAIT_1: 2	TIME_WAIT: 14441	ESTABLISHED: 98	
1315344677 FIN_WAIT_1: 4	FIN_WAIT_2: 6	TIME_WAIT: 13946	ESTABLISHED: 90	
1315344678 FIN_WAIT_1: 3	FIN_WAIT_2: 15	TIME_WAIT: 14670	ESTABLISHED: 82	
1315344679 FIN_WAIT_1: 2	TIME_WAIT: 15164	ESTABLISHED: 98	
1315344680 FIN_WAIT_1: 2	TIME_WAIT: 15062	ESTABLISHED: 98	
1315344681 TIME_WAIT: 15822	ESTABLISHED: 100	
1315344683 FIN_WAIT_1: 4	TIME_WAIT: 15855	ESTABLISHED: 82	
1315344684 FIN_WAIT_1: 15	TIME_WAIT: 15506	ESTABLISHED: 84	
1315344685 FIN_WAIT_2: 9	TIME_WAIT: 15928	ESTABLISHED: 91	
1315344687 FIN_WAIT_1: 4	TIME_WAIT: 15356	ESTABLISHED: 96	
1315344688 FIN_WAIT_1: 2	FIN_WAIT_2: 8	TIME_WAIT: 15490	ESTABLISHED: 90	
1315344689 FIN_WAIT_1: 1	TIME_WAIT: 15449	ESTABLISHED: 99	
1315344690 FIN_WAIT_1: 3	TIME_WAIT: 15801	ESTABLISHED: 97	
1315344692 FIN_WAIT_1: 1	TIME_WAIT: 15882	ESTABLISHED: 70	
1315344693 FIN_WAIT_1: 3	TIME_WAIT: 16106	ESTABLISHED: 56	
1315344694 FIN_WAIT_1: 3	TIME_WAIT: 15637	ESTABLISHED: 2	
1315344695 TIME_WAIT: 14166	
1315344697 TIME_WAIT: 12588	
1315344698 TIME_WAIT: 10454	
1315344699 TIME_WAIT: 8917	
1315344700 TIME_WAIT: 7441	
1315344701 TIME_WAIT: 5735	
1315344702 TIME_WAIT: 4119	
1315344703 TIME_WAIT: 1815	
1315344704 TIME_WAIT: 88	

The test runs longer here because it actually gets to 50000 connections. We're doing about 1500 requests per second, so it would take just under 11 seconds to exhaust all 16384 ports, but since ports are reusable in 10 seconds, we would have reclaimed the first 3000 ports in this time and then just continue to use them.

Here's the catch though, you can only do this on BSD. Linux doesn't let you change the MSL. Also, DO NOT DO THIS IN PRODUCTION!

TIME_WAIT, FIN and HTTP keep-alive

Now I mentioned earlier that only the endpoint that sends the first FIN packet will enter the TIME_WAIT state. This is important because in production, you really do not want your server to end up with a lot of connections in the TIME_WAIT state. A malicious user could, for example, make a large number of HEAD requests to your server, and either wait for them to terminate, or specify Connection:close but leave the connection open for your server to close by sending the first FIN packet.

How do you deal with this?

6 comments :

Zak
September 07, 2011 4:16 PM

If I may ask, what the the file descriptor limit on this system?

Thx

Philip
September 07, 2011 4:28 PM

I'd bumped it to 5000 before I'd started testing, but that wouldn't have been an issue here, since fds are only allocated for open connections (SYN_SENT/SYN_RECD/ESTABLISHED/FIN_WAIT{1,2}/CLOSING) and at any point of time there are only 100 connections in any of those states.

Karthikeyan
September 08, 2011 9:58 PM

If you are setting up 100 parallel (v)users, why can't they reuse same ports so you would need only 100 ports? I use tools like roadrunner and like and never looked at this depth, so i might be missing something.

Philip
September 08, 2011 10:41 PM

A socket is a tuple of five elements: protocol, server IP, server port, client IP, client port. This has to be unique for the both ends to tell the difference between different peers (note that TCP is a peer-to-peer protocol and there are no clients or servers). Now since the server (HTTP server) is passive, and receives connections, it needs to listen on a fixed, well known port. It can accept multiple clients on this port, but only one at a time. Similarly, the (HTTP) client can only make one connection from a single port. In order to make 100 simultaneous connections, it needs to use 100 separate ports.

The immediate question that comes up then, is how does the server handle 100 parallel connections on the same port. The answer is that it doesn't. It handles one connection at a time, and the remaining 99 sit in a listen queue waiting for the server to be done. When done, it moves to the next connection in the queue. By default this listen queue is set to 128, so any more connections will be dropped by the TCP layer.

This is fine if the server handles a client very quickly, but if some connections take more than a few milliseconds to be served, the server will also use ephemeral ports, and will transfer the connection to an ephemeral port leaving its well known port open to accept new connections.

Anonymous
September 09, 2011 10:37 AM

Thanks Phil. Coincidentally I have two servers with very high TIME_WAIT state numbers (10K each).

a
January 20, 2014 4:27 AM

How can you find the process associated to an ephemeral port using lsof or netstat when they do not stay at TIME_WAIT state?

Post a Comment

...===...