Tuning Kamailio for high throughput and performance

Unrivaled SIP message processing throughput is one of the central claims to fame made by Kamailio. When it comes to call setups per second (“CPS”) or SIP messages per second, there’s nothing faster than the OpenSER technology stack. Naturally, we pass that benefit on in the value proposition of CSRP, our Kamailio-based “Class 4” routing, rating and accounting platform.

It’s an important differentiator from many traditional softswitches, SBCs and B2BUAs, some of which are known to fall over with as little as 100 CPS of traffic, and many others to top out at a few hundred CPS spread over the entire installation. It shapes the horizontal scalability and port density of those platforms, and, accordingly, the unit economics for actors in the business: per-port licencing costs, server dimensioning, and ultimately, gross margins in a world where PSTN termination costs are spiraling down rapidly and short-duration traffic–love it, hate it–plays a conspicuous role in the ITSP industry prospectus.

That’s why it’s worthwhile to take the time to understand how Kamailio does what it does, and what that means for you as an implementor (or a prospective CSRP customer? :-).

Kamailio concurrency architecture

Kamailio does not use threads as such. Instead, when it boots, it fork()s some child processes that are specialised into SIP packet receiver roles. These are bona fide independent processes, and although they may be colloquially referred to as “threads”, they’re not POSIX threads, and, critically, don’t use POSIX threads’ locking and synchronisation mechanisms. Kamailio child processes communicate amongst themselves (interprocess communication, or “IPC”) using System V shared memory. We’re going to call these “receiver processes” for the remainder of the article, since that’s what Kamailio itself calls them.

The number of receiver processes to spawn is governed by the children= core configuration directive. This value is multiplied by the number of listening interfaces and transports. For example, in the output below, I have my children set to 8, but because I am listening on two network interfaces (209.51.167.66 and 10.150.20.2), there are eight processes for each interface. If I enabled SIP over TCP as well as UDP, the number would be 32. But a more typical installation would simply have 8:

[root@allegro-1 ~]# kamctl ps
Process:: ID=0 PID=22937 Type=attendant
Process:: ID=1 PID=22938 Type=udp receiver child=0 sock=209.51.167.66:5060
Process:: ID=2 PID=22939 Type=udp receiver child=1 sock=209.51.167.66:5060
Process:: ID=3 PID=22940 Type=udp receiver child=2 sock=209.51.167.66:5060
Process:: ID=4 PID=22941 Type=udp receiver child=3 sock=209.51.167.66:5060
Process:: ID=5 PID=22942 Type=udp receiver child=4 sock=209.51.167.66:5060
Process:: ID=6 PID=22943 Type=udp receiver child=5 sock=209.51.167.66:5060
Process:: ID=7 PID=22944 Type=udp receiver child=6 sock=209.51.167.66:5060
Process:: ID=8 PID=22945 Type=udp receiver child=7 sock=209.51.167.66:5060
Process:: ID=9 PID=22946 Type=udp receiver child=0 sock=10.150.20.2:5060
Process:: ID=10 PID=22947 Type=udp receiver child=1 sock=10.150.20.2:5060
Process:: ID=11 PID=22948 Type=udp receiver child=2 sock=10.150.20.2:5060
Process:: ID=12 PID=22949 Type=udp receiver child=3 sock=10.150.20.2:5060
Process:: ID=13 PID=22950 Type=udp receiver child=4 sock=10.150.20.2:5060
Process:: ID=14 PID=22951 Type=udp receiver child=5 sock=10.150.20.2:5060
Process:: ID=15 PID=22952 Type=udp receiver child=6 sock=10.150.20.2:5060
Process:: ID=16 PID=22953 Type=udp receiver child=7 sock=10.150.20.2:5060
Process:: ID=17 PID=22954 Type=slow timer
Process:: ID=18 PID=22955 Type=timer
Process:: ID=19 PID=22956 Type=MI FIFO

(There are some other child processes besides receivers, but these are ancillary — they do not perform Kamailio’s core function of SIP message processing. More about other processes later.)

You can think of these receiver processes as something like “traffic lanes” for SIP packets; as many “lanes” as there are, that’s how many SIP messages can be crammed onto the “highway” at the same time:

kamailio_sip_worker_processes

This is more or less the standard static “thread pool” design. For low-latency, high-volume workloads, it’s probably the fastest available option. Because the size of the worker pool does not change, the overhead of starting and stopping threads constantly is avoided. What applies to static thread pool management in general also applies here.

Of course, synchronisation, the mutual exclusion locks (“mutexes”) which ensure that multiple threads do not access and modify the same data at the same time in conflicting ways, is the bane of multiprocess programming, whatever form the processes take. The parallelism benefit of multiple threads is undermined when they all spend a lot of time blocking, waiting on mutex locks held by other threads to open before their execution can continue. Think of a multi-lane road where every car is constantly changing lanes; there’s a lot of waiting, acknowledgment and coordination that has to happen, inevitably leading to a slow-down or jam. The ideal design is “shared-nothing”, where every car stays stays in its own lane always–that is, where every thread can operate more or less self-sufficiently without complicated sharing (and therefore, locking) with other threads.

The design of Kamailio is what you might call “share as little as possible”; while certain data structures and other constructs (AVPsXAVPs, SIP transactions, dialog statehtable, etc.) are unavoidably global (otherwise they wouldn’t be very useful), residing in the shared memory IPC space accessed by all receiver threads, much of what every receiver process requires to operate on a SIP message is proprietary to that process. For instance, every child process receives its own connection handle to databases and key-value stores (e.g. MySQL, Redis), removing the need for common (and contended) connection pooling. In addition to the shared memory pool used by all processes, every child process gets a small “scratch area” of memory where ephemeral, short-term data (such as $var(…) config variables) as well as persistent process-proprietary data lives. (This is called “package memory” in Kamailio, and is set with the -M command line argument upon invocation, as opposed to -m, which sets the size of the shared memory pool.)

Of course, actual results will depend on which Kamailio features you utilise, and how much you utilise them. Nearly all useful applications of Kamailio involve transaction statefulness, so you can expect, at a minimum, for transactions to be shared. If, for example, your processing is database-driven, you can expect receiver processes to operate more independently than if your processing is heavily tied up in shared memory constructs like htable or pipelimit.

Furthermore, in contrast to the architecture found in many classically multithreaded programs with this “thread pool” design, there is no “distributor” thread that apportions incoming packets. Instead, every child process calls recvfrom() (or accept() or whatever) on the same socket address. The operating system kernel itself distributes the incoming packets to listening child processes in a semi-random fashion that, in statistically large quantities, is substantially similar to “round-robin”. It’s a simple, straightforward approach that leverages the kernel’s own packet queueing and eliminates the complexity of a supervisory process to marshal data.

How many children to have?

Naturally, discussions of performance and throughput all sooner or later turn to:

What “children” value should I use for best performance?

It’s a hotly debated topic, and probably one of the more common FAQs on the Kamailio users’ mailing list. The stock configuration ships with a value of 8, which leads many people to ask: why so low? At first glance, it might stand to reason that on a busy system, the more child processes, the better. However, that’s not accurate.

The reason the answer is complicated is because it depends on your hardware and, more importantly, on Kamailio’s workload.

Before we go forward, let’s define a term: “available hardware threads”. For our purposes, this is the number of CPU “appearances” in /proc/cpuinfo. This takes into account the “logical” cores created by hyper-threading.

For instance, I have a dual-core laptop with four logical “CPUs”:

$ nproc
4

sasha@saurus:~$ cat /proc/cpuinfo | grep 'core id' | sort -u
core id		: 0
core id		: 1

In this case, our number of available hardware threads is 4.

In principle, the number of child processes that can usefully execute in parallel is equal to the number of available hardware threads (in the /proc/cpuinfo sense). Given a purely static Kamailio configuration on an 8-HW thread system, 8 receiver processes will have 8 different “CPU” affinities and peg out the processors with as many packets as the hardware can usefully handle. Such a configuration can handle tens of thousands of messages per second, and the limits you will eventually run into are more likely to do with userspace I/O contention or NIC frames-per-interrupt or hardware buffer type issues than with Kamailio itself.

Once you increase the number of receiver processes beyond that, the surplus processes will be fighting over the same number of hardware threads, and you’ll be more harmed by the downside of that userspace scheduling contention and the limited amount of shared memory locking that does exist in Kamailio than you’ll benefit from the upside of more processes.

However, most useful applications of Kamailio don’t involve a hard-coded config file, but rather external I/O interactions with outside systems: databases, key-value stores, web services, embedded programs, and the like. Waiting on an outside I/O call, such as a SQL query to MySQL, to return is a synchronous (or blocking) process; while the receiver thread waits for the database to respond, it sits there doing nothing. It’s tied up and cannot process any more SIP messages. It’s safe to say that the end-to-end processing latency for any given SIP message is determined by the cumulative I/O wait involved in the processing. Such operations are referred to as I/O-bound operations. Most of what a typical Kamailio deployment does is somehow I/O-bound.

This is where you have to take a discount from the aforementioned idealised maximum throughput of Kamailio, and it’s usually a rather steep one. The question is rarely: “How many SIP messages per second can Kamailio handle?” The right question is: “How many SIP messages per second can Kamailio handle with your configuration script and external I/O dependencies?” It stands to reason that given a fixed-size receiver thread pool, one should aim to keep external I/O wait to a minimum.

Still, when a receiver process spends a lot of time waiting on external I/O, it’s just sleeping until notified by the kernel that new data has arrived on its socket descriptor or what have you. That sleeping creates an opening for additional processes to do useful work in that time. If you have a lot of external I/O wait, it’s safe to increase the number of receiver threads to values like 32 or 64. If most of what your worker processes do is wait on a morbidly obese Java servlet on another server, you can afford to have more of them waiting.

This is why a typical Linux system has hundreds of background processes running, even though there are only 2, 4 or 8 hardware threads available. Most of the time, those processes aren’t doing anything. They’re just sitting around waiting for external stimuli of some sort. If they were all pegging out the CPU, you’d have a problem.

How many receiver processes can there be? All other things being equal, the answer is “not too many”. They’re not designed to be run in the hundreds or thousands. Each child process is fairly heavyweight, carrying, at a minimum, an allocation of a few megabytes of package memory, and toting its own connection handle to services such as databases, RTP proxies, etc. Since Kamailio child processes do have to share quite a few things, there’s shared memory mutexes over those data structures. I don’t have any numbers, but the fast mutex design is clearly not intended to support a very large number of processes. I suppose it’s a testament to CSRP’s relatively efficient call processing loop that, despite being very database-bound, we’ve found in our own testing signs of diminishing returns after increasing children much beyond the number of available hardware threads.

Academically speaking, the easiest way to know that you need more child processes is to monitor the kernel’s packet receive queue using netstat (or ss, since netstat is deprecated in RHEL >= 7, in keeping with general developments in systemd land):

[root@allegro-1 ~]# ss -4 -n -l | grep 5060
udp    UNCONN     0      0      10.150.20.2:5060                  *:*
udp    UNCONN     0      0      209.51.167.66:5060                  *:*

The third column is the RecvQ column. Under normal conditions, its value should be 0, perhaps ephemerally spiking to a few hundred or thousand entries here and there. If the receive queue size is continuously > 0, bursting stubbornly high, or, worst of all, increasing monotonically, this tells you that incoming SIP messages are not being consumed by the receiver processes fast enough. These receive queues can be tuned to some extent, but that ultimately won’t solve your problem. You just need more processes to suckle on the packet teat.

More fine-grained results can be obtained with sipp scenario testing. Run calls through your Kamailio proxy and ramp the call setup rate up until the UAC starts reporting retransmissions. This gives insight into a different dimension of the problem than the packet queue: is your proxy taking too long to respond? In both cases, however, the available options are either to decrease the I/O wait to free up the receiver processes to process more messages, or to add more receiver processes.

However, once you go down the road of adding receiver processes, you need to ask yourself: are these processes doing a lot of waiting, or are they always busy? If your request/message processing in configuration script has relatively little end-to-end I/O delay, all you’re going to do is overbook your CPU, driving up your load average and slowing down the system as a whole. In that case, you’ve simply run into the limits of your hardware or your software architecture. There’s no simple fix for that.

That’s why, when the question about the ideal number of receiver processes is asked, the answers given are often avoidant and noncommittal.

What about asynchronous processing?

Over the last few years, Kamailio has evolved a lot of asynchronous processing features. Daniel-Constantin Mierla has some useful examples and information from Kamailio World 2014.

The basic idea behind asynchronous processing, in Kamailio terms is that, in addition to the core receiver processes, an additional pool of processes is spawned to which latent blocking operations can be delegated. Transactions are suspended and enqueued to these outside processes, and they’ll get to them … whenever they can get to them–that’s the “asynchronous” part. This keeps the main SIP receiver processes free to process further messages instead of being blocked with expensive I/O operations, as the heavy lifting is left up to the dedicated async task worker processes.

Asynchronous processing can be very useful in certain situations. If you know your request routing is going to be expensive, you can send back an immediate, stateless 100 Trying and push the processing tasks out to the async task workers.

However, a note of caution. Asynchronous processing of all kinds is often held to be a panacea, driven by popular Silicon Valley fashions and async-everything design patterns in the world of Node.js. There’s a lot of cargo cult thinking and exuberance around asynchronous processing.

As a guiding principle, remember that asynchrony is not magic, and it cannot achieve that which is otherwise thermodynamically impossible given the number of transistors on your chip. In many cases, asynchronous programming is almost a kind of syntactical sugar, a different semantic vantage point on the same operations which ultimately have to be executed in some way on the same system given the same resources. The fact that the responsibility for I/O multiplexing is pushed to an external, opaque actor doesn’t change that.

Asynchronous processing also imposes its own overheads: in the case of Kamailio, there’s a complexity to suspending a TM transaction and reanimating it in a different thread that should be weighed. (I cannot say how much complexity, and, as with everything else in this article, have made no effort to measure it or describe it with the rigour of the scientific method. But it’s there.)

In the commonplace case of database-driven workloads, asynchronous processing does little more than push the pain point to a different place. To drive the point home, let’s take an example from our very own CSRP product:

In CSRP, we write CDR events to our PostgreSQL database in an asynchronous way, since these operations are quite expensive and can potentially set off a database trigger cascade for call rating, lengthening the transaction. We don’t really care if CDRs are written with a slight delay; it’s far more important that this accounting not block SIP receiver processes.

However, many CSRP customers choose to run their PostgreSQL database on the same host as the Kamailio proxy. If the database is busy writing CDRs and is pegging out the storage controller with write ops, it’s going to make everything less responsive, asynchronous or not, including the read-only queries required for call processing. Even if the database is situated on a different host, our call processing is highly database-dependent, so overwhelming the database has deleterious consequences regardless.

This can engender a nasty positive feedback loop:

  • Adding more asynchronous task workers won’t help; they’ll just further overwhelm storage with an additional firehose of CDR events.
  • The asynchronous task queue will stack up until calling SIP endpoints will start CANCELing calls due to high post-dial delay (PDD).
  • Adding more receive workers won’t help; if the system as a whole is experiencing high I/O wait, adding more workers to take on more SIP messages just means more queries and yet more load.

A fashionable design pattern can’t fix that; you just need more hardware, or a different approach (in terms of I/O, algorithms, storage demand, etc.) to call processing.

The point is: before shifting the load to another part of the system so as to get more traffic through the front door, consider the impact globally and holistically. Maybe you can get Kamailio to slurp up more packets, but that doesn’t mean you should. How well do your external inputs scale to the task?

Asynchronous tasks can be very handy for certain kinds of applications, most notably where some sort of activity needs to be time-delayed into the future (e.g. push notifications). We love our asynchronous CDR accounting, since it’s heavy, and yet there’s no need for that to be real-time or responsive by SIP standards. However, for maximising call throughput in an I/O-bound workload such as ours, in which storage and database demand is more or less a linear function of requests per second, it’s far less clear. Our own testing suggests that asynchronous processing yields marginal benefits at best, and that we might be better off keeping ourselves honest and putting our efforts into further lowering our processing latency in the normal, synchronous execution context.

Conclusion

  • There’s no straightforward, generic answer to the question of how to reap maximum throughput from Kamailio and/or how many receiver worker processes to use. It requires deep consideration of the nature of the workload and the execution environment, and, most likely, empirical testing — doubly so for bespoke and/or nonstandard applications.
  • A reasonable guideline for most generic and/or commonplace Kamailio workloads is to set the children equal to the number of available hardware threads. The prevalence of servers with quad-core processors + HyperThreading probably explains why the stock config ships with a setting of 8.
  • Asynchronous features are convenient and can, to an extent, be used to increase raw throughput, but rapidly encounter diminishing returns when the result is a drastic increase in base I/O load on either the local host or a dependency to which the workload is heavily I/O-bound.

Many thanks to my colleagues Fred Posner, Matt Jordan and Kevin Fleming for reading drafts of this article.