GrammaTalk

Please, Sir, I want some more [physical pages]

Posted on

by

In 2012, we did significant work on parallelizing CodeSonar’s analysis, culminating in the CodeSonar 3.8 release. The parallel analysis achieves respectable performance gains:

These charts only examine the timing of the analysis phase of an end-to-end run, which doesn’t include some (shorter) preliminary phases. I’m not going to get into the details of why a super-linear speedup is possible at 2 cores, or why we are seeing nearly a 10x speedup on an 8-core machine, but I assure you there are many plausible explanations.

The upshot is that by allowing CodeSonar to run more processes, we have significantly reduced the time required to run an analysis. In addition, the multi-process (message passing) architecture provides fault-isolation: If one process crashes, the analysis can still carry on.

So we can ship this to users and they will love it, right?

Not so fast. It turns out determining how many processes to run is a tricky issue.

The simplest approach is to let the user decide. Anyone familiar with the “make” build system has likely run commands like “make –j4” to tell “make” it should run up to four jobs concurrently. This is a flexible approach, but in practice, most users will forget to specify the parameter, or will never discover it in the first place. We need something easier.

We could run N processes, where N is the number of logical processors on the machine. This is not a bad idea, especially for jobs where each process has modest memory requirements. Furthermore, this seems to be a common user expectation. However, in our case, each process might need hundreds of megs of RAM. If we start too many processes, things might start thrashing, which could easily make performance worse than the serial analysis was to start with.

In order to deal with the memory issue, we can measure how much physical memory there is on the machine. If the machine has 4G of physical memory and each process required 400mb of RAM, then there is enough memory to run 10 processes. Using this approach, we could start P processes, where:

P = min(processors, phys_mem / 400mb)

This approach is safer, since we won’t be able to start 8 processes on a machine with 1G of physical memory and 8 cores, causing the system to thrash. But there is still a pretty serious issue here — we aren’t the only job running on this computer.

There are other jobs (including the OS kernel) gobbling up available cycles and physical memory. As good citizens, we should not be starting so many processes that we end up stealing resources from other well-behaved jobs, which again ends with thrashing. What we should really be utilizing is any spare capacity not otherwise in use.

Operating systems are generally better about sharing clock cycles than they are about sharing physical memory. It’s easy to share cycles between 100 processes on a 1-processor system; maybe each process runs at almost 1% realtime speed, assuming they are all cpu-bound. It’s hard to share 1mb of RAM between 100 processes each wanting 1mb of RAM – performance will be significantly worse than 1% of realtime speed, assuming the usual preemptive style scheduler.

So let’s focus on the worse problem — memory. Thrashing is really bad and needs to be avoided at all costs. We can ask the operating system how much free physical memory there is and use this adjusted definition of P:

P = min(processors, free_phys_mem / 400mb)

We tried this, and the results weren’t bad; however, even on machines that ostensibly weren’t doing anything, the operating system was reporting little free memory. This wasn’t entirely unexpected — modern kernels are aggressive about file system caching — they may as well use the spare memory on a machine for something, and file system caching can make a big difference.

The next step was to assume all memory in use by the file system cache was up for grabs. This is not crazy in CodeSonar’s case, since it sports a userland I/O cache anyway, but it is antisocial with respect to other jobs on the system. So the new definition for P:

P = min(processors, (free_phys_mem+fs_cache_mem) / 400mb)

Most platforms could furnish this information with proper coaxing, but Mac OS X’s “unified” memory manager was a wrench in the works. It doesn’t distinguish between file system cache pages and other pages, instead having “active” and “inactive” pages. So we assume 25% of the “inactive” pages are up for grabs:

Pmacosx = min(processors, (free_phys_mem+(inactive_mem/4)) / 400mb)

That’s essentially the way CodeSonar decides how many processes to use by default. Is there room for improvement? Absolutely. What happens if the user starts two CodeSonar analyses simultaneously? Each of them will decide to gobble up all available resources, leaving none for the other. What if the file system cache was important to system performance? Performance will suffer because we will start hitting the disk. What if the workload on the machine changes over time? We won’t adjust the number of processes appropriately, leaving the system either under or over capacity.

At present, our solution to all these problems is to recommend the user manually specify the number of analysis processes if the automatic tuning isn’t cutting it. If the user wants to dynamically react to changing load conditions, they can arrange to manually start/stop analysis processes according to whatever regimen they choose. In practice, though, users rarely specify the process count and I would bet they have never bothered to dynamically react to changing conditions.

Where to from here?

So how would one create socially-minded applications that consume available capacity without negatively impinging on other applications? Furthermore, how can resources be split between multiple such applications running concurrently in a fair manner? I believe the OS kernel is in the best position to facilitate solutions. It already mediates resource needs of the processes on a machine, and it seems unlikely that a third-party solution would gain enough adoption from application developers to succeed.

The interface for the new functionality doesn’t need to be complicated. Perhaps the application could ask the kernel, “is there capacity for another process like this one?”

Conversely, it would be important to ask, “Would you advise ending one of the processes in this set of processes?”

More expressive interfaces are certainly possible, but this is largely sufficient for the current use case. One shortcoming is that there is no way to ask whether there is capacity for a process that will have resource requirements X,Y,Z if a similar process doesn’t already exist. There also isn’t enough information being given to the OS to achieve some approximation of fairness.

Related Technologies

There are some related technologies in existence today. The first is the POSIX “nice” function/command. A process that claims to be “nice” advises the operating system to yield resources to meaner processes when there is contention. This works great for time sharing, but isn’t a good solution for memory contention, and in the context of parallel computing, might decide to slow down or pause a process that is in the middle of a unit of work. This can be undesirable if the algorithm has subsequent data dependences on that unit, since the dependent units will be blocked.

The second related concept is that of a quota (see also: Windows Job object). This isn’t what we are looking for, since we want to use available capacity, not some fixed quota.

Load balancing is clearly related, but is generally concerned with balancing loads across multiple systems, and is clearly a third-party solution with significant complexity baggage and many competing implementations.

As more resource-intensive applications look to exploit modern computer hardware, this throttling problem will become more commonplace. I hope this post might foster thought or discussion on the topic. This is of particular interest at GrammaTech because a distributed version of CodeSonar is in the works, and we expect the resource contention issue will be exacerbated, since the analysis will more frequently be doing work on a variety of systems with competing responsibilities.

Related Posts

Check out all of GrammaTech’s resources and stay informed.

view all posts

Contact Us

Get a personally guided tour of our solution offerings. 

Contact US