Main stacks in Linux

What are the main stacks in Linux? What I mean is, for example when an interrupt occurs what stack will be used for it, and what is the difference between user process and kernel process stacks?

Asked By: Nik Novák


On Linux, when running in userspace, the userspace stack is used. If the process is running in kernelspace, it uses a kernel stack “owned” by the process. Interrupts are handled in-kernel.

Probably a better place to find out such details about the private parts of your favorite kernel is to rummage around in sites catering to people programming it, like e.g. kernelnewbies or search around in LWN‘s “kernel pages” for Linux. You should be able to find similar places for the BSDs or Solaris, and even MacOS. Windows information might be harder to come by…

Such information isn’t discussed in typical operating system texts, you’d have to look for descriptions for developers.

Answered By: vonbrand

This is highly platform-specific. Unless you bind to a certain platform (even difference between x86-32 and x86-64 is principal), one can’t answer this. But, if to limit it to x86, according to your last comment, I could suggest some information.

There are two main styles of service request ("syscall") from user land to kernel land: interrupt-styled and sysenter-styled. (These terms are invented by me for this description.) Interrupt-styled requests are those that handled by processor exactly in the same manner as an external interrupt. In x86 protected mode, this is called using int 0x80 (newer) or lcall 7,0 (the oldest variant, SysV-compatible) and implemented using so-called gates (task gate, interrupt gate, etc.), configured as special segment descriptors. The task switching is executed by processor. During this switching, the old task registers, including stack pointer, are stored to old task TSS, and the new task registers, including stack pointer, are loaded from the new task TSS. In other words, all "usual" registers are stored and loaded (so this is very long action). (There is a separate issue with FPU/SSE/etc. state which change is postponed – see documentation for details.)

For handling such service requests, kernel prepares a separate stack for each thread (a.k.a. LWP – lightweight process), because a thread can be switched during any blockable function call. Such stack usually has small size (for example, 4KB).

As soon as x86 task switching always changes stack pointer, there is no chance to reuse userland stack for kernel. On the other hand, such reuse shall not be allowed at all (except small amount of the current thread data) because a user process page can be unsecure: another active thread can change or even unmap it. That’s why it is simply prohibited to use userland stack for running in kernel, so, each thread shall have different stacks for its user and kernel land; this remains true for modern, sysenter-styled processing. (On the other hand, as already noted above, each thread shall have a stack for its kernel land another than of another thread.)

Sysenter-styled processing had been designed much later and implemented with SYSENTER and SYSCALL processor instructions. They differ in that they were not designed with keeping an old (too firm) restriction in mind, that system call shall keep all registers. Instead, they were designed more closer to a usual function call ABI which allows that a function can arbitrarily change some registers (in most ABIs, this is named "scratch" registers), only a few registers are changed and the care to keep old values is brought by handler routines. SYSENTER/SYSEXIT instruction pair (both for 32 and 64 bits) spoil old contents of RDX and RCX (in a weird manner – userland shall prefill them with proper values), and new RIP and RSP are loaded from respective MSRs, so, stack is switched to the kernel land one immediately. Opposed to this, SYSCALL/SYSRET (64 bit only) use RCX and R11 for return address and flags, and do not change stack by themselves. Later on, kernel utilizes part of this stack to save a few registers and then switches to own stack, because 1) there is no guarantee that userland stack is enough big to keep all needed values, and 2) for security reasons (see above). From this point, we have again a per-thread kernel stack.

Beside userland threads, there are many kernel-only threads (you can see them in ps output as names inside square brackets). Each such thread has its own stack. They implement 1) periodic routines, started on some event or timeout, 2) transient actions or 3) handle actions requested from real interrupt handlers. (For case 3 they named "bh" in old kernels, and "ksoftirqd" in newer ones.) Large part of these threads are attached to a single logical CPU. As soon as they have no user land, they have no user land stack.

External interrupt handlers are limited in Linux, AFAIK, to no more than one simultaneously executed handler for each logical CPU; during such handler execution, no IO interrupts are allowed. (NMIs are a terrible exception with bug-prone handling.) They come using task-switching interrupt gate and have got an own stack for each logical CPU, for the same reasons as described above.

As already noted, the most part of this is too x86-specific. Task switching with mandatory stack pointer replacing is rare to see at another architectures. For example, ARM32 has a single stack pointer per privilege level, so, if an external interrupt comes during kernel land, stack pointer is not changed.

Some details in this answer can be obsoleted due to high kernel development speed. Consider it only as a general suggestion and verify against the concrete version you will explore. For more description on x86 interrupt handling and task switching, please refer to "Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1" (freely available on Intel website).

Answered By: Netch
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.