How can I find the implementations of Linux kernel system calls?
I am trying to understand how a function, say mkdir
, works by looking at the kernel source. This is an attempt to understand the kernel internals and navigate between various functions. I know mkdir
is defined in sys/stat.h
. I found the prototype:
/* Create a new directory named PATH, with permission bits MODE. */
extern int mkdir (__const char *__path, __mode_t __mode)
__THROW __nonnull ((1));
Now I need to see in which C file this function is implemented. From the source directory, I tried
ack "int mkdir"
which displayed
security/inode.c
103:static int mkdir(struct inode *dir, struct dentry *dentry, int mode)
tools/perf/util/util.c
4:int mkdir_p(char *path, mode_t mode)
tools/perf/util/util.h
259:int mkdir_p(char *path, mode_t mode);
But none of them matches the definition in sys/stat.h
.
Questions
- Which file has the
mkdir
implementation? - With a function definition like the above, how can I find out which file has the implementation? Is there any pattern which the kernel follows in defining and implementing methods?
NOTE: I am using kernel 2.6.36-rc1.
None of the implementations you found matches the prototype in sys/stat.h Maybe searching for an include statement with this header file would be more successful?
System calls aren’t handled like regular function calls. It takes special code to make the transition from user space to kernel space, basically a bit of inline assembly code injected into your program at the call site. The kernel side code that “catches” the system call is also low-level stuff you probably don’t need to understand deeply, at least at first.
In include/linux/syscalls.h
under your kernel source directory, you find this:
asmlinkage long sys_mkdir(const char __user *pathname, int mode);
Then in /usr/include/asm*/unistd.h
, you find this:
#define __NR_mkdir 83
__SYSCALL(__NR_mkdir, sys_mkdir)
This code is saying mkdir(2)
is system call #83. That is to say, system calls are called by number, not by address as with a normal function call within your own program or to a function in a library linked to your program. The inline assembly glue code I mentioned above uses this to make the transition from user to kernel space, taking your parameters along with it.
Another bit of evidence that things are a little weird here is that there isn’t always a strict parameter list for system calls: open(2)
, for instance, can take either 2 or 3 parameters. That means open(2)
is overloaded, a feature of C++, not C, yet the syscall interface is C-compatible. (This is not the same thing as C’s varargs feature, which allows a single function to take a variable number of arguments.)
To answer your first question, there is no single file where mkdir()
exists. Linux supports many different file systems and each one has its own implementation of the “mkdir” operation. The abstraction layer that lets the kernel hide all that behind a single system call is called the VFS. So, you probably want to start digging in fs/namei.c
, with vfs_mkdir()
. The actual implementations of the low-level file system modifying code are elsewhere. For instance, the ext4 implementation is called ext4_mkdir()
, defined in fs/ext4/namei.c
.
As for your second question, yes there are patterns to all this, but not a single rule. What you actually need is a fairly broad understanding of how the kernel works in order to figure out where you should look for any particular system call. Not all system calls involve the VFS, so their kernel-side call chains don’t all start in fs/namei.c
. mmap(2)
, for instance, starts in mm/mmap.c
, because it’s part of the memory management (“mm”) subsystem of the kernel.
I recommend you get a copy of “Understanding the Linux Kernel” by Bovet and Cesati.
This probably doesn’t answer your question directly, but I’ve found strace
to be really cool when trying to understand the underlying system calls, in action, that are made for even the simplest shell commands. e.g.
strace -o trace.txt mkdir mynewdir
The system calls for the command mkdir mynewdir
will be dumped to trace.txt for your viewing pleasure.
System calls are usually wrapped in the SYSCALL_DEFINEx()
macro, which is why a simple grep
doesn’t find them:
fs/namei.c:SYSCALL_DEFINE2(mkdir, const char __user *, pathname, int, mode)
The final function name after the macro is expanded ends up being sys_mkdir
. The SYSCALL_DEFINEx()
macro adds boilerplate things like tracing code that each syscall definition needs to have.
Note: the .h file doesn’t define the function. It’s declared in that .h file and defined (implemented) elsewhere. This allows the compiler to include information about the function’s signature (prototype) to allow type checking of arguments and match the return types to any calling contexts in your code.
In general .h (header) files in C are used to declare functions and define macros.
mkdir
in particular is a system call. There may be a GNU libc wrapper around that system call (almost certainly is, in fact). The true kernel implementation of mkdir
can be found by searching the kernel sources and the system calls in particular.
Note that there will also be an implementation of some sort of directory creation code for each filesystem. The VFS (virtual filesystem) layer provides a common API which the system call layer can call into. Every filesystem must register functions for the VFS layer to call into. This allows different filesystems to implement their own semantics for how directories are structured (for example if they are stored using some sort of hashing scheme to make searching for specific entries more efficient). I mention this because you’re likely to trip over these filesystem specific directory creation functions if you’re searching the Linux kernel source tree.
A good place to read the Linux kernel source is the Linux cross-reference (LXR)¹. Searches return typed matches (functions prototypes, variable declarations, etc.) in addition to free text search results, so it’s handier than a mere grep (and faster too).
LXR doesn’t expand preprocessor definitions. System calls have their name mangled by the preprocessor all over the place. However, most (all?) system calls are defined with one of the SYSCALL_DEFINEx
families of macros. Since mkdir
takes two arguments, a search for SYSCALL_DEFINE2(mkdir
leads to the declaration of the mkdir
syscall:
SYSCALL_DEFINE2(mkdir, const char __user *, pathname, int, mode)
{
return sys_mkdirat(AT_FDCWD, pathname, mode);
}
ok, sys_mkdirat
means it’s the mkdirat
syscall, so clicking on it only leads you to the declaration in include/linux/syscalls.h
, but the definition is just above.
The main job of mkdirat
is to call vfs_mkdir
(VFS is the generic filesystem layer). Cliking on that shows two search results: the declaration in include/linux/fs.h
, and the definition a few lines above. The main job of vfs_mkdir
is to call the filesystem-specific implementation: dir->i_op->mkdir
. To find how this is implemented, you need to turn to the implementation of the individual filesystem, and there’s no hard-and-fast rule — it could even be a module outside the kernel tree.
¹ LXR is an indexing program. There are several websites that provide an interface to LXR, with slightly different sets of known versions and slightly different web interfaces. They tend to come and go, so if the one you’re used to isn’t available, do a web search for “linux cross-reference” to find another.
Here are a couple really great blog posts describing various techniques for hunting down low-level kernel source code.