Why is the Linux kernel 15+ million lines of code?

What are the contents of this monolithic code base?

I understand processor architecture support, security, and virtualization, but I can’t imagine that being more than 600,000 lines or so.

What are the historic & current reason drivers are included in the kernel code base?

Do those 15+ million lines include every single driver for every piece of hardware ever? If so, that then begs the question, why are drivers embedded in the kernel and not separate packages that are auto-detected and installed from hardware IDs?

Is the size of the code base an issue for storage-constrained or memory-constrained devices?

It seems it would bloat the kernel size for space-constrained ARM devices if all that was embedded. Are a lot of lines culled by the preprocessor? Call me crazy, but I can’t imagine a machine needing that much logic to run what I understand is the roles of a kernel.

Is there evidence that the size will be an issue in 50+ years due to it’s seemingly ever-growing nature?

Including drivers means it will grow as hardware is made.

EDIT: For those thinking this is the nature of kernels, after some research I realized it isn’t always. A kernel is not required to be this large, as Carnegie Mellon’s microkernel Mach was listed as an example ‘usually under 10,000 lines of code’

Asked By: Jonathan

||

According to cloc run against 3.13, Linux is about 12 million lines of code.

  • 7 million LOC in drivers/
  • 2 million LOC in arch/
  • only 139 thousand LOC in kernel/

lsmod | wc on my Debian laptop shows 158 modules loaded at runtime, so dynamically loading modules is a well-used way of supporting hardware.

The robust configuration system (e.g. make menuconfig) is used to select which code to compile (and more to your point, which code to not compile). Embedded systems define their own .config file with just the hardware support they care about (including supporting hardware built-in to the kernel or as loadable modules).

Answered By: user4443

For anyone curious, here’s the linecount breakdown for the GitHub mirror:

=============================================
    Item           Lines             %
=============================================
  ./usr                 845        0.0042
  ./init              5,739        0.0283
  ./samples           8,758        0.0432
  ./ipc               8,926        0.0440
  ./virt             10,701        0.0527
  ./block            37,845        0.1865
  ./security         74,844        0.3688
  ./crypto           90,327        0.4451
  ./scripts          91,474        0.4507
  ./lib             109,466        0.5394
  ./mm              110,035        0.5422
  ./firmware        129,084        0.6361
  ./tools           232,123        1.1438
  ./kernel          246,369        1.2140
  ./Documentation   569,944        2.8085
  ./include         715,349        3.5250
  ./sound           886,892        4.3703
  ./net             899,167        4.4307
  ./fs            1,179,220        5.8107
  ./arch          3,398,176       16.7449
  ./drivers      11,488,536       56.6110
=============================================

drivers contributes to a lot of the linecount.

Answered By: user3276552

Linux tinyconfig compiled sources line count
tinyconfig bubble graph svg (fiddle)

shell script to create the json from the kernel build, use with http://bl.ocks.org/mbostock/4063269


Edit: turned out unifdef have some limitation (-I is ignored and -include unsupported, the latter is used to include the generated configuration header) at this point using cat doesn’t change much:

274692 total # (was 274686)

script and procedure updated.


Beside drivers, arch etc. there’s a lot of conditional code compiled or not depending on the chosen configuration, code not necessarily in dynamic loaded modules but built in the core.

So, downloaded linux-4.1.6 sources, picked the tinyconfig, it doesn’t enable modules and I honestly don’t know what it enable or what a user can do with it at runtime, anyway, config the kernel:

# tinyconfig      - Configure the tiniest possible kernel
make tinyconfig

built the kernel

time make V=1 # (should be fast)
#1049168 ./vmlinux (I'm using x86-32 on other arch the size may be different)

the kernel build process leave hidden files called *.cmd with the command line used also to build .o files, to process those files and extract target and dependencies copy script.sh below and use it with find:

find -name "*.cmd" -exec sh script.sh "{}" ;

this create a copy for each dependency of target .o named .o.c

.c code

find -name "*.o.c" | grep -v "/scripts/" | xargs wc -l | sort -n
...
   8285 ./kernel/sched/fair.o.c
   8381 ./kernel/sched/core.o.c
   9083 ./kernel/events/core.o.c
 274692 total

.h headers (sanitized)

make headers_install INSTALL_HDR_PATH=/tmp/test-hdr
find /tmp/test-hdr/ -name "*.h" | xargs wc -l
...
  1401 /tmp/test-hdr/include/linux/ethtool.h
  2195 /tmp/test-hdr/include/linux/videodev2.h
  4588 /tmp/test-hdr/include/linux/nl80211.h
112445 total
Answered By: Alex

The answers so far seem to be “yes there is lots of code” and nobody is tackling the question with the most logical answer: 15M+? SO WHAT? What does 15M lines of source code have to do with the price of fish? What makes this so unimaginable?

Linux clearly does lots. Lots more than anything else… But some of your points show you don’t respect what’s happening when it’s built and used.

  • Not everything is compiled. The Kernel build system allows you to quickly define configurations which select sets of source code. Some is experimental, some is old, some just isn’t needed for every system. Look at /boot/config-$(uname -r) (on Ubuntu) in make menuconfig and you’ll see just how much is excluded.

    And that’s a variable-target desktop distribution. The config for an embedded system would only pull in the things it needs.

  • Not everything is built-in. In my configuration, most of the Kernel features are built as modules:

    grep -c '=m' /boot/config-`uname -r`  # 4078
    grep -c '=y' /boot/config-`uname -r`  # 1944
    

    To be clear, these could all be built-in… Just as they could be printed out and made into a giant paper sandwich. It just wouldn’t make sense unless you were doing a custom build for a discrete hardware job (in which case, you’d have limited the number of these items down already).

  • Modules are dynamically loaded. Even when a system has thousands of modules available to it, the system will allow you to load just the things you need. Compare the outputs of:

    find /lib/modules/$(uname -r)/ -iname '*.ko' | wc -l  # 4291
    lsmod | wc -l                                         # 99
    

    Almost nothing is loaded.

  • Microkernels aren’t the same thing. Just 10 seconds looking at the leading image to the Wikipedia page you linked would highlight they are designed in a completely different way.

    Linux drivers are internalised (mostly as dynamically loaded modules), not userspace, and the filesystems are similarly internal. Why is that worse than using external drivers? Why is micro better for general purpose computing?


The comments again highlight you’re not getting it. If you want to deploy Linux on discrete hardware (eg aerospace, a TiVo, tablet, etc) you configure it to build only the drivers you need. You can do the same on your desktop with make localmodconfig. You end up with a tiny for-purpose Kernel build with zero flexibility.

For distributions like Ubuntu, a single 40MB Kernel package is acceptable. No, scrub that, it’s actually preferable to the massive archiving and download scenario that keeping 4000+ floating modules as packages would be. It uses less disk space for them, easier to package at compile-time, easier to store and is better for their users (who have a system that just works).

The future doesn’t seem to be an issue either. The rate of CPU speed, disk density/pricing and bandwidth improvements seems much faster than the growth of the Kernel. A 200MB Kernel package in 10 years wouldn’t be the end if the world.

It’s also not a one way street. Code does get kicked out if it isn’t maintained.

Answered By: Oli

The tradeoffs of monolithic kernels were debated between Tananbaum and Torvalds in public from the very beginning. If you don’t need to cross into userspace for everything, then the interface to the kernel can be simpler. If the kernel is monolithic, then it can be more optimized (and more messy!) internally.

We have had modules as a compromise for quite a while. And it is continuing with things like DPDK (moving more networking functionality out of the kernel). The more cores get added, the more important it is to avoid locking; so more things will move into userspace and the kernel will shrink.

Note that monolithic kernels are not the only solution. On some architectures, the kernel/userspace boundary isn’t more expensive than any other function call, making microkernels attractive.

Answered By: Rob

Drivers are maintained in-kernel so when a kernel change requires a global search-and-replace (or search-and-hand-modify) for all users of a function, it gets done by the person making the change. Having your driver updated by people making API changes is a very nice advantage, instead of having to do it yourself when it doesn’t compile on a more recent kernel.

The alternative (which is what happens for drivers maintained out-of-tree), is that the patch has to get re-synced by its maintainers to keep up with any changes.

A quick search turned up a debate over in-tree vs. out-of-tree driver development.

The way Linux is maintained is mostly by keeping everything in the mainline repo. Building of small stripped-down kernels is supported by config options to control #ifdefs. So you can absolutely build tiny stripped-down kernels which compile only a tiny part of the code in the whole repo.

The extensive use of Linux in embedded systems has led to better support for leaving stuff out than Linux had years earlier when the kernel source tree was smaller. A super-minimal 4.0 kernel is probably smaller than a super-minimal 2.4.0 kernel.

Answered By: Peter Cordes
Categories: Answers Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.