What if 'kill -9' does not work?

I have a process I can’t kill with kill -9 <pid>. What’s the problem in such a case, especially since I am the owner of that process. I thought nothing could evade that kill option.

Asked By: tshepang

||

It sounds like you might have a zombie process. This is harmless: the only resource a zombie process consumes is an entry in the process table. It will go away when the parent process dies or reacts to the death of its child.

You can see if the process is a zombie by using top or the following command:

ps aux | awk '$8=="Z" {print $2}'
Answered By: Josh

Sometime process exists and cannot be killed due to:

  • being zombie. I.e. process which parent did not read the exit status. Such process does not consume any resources except PID entry. In top it is signaled Z
  • erroneous uninterruptible sleep. It should not happen but with a combination of buggy kernel code and/or buggy hardware it sometime does. The only method is to reboot or wait. In top it is signaled by D.
Answered By: Maciej Piechotka

kill -9 (SIGKILL) always works, provided you have the permission to kill the process. Basically either the process must be started by you and not be setuid or setgid, or you must be root. There is one exception: even root cannot send a fatal signal to PID 1 (the init process).

However kill -9 is not guaranteed to work immediately. All signals, including SIGKILL, are delivered asynchronously: the kernel may take its time to deliver them. Usually, delivering a signal takes at most a few microseconds, just the time it takes for the target to get a time slice. However, if the target has blocked the signal, the signal will be queued until the target unblocks it.

Normally, processes cannot block SIGKILL. But kernel code can, and processes execute kernel code when they call system calls. Kernel code blocks all signals when interrupting the system call would result in a badly formed data structure somewhere in the kernel, or more generally in some kernel invariant being violated. So if (due to a bug or misdesign) a system call blocks indefinitely, there may effectively be no way to kill the process. (But the process will be killed if it ever completes the system call.)

A process blocked in a system call is in uninterruptible sleep. The ps or top command will (on most unices) show it in state D (originally for “disk”, I think).

A classical case of long uninterruptible sleep is processes accessing files over NFS when the server is not responding; modern implementations tend not to impose uninterruptible sleep (e.g. under Linux, since kernel 2.6.25, SIGKILL does interrupt processes blocked on an NFS access).

If a process remains in uninterruptible sleep for a long time, you can get information about what it’s doing by attaching a debugger to it, by running a diagnostic tool such as strace or dtrace (or similar tools, depending on your unix flavor), or with other diagnostic mechanisms such as /proc/PID/syscall under Linux. See Can't kill wget process with `kill -9` for more discussion of how to investigate a process in uninterruptible sleep.

You may sometimes see entries marked Z (or H under Linux, I don’t know what the distinction is) in the ps or top output. These are technically not processes, they are zombie processes, which are nothing more than an entry in the process table, kept around so that the parent process can be notified of the death of its child. They will go away when the parent process pays attention (or dies).

Kill actually means send a signal. there are multiple signals you can send. kill -9 is a special signal.

When sending a signal the application deals with it. if not the kernel deals with it. so you can trap a signal in your application.

But I said kill -9 was special. It is special in that the application doesn’t get it. it goes straight to the kernel which then truly kills the application at the first possible opportunity. in other words kills it dead

kill -15 sends the signal SIGTERM which stands for SIGNAL TERMINATE in other words tells the application to quit. This is the friendly way to tell an application it is time to shutdown. but if the application is not responding kill -9 will kill it.

if kill -9 doesn’t work it probably means your kernel is out of whack. a reboot is in order. I can’t recall that ever happening.

Answered By: DeveloperChris

If @Maciej‘s and @Gilles‘s answer’s don’t solve your problem, and you don’t recognize the process (and asking what it is with your distro doesn’t turn up answers ). Check for Rootkit’s and any other signs that you’ve been owned. A rootkit is more than capable of preventing you from killing the process. In fact many are capable of preventing you from seeing them. But if they forget to modify 1 small program they might be spotted ( e.g. they modified top, but not htop ). Most likely this is not the case but better safe than sorry.

Answered By: xenoterracide

Check your /var/log/kern.log and /var/log/dmesg (or equivalents) for any clues. In my experience this has happened to me only when an NFS mount’s network connection has suddenly dropped or a device driver crashed. Could happen if a hard drive crashes as well, I believe.

You can use lsof to see what device files the process has open.

Answered By: LawrenceC

The init process is immune to SIGKILL.

This is also true also for kernel threads, i.e. “processes” with a PPID equal to 0.

Answered By: jlliagre

There are cases where even if you send a kill -9 to a process, that pid will stop, but the process restarts automatically (for instance, if you try it with gnome-panel, it will restart): could that be the case here?

Answered By: dag729

Made a little script that helped me a lot take a look!

You can use it to kill any process with a given name in its path(pay attention to this!!)
Or you can kill any process of a given user using the “-u username” parameter.

#!/bin/bash

if [ "$1" == "-u" ] ; thenn
        PID=`grep "$2" /etc/passwd | cut -d ":" -f3`
        processes=`ps aux | grep "$PID" | egrep -v "PID|ps -au|killbyname|grep" | awk '{ print $2}'`
        echo "############# Killing all processes of user: $2 ############################"
else
        echo "############# Killing processes by name: $1 ############################"
        processes=`ps aux | grep "$1" | egrep -v "killbyname|grep" | awk '{ print $2}' `
fi


for process in $processes ; do
        # "command" stores the entire commandline of the process that will be killed
        #it may be useful to show it but in some cases it is counter-productive
        #command=`ps aux | grep $process | egrep -v "grep" | awk '{ print $2 }'`
        echo "Killing process: $process"
        echo ""
        kill -9 $process
done
Answered By: user36035

As others have mentioned, a process in uninterruptible sleep cannot be killed immediately (or, in some cases, at all). It’s worth noting that another process state, TASK_KILLABLE, was added to solve this problem in certain scenarios, particularly the common case where the process is waiting on NFS. See http://lwn.net/Articles/288056/

Unfortunately I don’t believe this is used anywhere in the kernel but NFS.

Answered By: user36054

First, check if its a Zombie process (which is very possible):

ps -Al

You will see something like:

0 Z  1000 24589     1  0  80   0 -     0 exit   ?        00:00:00 soffice.bin <defunct>

(Note the “Z” on the left)

If the 5th column is not 1, then it means it has a parent process.
Try killing that parent process id.

If its PPID = 1, DON’T KILL IT!!, think which other devices or processes may be related to it.

For example, if you were using a mounted device or samba, try to unmount it. That may release the Zombie process.

NOTE: If ps -Al (or top) shows a “D” instead of “Z”, it could be related to remote mount (like NFS). In my experience, rebooting is the only way to go there, but you may check the other answers which cover that case in more detail.

Answered By: lepe

from here originally:

check if strace shows anything

strace -p <PID>

try attaching to the process with gdb

gdb <path to binary> <PID>

if the process was interacting with a device that you can unmount, remove the kernel module for, or physically disconnect/unplug… then try that.

Answered By: nmz787

I had kind of this issue. This was a program that I had launched with strace and interrupted with Ctrl+C. It ended up in a T (traced or stopped) state. I don’t know how it happened exactly, but it was not killable with SIGKILL.

Long story short, I succeeded in killing it with gdb:

gdb -p <PID>
> kill
Kill the program being debugged? (y or n) y
> quit
Answered By: Christophe Drevet

Based on a clue from gilles’ answer, I had a process marked “Z” in top (<defunct> in ps) that was using system resources, it even had a port open that was LISTEN’ing and you could connect to that port. This was after executing a kill -9 on it. Its parent was “1” (i.e. init) so theoretically it should just be repeaed and disappear. But it wasn’t, it was sticking around, though not running, and “not dying”

So in my case it was zombie but still consuming resources…FWIW.

And it was not killable by any number of kill -9‘s

And its parent was init but it wasn’t being reaped (cleaned up). I.e. init had a zombie child.

And reboot was not necessary to fix the problem. Though a reboot “would have worked” around the problem/made it faster shutdown. Just not graceful, which was still possible.

And it was a LISTEN port owned by a zombie process (and a few other ports too like CLOSE_WAIT status connected localhost to localhost). And it still even accepted connections. Even as a zombie. I guess it hadn’t gotten around to cleanup up the ports yet so incoming connections were still added to the tcp listening port’s backlog, though they had no chance of being accepted.

Many of the above are stated as “impossible” on various places in the interwebs.

Turns out that I had an internal thread within it that was executing a “system call” (ioctl in this instance) that was taking a few hours to return (this was expected behavior). Apparently the system cannot kill the process “all the way” until it returns from the ioctl call, guess it enters kernel land. After a few hours it returned, things cleared up and the sockets were all automatically closed, etc. as expected. That’s some languishing time on death row! The kernel was patiently waiting to kill it.

So to answer the OP, sometimes you have to wait. A long time. Then the kill will finally take.

Also check dmesg to see if there was a kernel panic (i.e. kernel bug).

Answered By: rogerdpack

I have this more frequently with not well-behaved FUSE filesystems. Those processes cannot be killed and suspend will not work anymore because those processes also cannot be frozen: Freezing of tasks failed after 20 seconds (2 tasks refusing to freeze, wq_busy=0):.

Sometimes the process hangs because of a faulty file system. In that case, try forced umount:

sudo umount -f /path-to-problematic-mount-like-sshfs

I already tried to lazily unmount it and had lost my mount point.
The solution that worked for me was to use the FUSE control file system to abort the problematic connections. Quote from the link:

waiting
The number of requests which are waiting to be transferred to userspace or being processed by the filesystem daemon. If there is no filesystem activity and ‘waiting’ is non-zero, then the filesystem is hung or deadlocked.

abort
Writing anything into this file will abort the filesystem connection. This means that all waiting requests will be aborted an error returned for all aborted and new requests.

Finding the correct connection is cumbersome. In my case it was easy because all other FUSE file systems had no activity and one of the FUSE connections had 2 waiting requests no matter how often I polled. I aborted that connection and after that the process exited as desired.

for fuseConnection in /sys/fs/fuse/connections/*/; do
    waiting="$( cat -- "$fuseConnection/waiting" &> /dev/null )"
    if [ -n "$waiting" ] && [ "$waiting" != 0 ]; then 
       echo "$fuseConnection has waiting requests."
    fi
done

If you are sure that you got the correct connection, then you can abort it by writing anything to the special abort file. Note that this might interrupt file transfers and such when done on the wrong connection. I’m not yet aware of a better method to find out which connection belongs to which mount.

echo 1 > /sys/fs/fuse/connections/1234567890/abort
Answered By: mxmlnkn
Categories: Answers Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.