Skip to content
Advertisement

waitpid() function returns ERROR (-1), why?

I’m writing a Linux shell-like program in C.

Among others, I’m implementing two built-in commands: jobs, history. In jobs, I print the list of currently working commands (in the background). In history I print the list of all commands history until now, specifying for each command if it’s RUNNING or DONE.

To implement the two, my idea was to have a list of commands, mapping the command name to their PID. Once the jobs/history command is called, I run through them, check which ones are running or done, and print accordingly.

I read online that the function: waitpid(pid, &status, WNOHANG), can detect from “PID” whether a process is still running or done, without stopping the process. It works well, except for this:

When a program is alive, the function returns it. When a program is done, the first time I call it returns done, and from there on, if called again with the same PID, it returns -1 (ERROR).

For example, it would look like this: (the & symbolizes background command)

$ sleep 3 &
$ jobs
sleep ALIVE 
$ jobs  (withing the 3 seconds)
sleep ALIVE
$ jobs (after 3 seconds)
sleep DONE
$ jobs 
sleep ERROR
$ jobs 
sleep ERROR
....

Also, these are not influenced by other command calls I might do before or after, it seems the behavior described above is independent of other commands.

I read online various reasons why waitpid might return -1, but I wasn’t able to identify the reason in my case. Also, I tried looking for how to understand what type of waitpid error is it, but again unsuccessfully.

My questions are:

  1. Why do you think this behavior is happening
  2. If you have a solution (the ideal thing would it for it to keep returning DONE)
  3. If you have a better idea of how to implement the jobs/history command is well accepted

One solution for this problem is that as soon as I get “DONE”, I sign the command as DONE, and don’t perform the waitid anymore on it before printing it. This would solve the issue, but I would remain in the dark as to WHY is this happening

Advertisement

Answer

You should familiarize yourself with how child processes are handled on Unix environments. In particular read about Zombie processes.

When a process dies, it enters a ‘zombie’ state, so that its PID is still reserved and uniquely identifies the now-dead process. A successful wait on a zombie process frees up the process descriptor and its PID. Consequently subsequent calls to wait on the same PID will fail cause there’s no more process with that PID (unless a new process is allocated the same PID, in which case waiting on it would be a logical error).

You should restructure your program so that if a wait is successful and reports that a process is DONE, you record that information in your own data structure and never call wait on that PID again.

For comparison, once a process is done, bourne shell reports it one last time and then removes it from the list of jobs:

$ sleep 10 &
$ jobs
[1] + Running                 sleep 10
$ jobs
[1] + Running                 sleep 10
$ jobs
[1]   Done                    sleep 10
$ jobs
$
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement