I’m writing stress test suite for testing distributed file systems over NFS.
In some cases when some process deletes file, while some other process attempts to read from it, I’m getting “Stale file handle” error (116).
Is that kind of error is expected and acceptable in such race condition?
Test working as follows:
- Starting x number of client machines
- Each client machine runs y processes
- Each process can do any file operation as stat/read/delete/open
- Mentioned file ops are standard python methods – os.stat/read/os.remove/open
- All files are empty 0 bytes data
File is exists, as successful stat
operation shows:
controller_debug.log.2:2016-10-26 15:02:30,156;INFO – [LG-E27A-LNX:0xa]: finished 640522b4d94c453ea545cb86568320ca, result: success | stat | /JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41 | data: {} | 2016/10/26 15:02:30.156
Process 0x1
on client CLIENT-A
completed successful delete:
controller_debug.log.2:2016-10-26 15:02:30,164;INFO – [CLIENT-A:0x1]: finished 5f5dfe6a06de495f851745a78857eec1, result: success | delete | /JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41 | data: {} | 2016/10/26 15:02:30.161
3 milliseconds later, process 0xb
on client CLIENT-B
failed “read” op due to “Stale file handle”
controller_debug.log.2:2016-10-26 15:02:30,164;INFO – [CLIENT-B:0xb]: finished e84e2064ead042099310af1bd44821c0, result: failed | read | /mnt/DIRSPLIT-node0.b27-1/JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41 | [errno:116] | Stale file handle | 142 | data: {} | 2016/10/26 15:02:30.160 controller_debug.log.2:2016-10-26 15:02:30,164;ERROR – Operation read FAILED UNEXPECTEDLY on File JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41 due to Stale file handle
Thanks
Advertisement
Answer
This is totally expected. The NFS specification is clear about use of file handles after an object (be it file or directory) has been deleted. Section 4 clearly addresses this. For example:
The persistent filehandle will become stale or invalid when the file system object is removed. When the server is presented with a persistent filehandle that refers to a deleted object, it MUST return an error of NFS4ERR_STALE.
This is such a common problem, it even has its own entry in section A.10 of the NFS FAQ, which says one common cause of ESTALE errors is that:
The file handle refers to a deleted file. After a file is deleted on the server, clients don’t find out until they try to access the file with a file handle they had cached from a previous LOOKUP. Using rsync or mv to replace a file while it is in use on another client is a common scenario that results in an ESTALE error.
The expected resolution is that your client app must close and reopen the file to see what has happened. Or, as the FAQ says:
… to recover from an ESTALE error, an application must close the file or directory where the error occurred, and reopen it so the NFS client can resolve the pathname again and retrieve the new file handle.