I am interested in intercepting all system calls related to the file system and instead having my own code run. For example, calls to creat, write, close, lseek, getcwd, etc. My goal is to create a function like execve that captures all file I/O from the spawned program to an in memory filesystem managed by the calling process. This way the calling program can then inspect the output with no filesystem overhead.
My use case is working with large numerical simulation programs that do not have an API or library. These programs only communicate though input and output files. If these files are large, it can take a large portion of the runtime just to do the I/O. On some computers, with super user permissions, it is possible to setup a file system that lives in RAM (tmpfs on Linux for example), but without super user permission, or a machine that is configured in a certain way this isn’t possible.
I understand that using LD_PRELOAD it would be possible to have custom code called instead of the functions in libc. However, this only works for dynamically linked programs and it doesn’t answer the question of how the IPC should be performed between the calling program (that I want to host the in memory file system) and the callee. The question for this approach is how to perform best the IPC. Should I use pipes, unix domain sockets, or some shared memory?
I have also looked at ptrace as a way to intercept system calls. This seems like it might work, but I have two questions about this approach. First, how does one prevent the actual system call from occurring (as opposed to just modifying the arguments to the system call as I have seen in some examples). Second, does ptrace allow for high performance reading of the memory space of the callee?
Advertisement
Answer
Using LD_PRELOAD
, you can have your intercepting code run in the callee’s memory space. Using a library constructor function (__attribute__((constructor))
), you can have code of your choosing run when the library first starts up, e.g. mmap
ing your virtual filesystem and initializing it.
Then, when you intercept the calls with your preloaded library, the library’s functions are running in the target process, with access to the constructed filesystem — no need for IPC.
If the calling process must manage the filesystem, you’ll incur overhead communicating to it. I’d recommend mapping the important parts of the filesystem in the child process (perhaps as a shared memory region), and instead using a listener in the child to watch for filesystem changes from the parent (with suitable locking around your filesystem operations). You can do the change notification with a simple pipe as the bandwidth requirements are lower.
Also check out Plash, a paravirtualizing system which sandboxes filesystem access by providing a modified Glibc.