In this post, we're going to clone the parent process of our container into a child process.
Before that can happen, we'll set up the ground by preparing some inter-process communication channels allowing us to interact with the child process we're going to create. When it comes to inter-process communication (or IPC for short), the Unix domain sockets or "same host socket communication" is the solution. They differ from the "network socket communication" kind of sockets that are used to perform network operations with remote hosts. You can find a really nice article about sockets and IPC here It consists in practice of a file (Unix philosophy: everything is a file) In which we're going to read or write, to transfer information from one process to another. For our tool, we don't need any fancy IPC, but we want to be able to transfer simple boolean to / from our child process. Creating a pair of sockets will allow us to give one to the child process, and one to the parent process. This way we'll be able to transfer raw binary data from one process to the other the same way we write binary data into a file stored in a filesystem. Let's create a new file We create a As we use a new When creating the configuration, let's generate our socketpair and add it to the Also let's modify the In our container implementation, let's add a field in the As sockets requires some cleaning before exit, let's close them in the To ease the use of the sockets, let's create two wrappers to ease the use. We only want to transfer boolean, so let's create a Here it's just some interacting with the We won't use the wrappers for now, but they'll come handy later. The code for this step is available on github litchipi/crabcan branch “step7”. In order to regroup everything related to the cloning and management of the child process, let's create a new module We can also create new types of errors to deal with anything going wrong in our child process generation or anything during the preparation inside the container, and add them to For now, we create a dummy child function simply echoing the arguments it will execute. We create the function in The child process simply outputs something to stdout, and returns 0 as a signal that nothing went wrong. We also pass it some configuration in which we'll be able to bundle everything we want our child process to acknowledge. Then we create the function cloning the parent process and calling the child, still in Let's split this code to understand it properly: If you don't know what Linux namespaces are, I recommend reading the Wikipedia article about it for a quick and somewhat complete introduction. In one line, a namespace is an isolation provided by the Linux kernel to allow a process in this namespace to have a different version of a resource than the global system. In practice: Check out the linux manual for namespaces for more details about namespaces. Back to our child cloning preparation, each flag will create a new namespace for the child process, for the given namespace. If a flag is not set, usually the namespace the child will be part of will be the one from the parent process. Here is the complete code: - - - - - - So while creating our child, we will separate its world from the one of the system, allowing it to modify whatever it wants (at least for the namespaces used) without harming the configuration of our system. Now that we have our clean In Now that our container contains everything to generate a new clean child process, we will update the main function to wait for the child to finish. This way, the container generate the child process using the arguments we give to it, then hold and wait for the child to end before quitting. The function This function uses the syscall As we wait for the termination, we will just pass Maybe since the beginning you were wondering why we need Here's the output we can get from testing this step: The code for this step is available on github litchipi/crabcan branch “step8”. Inter-process communication (IPC) with sockets
Introduction to IPC
Create a socketpair
src/cli.rs
containing everything related to IPC:use crate Errcode;
use RawFd;
use ;
generate_socketpair
function in which we call the socketpair
function, which is the standard Unix way of creating socket pairs, but called from Rust.AddressFamily::Unix
: We are creating a Unix domain socket(see all AddressFamily variants for details)SockType::SeqPacket
: The socket will use a communication semantic with packets and fixed length datagrams.(see all SockType variants for details)None
: The socket will use the default protocol associated with the socket type.SockFlag::SOCK_CLOEXEC
: The socket will be automatically closed after any syscall of theexec
family. (see Linux manual for exec
syscalls)Errcode::SocketError
variant, let's add it to src/errors.rs
now:
Adding the sockets to the container config
ContainerOpts
data so the child process can access to it easily. In the src/config.rs
file:use crate generate_socketpair;
// ...
use RawFd;
ContainerOpts::new
function so it returns the sockets along with the constructed ContainerOpts
struct, the parent container needs to get access to it.Adding to the container, setting up the cleaning
Container
struct to be able to access the sockets more easily. In the file src/container.rs
:use close;
use RawFd;
// ...
clean_exit
function.Creating wrappers for IPC
send_boolean
and recv_boolean
function in src/ipc.rs
:
send
and recv
functions from the nix
crate, handling data types conversion, etc... There's nothing much to say about it, but it's still interesting how we can interact with functions that has a low-level C backend from Rust.Patch for this step
The raw patch to apply on the previous step can be found hereCloning a process
child
in a file src/child.rs
. First of all, define the modules in src/main.rs
:...
src/errors.rs
:Creating a child process
src/child.rs
:src/child.rs
:use crate Errcode;
use crate ContainerOpts;
use Pid;
use clone;
use Signal;
use CloneFlags;
const STACK_SIZE: usize = 1024 * 1024;
STACK_SIZE
that we define of size 1KiB
.This buffer will hold the stack of the child process, note that this is differentfrom the original C clone
function (as detailed in the nix::sched::clone documentation)clone
syscall, redirecting to our child
function, with our config
structas an argument, the temporary stack for the process, the flags we set, along with the instruction tosend the parent process a SIGCHLD
signal when the child exits.PID
in short, a number identifying uniquely ourprocess for the Linux kernel. We return this pid as we will store it in our container struct.A word about namespaces
init
one (PID = 1)Setting the flags
let mut flags = empty;
flags.insert;
flags.insert;
flags.insert;
flags.insert;
flags.insert;
flags.insert;
CLONE_NEWNS
will start the cloned child in a new mount
namespace, initialized with a copy of the namespace from the parent process.
Check the mount-namespaces manual for more informationCLONE_NEWCGROUP
will start the cloned child in a new cgroup
namespace.
Cgroups are explained a bit later in the tutorial as we will use them to restrict the capabilities our child process have.
Check the cgroup-namespaces manual for more informationCLONE_NEWPID
will start the cloned child in a new pid
namespace.
This basically mean that our child process will think he will have a PID = X, but in reality in the Linux kernel he will have another one.
Check the pid-namespaces manual for more informationCLONE_NEWIPC
will start the cloned child in a new ipc
namespace.
Processes inside this namespace can interact with each other, whereas processes outside cannot through normal IPC
methods.
Check the ipc-namespaces manual for more informationCLONE_NEWNET
will start the cloned child in a new network
namespace.
It will not share the interfaces and network configurations from other namespaces.
Check the network-namespaces manual for more informationCLONE_NEWUTS
will start the cloned child in a new uts
namespace.
I cannot explain why the name UTS (UTS stands for UNIX Timesharing System), but it will allow the contained process to set its own hostname and NIS domain name in the namespace.
Check the uts-namespaces manual for more informationGenerate the child from the container
generate_child_process
function, we can call it in the create
function of our container, and store the resulting pid
in the struct fields.src/container.rs
, add:use crate generate_child_process;
use Pid;
use waitpid;
Waiting for the child to finish
In src/container.rs
:wait_child
is defined in src/container.rs
like so:waitpid
, from the manual:The waitpid() system call suspends execution of the calling process until a child specified by pid argument has changed state. By default, waitpid() waits only for terminated children, but this behavior is modifiable via the options argument, as described below.
None
as options, and return a Errcode::ContainerError
error if the syscall didn't finished successfully.Testing
sudo
to run our tests, in the first 7 steps that wasn't necessary, but here as we create new namespaces for our child process, theCAP_SYS_ADMIN
capacity is needed (See the manual for capabilities or this article from LWN ).[2021-11-12T08:52:17Z INFO crabcan] Args { debug: true, command: "/bin/bash", uid: 0, mount_dir: "./mountdir/" }
[2021-11-12T08:52:17Z DEBUG crabcan::container] Linux release: 5.11.0-38-generic
[2021-11-12T08:52:17Z DEBUG crabcan::container] Container sockets: (3, 4)
[2021-11-12T08:52:17Z DEBUG crabcan::container] Creation finished
[2021-11-12T08:52:17Z DEBUG crabcan::container] Container child PID: Some(Pid(134400))
[2021-11-12T08:52:17Z DEBUG crabcan::container] Waiting for child (pid 134400) to finish
[2021-11-12T08:52:17Z INFO crabcan::child] Starting container with command /bin/bash and args ["/bin/bash"]
[2021-11-12T08:52:17Z DEBUG crabcan::container] Finished, cleaning & exit
[2021-11-12T08:52:17Z DEBUG crabcan::container] Cleaning container
[2021-11-12T08:52:17Z DEBUG crabcan::errors] Exit without any error, returning 0
Patch for this step
The raw patch to apply on the previous step can be found here