February 10, 2020

eBPF and the sockmap API

Recently I read an older post by the Cloudflare engineering team on using sockmap for TCP splicing (here). Intrigued by the potential of some serious entertainment and curious about the complex tooling I decide to wet my beak in this new area.

I am firm believer of learning by doing and in this case doubly so with the lack of documentation around this particular eBPF interface. But what should be the ‘doing’ vehicle for the learning? Well what about something useless like connecting two TCP clients to each other via an intermediary? Sort of like a really simple proxy/relay/TURN server? Perfect!

All of the code in this post is available as a complete program on github here.

First attempt

Compiling and loading a BPF program is multiple step process that roughly goes:

  1. Write an eBPF program in C.
  2. Compile said program using clang into an eBPF machine code ELF object file.
  3. For any shared resources in the program, such as maps, first create the map using the appropriate system call and then do the ‘relocation’ of the resource in the object file.
  4. Ask the kernel to load your object file and attach it to some eBPF interface.

An eBPF program as so

#include <linux/bpf.h>
#include <linux/types.h>

#include <bpf/bpf_helpers.h>

#define SEC(NAME) __attribute__((section(NAME), used))

struct bpf_map_def SEC("maps") sock_map =
  {
   .type = BPF_MAP_TYPE_SOCKMAP,
   .key_size = sizeof(int),
   .value_size = sizeof(int),
   .max_entries = 2,
  };

struct bpf_map_def SEC("maps") ip_map =
  {
   .type = BPF_MAP_TYPE_HASH,
   .key_size = sizeof(__u64),
   .value_size = sizeof(int),
   .max_entries = 64,
  };

SEC("sk_skb/stream_parser")
int turn_parser(struct __sk_buff *skb)
{
	return skb->len;
}

SEC("sk_skb/stream_verdict")
int turn_verdict(struct __sk_buff *skb) {
  __u64 ip = skb->remote_ip4;
  __u32 port = skb->remote_port;
  __u64 key = (ip << 32) | port;

  int *idx = bpf_map_lookup_elem(&ip_map, &key);
  if (!idx) {
    return SK_DROP;
  }

  return bpf_sk_redirect_map(skb, &sock_map, *idx, 0);
}

has two maps that both need to be created by the ‘user space’ loader application and then the maps file descriptors ‘mapped’ into the appropriate locations in the eBPF object file using a magical eBPF instruction. The kernel will then resolve the file descriptor to an actual memory address when the eBPF program is loaded.

If this seems a bit complicated at first glance it's because it is a bit complicated! So being a fairly lazy person I decided to see what tool is out there that can do the above for me without resorting to writing my own loader and I quickly discovered BCC - the BPF Compiler Collection.

BCC is a really neat project that allows you to compile your BPF programs and patch them with the relocations all from your favourite programming language - as long as that programming language is either C or Python. So pretty much everyone is covered I no?

It's also some pretty cool technology. Essentially, from what I can tell, it's a clang frontend that compiles your C code to an eBPF program and automatically does the relocations for you as part of the compilation process. Given the allure of automatically generating eBPF programs including all the relocations, BCC seemed like ideal - until I started using it and discovered that sockmap and it's associated calls are NOT supported by the clang frontend.

No matter. Not afraid to roll up my sleaves, grease my elbows and patch this beautiful beast of a tool to mine own purposes I dove head first into the rabbit hole. And boy is it a magnificent rabbit hole. But it's also a really deep, mostly undocumented C++ and clang API rabbit hole and I was here to learn about eBPF. Instead I decided to look around for an alternate solution. (Also I wanted to finish this blog post).

The first attempt yielded some partial patches (promise, will try to wrap these up) to the BCC toolchain to support sockmap, some additional APIs and a deep respect for the current BCC team.

Second attempt

The second attempt was more straight forward. As part of digging into BCC I discovered a library called libbpf, which is a standalone build of the linux kernel trees bpf library. A quick review of the header files seemed to show that this tool could certainly help with most of nitty-gritty ELF parsing, eBPF specifics and even provide convenient wrappers around the BPF system call.

Seemed almost too good to be true. Given the above eBPF program, how do I actually compile and load it? For the compiling piece I'd have to resort a crude command line:

clang -Wall -Wextra -O2 -emit-llvm -c ebpf-kern.c -S -o - | llc -march=bpf -filetype=obj -o ebpf-kern.o

Long gone is the magic of the BCC and I would of course have to do my own… dun-dun-duh RELOCATIONS! But before we jump into that I'd like to present the master piece below - them main entry point. As is fashionalble these days there are no comments. The complete main.cc is on github.

#include <poll.h>
#include <stdint.h>

#include <bpf/libbpf.h>

I built libbpf from source (on Ubuntu Eoan) for this - I figured it'd be easier to have a version that I could ‘mess with’ if needed.

#include "server.h"
#include "client.h"
#include "bpf-loader.h"

static int add_ip(struct bpf_map *ip_map, struct bpf_map *sock_map, const client &from, int idx, const client &to) {
  uint64_t key = (static_cast<uint64_t>(htonl(from.ip())) << 32) | htonl(from.port());

  if (bpf_map_update_elem(bpf_map__fd(ip_map), &key, &idx, BPF_ANY) != 0) {
    fprintf(stderr, "%d: %s\n", errno, strerror(errno));
    return -1;
  }

  int fd = to.fd();
  if (bpf_map_update_elem(bpf_map__fd(sock_map), &idx, &fd, BPF_ANY) != 0) {
    fprintf(stderr, "%d: %s\n", errno, strerror(errno));
    return -1;
  }

  return 0;
}

As you may have noted in the eBPF program there are two maps used. The first one is the sockmap that contains all the client sockets, the second one is a hashmap that hashes the clients remote ip and remote port to the sockmap index it's stored at.

This allows the sockmap verdict program to route each sk_buff to the correct socket by constructing the hash key from the sk_buff remote ip and remote port and using that to get the correct sockmap index from the hash map. Of course this only supports IPv4. You might notice that the key is all using network byte order, this is because the sk_buff uses network byte order. It also stores the port as 32-bit value, hence the htonl.

int main(int argc, char *argv[]) {
  bpf_loader b;

Most of the loading magic is contained within the bpf_loader class. It exposes a fairly clean interface but is actually a horrible hack underneath.

  auto r3 = b.load(argv[1]);

  if (r3.is_err()) {
    fprintf(stderr, "%d: %s\n", r3.error().err(), r3.error().msg().c_str());
    return -1;
  }

  auto r4 = b.map("sock_map");
  if (r4.is_err()) {
    fprintf(stderr, "%d: %s\n", r4.error().err(), r4.error().msg().c_str());
    return -1;
  }

  struct bpf_map *sock_map = r4.value();

  r4 = b.map("ip_map");
  if (r4.is_err()) {
    fprintf(stderr, "%d: %s\n", r4.error().err(), r4.error().msg().c_str());
    return -1;
  }

  struct bpf_map *ip_map = r4.value();

At this point we've loaded our eBPF object file and retrieved the two maps (“sock_map” and “ip_map”), so we know the eBPF program is good.

  server s("localhost", 8080);
  auto r = s.init();

  if (r.is_err()) {
    fprintf(stderr, "%d: %s\n", r.error().err(), r.error().msg().c_str());
    return -1;
  }

  auto r2 = s.accept();
  if (r2.is_err()) {
    fprintf(stderr, "%d: %s\n", r2.error().err(), r2.error().msg().c_str());
    return -1;
  }

  client c1(r2.value().first, r2.value().second);
  fprintf(stdout, "%s connected on %d (%d)\n", c1.hostname().c_str(), c1.port(), c1.fd());

  r2 = s.accept();
  if (r2.is_err()) {
    fprintf(stderr, "%d: %s\n", r2.error().err(), r2.error().msg().c_str());
    return -1;
  }

  client c2(r2.value().first, r2.value().second);
  fprintf(stdout, "%s connected on %d (%d)\n", c2.hostname().c_str(), c2.port(), c2.fd());

We've created a listening socket, accepted two client connections, let's connect them!


  add_ip(ip_map, sock_map, c1, 0, c2);
  add_ip(ip_map, sock_map, c2, 1, c1);

We call the add_ip function to set up the kernel side by adding the client sockets to the ip_map and sock_map. The program has already been loaded at this point though, we've just been dropping all the incoming data.

  struct pollfd fds[2] =
    {
     { .fd = c1.fd(), .events = POLLRDHUP },
     { .fd = c2.fd(), .events = POLLRDHUP },
    };

  poll(fds, sizeof(fds) / sizeof(fds[0]), -1);

  fprintf(stdout, "we are done!\n");
  // wait for the magic!
  return 0;
}

Then we just wait for one of the clients close their connection which exits the program, cleaning up the eBPF program and the eBPF maps.

The eBPF loader

All the nitty gritty of loading and massaging the eBPF object file is contained within the bpf_loader. The code is pretty straight forward except for the part that updates the maps in the eBPF file with the corresponding file descriptors.

libbpf provides an API where you can register a callback to modify the eBPF program before it's loaded into the kernel. To use it, we simply register our callback function, which program it should run for (the verify function) and how many instances of the program we should replacements for (just the one in this case).

bpf_program__set_prep(programs_["sk_skb/stream_verdict"], 1, sock_map_inserter);

Simple eh? The actual callback implementation is a bit more complicated since we need to ‘patch up’ the correct eBPF instructions with the file descriptors for the maps.

static int sock_map_inserter(struct bpf_program *prog, int n,
                        struct bpf_insn *insns, int insns_cnt,
                        struct bpf_prog_prep_result *res) {
  struct bpf_insn *ni = static_cast<struct bpf_insn *>(calloc(sizeof(insns), insns_cnt));
  memcpy(ni, insns, sizeof(insns) * insns_cnt);

  for (int i = 0; i < insns_cnt; ++i) {
    if (ni[i].code == (BPF_LD | BPF_IMM | BPF_DW)) {

What we are looking for here is a load instruction, specifically a load 64-bit load with IMM mode. Of course this is fragile, we make assumption here that we would only do this load in case of a map/resource call. But we don't know what this load is for at this point or how to handle it. The specifics on the magical load instructions are described in a comment in linux/bpf.h.

/* When BPF ldimm64's insn[0].src_reg != 0 then this can have
 * two extensions:
 *
 * insn[0].src_reg:  BPF_PSEUDO_MAP_FD   BPF_PSEUDO_MAP_VALUE
 * insn[0].imm:      map fd              map fd
 * insn[1].imm:      0                   offset into value
 * insn[0].off:      0                   0
 * insn[1].off:      0                   0
 * ldimm64 rewrite:  address of map      address of map[0]+offset
 * verifier type:    CONST_PTR_TO_MAP    PTR_TO_MAP_VALUE
 */

If we were using BCC, the frontend would take care of this for us but… see attempt no. 1. Instead we get to loop through all the following instructions to find the call site and then figure out what call this load instruction is for.

static struct bpf_insn * find_call(struct bpf_insn *ni, int count) {
  for (int i = 0; i < count; ++i) {
    if ((ni[i].code & 0xf0) == 0x80) {
      return &ni[i];
    }
  }

  return nullptr;
}

Very simply, a call is a special eBPF jump encoding with a code for the function in the imm field. We simply loop through the remaining instructions to see if we can find one. And if not we'll crash :-)

    
      struct bpf_insn *call_ni = find_call(&ni[i], insns_cnt - i);
      int fd = 0;

      if (call_ni->imm == BPF_FUNC_map_lookup_elem) {
        fd = fd_maps["ip_map"];
      } else if (call_ni->imm == BPF_FUNC_sk_redirect_map) {
        fd = fd_maps["sock_map"];
      } else {
        continue;
      }

Here we look for the two types of function calls we have maps for, either look something up in the hash map (BPF_FUNC_map_lookup_elem) or redirect via the sockmap (BPF_FUNC_sk_redirect_map). The function call codes themselves are built using a macro in the linux/bpf.h header file.

For everything else, we skip it.

      ni[i].src_reg = BPF_PSEUDO_MAP_FD;
      ni[i].imm = fd;
    }
  }
  

Then we do the actual patching, that is setting the src_reg to be BPF_PSEUDO_MAP_FD and the imm value to the correct map file descriptor, just as described in the comment.


  res->new_insn_ptr = ni;
  res->new_insn_cnt = insns_cnt;
  res->pfd = nullptr;
  return 0;
}

And finally we return the patched eBPF program to libbpf to be loaded by the kernel.

Running it

First clone the repo as git clone https://github.com/dbolcsfoldi/ebpf-spice-cookie.git and build with ./build.sh.

To run it it's pretty straight forward, as root (or with CAP_NET_ADMIN set) do ./ebpf-user ebpf-kern.o and then connect to server using everyones favourite cat, the net cat.

nc -4 127.0.0.1 8080 and again nc -4 127.0.0.1 8080 and type away! Data traffic forwarded in the kernel using the magic of eBPF.

Done!

Copyright © 2019 - David Bolcsfoldi

Powered by Hugo & Kiss.