wingolog

notes from the fosdem 2018 networking devroom

2018-02-05T17:22:41Z

Greetings, internet!

I am on my way back from FOSDEM and thought I would share with yall some impressions from talks in the Networking devroom. I didn't get to go to all that many talks -- FOSDEM's hallway track is the hottest of them all -- but I did hit a select few. Thanks to Dave Neary at Red Hat for organizing the room.

Ray Kinsella -- Intel -- The path to data-plane micro-services

The day started with a drum-beating talk that was very light on technical information.

Essentially Ray was arguing for an evolution of network function virtualization -- that instead of running VNFs on bare metal as was done in the days of yore, that people started to run them in virtual machines, and now they run them in containers -- what's next? Ray is saying that "cloud-native VNFs" are the next step.

Cloud-native VNFs to move from "greedy" VNFs that take charge of the cores that are available to them, to some kind of resource sharing. "Maybe users value flexibility over performance", says Ray. It's the Care Bears approach to networking: (resource) sharing is caring.

In practice he proposed two ways that VNFs can map to cores and cards.

One was in-process sharing, which if I understood him properly was actually as nodes running within a VPP process. Basically in this case VPP or DPDK is the scheduler and multiplexes two or more network functions in one process.

The other was letting Linux schedule separate processes. In networking, we don't usually do it this way: we run network functions on dedicated cores on which nothing else runs. Ray was suggesting that perhaps network functions could be more like "normal" Linux services. Ray doesn't know if Linux scheduling will work in practice. Also it might mean allowing DPDK to work with 4K pages instead of the 2M hugepages it currently requires. This obviously has the potential for more latency hazards and would need some tighter engineering, and ultimately would have fewer guarantees than the "greedy" approach.

Interesting side things I noticed:

All the diagrams show Kubernetes managing CPU node allocation and interface assignment. I guess in marketing diagrams, Kubernetes has completely replaced OpenStack.
One slide showed guest VNFs differentiated between "virtual network functions" and "socket-based applications", the latter ones being the legacy services that use kernel APIs. It's a useful terminology difference.
The talk identifies user-space networking with DPDK (only!).

Finally, I note that Conway's law is obviously reflected in the performance overheads: because there are organizational isolations between dev teams, vendors, and users, there are big technical barriers between them too. The least-overhead forms of resource sharing are also those with the highest technical consistency and integration (nodes in a single VPP instance).

Magnus Karlsson -- Intel -- AF_XDP

This was a talk about getting good throughput from the NIC to userspace, but by using some kernel facilities. The idea is to get the kernel to set up the NIC and virtualize the transmit and receive ring buffers, but to let the NIC's DMA'd packets go directly to userspace.

The performance goal is 40Gbps for thousand-byte packets, or 25 Gbps for traffic with only the smallest packets (64 bytes). The fast path does "zero copy" on the packets if the hardware has the capability to steer the subset of traffic associated with the AF_XDP socket to that particular process.

The AF_XDP project builds on XDP, a newish thing where a little kind of bytecode can run on the kernel or possibly on the NIC. One of the bytecode commands (REDIRECT) causes packets to be forwarded to user-space instead of handled by the kernel's otherwise heavyweight networking stack. AF_XDP is the bridge between XDP on the kernel side and an interface to user-space using sockets (as opposed to e.g. AF_INET). The performance goal was to be within 10% or so of DPDK's raw user-space-only performance.

The benefits of AF_XDP over the current situation would be that you have just one device driver, in the kernel, rather than having to have one driver in the kernel (which you have to have anyway) and one in user-space (for speed). Also, with the kernel involved, there is a possibility for better isolation between different processes or containers, when compared with raw PCI access from user-space..

AF_XDP is what was previously known as AF_PACKET v4, and its numbers are looking somewhat OK. Though it's not upstream yet, it might be interesting to get a Snabb driver here.

I would note that kernel-userspace cooperation is a bit of a theme these days. There are other points of potential cooperation or common domain sharing, storage being an obvious one. However I heard more than once this weekend the kind of "I don't know, that area of the kernel has a different culture" sort of concern as that highlighted by Daniel Vetter in his recent LCA talk.

François-Frédéric Ozog -- Linaro -- Userland Network I/O

This talk is hard to summarize. Like the previous one, it's again about getting packets to userspace with some support from the kernel, but the speaker went really deep and I'm not quite sure what in the talk is new and what is known.

François-Frédéric is working on a new set of abstractions for relating the kernel and user-space. He works on OpenDataPlane (ODP), which is kinda like DPDK in some ways. ARM seems to be a big target for his work; that x86-64 is also a target goes without saying.

His problem statement was, how should we enable fast userland network I/O, without duplicating drivers?

François-Frédéric was a bit negative on AF_XDP because (he says) it is so focused on packets that it neglects other kinds of devices with similar needs, such as crypto accelerators. Apparently the challenge here is accelerating a single large IPsec tunnel -- because the cryptographic operations are serialized, you need good single-core performance, and making use of hardware accelerators seems necessary right now for even a single 10Gbps stream. (If you had many tunnels, you could parallelize, but that's not the case here.)

He was also a bit skeptical about standardizing on the "packet array I/O model" which AF_XDP and most NICS use. What he means here is that most current NICs move packets to and from main memory with the help of a "descriptor array" ring buffer that holds pointers to packets. A transmit array stores packets ready to transmit; a receive array stores maximum-sized packet buffers ready to be filled by the NIC. The packet data itself is somewhere else in memory; the descriptor only points to it. When a new packet is received, the NIC fills the corresponding packet buffer and then updates the "descriptor array" to point to the newly available packet. This requires at least two memory writes from the NIC to memory: at least one to write the packet data (one per 64 bytes of packet data), and one to update the DMA descriptor with the packet length and possible other metadata.

Although these writes go directly to cache, there's a limit to the number of DMA operations that can happen per second, and with 100Gbps cards, we can't afford to make one such transaction per packet.

François-Frédéric promoted an alternative I/O model for high-throughput use cases: the "tape I/O model", where packets are just written back-to-back in a uniform array of memory. Every so often a block of memory containing some number of packets is made available to user-space. This has the advantage of packing in more packets per memory block, as there's no wasted space between packets. This increases cache density and decreases DMA transaction count for transferring packet data, as we can use each 64-byte DMA write to its fullest. Additionally there's no side table of descriptors to update, saving a DMA write there.

Apparently the only cards currently capable of 100 Gbps traffic, the Chelsio and Netcope cards, use the "tape I/O model".

Incidentally, the DMA transfer limit isn't the only constraint. Something I hadn't fully appreciated before was memory write bandwidth. Before, I had thought that because the NIC would transfer in packet data directly to cache, that this wouldn't necessarily cause any write traffic to RAM. Apparently that's not the case. Later over drinks (thanks to Red Hat's networking group for organizing), François-Frédéric asserted that the DMA transfers would eventually use up DDR4 bandwidth as well.

A NIC-to-RAM DMA transaction will write one cache line (usually 64 bytes) to the socket's last-level cache. This write will evict whatever was there before. As far as I can tell, there are three cases of interest here. The best case is where the evicted cache line is from a previous DMA transfer to the same address. In that case it's modified in the cache and not yet flushed to main memory, and we can just update the cache instead of flushing to RAM. (Do I misunderstand the way caches work here? Do let me know.)

However if the evicted cache line is from some other address, we might have to flush to RAM if the cache line is dirty. That causes a memory write traffic. But if the cache line is clean, that means it was probably loaded as part of a memory read operation, and then that means we're evicting part of the network function's working set, which will later cause memory read traffic as the data gets loaded in again, and write traffic to flush out the DMA'd packet data cache line.

François-Frédéric simplified the whole thing to equate packet bandwidth with memory write bandwidth, that yes, the packet goes directly to cache but it is also written to RAM. I can't convince myself that that's the case for all packets, but I need to look more into this.

Of course the cache pressure and the memory traffic is worse if the packet data is less compact in memory; and worse still if there is any need to copy data. Ultimately, processing small packets at 100Gbps is still a huge challenge for user-space networking, and it's no wonder that there are only a couple devices on the market that can do it reliably, not that I've seen either of them operate first-hand :)

Talking with Snabb's Luke Gorrie later on, he thought that it could be that we can still stretch the packet array I/O model for a while, given that PCIe gen4 is coming soon, which will increase the DMA transaction rate. So that's a possibility to keep in mind.

At the same time, apparently there are some "coherent interconnects" coming too which will allow the NIC's memory to be mapped into the "normal" address space available to the CPU. In this model, instead of having the NIC transfer packets to the CPU, the NIC's memory will be directly addressable from the CPU, as if it were part of RAM. The latency to pull data in from the NIC to cache is expected to be slightly longer than a RAM access; for comparison, RAM access takes about 70 nanoseconds.

For a user-space networking workload, coherent interconnects don't change much. You still need to get the packet data into cache. True, you do avoid the writeback to main memory, as the packet is already in addressable memory before it's in cache. But, if it's possible to keep the packet on the NIC -- like maybe you are able to add some kind of inline classifier on the NIC that could directly shunt a packet towards an on-board IPSec accelerator -- in that case you could avoid a lot of memory transfer. That appears to be the driving factor for coherent interconnects.

At some point in François-Frédéric's talk, my brain just died. I didn't quite understand all the complexities that he was taking into account. Later, after he kindly took the time to dispell some more of my ignorance, I understand more of it, though not yet all :) The concrete "deliverable" of the talk was a model for kernel modules and user-space drivers that uses the paradigms he was promoting. It's a work in progress from Linaro's networking group, with some support from NIC vendors and CPU manufacturers.

Luke Gorrie and Asumu Takikawa -- SnabbCo and Igalia -- How to write your own NIC driver, and why

This talk had the most magnificent beginning: a sort of "repent now ye sinners" sermon from Luke Gorrie, a seasoned veteran of software networking. Luke started by describing the path of righteousness leading to "driver heaven", a world in which all vendors have publically accessible datasheets which parsimoniously describe what you need to get packets flowing. In this blessed land it's easy to write drivers, and for that reason there are many of them. Developers choose a driver based on their needs, or they write one themselves if their needs are quite specific.

But there is another path, says Luke, that of "driver hell": a world of wickedness and proprietary datasheets, where even when you buy the hardware, you can't program it unless you're buying a hundred thousand units, and even then you are smitten with the cursed non-disclosure agreements. In this inferno, only a vendor is practically empowered to write drivers, but their poor driver developers are only incentivized to get the driver out the door deployed on all nine architectural circles of driver hell. So they include some kind of circle-of-hell abstraction layer, resulting in a hundred thousand lines of code like a tangled frozen beard. We all saw the abyss and repented.

Luke described the process that led to Mellanox releasing the specification for its ConnectX line of cards, something that was warmly appreciated by the entire audience, users and driver developers included. Wonderful stuff.

My Igalia colleague Asumu Takikawa took the last half of the presentation, showing some code for the driver for the Intel i210, i350, and 82599 cards. For more on that, I recommend his recent blog post on user-space driver development. It was truly a ray of sunshine in dark, dark Brussels.

Ole Trøan -- Cisco -- Fast dataplanes with VPP

This talk was a delightful introduction to VPP, but without all of the marketing; the sort of talk that makes FOSDEM worthwhile. Usually at more commercial, vendory events, you can't really get close to the technical people unless you have a vendor relationship: they are surrounded by a phalanx of salesfolk. But in FOSDEM it is clear that we are all comrades out on the open source networking front.

The speaker expressed great personal pleasure on having being able to work on open source software; his relief was palpable. A nice moment.

He also had some kind words about Snabb, too, saying at one point that "of course you can do it on snabb as well -- Snabb and VPP are quite similar in their approach to life". He trolled the horrible complexity diagrams of many "NFV" stacks whose components reflect the org charts that produce them more than the needs of the network functions in question (service chaining anyone?).

He did get to drop some numbers as well, which I found interesting. One is that recently they have been working on carrier-grade NAT, aiming for 6 terabits per second. Those are pretty big boxes and I hope they are getting paid appropriately for that :) For context he said that for a 4-unit server, these days you can build one that does a little less than a terabit per second. I assume that's with ten dual-port 40Gbps cards, and I would guess to power that you'd need around 40 cores or so, split between two sockets.

Finally, he finished with a long example on lightweight 4-over-6. Incidentally this is the same network function my group at Igalia has been building in Snabb over the last couple years, so it was interesting to see the comparison. I enjoyed his commentary that although all of these technologies (carrier-grade NAT, MAP, lightweight 4-over-6) have the ostensible goal of keeping IPv4 running, in reality "we're day by day making IPv4 work worse", mainly by breaking the assumption that just because you get traffic from port P on IP M, doesn't mean you can send traffic to M from another port or another protocol and have it reach the target.

All of these technologies also have problems with IPv4 fragmentation. Getting it right is possible but expensive. Instead, Ole mentions that he and a cross-vendor cabal of dataplane people have a "dark RFC" in the works to deprecate IPv4 fragmentation entirely :)

OK that's it. If I get around to writing up the couple of interesting Java talks I went to (I know right?) I'll let yall know. Happy hacking!

an early look at p4 for software networking

2017-06-26T14:00:59Z

Happy midsummer, hackfriends!

As you know at work we have been trying to find ways to apply compilers technology to the networking space. We will compile high-level configurations into low-level network processing graphs, search algorithms into lookup routines optimized for the target data structures, packet filters into code ready to be further trace-compiled, or hash functions into parallel AVX2 code.

On one side, we try to provide fast implementations of existing "languages"; on the other side we can't help but try out new co-designed domain-specific languages that can be expressive and run fast. As an example, with pfmatch we extended pflang, the tcpdump language, with a more match-action kind of semantics. It worked fine but the embedding between pfmatch and the host language could have been more smooth; in the end the abstractions it offers don't really apply to what we have needed to build. For a long time we have been wondering if indeed there is a better domain-specific programming language to apply to the networking domain.

P4 claims to be this language, and I think it's worth a look. P4's goal is be able to define switches and other networking equipment in software, with the specific goal that they would like to be able for P4 programs to be synthesized to ASICs, or installed in the FPGA of a "Smart NIC", or compiled to CPUs. It's a wide target domain and the silicon-bakery side of things definitely constrains what is possible. Indeed P4 explicitly disclaims any ambition to be a general-purpose programming language. Still, I think they manage to achieve an admirable balance between declarative programming and transparent low-level compilability.

The best, most current intro to P4 out there is probably Vladimir Gurevich's slides from last month's P4 "developer day" in California. I think it does a good job linking the language's syntax and semantics with how they are intended to be applied to the target domain. For a more PL-friendly and abstract introduction, the P4₁₆ specification is a true delight.

Like I said, at work we build software switches and other network functions, and our target is commodity hardware. We write most of our work in Snabb, a powerful network toolkit built on LuaJIT, though we are branching out now to VPP/fd.io as well, just to broaden the offering a bit. Generally we try to build solutions that don't have any dependencies other than a commodity Xeon server and a commodity NIC like Intel's 82599. So how could P4 help us in what we're doing?

My first thought in this regard was that if there is a library of P4 building blocks out there, that it would be really convenient to be able to incorporate a functional block written in P4 within the graph of a Snabb program. For example, if we have an IPFIX collector written in Snabb (and we do!), it would be cool to stick that in the middle of a P4 traffic conditioner.

(Immediately I run into the problem that I am straining my mind to think of a network function that we wouldn't rather just write in Snabb -- something valuable enough that we wouldn't want to "own" it and instead we would import someone else's black box into our data-plane. Maybe this interesting in-network key-value cache counts? But I digress, let's assume that something exists here.)

One question is, why bother doing P4 in software? I can understand that if you have 1Tbps ports that you definitely need custom silicon pushing your packets around. You would like to be able to program that silicon, so P4 looks to be a compelling step forward. But if your needs are satisfied with 40Gbps ports and you have chosen a software networking solution for its low cost, low lock-in, high flexibility, and sufficient performance -- well does P4 buy you something?

Right now it would seem that the answer is "no". A Cisco group wrote a custom P4 compiler to VPP, which is architecturally pretty much the same as Snabb, and they had to do some work to get the performance within a couple percent of the hand-coded application. The only win I can see is if people start building up libraries of reusable P4 components that can be linked together -- but the language itself currently doesn't support any more global composition primitive than #include (yes, it uses CPP :).

Additionally, at least as things are now, it doesn't seem that there's a library of reusable, open source P4 components out there to take advantage of. If this changes, I'll have to have another look. And of course it's worth keeping an eye on what kinds of cool things people are building :)

Thanks to Luke Gorrie for conversations leading to this blog post. All opinions and errors mine, of course!

encyclopedia snabb and the case of the foreign drivers

2017-02-24T17:37:00Z

Peoples of the blogosphere, welcome back to the solipsism! Happy 2017 and all that. Today's missive is about Snabb (formerly Snabb Switch), a high-speed networking project we've been working on at work for some years now.

What's Snabb all about you say? Good question and I have a nice answer for you in video and third-party textual form! This year I managed to make it to linux.conf.au in lovely Tasmania. Tasmania is amazing, with wild wombats and pademelons and devils and wallabies and all kinds of things, and they let me talk about Snabb.

Click to download video

You can check that video on the youtube if the link above doesn't work; slides here.

Jonathan Corbet from LWN wrote up the talk in an article here, which besides being flattering is a real windfall as I don't have to write it up myself :)

In that talk I mentioned that Snabb uses its own drivers. We were recently approached by a customer with a simple and honest question: does this really make sense? Is it really a win? Why wouldn't we just use the work that the NIC vendors have already put into their drivers for the Data Plane Development Kit (DPDK)? After all, part of the attraction of a switch to open source is that you will be able to take advantage of the work that others have produced.

Our answer is that while it is indeed possible to use drivers from DPDK, there are costs and benefits on both sides and we think that when we weigh it all up, it makes both technical and economic sense for Snabb to have its own driver implementations. It might sound counterintuitive on the face of things, so I wrote this long article to discuss some perhaps under-appreciated points about the tradeoff.

Technically speaking there are generally two ways you can imagine incorporating DPDK drivers into Snabb:

Bundle a snapshot of the DPDK into Snabb itself.
Somehow make it so that Snabb could (perhaps optionally) compile against a built DPDK SDK.

As part of a software-producing organization that ships solutions based on Snabb, I need to be able to ship a "known thing" to customers. When we ship the lwAFTR, we ship it in source and in binary form. For both of those deliverables, we need to know exactly what code we are shipping. We achieve that by having a minimal set of dependencies in Snabb -- only LuaJIT and three Lua libraries (DynASM, ljsyscall, and pflua) -- and we include those dependencies directly in the source tree. This requirement of ours rules out (2), so the option under consideration is only (1): importing the DPDK (or some part of it) directly into Snabb.

So let's start by looking at Snabb and the DPDK from the top down, comparing some metrics, seeing how we could make this combination.

	snabb	dpdk
Code lines	61K	583K
Contributors (all-time)	60	370
Contributors (since Jan 2016)	32	240
Non-merge commits (since Jan 2016)	1.4K	3.2K

These numbers aren't directly comparable, of course; in Snabb our unit of code change is the merge rather than the commit, and in Snabb we include a number of production-ready applications like the lwAFTR and the NFV, but they are fine enough numbers to start with. What seems clear is that the DPDK project is significantly larger than Snabb, so adding it to Snabb would fundamentally change the nature of the Snabb project.

So depending on the DPDK makes it so that suddenly Snabb jumps from being a project that compiles in a minute to being a much more heavy-weight thing. That could be OK if the benefits were high enough and if there weren't other costs, but there are indeed other costs to including the DPDK:

Data-plane control. Right now when I ship a product, I can be responsible for the whole data plane: everything that happens on the CPU when packets are being processed. This includes the driver, naturally; it's part of Snabb and if I need to change it or if I need to understand it in some deep way, I can do that. But if I switch to third-party drivers, this is now out of my domain; there's a wall between me and something that running on my CPU. And if there is a performance problem, I now have someone to blame that's not myself! From the customer perspective this is terrible, as you want the responsibility for software to rest in one entity.
Impedance-matching development costs. Snabb is written in Lua; the DPDK is written in C. I will have to build a bridge, and keep it up to date as both Snabb and the DPDK evolve. This impedance-matching layer is also another source of bugs; either we make a local impedance matcher in C or we bind everything using LuaJIT's FFI. In the former case, it's a lot of duplicate code, and in the latter we lose compile-time type checking, which is a no-go given that the DPDK can and does change API and ABI.
Communication costs. The DPDK development list had 3K messages in January. Keeping up with DPDK development would become necessary, as the DPDK is now in your dataplane, but it costs significant amounts of time.
Costs relating to mismatched goals. Snabb tries to win development and run-time speed by searching for simple solutions. The DPDK tries to be a showcase for NIC features from vendors, placing less of a priority on simplicity. This is a very real cost in the form of the way network packets are represented in the DPDK, with support for such features as scatter/gather and indirect buffers. In Snabb we were able to do away with this complexity by having simple linear buffers, and our speed did not suffer; adding the DPDK again would either force us to marshal and unmarshal these buffers into and out of the DPDK's format, or otherwise to reintroduce this particular complexity into Snabb.
Abstraction costs. A network function written against the DPDK typically uses at least three abstraction layers: the "EAL" environment abstraction layer, the "PMD" poll-mode driver layer, and often an internal hardware abstraction layer from the network card vendor. (And some of those abstraction layers are actually external dependencies of the DPDK, as with Mellanox's ConnectX-4 drivers!) Any discrepancy between the goals and/or implementation of these layers and the goals of a Snabb network function is a cost in developer time and in run-time. Note that those low-level HAL facilities aren't considered acceptable in upstream Linux kernels, for all of these reasons!
Stay-on-the-train costs. The DPDK is big and sometimes its abstractions change. As a minor player just riding the DPDK train, we would have to invest a continuous amount of effort into just staying aboard.
Fork costs. The Snabb project has a number of contributors but is really run by Luke Gorrie. Because Snabb is so small and understandable, if Luke decided to stop working on Snabb or take it in a radically different direction, I would feel comfortable continuing to maintain (a fork of) Snabb for as long as is necessary. If the DPDK changed goals for whatever reason, I don't think I would want to continue to maintain a stale fork.
Overkill costs. Drivers written against the DPDK have many considerations that simply aren't relevant in a Snabb world: kernel drivers (KNI), special NIC features that we don't use in Snabb (RDMA, offload), non-x86 architectures with different barrier semantics, threads, complicated buffer layouts (chained and indirect), interaction with specific kernel modules (uio-pci-generic / igb-uio / ...), and so on. We don't need all of that, but we would have to bring it along for the ride, and any changes we might want to make would have to take these use cases into account so that other users won't get mad.

So there are lots of costs if we were to try to hop on the DPDK train. But what about the benefits? The goal of relying on the DPDK would be that we "automatically" get drivers, and ultimately that a network function would be driver-agnostic. But this is not necessarily the case. Each driver has its own set of quirks and tuning parameters; in order for a software development team to be able to support a new platform, the team would need to validate the platform, discover the right tuning parameters, and modify the software to configure the platform for good performance. Sadly this is not a trivial amount of work.

Furthermore, using a different vendor's driver isn't always easy. Consider Mellanox's DPDK ConnectX-4 / ConnectX-5 support: the "Quick Start" guide has you first install MLNX_OFED in order to build the DPDK drivers. What is this thing exactly? You go to download the tarball and it's 55 megabytes. What's in it? 30 other tarballs! If you build it somehow from source instead of using the vendor binaries, then what do you get? All that code, running as root, with kernel modules, and implementing systemd/sysvinit services!!! And this is just step one!!!! Worse yet, this enormous amount of code powering a DPDK driver is mostly driver-specific; what we hear from colleagues whose organizations decided to bet on the DPDK is that you don't get to amortize much knowledge or validation when you switch between an Intel and a Mellanox card.

In the end when we ship a solution, it's going to be tested against a specific NIC or set of NICs. Each NIC will add to the validation effort. So if we were to rely on the DPDK's drivers, we would have payed all the costs but we wouldn't save very much in the end.

There is another way. Instead of relying on so much third-party code that it is impossible for any one person to grasp the entirety of a network function, much less be responsible for it, we can build systems small enough to understand. In Snabb we just read the data sheet and write a driver. (Of course we also benefit by looking at DPDK and other open source drivers as well to see how they structure things.) By only including what is needed, Snabb drivers are typically only a thousand or two thousand lines of Lua. With a driver of that size, it's possible for even a small ISV or in-house developer to "own" the entire data plane of whatever network function you need.

Of course Snabb drivers have costs too. What are they? Are customers going to be stuck forever paying for drivers for every new card that comes out? It's a very good question and one that I know is in the minds of many.

Obviously I don't have the whole answer, as my role in this market is a software developer, not an end user. But having talked with other people in the Snabb community, I see it like this: Snabb is still in relatively early days. What we need are about three good drivers. One of them should be for a standard workhorse commodity 10Gbps NIC, which we have in the Intel 82599 driver. That chipset has been out for a while so we probably need to update it to the current commodities being sold. Additionally we need a couple cards that are going to compete in the 100Gbps space. We have the Mellanox ConnectX-4 and presumably ConnectX-5 drivers on the way, but there's room for another one. We've found that it's hard to actually get good performance out of 100Gbps cards, so this is a space in which NIC vendors can differentiate their offerings.

We budget somewhere between 3 and 9 months of developer time to create a completely new Snabb driver. Of course it usually takes less time to develop Snabb support for a NIC that is only incrementally different from others in the same family that already have drivers.

We see this driver development work to be similar to the work needed to validate a new NIC for a network function, with the additional advantage that it gives us up-front knowledge instead of the best-effort testing later in the game that we would get with the DPDK. When you add all the additional costs of riding the DPDK train, we expect that the cost of Snabb-native drivers competes favorably against the cost of relying on third-party DPDK drivers.

In the beginning it's natural that early adopters of Snabb make investments in this base set of Snabb network drivers, as they would to validate a network function on a new platform. However over time as Snabb applications start to be deployed over more ports in the field, network vendors will also see that it's in their interests to have solid Snabb drivers, just as they now see with the Linux kernel and with the DPDK, and given that the investment is relatively low compared to their already existing efforts in Linux and the DPDK, it is quite feasible that we will see the NIC vendors of the world start to value Snabb for the performance that it can squeeze out of their cards.

So in summary, in Snabb we are convinced that writing minimal drivers that are adapted to our needs is an overall win compared to relying on third-party code. It lets us ship solutions that we can feel responsible for: both for their operational characteristics as well as their maintainability over time. Still, we are happy to learn and share with our colleagues all across the open source high-performance networking space, from the DPDK to VPP and beyond.

talks i would like to give in 2016

2016-01-21T11:59:18Z

Every year I feel like I'm trailing things in a way: I hear of an amazing conference with fab speakers, but only after the call for submissions had closed. Or I see an event with exactly the attendees I'd like to schmooze with, but I hadn't planned for it, and hey, maybe I could have even spoke there.

But it's a new year, so let's try some new things. Here's a few talks I would love to give this year.

building languages on luajit

Over the last year or two my colleagues and I have had good experiences compiling in, on, and under LuaJIT, and putting those results into production in high-speed routers. LuaJIT has some really interesting properties as a language substrate: it has a tracing JIT that can punch through abstractions, it has pretty great performance, and it has a couple of amazing escape hatches that let you reach down to the hardware in the form of the FFI and the DynASM assembly generator. There are some challenges too. I can tell you about them :)

try guile for your next project!

This would be a talk describing Guile, what it's like making programs with it, and the kind of performance you can expect out of it. If you're a practicing programmer who likes shipping small programs that work well, are fun to write, and run with pretty good performance, I think Guile can be a great option.

I don't get to do many Guile talks because hey, it's 20 years old, so we don't get the novelty effect. Still, I judge a programming language based on what you can do with it, and recent advances in the Guile implementation have expanded its scope significantly, allowing it to handle many problem sizes that it couldn't before. This talk will be a bit about the language, a bit about the implementation, and a bit about applications or problem domains.

compiling with persistent data structures

As part of Guile's recent compiler improvements, we switched to a somewhat novel intermediate language. It's continuation-passing-style, but based on persistent data structures. Programming with it is interesting and somewhat different than other intermediate languages, and so this would be a talk describing the language and what it's like to work in it. Definitely a talk for compiler people, by a compiler person :)

a high-performance networking with luajit talk

As I mentioned above, my colleagues and I at work have been building really interesting things based on LuaJIT. In particular, using the Snabb Switch networking toolkit has let us build an implementation of a "lightweight address family translation router" -- the internet-facing component of an IPv4-as-a-service architecture, built on an IPv6-only network. Our implementation flies.

It sounds a bit specialized, and it is, but this talk could go two ways.

One version of this talk could be for software people that aren't necessarily networking specialists, describing the domain and how with Snabb Switch, LuaJIT, compilers, and commodity x86 components, we are able to get results that compete well with offerings from traditional networking vendors. Building specialized routers and other network functions in software is an incredible opportunity for compiler folks.

The other version would be more for networking people. We'd explain the domain less and focus more on architecture and results, and look more ahead to challenges of 100Gb/s ports.

let me know!

I'll probably submit some of these to a few conferences, but if you run an event and would like me to come over and give one of these talks, I would be flattered :) Maybe that set of people is empty, but hey, it's worth a shot. Probably contact via the twitters has the most likelihood of response.

There are some things you need to make sure are covered before reaching out, of course. It probably doesn't need repeating in 2016, but make sure that you have a proper code of conduct, and that that you'll be able to put in the time to train your event staff to create that safe space that your attendees need. Getting a diverse speaker line-up is important to me too; conferences full of white dudes like me are not only boring but also serve to perpetuate an industry full of white dudes. If you're reaching out, reach out to women and people of color too, and let me know that you're working on it. This old JSConf EU post has some ideas too. Godspeed, and happy planning!

ffs ssl

2014-10-17T14:33:30Z

I just set up ~~SSL~~TLS on my web site. Everything can be had via https://wingolog.org/, and things appear to work. However the process of transitioning even a simple web site to SSL is so clownshoes bad that it's amazing anyone ever does it. So here's an incomplete list of things that can go wrong when you set up TLS on a web site.

You search "how to set up https" on the Googs and click the first link. It takes you here which tells you how to use StartSSL, which generates the key in your browser. Whoops, your private key is now known to another server on this internet! Why do people even recommend this? It's the worst of the worst of Javascript crypto.

OK so you decide to pay for a certificate, assuming that will be better, and because who knows what's going on with StartSSL. You've heard of RapidSSL so you go to rapidssl.com. WTF their price is 49 dollars for a stupid certificate? Your domain name was only 10 dollars, and domain name resolution is an actual ongoing service, unlike certificate issuance that just happens one time. You can't believe it so you click through to the prices to see, and you get this:

Whatttttttttt

OK so I'm using Epiphany on Debian and I think that uses the system root CA list which is different from what Chrome or Firefox do but Jesus this is shaking my faith in the internet if I can't connect to an SSL certificate provider over SSL.

You remember hearing something on Twitter about cheaper certs, and oh ho ho, it's rapidsslonline.com, not just RapidSSL. WTF. OK. It turns out Geotrust and RapidSSL and Verisign are all owned by Symantec anyway. So you go and you pay. Paying is the first thing you have to do on rapidsslonline, before anything else happens. Welp, cross your fingers and take out your credit card, cause SSLanta Clause is coming to town.

Recall, distantly, that SSL has private keys and public keys. To create an SSL certificate you have to generate a key on your local machine, which is your private key. That key shouldn't leave your control -- that's why the DigitalOcean page is so bogus. The certification authority (CA) then needs to receive your public key and then return it signed. You don't know how to do this, because who does? So you Google and copy and paste command line snippets from a website. Whoops!

Hey neat it didn't delete your home directory, cool. Let's assume that your local machine isn't rooted and that your server isn't rooted and that your hosting provider isn't rooted, because that would invalidate everything. Oh what so the NSA and the five eyes have an ongoing program to root servers? Um, well, water under the bridge I guess. Let's make a key. You google "generate ssl key" and this is the first result.

# openssl genrsa -des3 -out foo.key 1024

Whoops, you just made a 1024-bit key! I don't know if those are even accepted by CAs any more. Happily if you leave off the 1024, it defaults to 2048 bits, which I guess is good.

Also you just made a key with a password on it (that's the -des3 part). This is eminently pointless. In order to use your key, your web server will need the decrypted key, which means it will need the password to the key. Adding a password does nothing for you. If you lost your private key but you did have it password-protected, you're still toast: the available encryption cyphers are meant to be fast, not hard to break. Any serious attacker will crack it directly. And if they have access to your private key in the first place, encrypted or not, you're probably toast already.

OK. So let's say you make your key, and make what's called the "~~CRT~~CSR", to ask for the cert.

# openssl req -new -key foo.key -out foo.csr

Now you're presented with a bunch of pointless-looking questions like your country code and your "organization". Seems pointless, right? Well now I have to live with this confidence-inspiring dialog, because I left off the organization:

Don't mess up, kids! But wait there's more. You send in your CSR, finally figure out how to receive mail for hostmaster@yourdomain.org because that's what "verification" means (not, god forbid, control of the actual web site), and you get back a certificate. Now the fun starts!

How are you actually going to serve SSL? The truly paranoid use an out-of-process SSL terminator. Seems legit except if you do that you lose any kind of indication about what IP is connecting to your HTTP server. You can use a more HTTP-oriented terminator like bud but then you have to mess with X-Forwarded-For headers and you only get them on the first request of a connection. You could just enable mod_ssl on your Apache, but that code is terrifying, and do you really want to be running Apache anyway?

In my case I ended up switching over to nginx, which has a startlingly underspecified configuration language, but for which the Debian defaults are actually not bad. So you uncomment that part of the configuration, cross your fingers, Google a bit to remind yourself how systemd works, and restart the web server. Haich Tee Tee Pee Ess ahoy! But did you remember to disable the NULL authentication method? How can you test it? What about the NULL encryption method? These are actual things that are configured into OpenSSL, and specified by standards. (What is the use of a secure communications standard that does not provide any guarantee worth speaking of?) So you google, copy and paste some inscrutable incantation into your config, turn them off. Great, now you are a dilettante tweaking your encryption parameters, I hope you feel like a fool because I sure do.

Except things are still broken if you allow RC4! So you better make sure you disable RC4, which incidentally is exactly the opposite of the advice that people were giving out three years ago.

OK, so you took your certificate that you got from the CA and your private key and mashed them into place and it seems the web browser works. Thing is though, the key that signs your certificate is possibly not in the actual root set of signing keys that browsers use to verify the key validity. If you put just your key on the web site without the "intermediate CA", then things probably work but browsers will make an additional request to get the intermediate CA's key, slowing down everything. So you have to concatenate the text files with your key and the one with the intermediate CA's key. They look the same, just a bunch of numbers, but don't get them in the wrong order because apparently the internet says that won't work!

But don't put in too many keys either! In this image we have a cert for jsbin.com with one intermediate CA:

And here is the same but with an a different root that signed the GeoTrust Global CA certificate. Apparently there was a time in which the GeoTrust cert hadn't been added to all of the root sets yet, and it might not hurt to include them all:

Thing is, the first one shows up "green" in Chrome (yay), but the second one shows problems ("outdated security settings" etc etc etc). Why? Because the link from Equifax to Geotrust uses a SHA-1 signature, and apparently that's not a good idea any more. Good times? (Poor Remy last night was doing some basic science on the internet to bring you these results.)

Or is Chrome denying you the green because it was RapidSSL that signed your certificate with SHA-1 and not SHA-256? It won't tell you! So you Google and apply snakeoil and beg your CA to reissue your cert, hopefully they don't charge for that, and eventually all is well. Chrome gives you the green.

Or does it? Probably not, if you're switching from a web site that is also available over HTTP. Probably you have some images or CSS or Javascript that's being loaded over HTTP. You fix your web site to have scheme-relative URLs (like //wingolog.org/ instead of http://wingolog.org/), and make sure that your software can deal with it all (I had to patch Guile :P). Update all the old blog posts! Edit all the HTMLs! And finally, green! You're golden!

Or not! Because if you left on SSLv3 support you're still broken! Also, TLSv1.0, which is actually greater than SSLv3 for no good reason, also has problems; and then TLS1.1 also has problems, so you better stick with just TLSv1.2. Except, except, older Android phones don't support TLSv1.2, and neither does the Googlebot, so you don't get the rankings boost you were going for in the first place. So you upgrade your phone because that's a thing you want to do with your evenings, and send snarky tweets into the ether about scumbag google wanting to promote HTTPS but not supporting the latest TLS version.

So finally, finally, you have a web site that offers HTTPS and HTTP access. You're good right? Except no! (Catching on to the pattern?) Because what happens is that people just type in web addresses to their URL bars like "google.com" and leave off the HTTP, because why type those stupid things. So you arrange for http://www.wobsite.com to redirect https://www.wobsite.com for users that have visited the HTTPS site. Except no! Because any network attacker can simply strip the redirection from the HTTP site.

The "solution" for this is called HTTP Strict Transport Security, or HSTS. Once a visitor visits your HTTPS site, the server sends a response that tells the browser never to fetch HTTP from this site. Except that doesn't work the first time you go to a web site! So if you're Google, you friggin add your name to a static list in the browser. EXCEPT EVEN THEN watch out for the Delorean.

And what if instead they go to wobsite.com instead of the www.wobsite.com that you configured? Well, better enable HSTS for the whole site, but to do anything useful with such a web request you'll need a wildcard certificate to handle the multiple URLs, and those run like 150 bucks a year, for a one-bit change. Or, just get more single-domain certs and tack them onto your cert, using the precision tool cat, but don't do too many, because if you do you will overflow the initial congestion window of the TCP connection and you'll have to wait for an ACK on your certificate before you can actually exchange keys. Don't know what that means? Better look it up and be an expert, or your wobsite's going to be slow!

If your security goals are more modest, as they probably are, then you could get burned the other way: you could enable HSTS, something could go wrong with your site (an expired certificate perhaps), and then people couldn't access your site at all, even if they have no security needs, because HTTP is turned off.

Now you start to add secure features to your web app, safe with the idea you have SSL. But better not forget to mark your cookies as secure, otherwise they could be leaked in the clear, and better not forget that your website might also be served over HTTP. And better check up on when your cert expires, and better have a plan for embedded browsers that don't have useful feedback to the user about certificate status, and what about your CA's audit trail, and better stay on top of the new developments in security! Did you read it? Did you read it? Did you read it?

It's a wonder anything works. Indeed I wonder if anything does.