Efficient P4+FPGA-based Forwarding for SCION, a Path-Aware Internet Architecture

# Efficient P4+FPGA-based Forwarding for SCION, a Path-Aware Internet Architecture

Kamila Součková  
2020/04/22

---
excluded: true
class: contrast, expanded, middle, narrow

1. **Intro**  
2. **Challenges**  
3. **Lessons learned**

---
excluded: true
class: center, middle, contrast, no-slide-number
count: false
# Intro
Why? How?

---

# .white[Why?]

<div style="margin: -0.6em 0 8em 0;">
.centerblock.width60[![SCION](/media/badge-white.png)]
</div>

---

???

The SCION future Internet architecture is a completely new approach to solving some of thecurrent Internet’s problems. Its viability has already been demonstrated -- we have a global research network and multiple commercial deployments.

---

TODO explain path in header

---

To go beyond research, it must work in HW:  
throughput, cost, energy efficiency

⇒ make sure that it can be done; improve protocol design

???

But to work at Internet scale, it is necessary to make it perform at high bandwidth, reduce costs, andevaluate its suitability for eventual implementation in hardware. So that's what we did.

---

# How?

methods for high-speed packet forwarding:

.centerblock.width80[![Methods for high-speed packet forwarding](/media/high-speed-packet-forwarding.svg)]

???

We have a software implementation, but that one can do less than 1Gbps. It's in userspace, so there's lots of opportunities for improving while staying in software, but... In order to be taken seriously e.g. by ISPs, we need to go up at least 3 orders of magnitude.

Tricks like eBPF or DPDK can get us about one order of magnitude, possibly two, but beyond that a CPU won't really work.

---

# How?

methods for high-speed packet forwarding:

.centerblock.width80[![Methods for high-speed packet forwarding](/media/high-speed-packet-forwarding-SCION.svg)]

---

# How?

methods for high-speed packet forwarding:

.centerblock.width80[![Methods for high-speed packet forwarding](/media/high-speed-packet-forwarding-FPGA.svg)]

---

# P4 + NetFPGA

.floatleft.width50[![netfpga](/media/netfpga.png)]

.floatright.width40[![P4+NetFPGA structure](/media/p4-netfpga-structure.png)]

.floatleft.width40.margin0[
NetFPGA SUME:  
4x 10GbE  
Virtex 7 FPGA  
**P4** (PoC)
]

---
exclude: true

# P4 + NetFPGA

"normal" FPGA programming: _hardware description language (HDL)_: logic and connections

.floatright.width40[![P4+NetFPGA structure](/media/p4-netfpga-structure.png)]

P4:
* **high-level** and modular ⇒ less effort to write, maintain, port
* **looks like software** ⇒ less need for FPGA expertise
* **extensible**: supports `extern`s

but note: `p4c-sdnet` compiler is a PoC

???

P4 should be target-independent, but some degree of "porting" is still needed because in practice different targets have different limitations and may need different tradeoffs

---
layout: true

.floatright.width50[![FPGA diagram](/media/fpga-diagram-abc.png)]
# Hardware vs Software

- software: sequence of arbitrary instructions
- hardware: **fixed** circuit; data moves  
  ⇒ "think in space, not in time"

---

pipelining:

.centerblock.width60[![pipelining 3 stages](/media/pipeline-3.svg)]

---
count: false

pipelining:

.centerblock.width60[![pipelining 4 stages](/media/pipeline-4.svg)]

---
layout: false
class: expanded1

.floatright.width50[![FPGA diagram](/media/fpga-diagram-abc.png)]
# Hardware vs Software

the hardest thing about hardware design:

.floatleft.width40[![pipeline timing broken](/media/pipeline-timingfucked.svg)]

signal propagation speed  
is finite

- _timing constraint violation_ ⇒ the circuit won't work
- delays accumulate

---
class: center, middle, contrast, no-slide-number
count: false

# Making a SCION router

_tricks, challenges, lessons learned_

---

# Reduce, Reuse, Recycle

Expose network interfaces to the host,  
pass packets through to SW when they can't be handled

.centerblock.width60[![control plane with DMA-based interfaces](/media/scion-controlplane.svg)]

⇒ use .hl[unmodified] SW router for unhandled packets

.floatright.moral[Not everything belongs in HW]

???
⇒ error handling Solved(TM)
⇒ iteratively move functionality into HW

---

# Adding AES

TODO rephrase:
SCION requires a cryptographic check of path validity

.floatright.width30[![P4+NetFPGA structure](/media/p4-netfpga-structure-extern.png)]

```c
@Xilinx_MaxLatency(10)
extern void my_aes128(
    in bit<128> K,
    in bit<128> data,
    out bit<128> result
);
```
* declaration generates an interface in Verilog, you provide implementation
  * must be pipelined
  * can only be called once (declare multiple if needed)

.floatright.moral[Mix and match: embed native in P4]

???

and beware of compiler bugs

---

# Implementing the parser

SCION header:

.floatleft.margin0[
![header stack](/media/header-stack-my.svg)
]

.floatright.margin0.width60[
path aware ⇒ route in header;  
pointer to "my" field

.floatright.margin0.width60.red[
NetFPGA does not support header stacks or `varbit`
]

???
my experience with NetFPGA, this would be different with a different compiler
spent way too much time trying to find workarounds and encountering bugs in the workarounds
(including an off-by-one in the compiler :D)

---
.floatright.width30[![P4+NetFPGA structure](/media/p4-netfpga-structure-mod.png)]

# Implementing the parser

Fix: instead of saving all fields, use `packet_mod` Xilinx extension:

```c
parser ExModDeparser(packet_mod p, in headers_t h) {
    state start {
*       p.update(h.ethernet);
        transition select(h.ethernet.ethertype) {
            ETHERTYPE_IPV4: deparse_ipv4;
            ETHERTYPE_IPV6: deparse_ipv6;
        }
    }
…
}
```

.floatright.moral[Mix and match: P4 is embedded in a larger design]

???

Xilinx extension, **not** standard P4 — but standard P4 can use `varbit`

---
# Implementing the parser

Idea: **Skip over unused fields, read my field**

A loop wouldn't pipeline well  
(need constant time ⇒ unrolls into a very deep circuit)

Therefore…

.floatleft.width30[![separate sub-parsers](/media/subparsers-1.png)]

.floatright.moral[Parallel is often better than serial.]

.floatright.width30.margin0.red[But: For length up to e.g. 64, requires a **lot** of FPGA area]

???

Select the correct sub-parser at runtime, but the jump size is fixed at compile time.

---
# Implementing the parser

Idea: **Skip over unused fields, read my field .hl[in two stages]:**

.floatleft.width30[![separate sub-parsers](/media/subparsers-2.png)]
.floatright.width40[2x latency for .height1em[![sqrt(area)](/media/sqrt-area.png)]]
--

.floatright.width40.moral[Depth vs width tradeoffs are worth thinking about.]

???

TODO 0..7, not 1..8

TODO moral to the story:
2-stage thing is a deeper circuit, but has better timing results because FPGA is less full and thus the place&route can do a better job

---
layout: false
class: tall

# Area and timing constraints

.centerblock.width60[![Design of the router](/media/scion-breakdown.svg)]

inherent complexity + workarounds for bugs ⇒ NetFPGA almost full (!) ⇒ at 200 MHz timing violated

???
fixing timing complicated by P4 ⇒ HDL translation

TODO would be cute to add a "bug" to the picture in places

---

# Meeting timing

* find the _critical path_
* think about data dependencies
  * small, simple modules
  * "tight" interfaces: correct use of `in`/`out`, not `inout`; no extraneous parameters
  * minimise use of non-local data
  * wider is (usually) better than deeper, e.g.:

<div style="clear: both;"></div>
<div style="position: relative; margin: -1em auto; width: 95%;">
.floatleft[
```sh
if [long computation 1]
then [long computation 2]
```
]
.floatright[
```sh
[long computation 1];
[long computation 2];
[combine the results]    
```
]
<div style="position: absolute; top: 1.4em; left: 52.5%;">⇒</div>
</div>
<div style="clear: both; margin-bottom: -1em;"></div>

.floatright.width60.moral[Keep in mind: signal moves through the circuit and has to arrive in time]

???

if you are using 1 field of a struct, don't pass in the whole struct
restructuring my code gave me a 20% improvement on the critical path, in addition to better code

---
class: fullslide

---
class: center, middle, contrast, no-slide-number
count: false
# Lessons learned

---
class: expanded1

# High-speed packet processing

* P4:
  * suitable even for complex protocols
  * useful abstraction: high-level, yet runs fast on hardware
  * pipelining and parallelism handled by compiler
  * but: different targets have different tradeoffs and limitations
* P4-NetFPGA: is a PoC
  * looking forward to more targets & toolchains in the P4 ecosystem
* P4 on FPGAs:
  some low-level FPGA-specific knowledge required, but high flexibility

???
pipelining and parallelism handled automatically
P4 is clearly designed to compile to HW / pipeline and parallelise well, yet high-level enough to be useful (you don't have to do it yourself)

plus new compiler

---
# Network protocol design

* avoid variable length fields (really)
* should be parseable without interpretation
  * e.g. value of a tag implicitly defining length of a field is bad
* explicit lengths, not "continue" flags

---
class: expanded1
exclude: true

# If I had a time machine...

- I 
- I **would** use P4:
  - a surprisingly good abstraction
  - line rate achieved
- I would get paid for it :-)

???
- I would not aim at "production" before I have a prototype that guarantees that the basic functionality can actually work, because by the time I got to the end of the thing I was so burned out and full of hate that I didn't even polish the PoC :D
  - if I could know it in advance, I would not put my degree and pride in the hands of a horribly broken toolchain (but I'm not sure I could have known)
- I would still stick with P4 / HLS (esp. if I could have a better toolchain), because my network protocol is complex and P4 is a surprisingly useful abstraction that saved me a lot of effort and nerves

---

TODO template slide

.floatright.width15[![](https://chart.apis.google.com/chart?cht=qr&chs=500x500&choe=UTF-8&chld=H&chl=https://kamila.is//talking/p4-roundtable/)]