class: center, middle, contrast, no-slide-number count: false # Efficient P4+FPGA-based Forwarding for SCION, a Path-Aware Internet Architecture Kamila Součková 2020/04/22 --- excluded: true class: contrast, expanded, middle, narrow 1. **Intro** 2. **Challenges** 3. **Lessons learned** --- excluded: true class: center, middle, contrast, no-slide-number count: false # Intro Why? How? --- layout: true class: bg-top background-image: url(/media/banner-2000px-cut.jpg) # .white[Why?]
.centerblock.width60[]
--- .center[https://scion-architecture.net · https://scionlab.org] ??? The SCION future Internet architecture is a completely new approach to solving some of thecurrent Internet’s problems. Its viability has already been demonstrated -- we have a global research network and multiple commercial deployments. --- TODO explain path in header --- To go beyond research, it must work in HW: throughput, cost, energy efficiency ⇒ make sure that it can be done; improve protocol design ??? But to work at Internet scale, it is necessary to make it perform at high bandwidth, reduce costs, andevaluate its suitability for eventual implementation in hardware. So that's what we did. --- layout: false # How? methods for high-speed packet forwarding: .centerblock.width80[] ??? We have a software implementation, but that one can do less than 1Gbps. It's in userspace, so there's lots of opportunities for improving while staying in software, but... In order to be taken seriously e.g. by ISPs, we need to go up at least 3 orders of magnitude. Tricks like eBPF or DPDK can get us about one order of magnitude, possibly two, but beyond that a CPU won't really work. --- count: false # How? methods for high-speed packet forwarding: .centerblock.width80[] --- count: false # How? methods for high-speed packet forwarding: .centerblock.width80[] --- # P4 + NetFPGA .floatleft.width50[] .floatright.width40[] .floatleft.width40.margin0[ NetFPGA SUME: 4x 10GbE Virtex 7 FPGA **P4** (PoC) ] --- exclude: true # P4 + NetFPGA "normal" FPGA programming: _hardware description language (HDL)_: logic and connections .floatright.width40[] P4: * **high-level** and modular ⇒ less effort to write, maintain, port * **looks like software** ⇒ less need for FPGA expertise * **extensible**: supports `extern`s but note: `p4c-sdnet` compiler is a PoC ??? P4 should be target-independent, but some degree of "porting" is still needed because in practice different targets have different limitations and may need different tradeoffs --- layout: true .floatright.width50[] # Hardware vs Software - software: sequence of arbitrary instructions - hardware: **fixed** circuit; data moves ⇒ "think in space, not in time" --- pipelining: .centerblock.width60[] --- count: false pipelining: .centerblock.width60[] --- layout: false class: expanded1 .floatright.width50[] # Hardware vs Software the hardest thing about hardware design: .floatleft.width40[] signal propagation speed is finite
- _timing constraint violation_ ⇒ the circuit won't work - delays accumulate --- class: center, middle, contrast, no-slide-number count: false # Making a SCION router _tricks, challenges, lessons learned_ --- # Reduce, Reuse, Recycle Expose network interfaces to the host, pass packets through to SW when they can't be handled .centerblock.width60[] ⇒ use .hl[unmodified] SW router for unhandled packets -- .floatright.moral[Not everything belongs in HW] ??? ⇒ error handling Solved(TM) ⇒ iteratively move functionality into HW --- # Adding AES TODO rephrase: SCION requires a cryptographic check of path validity .floatright.width30[] ```c @Xilinx_MaxLatency(10) extern void my_aes128( in bit<128> K, in bit<128> data, out bit<128> result ); ``` * declaration generates an interface in Verilog, you provide implementation * must be pipelined * can only be called once (declare multiple if needed) -- .floatright.moral[Mix and match: embed native in P4] ??? and beware of compiler bugs --- # Implementing the parser SCION header: .floatleft.margin0[  ] .floatright.margin0.width60[ path aware ⇒ route in header; pointer to "my" field .light[(need to save the rest so we can emit in deparser)] ] -- .floatright.margin0.width60.red[ NetFPGA does not support header stacks or `varbit` ] ??? my experience with NetFPGA, this would be different with a different compiler spent way too much time trying to find workarounds and encountering bugs in the workarounds (including an off-by-one in the compiler :D) --- .floatright.width30[] # Implementing the parser Fix: instead of saving all fields, use `packet_mod` Xilinx extension:
```c parser ExModDeparser(packet_mod p, in headers_t h) { state start { * p.update(h.ethernet); transition select(h.ethernet.ethertype) { ETHERTYPE_IPV4: deparse_ipv4; ETHERTYPE_IPV6: deparse_ipv6; } } … } ``` -- .floatright.moral[Mix and match: P4 is embedded in a larger design] ??? Xilinx extension, **not** standard P4 — but standard P4 can use `varbit` --- # Implementing the parser Idea: **Skip over unused fields, read my field** -- A loop wouldn't pipeline well (need constant time ⇒ unrolls into a very deep circuit) -- Therefore… .floatleft.width30[] -- .floatright.moral[Parallel is often better than serial.] -- .floatright.width30.margin0.red[But: For length up to e.g. 64, requires a **lot** of FPGA area] ??? Select the correct sub-parser at runtime, but the jump size is fixed at compile time. --- # Implementing the parser Idea: **Skip over unused fields, read my field .hl[in two stages]:** .floatleft.width30[] .floatright.width40[2x latency for .height1em[]] -- .floatright.width40.moral[Depth vs width tradeoffs are worth thinking about.] ??? TODO 0..7, not 1..8 TODO moral to the story: 2-stage thing is a deeper circuit, but has better timing results because FPGA is less full and thus the place&route can do a better job --- layout: false class: tall # Area and timing constraints .centerblock.width60[] inherent complexity + workarounds for bugs ⇒ NetFPGA almost full (!) ⇒ at 200 MHz timing violated ??? fixing timing complicated by P4 ⇒ HDL translation TODO would be cute to add a "bug" to the picture in places --- class: expanded1 # Meeting timing * find the _critical path_ * think about data dependencies * small, simple modules * "tight" interfaces: correct use of `in`/`out`, not `inout`; no extraneous parameters * minimise use of non-local data * wider is (usually) better than deeper, e.g.:
.floatleft[ ```sh if [long computation 1] then [long computation 2] ``` ] .floatright[ ```sh [long computation 1]; [long computation 2]; [combine the results] ``` ]
⇒
-- .floatright.width60.moral[Keep in mind: signal moves through the circuit and has to arrive in time] ??? if you are using 1 field of a struct, don't pass in the whole struct restructuring my code gave me a 20% improvement on the critical path, in addition to better code --- class: fullslide
--- class: center, middle, contrast, no-slide-number count: false # Lessons learned --- class: expanded1 # High-speed packet processing * P4: * suitable even for complex protocols * useful abstraction: high-level, yet runs fast on hardware * pipelining and parallelism handled by compiler * but: different targets have different tradeoffs and limitations * P4-NetFPGA: is a PoC * looking forward to more targets & toolchains in the P4 ecosystem * P4 on FPGAs: some low-level FPGA-specific knowledge required, but high flexibility ??? pipelining and parallelism handled automatically P4 is clearly designed to compile to HW / pipeline and parallelise well, yet high-level enough to be useful (you don't have to do it yourself) plus new compiler --- # Network protocol design * avoid variable length fields (really) * should be parseable without interpretation * e.g. value of a tag implicitly defining length of a field is bad * explicit lengths, not "continue" flags --- class: expanded1 exclude: true # If I had a time machine... - I - I **would** use P4: - a surprisingly good abstraction - line rate achieved - I would get paid for it :-) ??? - I would not aim at "production" before I have a prototype that guarantees that the basic functionality can actually work, because by the time I got to the end of the thing I was so burned out and full of hate that I didn't even polish the PoC :D - if I could know it in advance, I would not put my degree and pride in the hands of a horribly broken toolchain (but I'm not sure I could have known) - I would still stick with P4 / HLS (esp. if I could have a better toolchain), because my network protocol is complex and P4 is a surprisingly useful abstraction that saved me a lot of effort and nerves --- TODO template slide .floatright.width15[]