class: center, middle, contrast, no-slide-number count: false # FPGA-based line-rate packet forwarding for the SCION future Internet architecture Kamila Součková Master thesis defence ETH Zurich, 2019/09/10 --- class: contrast, expanded, middle, narrow 1. **Intro** Why? How? 3. **Design & Implementation** What? 4. **Results** Throughput and packet loss 5. **What have we learned?** Implications for packet forwarding & SCION --- class: center, middle, contrast, no-slide-number count: false # Intro Why? How? --- layout: true class: bg-top background-image: url(../img/banner-2000px-cut.jpg) # .white[Why?]
.centerblock.width60[![SCION](../img/badge-white.png)]
--- - already deployed in practice - SCIONLab, Anapaya - in order to succeed, SCION must work at line rate - throughput, cost, energy efficiency --- - currently all software-based - actual high-speed implementation may show opportunities for design improvement --- layout: false # What? methods for high-speed packet forwarding: .centerblock.width80[![Methods for high-speed packet forwarding](../img/high-speed-packet-forwarding.svg)] --- count: false # What? methods for high-speed packet forwarding: .centerblock.width80[![Methods for high-speed packet forwarding](../img/high-speed-packet-forwarding-SCION.svg)] --- count: false # What? methods for high-speed packet forwarding: .centerblock.width80[![Methods for high-speed packet forwarding](../img/high-speed-packet-forwarding-FPGA.svg)] -- .floatleft.width50[![netfpga](../img/netfpga.png)] .floatleft.width40.margin0[ NetFPGA SUME: 4x 10GbE Virtex 7 FPGA P4 ] --- # P4 FPGA design is very time-consuming, requires expertise, and changes cost a lot of effort ⇒ P4: language for programming packet forwarding planes * **high-level** ⇒ less effort to write and maintain * **target-independent** ⇒ same code runs on many platforms * **looks like software** ⇒ less need for domain expertise ??? target independent -- in theory :D --- layout: true .floatright.width50[![FPGA diagram](../img/fpga-diagram.png)] # How do FPGAs work? - software: sequence of arbitrary instructions - FPGA: fixed **circuit**; data moves through - pipelining: --- .centerblock.width60[![pipelining 3 stages](../img/pipeline-3.svg)] --- count: false .centerblock.width60[![pipelining 4 stages](../img/pipeline-4.svg)] --- layout: false class: expanded1 .floatright.width50[![FPGA diagram](../img/fpga-diagram.png)] # How do FPGAs work? recurring problem in FPGA design: .floatleft.width40[![pipeline timing broken](../img/pipeline-timingfucked.svg)] signal propagation speed is finite
- if the circuit is too complicated, the signal propagation delay will be too large ⇒ _timing constraint violation_ ⇒ the circuit won't work - delays accumulate --- class: expanded # Contributions of this project 1. PoC router that **forwards SCION at line rate** 2. testing **feasibility of SCION in hardware** + enabling going faster in the future 3. **suggestions for the SCION protocol** for better suitability for HW --- class: center, middle, contrast, no-slide-number count: false # Design & Implementation What? --- class: expanded # Forwarding in SCION .floatright.width60[![forwarding path packet](../img/path.png)] * PCFS: path in packet header * every border router processes only its own *hop field* * checks to ensure path authorisation * AES-based HF MAC --- class: tall # Design of the router .centerblock.width90[![Parts of the project](../img/scion-breakdown.svg)] ??? * modular * better while developing: may take a bit more time, but we can reuse our work and it simplifies debugging * swapping parts makes sense (e.g. AES or encaps) * because you can make other things with it * integrated with existing SCION infra * control plane * monitoring / metrics ---------------------- going to talk about 3 interesting aspects: - control plane - path authorisation check - parsing of SCION packet TODO may be worth reordering? --- class: expanded1, bg-top-right background-image: url(../img/bugs.png) # Challenges among others: * project building workflow: complicated, time-consuming, not automated, buggy * documentation: almost nonexistent and sometimes wrong * P4-NetFPGA scripts full of bugs * P4-NetFPGA compiler full of bugs * expensive workarounds needed * which unearth more bugs… ??? also: * it's a lot of work! --- ## Control plane: "Reduce, Reuse, Recycle" pretend to be a NIC: expose network interfaces to the host, pass packets through when they can't be handled .centerblock.width70[![control plane with DMA-based interfaces](../img/scion-controlplane.svg)] ⇒ use .hl[unmodified] SW router for unhandled packets ??? ⇒ error handling Solved(TM) ⇒ iteratively move functionality into HW --- # Hop field validation 1. check ingress interface 2. check timestamp and expiration 3. check MAC - chained ⇒ checks whole path - AES-CMAC with AS-specific key: subkeys + **1 AES per block** .centerblock.width90[![HF MACs chaining](../img/macs.svg)] ??? timestamp --- # Adding AES - AES-CMAC needed for HF MAC check - P4 allows calling external modules through `extern` functions: ```c extern void aes128(in bit<128> key, in bit<128> data, out bit<128> result); ``` - to add an extern, copy existing and poke at it :-/ -- .doc.bug[the module must be pipelined (was undocumented)] -- .bug[externs corrupt state ⇒ the first table after calling an extern won't work] ??? [BEFORE CLICK]: so we did that, we copied a checksum extern and stuffed an AES module into it instead, tried it, and it worked! But only for the first packet. Then we needed to reboot the host and then it would again correctly calculate AES for one packet, and then it would get stuck and need a reboot again. This does not scale very well. --- layout: true # Implementing the parser SCION header: .floatleft.margin0[ ![header stack](../img/header-stack.svg) ] --- ??? my experience with NetFPGA, this would be different with a different compiler --- count: false .floatright.margin0[ First idea: Use header stacks: ```c struct ScionHeader_t { … HopField_h[32] seg1_hfs; … } ``` ] --- count: false .floatright.margin0[ First idea: Use header stacks: ```c struct ScionHeader_t { … HopField_h[32] seg1_hfs; … } ``` .red[NetFPGA compiler does not support header stacks] ] --- layout: false # Implementing the parser Fix: We don't need to parse the whole path, we just need to save it so we can emit it later ⇒ use a `varbit` field .floatleft.margin0.width30[ ![header stack](../img/header-stack-varbit.svg) ] -- .floatright.margin0.red[ NetFPGA compiler does not support `varbit` ] --- # Implementing the parser Fix: Don't even save the path: use `packet_mod`: ```c parser ExModDeparser(packet_mod p, in headers_t h) { state start { * p.update(h.ethernet); transition select(h.ethernet.ethertype) { ETHERTYPE_IPV4: deparse_ipv4; ETHERTYPE_IPV6: deparse_ipv6; } } … } ``` Xilinx extension, **not** standard P4 — but standard P4 can use `varbit` --- # Implementing the parser Fix: Don't even save the path: use `packet_mod`: 1. modify NetFPGA design to use the `packet_mod`-enabled `XilinxStreamSwitch` architecture .light[\* experimental :-)] 2. skip over the path with `packet.advance(size)` -- .red[NetFPGA requires `packet.advance(size)` to be a compile-time constant] --- layout: true # Implementing the parser .light[Fix: Don't even save the path: use `packet_mod`:] --- 1. .light[modify NetFPGA design to use the `packet_mod`-enabled `XilinxStreamSwitch` architecture] .light[\* experimental :-)] 2. **loop** until current to skip over hop fields with `packet.advance(HF_SIZE)` -- .red.bug[loop in parser generates invalid intermediate code (off-by-one in the compiler?)] -- .center[Therefore…] --- 1. .light[modify NetFPGA design to use the `packet_mod`-enabled `XilinxStreamSwitch` architecture] .light[\* experimental :-)] 2. **Create a lot of separate sub-parsers: one for skipping 1 HF, one for skipping 2, etc.** Select the correct sub-parser at runtime, but the jump size is fixed at compile time. --- layout: false # Implementing the parser Create a lot of separate sub-parsers: up to .hl[max length]. Problem: To support reasonably long paths (~64 HFs), we need a lot of sub-parsers. ⇒ Requires a lot of FPGA area + RAM to build it. --- layout: true # Implementing the parser Create a lot of separate sub-parsers: up to .hl[max length]. Problem: To support reasonably long paths (~64 HFs), we need a lot of sub-parsers. ⇒ Requires a lot of FPGA area + RAM to build it. Fix: Two stages of max length 8: --- count: false .width100[![sqrt1](../img/sqrt1.svg)] --- count: false .width100[![sqrt2](../img/sqrt2.svg)] --- count: false .width100[![sqrt3](../img/sqrt3.svg)] --- count: false .width100[![sqrt4](../img/sqrt4.svg)] -- .bottomtick[] --- layout: false class: expanded1 exclude: true # Portability * P4 should be target-independent, but… * different externs * different compiler limitations (ahem) * pieces of the solution: * preprocessor `#define`s decouple platform from features; code depends only on features * `#ifdef TARGET_SUPPORTS_PACKET_MOD` rather than `#ifdef NETFPGA` * our repo structure minimises work needed to add new platform --- layout: false class: expanded1 # Area and timing constraints * workarounds for all the bugs very expensive: NetFPGA almost full (!) * at 200 MHz, timing violated * originally > 6 ns, .hl[optimised down to 5.1 ns (20%)], but we need 5.0 ns * further attempts just move problem around * complicated by P4 ⇒ HDL translation: not much control over the resulting design * 160 MHz would suffice, but changing clock is non-trivial -- ⇒ test partial designs with some features disabled --- class: center, middle, contrast, no-slide-number count: false # Results Throughput and packet loss --- .floatright.width60[![spirent testcenter](../img/a-random-spirent.png)] # Method Spirent TestCenter traffic generator (replays pre-generated packets)
full design does not meet timing ⇒ tested: 1. everything but updating IP overlay 2. everything but AES (results identical) all measurements repeatable ??? traffic generator does not know about SCION ⇒ replays pre-generated packets 1. everything but updating IP overlay: parsing, HF validation, forwarding 2. everything but AES: parsing, timestamp check, routing, IP overlay update --- # Theory .center[ our design is pipelined, processes 256 bits every cycle, runs at 200 MHz 200·10
6
· 256 .hl.large[≈ 51 Gbps] ⇒ should saturate 40 Gbps link ] --- class: center, middle .centerblock.width20[![cross fingers icon](../img/cross-fingers.svg)] # .large[DEMO TIME] --- class: center # Throughput: 1500B frames .width100[![throughput graph](../img/throughput-1500B.jpg)] 9.93 Gbps per port ⇒ **39.72 Gbps** --- class: center # Throughput: 115B frames .width100[![throughput graph](../img/throughput-115B.jpg)] 9.93 Gbps per port ⇒ **39.72 Gbps** --- # Packet loss (over 45s) .floatleft[
Load
Loss
8 Gbps
0
16 Gbps
0
24 Gbps
0
32 Gbps
0
40 Gbps
0
1500B frames
] -- .floatleft[
Load
Loss
8 Gbps
0
16 Gbps
0
24 Gbps
0
32 Gbps
0
40 Gbps
0
115B frames
] -- processing faster than line rate, very even traffic distribution ⇒ no buffering :-) --- # Analysis .centerblock.width50[![throughput graph](../img/throughput-115B-nokey.jpg)] - line rate, 0% packet loss achieved - timing "almost there" ⇒ without workarounds for the bad compiler, the full design would work - throughput would not change .center.hl[**⇒ SCION can be forwarded at 40 Gbps**] --- class: center, middle, contrast, no-slide-number count: false # What have we learned? Implications for packet forwarding & SCION --- class: expanded1 # High-speed packet processing * P4 suitable even for complex protocols * line rate by default! * but: targets have notable limitations * not as target-independent as we had hoped * P4-NetFPGA not suitable for complex things :-/ * barely a PoC; very scarce documentation * FPGA-based implementations: not that many limitations, but timing can be a problem * looking forward to more mature targets & toolchains :-) ??? plus new compiler --- class: expanded1 # Writing P4 that meets timing * find critical path (how depends on toolchain) * careful with data dependencies: delays accumulate * avoid `inout` parameters in critical spots * full FPGA is bad * CAM tables are somewhat expensive ⇒ may help to combine multiple lookups into one table * rewrite code like this:
.floatleft[ ```sh if [long computation 1] then [long computation 2] ``` ] .floatright[ ```sh [long computation 1]; [long computation 2]; [combine the results] ``` ]
⇒
--- # Suggestions for SCION protocol *
avoid state on routers, require few table lookups
✓ * define maximum sizes: path length, HF size * avoid variable lengths * alignment is good ✓ * HF "continue" flag should be removed * make things explicit * include host address length in header * consider re-thinking MAC chaining * further research into the tradeoffs is needed .small[collaboration with many people in NetSec, Anapaya and SIDN .hl[redesign in progress :-)]] --- # Conclusion .centerblock.width70[![throughput graph](../img/throughput-115B-nokey.jpg)] - SCION can be forwarded at high speeds - upcoming header redesign will incorporate our suggestions --- class: center, middle, contrast, no-slide-number count: false # Say hi! Email: **scion@kamila.is** Twitter: **@anotherkamila** Matrix: **@kamila:unchat.cat** .width20[![](https://chart.apis.google.com/chart?cht=qr&chs=500x500&choe=UTF-8&chld=H&chl=https://kamila.is//talking/scion/thesis/)] --- class: center, middle, contrast, no-slide-number count: false # Appendix --- class: no-slide-number count: false class: expanded1 # Future work * completing features / upgrading for newer SCION * redesign ⇒ changes (simplifications?) * researching resource usage * esp. power consumption * other targets / higher throughput / other applications .floatleft.margin0.width48[![network monitoring system](../img/scion-nms.svg)] .floatright.margin0.width48[![end host with smartNIC](../img/scion-smartnic.svg)]