FPGA-based line-rate packet forwarding for the SCION future Internet architecture

# FPGA-based line-rate packet forwarding for the SCION future Internet architecture

Kamila Součková  
Master thesis defence  
ETH Zurich, 2019/09/10

---
class: contrast, expanded, middle, narrow

1. **Intro**  
   Why? How?
3. **Design & Implementation**  
   What?
4. **Results**  
   Throughput and packet loss
5. **What have we learned?**  
   Implications for packet forwarding & SCION

---
class: center, middle, contrast, no-slide-number
count: false
# Intro
Why? How?

---

# .white[Why?]

<div style="margin: -0.6em 0 5.3em 0;">
.centerblock.width60[![SCION](../img/badge-white.png)]
</div>

---

- already deployed in practice
  - SCIONLab, Anapaya
- in order to succeed, SCION must work at line rate
  - throughput, cost, energy efficiency

---

- currently all software-based
- actual high-speed implementation may show opportunities for design improvement

---

# What?

methods for high-speed packet forwarding:

.centerblock.width80[![Methods for high-speed packet forwarding](../img/high-speed-packet-forwarding.svg)]

---

# What?

methods for high-speed packet forwarding:

.centerblock.width80[![Methods for high-speed packet forwarding](../img/high-speed-packet-forwarding-SCION.svg)]

---

# What?

methods for high-speed packet forwarding:

.centerblock.width80[![Methods for high-speed packet forwarding](../img/high-speed-packet-forwarding-FPGA.svg)]

.floatleft.width50[![netfpga](../img/netfpga.png)]

.floatleft.width40.margin0[
NetFPGA SUME:  
4x 10GbE  
Virtex 7 FPGA  
P4
]

---

# P4

FPGA design is very time-consuming, requires expertise, and changes cost a lot of effort

⇒ P4: language for programming packet forwarding planes
* **high-level** ⇒ less effort to write and maintain
* **target-independent** ⇒ same code runs on many platforms
* **looks like software** ⇒ less need for domain expertise

???
target independent --  in theory :D

---
layout: true

.floatright.width50[![FPGA diagram](../img/fpga-diagram.png)]
# How do FPGAs work?

- software: sequence of arbitrary instructions
- FPGA: fixed **circuit**; data moves through
  - pipelining:

---

.centerblock.width60[![pipelining 3 stages](../img/pipeline-3.svg)]

---
count: false

.centerblock.width60[![pipelining 4 stages](../img/pipeline-4.svg)]

---
layout: false
class: expanded1

.floatright.width50[![FPGA diagram](../img/fpga-diagram.png)]
# How do FPGAs work?

recurring problem in FPGA design:

.floatleft.width40[![pipeline timing broken](../img/pipeline-timingfucked.svg)]

signal propagation speed  
is finite

- if the circuit is too complicated, the signal propagation delay will be too large ⇒ _timing constraint violation_ ⇒ the circuit won't work
- delays accumulate

---
class: expanded

# Contributions of this project

1. PoC router that **forwards SCION at line rate**
2. testing **feasibility of SCION in hardware** + enabling going faster in the future
3. **suggestions for the SCION protocol** for better suitability for HW

---
class: center, middle, contrast, no-slide-number
count: false
# Design & Implementation
What?
---

# Forwarding in SCION

.floatright.width60[![forwarding path packet](../img/path.png)]

* PCFS: path in packet header
* every border router processes only its own *hop field*
* checks to ensure path authorisation
  * AES-based HF MAC

---

# Design of the router

.centerblock.width90[![Parts of the project](../img/scion-breakdown.svg)]

???

* modular
  * better while developing:
    may take a bit more time,
    but we can reuse our work
    and it simplifies debugging
  * swapping parts makes sense (e.g. AES or encaps)
  * because you can make other things with it
* integrated with existing SCION infra
  * control plane
  * monitoring / metrics

----------------------

going to talk about 3 interesting aspects:
- control plane
- path authorisation check
- parsing of SCION packet
TODO may be worth reordering?

---

class: expanded1, bg-top-right
background-image: url(../img/bugs.png)
# Challenges

among others:
* project building workflow:  
  complicated, time-consuming, not automated, buggy
* documentation: almost nonexistent and sometimes wrong
* P4-NetFPGA scripts full of bugs
* P4-NetFPGA compiler full of bugs
  * expensive workarounds needed
      * which unearth more bugs…

???
also:
* it's a lot of work!

---

## Control plane: "Reduce, Reuse, Recycle"

pretend to be a NIC:  
expose network interfaces to the host,  
pass packets through when they can't be handled

.centerblock.width70[![control plane with DMA-based interfaces](../img/scion-controlplane.svg)]

⇒ use .hl[unmodified] SW router for unhandled packets

???
⇒ error handling Solved(TM)
⇒ iteratively move functionality into HW

---

# Hop field validation

1. check ingress interface
2. check timestamp and expiration
3. check MAC
  - chained ⇒ checks whole path
  - AES-CMAC with AS-specific key:  
    subkeys + **1 AES per block**

.centerblock.width90[![HF MACs chaining](../img/macs.svg)]

???
timestamp

---

# Adding AES

- AES-CMAC needed for HF MAC check
- P4 allows calling external modules through `extern` functions:  
  ```c
  extern void aes128(in  bit<128> key,
  					     in  bit<128> data,
  					     out bit<128> result);
  ```
- to add an extern, copy existing and poke at it :-/

.doc.bug[the module must be pipelined (was undocumented)]

???
[BEFORE CLICK]: so we did that, we copied a checksum extern and stuffed an AES module into it instead, tried it, and it worked!
But only for the first packet.
Then we needed to reboot the host and then it would again correctly calculate AES for one packet, and then it would get stuck and need a reboot again.
This does not scale very well.

---

# Implementing the parser

SCION header:

.floatleft.margin0[
![header stack](../img/header-stack.svg)
]

---
???
my experience with NetFPGA, this would be different with a different compiler

---
count: false

.floatright.margin0[
First idea: Use header stacks:

```c
struct ScionHeader_t {
    …
    HopField_h[32] seg1_hfs;
    …
}
```
]

---
count: false

.floatright.margin0[
First idea: Use header stacks:

```c
struct ScionHeader_t {
    …
    HopField_h[32] seg1_hfs;
    …
}
```

---

# Implementing the parser

Fix: We don't need to parse the whole path, we just need to save it so we can emit it later ⇒ use a `varbit` field

.floatleft.margin0.width30[
![header stack](../img/header-stack-varbit.svg)
]

.floatright.margin0.red[
NetFPGA compiler does not  
support `varbit`
]

---

# Implementing the parser

Fix: Don't even save the path: use `packet_mod`:

```c
parser ExModDeparser(packet_mod p, in headers_t h) {
    state start {
*       p.update(h.ethernet);
        transition select(h.ethernet.ethertype) {
            ETHERTYPE_IPV4: deparse_ipv4;
            ETHERTYPE_IPV6: deparse_ipv6;
        }
    }
    …
}
```

Xilinx extension, **not** standard P4 — but standard P4 can use `varbit`

---

# Implementing the parser

Fix: Don't even save the path: use `packet_mod`:

1. modify NetFPGA design to use the `packet_mod`-enabled `XilinxStreamSwitch` architecture  
   .light[\* experimental :-)]
2. skip over the path with `packet.advance(size)`

---

---

1. .light[modify NetFPGA design to use the `packet_mod`-enabled `XilinxStreamSwitch` architecture]  
   .light[\* experimental :-)]
2. **loop** until current to skip over hop fields with `packet.advance(HF_SIZE)`

.red.bug[loop in parser generates invalid intermediate code  
(off-by-one in the compiler?)]

---

1. .light[modify NetFPGA design to use the `packet_mod`-enabled `XilinxStreamSwitch` architecture]  
   .light[\* experimental :-)]
2. **Create a lot of separate sub-parsers: one for skipping 1 HF, one for skipping 2, etc.**  
   Select the correct sub-parser at runtime, but the jump size is fixed at compile time.

---

Create a lot of separate sub-parsers: up to .hl[max length].

Problem: To support reasonably long paths (~64 HFs), we need a lot of sub-parsers.  
⇒ Requires a lot of FPGA area + RAM to build it.

---

# Implementing the parser

Create a lot of separate sub-parsers: up to .hl[max length].

Problem: To support reasonably long paths (~64 HFs), we need a lot of sub-parsers.  
⇒ Requires a lot of FPGA area + RAM to build it.

Fix: Two stages of max length 8:

---
count: false
.width100[![sqrt1](../img/sqrt1.svg)]
---
count: false
.width100[![sqrt2](../img/sqrt2.svg)]
---
count: false
.width100[![sqrt3](../img/sqrt3.svg)]
---
count: false
.width100[![sqrt4](../img/sqrt4.svg)]
--
.bottomtick[]
---
layout: false
class: expanded1
exclude: true

# Portability

* P4 should be target-independent, but…
  * different externs
  * different compiler limitations (ahem)
* pieces of the solution:
  * preprocessor `#define`s decouple platform from features; code depends only on features
    * `#ifdef TARGET_SUPPORTS_PACKET_MOD` rather than `#ifdef NETFPGA`
  * our repo structure minimises work needed to add new platform

---
layout: false
class: expanded1

# Area and timing constraints

* workarounds for all the bugs very expensive:  
  NetFPGA almost full (!)
* at 200 MHz, timing violated
  * originally > 6 ns, .hl[optimised down to 5.1 ns (20%)], but we need 5.0 ns
  * further attempts just move problem around
  * complicated by P4 ⇒ HDL translation:  
    not much control over the resulting design
  * 160 MHz would suffice, but changing clock is non-trivial

⇒ test partial designs with some features disabled

---
class: center, middle, contrast, no-slide-number
count: false
# Results
Throughput and packet loss
---

.floatright.width60[![spirent testcenter](../img/a-random-spirent.png)]

# Method

Spirent TestCenter traffic generator  
(replays pre-generated packets)

full design does not meet timing ⇒ tested:
  1. everything but updating IP overlay
  2. everything but AES

(results identical)

all measurements repeatable

???
traffic generator does not know about SCION ⇒ replays pre-generated packets
  1. everything but updating IP overlay:  
     parsing, HF validation, forwarding
  2. everything but AES:  
     parsing, timestamp check, routing, IP overlay update

---

# Theory

200·10<sup>6</sup> · 256  
.hl.large[≈ 51 Gbps]

⇒ should saturate 40 Gbps link
]

---
class: center, middle

.centerblock.width20[![cross fingers icon](../img/cross-fingers.svg)]
# .large[DEMO TIME]

---

# Throughput: 1500B frames

9.93 Gbps per port ⇒ **39.72 Gbps**

---

# Throughput: 115B frames

9.93 Gbps per port ⇒ **39.72 Gbps**

---

# Packet loss (over 45s)

.floatleft[
<table class="numeric-table">
  <thead>
    <tr>
      <th>Load</th>
      <th>Loss<br></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>8 Gbps</td>
      <td>0</td>
    </tr>
    <tr>
      <td>16 Gbps</td>
      <td>0</td>
    </tr>
    <tr>
      <td>24 Gbps</td>
      <td>0</td>
    </tr>
    <tr>
      <td>32 Gbps</td>
      <td>0</td>
    </tr>
    <tr>
      <td>40 Gbps</td>
      <td>0</td>
    </tr>
  </tbody>
  <caption>1500B frames</caption>
</table>
]

processing faster than line rate,  
very even traffic distribution

⇒ no buffering :-)

---

# Analysis

.centerblock.width50[![throughput graph](../img/throughput-115B-nokey.jpg)]
- line rate, 0% packet loss achieved
- timing "almost there" ⇒ without workarounds for the bad compiler, the full design would work
  - throughput would not change

.center.hl[**⇒ SCION can be forwarded at 40 Gbps**]

---
class: center, middle, contrast, no-slide-number
count: false
# What have we learned?
Implications for packet forwarding & SCION
---

# High-speed packet processing

* P4 suitable even for complex protocols
  * line rate by default!
  * but: targets have notable limitations
  * not as target-independent as we had hoped
* P4-NetFPGA not suitable for complex things :-/
  * barely a PoC; very scarce documentation
* FPGA-based implementations:  
  not that many limitations, but timing can be a problem
* looking forward to more mature targets & toolchains :-)

???
plus new compiler

---

# Writing P4 that meets timing

* find critical path (how depends on toolchain)
* careful with data dependencies:  
  delays accumulate
  * avoid `inout` parameters in critical spots
  * full FPGA is bad
  * CAM tables are somewhat expensive ⇒ may help to combine multiple lookups into one table
  * rewrite code like this:

<div style="clear: both;"></div>
<div style="position: relative; margin: -1em auto; width: 90%;">
.floatleft[
```sh
if [long computation 1]
then [long computation 2]
```
]
.floatright[
```sh
[long computation 1];
[long computation 2];
[combine the results]    
```
]
    <div style="position: absolute; top: 1.4em; left: 52.5%;">⇒</div>
</div>
<div style="clear: both; margin-bottom: -1em;"></div>

---
# Suggestions for SCION protocol

* <strike>avoid state on routers, require few table lookups</strike> ✓
* define maximum sizes: path length, HF size
* avoid variable lengths
    * alignment is good ✓
    * HF "continue" flag should be removed
* make things explicit
    * include host address length in header
* consider re-thinking MAC chaining
    * further research into the tradeoffs is needed

---
# Conclusion

.centerblock.width70[![throughput graph](../img/throughput-115B-nokey.jpg)]

- SCION can be forwarded at high speeds
- upcoming header redesign will incorporate our suggestions

---
class: center, middle, contrast, no-slide-number
count: false

# Say hi!

Email: **scion@kamila.is**  
Twitter: **@anotherkamila**  
Matrix: **@kamila:unchat.cat**

.width20[![](https://chart.apis.google.com/chart?cht=qr&chs=500x500&choe=UTF-8&chld=H&chl=https://kamila.is//talking/scion/thesis/)]

---
class: center, middle, contrast, no-slide-number
count: false

# Appendix

---
class: no-slide-number
count: false
class: expanded1

# Future work

* completing features / upgrading for newer SCION
* redesign ⇒ changes (simplifications?)
* researching resource usage
* esp. power consumption
* other targets / higher throughput / other applications

.floatleft.margin0.width48[![network monitoring system](../img/scion-nms.svg)]
.floatright.margin0.width48[![end host with smartNIC](../img/scion-smartnic.svg)]