登录 EN

添加临时用户

云数据中心内可编程的高性能网络处理

Programmable and High-Performance Network Function for the Cloud Data Center Network

作者:贾成君
  • 学号
    2018******
  • 学位
    博士
  • 电子邮箱
    che******com
  • 答辩日期
    2023.05.15
  • 导师
    李军
  • 学科名
    控制科学与工程
  • 页码
    151
  • 保密级别
    公开
  • 培养单位
    025 自动化系
  • 中文关键词
    数据中心,可编程网络,网络功能,智能网卡,分布式系统
  • 英文关键词
    Data Center, Programmable Network, Network Function, SmartNIC, Distributed System

摘要

云数据中心部署了越来越多的可编程网卡、可编程交换机等设备。面对云数据中心网络的高吞吐、低时延、可配置、高并发等挑战,如何充分利用可编程网络的异构计算能力、为上层用户提供各种网络处理功能,是当前计算机网络领域的热点问题。本文围绕网包、网流和集群,选取四种典型网络功能——访问控制、流量统计、拥塞控制、分布式通信,开展了相关研究。(1)云数据中心网关需要高吞吐、高规则容量的访问控制功能,本文为其设计了硬件加速方案,FACL,可部署多种决策树算法,容纳树深不定的各种规则集。FACL采用Pipeline和RTC混合的架构,支持多个决策树结果的合并,在XilinxU50智能网卡上的评测表明,FACL能在100K规则容量下以20%以下的硬件资源消耗提供250Mpps的吞吐能力,比现有软件方案降低了60%的硬件成本。(2)为应对交换机片上存储不足的问题,本文提出了低存储开销的流量统计算法,ElasticCounter,在高偏度的流量上提供了极高的空间利用率。ElasticCounter使用布谷鸟哈希提高计数器的利用率,允许哈希桶中的计数器调整比特长度以自适应网流实际分布,在频率测量、热点流检测、异常流检测三个任务上,ElasticCounter的错误率在相同存储空间下,比现有Sketch算法低了1~2个数量级。其能在3%的FPGA资源消耗下提供192Mpps的吞吐、比ElasticSketch提升了18%。(3)现有的拥塞控制算法在云数据中心网络的高并发流场景下,由于缺乏端侧调速的协同而无法稳定控制交换机队列。本文设计了新的端网协同的拥塞控制方案,C-AQM,具有良好的可部署性。C-AQM通过端侧提案、网络侧批准/拒绝的设计,相比XCP等网络决策算法大幅降低了计算复杂度,保证了瓶颈链路处网流带宽的公平性。ns3仿真和Tofino测试床实验表明,C-AQM能将队列抖动控制在20KB以内;相比DCQCN、HPCC等算法,其排队时延降低了50%~80%。(4)为了降低“利用可编程网络设备开发分布式系统”的编程难度,本文为分布式通信任务设计了新的软件库,O-RMA,可有效降低典型任务的完成时间。集群的任意主机可调用预先注册的RPC式O-RMA通信任务。该任务由RDMA扩展原语组成,原语包括本地/远程调用、异步内存读写等。40行O-RMA原语即可描述多副本日志追加、原子对象读写、分布式锁等任务。相比RDMA单边程序,O-RMA能将上述任务完成时间降低30%~63%,吞吐提升66%~225%。

More and more programmable NICs and programmable switches are deployed in cloud data centers. Much research focuses on how to make full use of the heterogeneous computing capabilities of programmable networks to provide network function service and deal with the requirements of high throughput, low latency, configurability, and high concurrency in cloud data center networks. This thesis chooses four typical network functions, i.e. access control, traffic measurement, congestion control, and distributed network communication, which represent the solutions for packets, flows, and clusters. Aiming at the high throughput and high rule capacity requirements of the access control function at the cloud data center gateway, the thesis designs a new hardware acceleration solution, FACL. FACL adopts a hybrid architecture of Pipeline and RTC, which can deploy a variety of decision tree algorithms and solve the challenges caused by the uncertain depth of decision trees. The evaluation on the Xilinx U50 FPGA NIC shows that FACL can provide 250Mpps throughput and 100K rule capacity with less than 20% resource consumption, which reduces the hardware cost by 60% compared with the software solutions nowadays.To deal with the insufficient on-chip storage on switches, the thesis proposes a traffic measurement algorithm with ultra-low storage overhead, ElasticCounter. ElasticCounter adopts cuckoo hashing to improve the utilization of counters, allowing the counters in the hash bucket to adjust the bit length to fit the actual distribution of network traffic, providing extremely high space utilization on highly skewed traffic. In the three tasks of frequency measurement, heavy hitter, and heavy change, the error rate of ElasticCounter is reduced by 1~2 orders of magnitude compared with the other Sketch algorithms under the same storage, and it consumes 3% of FPGA resources to provide a throughput of 192Mpps, which is 18% higher than Elastic Sketch. The high concurrency of cloud data center network flows makes the current congestion control algorithms unable to stably control the switch queue due to the lack of synergy of the endhost speed regulation. With the help of programmable networks, the thesis designs a congestion control scheme with endhost-network coordination, C-AQM. C-AQM employs the design of endhost-side proposal and switch-side approval/rejection. Compared with XCP, the computational complexity of C-AQM is significantly reduced,with the bandwidth fairness of network flows at bottleneck links guaranteed. The simulation on ns3 and the Tofino testbed show that C-AQM can control the switch queue jitter within 20KB, and reduce the queuing delay by 50% ~80% compared with DCQCN, HPCC, and others.To reduce the difficulty of developing distributed applications using programmable networks, the thesis designs a new software library, O-RMA, for distributed communication tasks. O-RMA allows any host in the cluster to call pre-registered tasks in an RPC-like manner, and the tasks are composed of a series of primitives, which are extended from RDMA semantics, such as local/remote calls, asynchronous memory read and write operations, and control commands. O-RMA describes the tasks of appending logs, reading/writing atomic objects, and locking/unlocking distributed locks, within 40 lines of primitives. Compared with programs with the one-side RDMA, O-RMA reduces the completion time of these tasks by 30% ~63%, and improves throughput by 66%~225%.