1. Overview of RoE

Two commonly recognized RDMA (Remote DMA) technologies are InfiniBand and iWARP (Internet Wide Area RDMA Protocol).

InfiniBand has been adopted in a wide range of applications, starting with HPC applications, with great success, whereas iWARP over Ethernet has been adopted in very few applications due to implementation and deployment challenges. Under the IEEE data center Bridging ( DCB ) umbrella, recent enhancements to the Ethernet data link layer present significant opportunities for widespread use of RDMA technology in mainstream data center applications. The proposed DCB standards include IEEE802.1bb – Priority-based flow control (PFC), 802.1Qau – Congestion Notification, 802.1az – Enhanced Transmission Selection (ETS), DCB Capability Exchange. The lossless delivery characteristics of DCB enabled by priority-based flow control (PFC) are similar to those of the InfiniBand data link layer. Therefore, it is a natural choice to apply InfiniBand-based native RDMA transport service to build RDMA service over DCB Ethernet based on PFC. The Infiniband Trade Association (IBTA) recently released a specification called RDMA over Converged Ethernet (RoCE, pronounced “Rocky”) that applies InfiniBand-based native RDMA transport services over Ethernet. ConnectX-2 EN with RoE (RDMA over Ethernet) implements the RoCE standard to deliver InfiniBand-like ultra-low latency and high scalability over Ethernet.

2. Mechanism of ConnectX-2 RoE

ConnectX-2 EN with RoE consists of a combination of InfiniBand's native RDMA transport and Ethernet with IBTA RoCE standard, the data link is replaced from InfiniBand-based layer 2 to Ethernet's layer 2 as shown below, Infiniband transport is applied over PFC-based lossless Ethernet datalinks.

Figure: LLE (Low Latency Ethernet) format and protocol stack

Software interface

ConnectX-2 EN with RoE complies with the Open Fabrics Alliance OFED verbs definition and is interoperable with OFA software stacks (similar to InfiniBand and iWARP). ConnectX-2 EN with RoE uses the feature-rich and proven InfiniBand Verbs interface available in the OFA stack, supporting RoCE and ConnectX-2 EN with RoE from OFED v1.5.1.

network layer

ConnectX-2 EN with RoE relies on the GRH (Global Route Header)-based network layer defined by InfiniBand and requests the functionality of the InfiniBand GRH-based network layer when necessary. GRH has a GID (Global Identifier) that is equivalent to IPv6 addressing and can also be applied to IPv4 addressing.

data link layer

At the data link layer level, it requires at least 802.1bb Priority Flow Control ( PFC ) or 802.3x Pause to guarantee standard Layer 2 Ethernet service and lossless packet delivery. It is desirable to support 802.1au Congestion notification, but it is a function that is absolutely necessary if the connection network between servers or between servers and storage is not overloaded and congestion is unlikely to occur. not. The way L2 addressing is based on destination and source MAC addresses, and the way QoS is implemented is in the 802.1Q header priority field, similar to 802.1az ( ETS ) and other Ethernet features. Finally, whether the packet is of RoCE type is indicated by the Ethertype assigned by the IEEE.
The following table shows how Ethernet's Layer 2 header fields map to the functionality provided by InfiniBand's Layer 2 header fields to enable seamless operation of the InfiniBand transport layer over the Ethernet data link layer. It is a summary of what is attached.

functionInfiniBand L2 Header FieldsEthernet L2 header fields
Address methodSLIDs and DLIDsSMAC and DMAC
the priority of the queue ( Queue )Service Level (SL)802.1Q header priority
Partitioning, or VLANPartition key (P-Key)802.1Q header VLAN ID
Congestion NotificationFECN and BECN as defined by IBTA802.1 Qau QCNs

converged traffic

A RoCE packet is identified by its Ethertype number in its L2 header. It recognizes different packet types at lower levels in the stack and allows different types of Ethernet communications, including RDMA communications, to coexist simultaneously on a single physical Ethernet wire. ConnectX-2 EN with RoE references the destination queue pair number ( DQPN ) in the transport header to separate traffic into multiple queue pairs.

management

ConenctX-2 EN with RoE does not require an SM (InfiniBand Subnet Manager) and operates using standard Ethernet network management techniques for L2 addressing, L2 topology discovery, and Switch Filtering Database (FDB) configuration can be used (eg spanning tree or learning can be used). RoCEE's QoS management is achieved using the 802.1Qaz (ETS) Ethernet management method, and for its congestion management function, RoCEE uses Ethernet's 802.1au congestion management feature. PFC prioritization and negotiation with PFC-enabled switches is done statically using VLANs (tying RDMA traffic to VLANs in the host and assigning higher PFC priority to those VLANs in the switch). , or dynamically between NICs and switches using the DCB exchange protocol, ConnectX-2 EN with RoE supports both modes of this PFC configuration method. Finally, performance monitoring, baseboard and device manager can be performed using standard SNMP/RMON MIBs.
The following table shows how the network management features expected by the InfiniBand transport layer and applications using that layer can be seamlessly implemented using standard Ethernet management techniques and the need for an InfiniBand subnet manager. It is a collection of what I lost. Data center IT administrators can use familiar Ethernet-based management tools to use ConnectX-2 EN with RoE like any other Ethernet technology and easily deploy it in their data centers.

Management functions required for the IB transport layer and the apps used in that layerHow to implement in InfiniBand in InfiniBand subnetHow to implement Ethernet using standard Ethernet management techniques
L2 addressingL2 addressing with Subnet ManagerSpecifying a fixed L2 address or other Ethernet mechanism
L2 topology discovery and switch FDB configurationTopology discovery using Direct routed subnet management packets ( SMP ) with Subnet Manager and Path computation and path distribution with Subnet ManagerSpanning tree and learning mechanism. Also, IETF transparent interconnection of many links ( TRILL ) and other Ethernet techniques
QoSQoS Manager extending Subnet ManagerStandard Ethernet QoS management method. Local API to access fabric policy settings
Congestion ManagementCongestion management for IB802.1Qau congestion management techniques
Performance ManagementIB performance ManagerSNMP/RMON MIBS
Device/Baseboard managementIB Baseboard ManagerSNMP/RMON MIBS

Based on the IBTA RoCE specification, the ConnectX-2 EN with RoE adapter is being released today by Mellanox Technologies and has been demonstrated to reach end-to-end application-level latency as low as 1.3 us (microseconds). Confirmed with Mellanox and other leading companies are working together to grow independent vendor applications that take full advantage of the RoCE-based adapter ecosystem and ConnectX-2. Examples of target applications include financial services, business intelligence, data warehousing, cloud computing, and Web 2.0.

3. Advantages of ConnectX-2 EN for RoE

Based on what has been said so far, we promise that ConnectX-2 EN with RoE will bring many benefits, enabling the development and deployment of RDMA technology into mainstream data center applications.

RoE-enabled ConnectX-2 EN takes advantage of Ethernet (DCB) breakthroughs to enable low-cost, efficient implementation of RDMA over Ethernet.


• ConnectX-2 EN's RDMA data communication can be differentiated at a faster data link layer and requires less CPU overhead.


ConnectX-2 EN with RoE has 10x lower application-to-application latency than other industry-standard implementations over Ethernet
It achieves a low latency of 1.3 us (microseconds). Applications primarily used in financial services include capital markets data processing and
Shows lower latency values of over 60% in trading transactions.


RoE-enabled ConnectX-2 EN supports RDMA's global RDMA features and its low-latency features.
This includes Reliable connection service, datagram service, RDMA and send/receive semantics, atomic operation, user level multicast,
Includes user level I/O access, kernel bypass, and zero copy.


The OFA verbs used by ConnectX-2 EN with RoE are based on InfiniBand and
Proven at scale with multiple ISV applications in both HPC and EDC space.


Roe-enabled ConnectX-2 EN-based network management is similar to other Ethernet and DCB-based network management,
IT administrators do not have to learn new techniques.

The above content is a translation of Mellanox's white paper ( WP_ConnectX-2_EN_with_ROE.pdf ). The original can be found at www.mellanox.com > Products > Ethernet cards > ConnectX-2 EN.