Source: Network Allies Blog

Network Allies Blog How to Achieve Proper PCIE Packet Accelerator Optimization

More network traffic, more packets, closer inspection of each one - it all adds up to more computation than network systems designers are typically able to squeeze out of even the most multi of multi-core general-purpose processors.The obvious solution: harness a special-purpose accelerator card. Ready-to-go right off the shelf, these accelerators use specialized silicon to speed up all sorts of packet-related tasks such as pattern matching, cryptography, and compression.But even with a zippy multi-lane PCIe interface connecting them to the CPU, accelerator cards won't deliver their maximum punch without special attention. How they exchange data with the host is important, and so is how those two processors share their labors, or workload.Managing CommunicationsThere are three ways of speeding communications across a PCIe bus:opting for direct memory access (DMA) transfers, instead of programmed I/O (PIO);using Writes instead of Reads;and making efficient use of interrupt requests (IRQs)Here's why: While DMA is best known as an efficient technique - i.e., it uses relatively few computing cycles - for moving data, it also can maximize the usage of a PCIe bus. That's because DMA engines, as they're called, operate on 32 bytes of data at once - and sometimes much more - compared to the 8-byte maximum in a 64-bit CPU's load/store instruction. When large volumes of data need transferring, this difference translates into much lower transaction overheard for DMA compared to PIO: at least 4 times, and possibly 8 or 16 times, less.As for Writing vs. Reading data, consider this: When one device on a PCIe bus Writes data to another device, all intermediate bridge devices along the communications path are permitted to buffer the data being written. And that enables multiple Writes to take place concurrently through any given bridge.A Read, in contrast, requires that all those intermediate bridge devices interlock, thus creating a clear and continuous communications from target to requestor. But during this period of interlock, no other Reads can take place, which limits the bus' effective bandwidth.Finally, designers need to manage the significant overhead that processor interrupts always incur - each interrupt, after all, triggers two unavoidable context switches. The main strategy, here, is to reduce the actual frequency of interrupts, which can be done by effectively bundling, or "coalescing," those that occur around the same moment and by changing how interrupts get used.Divvying up the WorkloadAccelerators often have time to spare, and the clever designer will use that capacity to offload certain pre- and post-processing tasks, such as checksum calculation, from the heavily-taxed CPU.Likewise, formatting data to match the specific internal needs of accelerator and CPU can greatly improve their sharing of data. With many internal fields, Ethernet packets look quite different from the strictly-bounded 64-bit data that Pentium processors like to consume, but with a small amount of juggling, that gap can be bridged and a system's combined performance greatly improved.Accelerator cards are a boon for all designers, but take our advice: Use them wisely.

Est. Annual Revenue

$5.0-25M

Est. Employees

25-100

Co-Founder & CEO

Jim Reinhold

CEO Approval Rating

84/100