This is the benchmark tool of the iip TCP/IP stack.
WARNING: Several commands described in this README need the root permission (sudo). So, please conduct the following procedure only when you understand what you are doing. The authors will not bear any responsibility if the implementations, provided by the authors, cause any problems.
Please first download the source code of this repository.
git clone https://github.com/yasukata/bench-iip.git
Afterward, please enter the bench-iip
directory.
cd bench-iip
Then, please download the source code of iip and the I/O backend.
git clone https://github.com/yasukata/iip.git
git clone https://github.com/yasukata/iip-dpdk.git
Please type the following command to build the application.
IOSUB_DIR=./iip-dpdk make
The command above will download the source code of DPDK and compile it with the files in this repository, ./iip
and ./iip-dpdk
.
The DPDK source code will be downloaded at ./iip-dpdk/dpdk/dpdk-VERSION.tar.xz
, and after the compilation, it will be installed in ./iip-dpdk/dpdk/install
.
So, the DPDK installation itself does not require the root permission (while we need the root permission to run this benchmark tool using DPDK).
For details, please refer to https://github.com/yasukata/iip-dpdk/blob/master/build.mk.
After the compilation finishes, we supposedly see an application file named a.out
.
Before starting to run the benchmark tool, we need to setup huge pages for DPDK.
The following command configures 2GB of huge pages with the 2MB granurality.
NOTE: If your system already has huge pages, you do not need to execute the following command.
sudo ./iip-dpdk/dpdk/dpdk-23.07/usertools/dpdk-hugepages.py -p 2M -r 2G
For quick testing, please use the following command to launch this benchmark tool as the server; this program works as a client when the argument specifies the remote server address by -s
, otherwise, it works as a server.
sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -l 0 --proc-type=primary --file-prefix=pmd1 --vdev=net_tap,iface=tap001 --no-pci -- -a 0,10.100.0.20 -- -p 10000 -m "```echo -e 'HTTP/1.1 200 OK\r\nContent-Length: 2\r\nConnection: keep-alive\r\n\r\nAA'```"
Now, the benchnark tool has started.
Please open another console (terminal), and please type the following commands to associate the tap device made by DPDK with a Linux bridge.
sudo brctl addbr br000
sudo ifconfig br000 10.100.0.10 up
sudo brctl addif br000 tap001
Then, let's first try the ping
command to communicate with the benchmark tool server; supposedly, we will get replies.
ping 10.100.0.20
Afterward, let's try telnet
.
telnet 10.100.0.20 10000
Please type GET
in the telnet console; then, we will get the following output (this is specified by the command above).
HTTP/1.1 200 OK
Content-Length: 2
Connection: keep-alive
AA
You can exit from telnet
by pressing the ]
button and the Ctrl button, and enter q
, like as follows.
Escape character is '^]'.
^]
telnet> q
Connection closed.
The options of this benchmark tool consist of three sections divided by --
: 1) the first is for DPDK (passed to rte_eal_init
), 2) the second is for the application-specific DPDK setting, and 3) the last is for the benchmark tool.
In the example above,
-l 0
specifies the CPU core to be used, this time is CPU core 0--proc-type=primary
specifies proc type (maybe not necessary)--file-prefix=pmd1
is for the namespace issue (maybe not necessary)--vdev=net_tap,iface=tap001
is for associating a newly created tap device namedtap001
with DPDK.--no-pci
requests DPDK not to lookup PCI NIC devices.
-a
: specify the IP address for a DPDK port; the format is PORT_NUM,IP_ADDR (e.g., 0,10.100.0.10 configures 10.100.0.10 for port0)-e
: specify the max timeout value (in millisecond) passed to epoll_wait; when 0 is specified, the interrupt-based mode is not activated and epoll_wait will not be called (default value is 0)
-c
: concurrency (for the client mode)-d
: io depth (for the client mode)-g
: mode (1: ping-pong, 2: burst); in the burst mode, a TCP client send data when it receives a TCP ack, and a TCP server ignores incoming data.-l
: payload length (if-m
is not specified)-m
: a string to be sent as the payload-n
: protocol number, either 6 (TCP) or 17 (UDP) (for the client mode) : default is TCP-p
: the server port (to listen on for the server mode, to connect to for the client mode)-r
: targeted throughput rate (requests/sec) for each thread (for the client mode)-s
: the server IP address to be connected (for the client mode)-t
: duration of the experiment in second (0 means infinite)
The command above uses the tap device, a virtual network interface, primarily for quick testing.
If you have an extra physical NIC, that is not used, you can test this benchmark tool with the unused extra physical NIC.
NOTE: It is not recommended to use your primary physical NIC for testing this benchmark tool because this benchmark tool fully occupies the physical NIC and you will lose connections to other hosts previously established over it.
Before starting the physical NIC setting, please remove br000
that is made in the previous experiment using the tap device.
sudo ifconfig br000 down
sudo brctl delbr br000
To use DPDK, we need to associate a physical NIC with a driver named vfio-pci
.
NOTE: you do not need to go through this step if you use Mellanox NICs; please note that here we list the commands just to show how to use NICs from other vendors.
First, please type the following command.
./iip-dpdk/dpdk/dpdk-23.07/usertools/dpdk-devbind.py -s
Supposedly, we will see this kind of output.
Network devices using kernel driver
===================================
0000:17:00.0 'MT28800 Family [ConnectX-5 Ex] 1019' if=enp23s0f0np0 drv=mlx5_core unused=
0000:17:00.1 'MT28800 Family [ConnectX-5 Ex] 1019' if=enp23s0f1np1 drv=mlx5_core unused=
In this example, we bind a NIC identified by the PCI id 0000:17:00.0
by the following command.
sudo ./iip-dpdk/dpdk/dpdk-23.07/usertools/dpdk-devbind.py -b vfio-pci 0000:17:00.0
The following executes the benchmark tool as the server and binds it with the NIC 0000:17:00.0.
sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -l 0 --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.20 -- -p 10000 -m "```echo -e 'HTTP/1.1 200 OK\r\nContent-Length: 2\r\nConnection: keep-alive\r\n\r\nAA'```"
The following runs the benchnark tool as the client using the NIC 0000:17:00.0.
sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -l 0 --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.10 -- -s 10.100.0.20 -p 10000 -m "GET " -c 1
Here, we show some rough numbers obtained from this benchmark tool.
Two machines having the same configuration.
- CPU: Two of 16-core Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz (32 cores in total)
- NIC: Mellanox ConnectX-5 100 Gbps NIC (the NICs of the two machines are directly connected via a cable)
- OS: Linux 6.2
- iip: e423db4bee7c75d028a5f5ae0cb3a4a249caa940
- iip-dpdk: b493a944c13135c38766003606e14d51ca61fc71
- bench-iip: a93f859d2ca35a93b8891cba14f9d8d2eacea17f
- client (iip and DPDK)
cnt=0; while [ $cnt -le 32 ]; do sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -n 2 -l 0-31 --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.10 -- -s 10.100.0.20 -p $((10000+$cnt)) -g 1 -l 1 -t 5 -c $(($cnt == 0 ? 1 : $cnt)) 2>&1 | tee -a ./result.txt; cnt=$(($cnt+2)); done
- server (iip and DPDK)
cnt=0; while [ $cnt -le 32 ]; do sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -n 2 -l 0-$(($cnt == 0 ? 0 : $(($cnt-1)))) --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.20 -- -p $((10000+$cnt)) -g 1 -l 1; cnt=$(($cnt+2)); done
please click here to show the changes made for disabling zero-copy mode
iip-dpdk/main.c
--- a/main.c
+++ b/main.c
@@ -684,6 +684,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok (max lro pkt size %u)\n", nic_conf[portid].rxmode.max_lro_pkt_size);
} else printf("no\n");
}
+#if 0
{
printf("TX multi-seg: ");
if (dev_info.tx_offload_capa & RTE_ETH_TX_OFFLOAD_MULTI_SEGS) {
@@ -691,6 +692,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok\n");
} else printf("no\n");
}
+#endif
{
printf("TX IPv4 checksum: ");
if (dev_info.tx_offload_capa & RTE_ETH_TX_OFFLOAD_IPV4_CKSUM) {
please click here to show the changes made for disabling the checksum offload feature of the NIC
iip-dpdk/main.c
--- a/main.c
+++ b/main.c
@@ -669,6 +669,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok (nic feature %lx udp-rss-all %lx)\n", dev_info.flow_type_rss_offloads, RTE_ETH_RSS_TCP);
} else printf("no\n"); /* TODO: software-based RSS */
}
+#if 0
{
printf("RX checksum: ");
if (dev_info.rx_offload_capa & RTE_ETH_RX_OFFLOAD_CHECKSUM) {
@@ -676,6 +677,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok\n");
} else printf("no\n");
}
+#endif
{
printf("RX LRO: ");
if (dev_info.rx_offload_capa & RTE_ETH_RX_OFFLOAD_TCP_LRO) {
@@ -691,6 +693,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok\n");
} else printf("no\n");
}
+#if 0
{
printf("TX IPv4 checksum: ");
if (dev_info.tx_offload_capa & RTE_ETH_TX_OFFLOAD_IPV4_CKSUM) {
@@ -705,6 +708,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok\n");
} else printf("no\n");
}
+#endif
{
printf("TX TCP TSO: ");
if (dev_info.tx_offload_capa & RTE_ETH_TX_OFFLOAD_TCP_TSO) {
- server (Linux)
please click here to show the code of the program
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <assert.h>
#include <signal.h>
#include <errno.h>
#include <sys/ioctl.h>
#include <sys/epoll.h>
#include <sys/socket.h>
#include <netinet/tcp.h>
#include <arpa/inet.h>
#include <pthread.h>
#include <numa.h>
#define MAX_CORE (256)
static unsigned int should_stop = 0;
static unsigned int mode = 0;
static unsigned short payload_len = 1;
static char payload_buf[0xffff] = { 0 };
struct monitor_data {
unsigned short idx;
struct {
unsigned long rx_cnt;
unsigned long tx_cnt;
unsigned long rx_bytes;
unsigned long tx_bytes;
} counter[2];
};
static struct monitor_data *monitor[MAX_CORE] = { 0 };
static void sig_h(int s __attribute__((unused)))
{
should_stop = 1;
signal(SIGINT, SIG_DFL);
}
static void *server_thread(void *data)
{
printf("core %lu : server fd %lu\n", (((unsigned long) data) & 0xffffffff), (((unsigned long) data) >> 32));
{
cpu_set_t c;
CPU_ZERO(&c);
CPU_SET((((unsigned long) data) & 0xffffffff), &c);
pthread_setaffinity_np(pthread_self(), sizeof(c), &c);
}
{
struct monitor_data *mon;
assert((mon = numa_alloc_local(sizeof(struct monitor_data))) != NULL);
memset(mon, 0, sizeof(struct monitor_data));
monitor[(((unsigned long) data) & 0xffffffff)] = mon;
{
int epfd;
assert((epfd = epoll_create1(EPOLL_CLOEXEC)) != -1);
{
struct epoll_event ev = {
.events = EPOLLIN,
.data.fd = (((unsigned long) data) >> 32),
};
assert(!epoll_ctl(epfd, EPOLL_CTL_ADD, ev.data.fd, &ev));
}
while (!should_stop) {
struct epoll_event ev[64];
int nfd = epoll_wait(epfd, ev, 64, 100);
{
int i;
for (i = 0; i < nfd; i++) {
if ((unsigned long) ev[i].data.fd == (((unsigned long) data) >> 32)) {
while (1) {
struct sockaddr_in sin;
socklen_t addrlen;
{
struct epoll_event _ev = {
.events = EPOLLIN,
.data.fd = accept(ev[i].data.fd, (struct sockaddr *) &sin, &addrlen),
};
if (_ev.data.fd == -1) {
assert(errno == EAGAIN);
break;
}
assert(!epoll_ctl(epfd, EPOLL_CTL_ADD, _ev.data.fd, &_ev));
}
}
} else {
char buf[0x10000];
ssize_t rx = read(ev[i].data.fd, buf, sizeof(buf));
if (rx <= 0)
close(ev[i].data.fd);
else {
mon->counter[mon->idx].rx_bytes += rx;
mon->counter[mon->idx].rx_cnt++;
if (mode == 1 /* ping-pong */) {
assert(write(ev[i].data.fd, payload_buf, payload_len) == payload_len);
mon->counter[mon->idx].tx_bytes += payload_len;
mon->counter[mon->idx].tx_cnt++;
}
}
}
}
}
}
close(epfd);
}
numa_free(mon, sizeof(struct monitor_data));
}
pthread_exit(NULL);
}
static void *remote_stop_thread(void *data)
{
int *ready = (int *) data, fd;
assert((fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) != -1);
{
int v = 1;
assert(!setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &v, sizeof(v)));
}
{
int v = 1;
assert(!setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &v, sizeof(v)));
}
{
int v = 1;
assert(!ioctl(fd, FIONBIO, &v));
}
{
struct sockaddr_in sin = {
.sin_family = AF_INET,
.sin_addr.s_addr = htonl(INADDR_ANY),
.sin_port = htons(50000 /* remote shutdown */),
};
assert(!bind(fd, (struct sockaddr *) &sin, sizeof(sin)));
}
assert(!listen(fd, SOMAXCONN));
asm volatile ("" ::: "memory");
*ready = 1;
while (!should_stop) {
fd_set fds;
FD_ZERO(&fds);
FD_SET(fd, &fds);
{
struct timeval tv = { .tv_sec = 1, };
if (0 < select(fd + 1, &fds, NULL, NULL, &tv)) {
struct sockaddr_in sin;
socklen_t addrlen;
{
int newfd = accept(fd, (struct sockaddr *) &sin, &addrlen);
if (0 < newfd)
close(newfd);
printf("close requested\n");
}
sig_h(0);
}
}
}
close(fd);
pthread_exit(NULL);
}
int main(int argc, char *const *argv)
{
unsigned short port = 0, num_cores = 0, core_list[MAX_CORE] = { 0 };
pthread_t remote_stop_th;
{
int ready = 0;
assert(!pthread_create(&remote_stop_th, NULL, remote_stop_thread, &ready));
while (!ready) usleep(10000);
}
{
int ch;
while ((ch = getopt(argc, argv, "c:g:l:p:")) != -1) {
switch (ch) {
case 'c':
{
ssize_t num_comma = 0, num_hyphen = 0;
{
size_t i;
for (i = 0; i < strlen(optarg); i++) {
switch (optarg[i]) {
case ',':
num_comma++;
break;
case '-':
num_hyphen++;
break;
}
}
}
if (num_hyphen) {
assert(num_hyphen == 1);
assert(!num_comma);
{
char *m;
assert((m = strdup(optarg)) != NULL);
{
size_t i;
for (i = 0; i < strlen(optarg); i++) {
if (m[i] == '-') {
m[i] = '\0';
break;
}
}
assert(i != strlen(optarg) - 1 && i != strlen(optarg));
{
uint16_t from = atoi(&m[0]), to = atoi(&m[i + 1]);
assert(from <= to);
{
uint16_t j, k;
for (j = 0, k = from; k <= to; j++, k++)
core_list[j] = k;
num_cores = j;
}
}
}
free(m);
}
} else if (num_comma) {
assert(num_comma + 1 < MAX_CORE);
{
char *m;
assert((m = strdup(optarg)) != NULL);
{
size_t i, j, k;
for (i = 0, j = 0, k = 0; i < strlen(optarg) + 1; i++) {
if (i == strlen(optarg) || m[i] == ',') {
m[i] = '\0';
if (j != i)
core_list[k++] = atoi(&m[j]);
j = i + 1;
}
if (i == strlen(optarg))
break;
}
assert(k);
num_cores = k;
}
free(m);
}
} else {
core_list[0] = atoi(optarg);
num_cores = 1;
}
}
break;
case 'g':
mode = atoi(optarg);
break;
case 'l':
payload_len = atoi(optarg);
break;
case 'p':
port = atoi(optarg);
break;
default:
assert(0);
break;
}
}
}
if (!num_cores) {
printf("please specify cores : -c\n");
exit(0);
}
if (!port) {
printf("please specify port number : -p\n");
exit(0);
}
printf("start server with %u cores: ", num_cores);
{
uint16_t i;
for (i = 0; i < num_cores; i++)
printf("%u ", core_list[i]);
}
printf("\n");
printf("listen on port %u\n", port);
printf("payload len %u\n", payload_len);
fflush(stdout);
switch (mode) {
case 1:
case 2:
break;
default:
printf("please specify a mode 1 ping-pong 2 burst : -g\n");
exit(0);
break;
}
{
int fd;
assert((fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) != -1);
{
int v = 1;
assert(!setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &v, sizeof(v)));
}
{
int v = 1;
assert(!setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &v, sizeof(v)));
}
{
int v = 1;
assert(!setsockopt(fd, SOL_TCP, TCP_NODELAY, &v, sizeof(v)));
}
{
int v = 1;
assert(!ioctl(fd, FIONBIO, &v));
}
{
struct sockaddr_in sin = {
.sin_family = AF_INET,
.sin_addr.s_addr = htonl(INADDR_ANY),
.sin_port = htons(port),
};
assert(!bind(fd, (struct sockaddr *) &sin, sizeof(sin)));
}
assert(!listen(fd, SOMAXCONN));
signal(SIGINT, sig_h);
{
pthread_t *th;
assert((th = calloc(num_cores, sizeof(pthread_t))) != NULL);
{
unsigned short i;
for (i = 0; i < num_cores; i++)
assert(!pthread_create(&th[i], NULL, server_thread, (void *)(((unsigned long) fd << 32) | ((unsigned long) core_list[i] & 0xffffffff))));
}
while (!should_stop) {
{
unsigned short i;
for (i = 0; i < num_cores; i++) {
if (monitor[i]) {
if (monitor[i]->idx)
monitor[i]->idx = 0;
else
monitor[i]->idx = 1;
}
}
}
{
unsigned long rx_bytes = 0, rx_cnt = 0, tx_bytes = 0, tx_cnt = 0;
{
unsigned short i;
for (i = 0; i < num_cores; i++) {
if (monitor[i]) {
unsigned short idx = (monitor[i]->idx ? 0 : 1);
if (monitor[i]->counter[idx].rx_cnt || monitor[i]->counter[idx].tx_cnt) {
printf("[%u] payload: rx %lu Mbps (%lu read), tx %lu Mbps (%lu write)\n",
i,
monitor[i]->counter[idx].rx_bytes / 125000UL,
monitor[i]->counter[idx].rx_cnt,
monitor[i]->counter[idx].tx_bytes / 125000UL,
monitor[i]->counter[idx].tx_cnt
); fflush(stdout);
rx_bytes += monitor[i]->counter[idx].rx_bytes;
tx_bytes += monitor[i]->counter[idx].tx_bytes;
rx_cnt += monitor[i]->counter[idx].rx_cnt;
tx_cnt += monitor[i]->counter[idx].tx_cnt;
}
memset(&monitor[i]->counter[idx], 0, sizeof(monitor[i]->counter[idx]));
}
}
}
printf("paylaod total: rx %lu Mbps (%lu pps), tx %lu Mbps (%lu pps)\n",
rx_bytes / 125000UL,
rx_cnt,
tx_bytes / 125000UL,
tx_cnt
); fflush(stdout);
}
sleep(1);
}
{
unsigned short i;
for (i = 0; i < num_cores; i++)
assert(!pthread_join(th[i], NULL));
}
free(th);
}
close(fd);
}
pthread_join(remote_stop_th, NULL);
printf("done.\n");
return 0;
}
The following compiles the program above (program_above.c
) and generates an executable file app
.
gcc -Werror -Wextra -Wall -O3 program_above.c -lpthread -lnuma -o app
Then, the following executes the compiled program app
.
ulimit -n unlimited; cnt=0; while [ $cnt -le 32 ]; do ./app -p $((10000+$cnt)) -c 0-$(($cnt == 0 ? 0 : $(($cnt-1)))) -g 1 -l 1; cnt=$(($cnt+2)); done
- results:
- client (iip and DPDK)
cnt=0; while [ $cnt -le 32 ]; do sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -n 2 -l 0-31 --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.10 -- -s 10.100.0.20 -p 10000 -g 1 -t 5 -c $(($cnt == 0 ? 1 : $cnt)) -d 1 -l 1 2>&1 | tee -a ./result.txt; cnt=$(($cnt+2)); done
- server (iip and DPDK)
cnt=0; while [ $cnt -le 32 ]; do sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -n 2 -l 0-31 --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.20 -- -p 10000 -g 1 -l 1; cnt=$(($cnt+2)); done
- server (Linux)
ulimit -n unlimited; cnt=0; while [ $cnt -le 32 ]; do ./app -p 10000 -c 0-31 -g 1 -l 1; cnt=$(($cnt+2)); done
- results:
- receiver
cnt=0; while [ $cnt -le 20 ]; do sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -n 1 -l 0 --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.20 -- -p 10000 -g 2; cnt=$(($cnt+1)); done
- sender
cnt=0; while [ $cnt -le 20 ]; do sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -n 1 -l 0 --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.10 -- -s 10.100.0.20 -p 10000 -g 2 -t 5 -c 1 -d 3 -l $((63488+63488*$cnt*32)) 1 2>&1 | tee -a ./result.txt; cnt=$(($cnt+1)); done
- note
changes made to disable zero-copy transmission on the sender side
--- a/main.c
+++ b/main.c
@@ -684,6 +684,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok (max lro pkt size %u)\n", nic_conf[portid].rxmode.max_lro_pkt_size);
} else printf("no\n");
}
+#if 0
{
printf("TX multi-seg: ");
if (dev_info.tx_offload_capa & RTE_ETH_TX_OFFLOAD_MULTI_SEGS) {
@@ -691,6 +692,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok\n");
} else printf("no\n");
}
+#endif
{
printf("TX IPv4 checksum: ");
if (dev_info.tx_offload_capa & RTE_ETH_TX_OFFLOAD_IPV4_CKSUM) {
changes made to disable TSO on the sender side
--- a/main.c
+++ b/main.c
@@ -39,7 +39,7 @@
#include <rte_bus_pci.h>
#include <rte_thash.h>
-#define NUM_RX_DESC (128)
+#define NUM_RX_DESC (2048)
#define NUM_TX_DESC NUM_RX_DESC
#define NUM_NETSTACK_PB (8192)
#define NUM_NETSTACK_TCP_CONN (512)
@@ -705,6 +705,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok\n");
} else printf("no\n");
}
+#if 0
{
printf("TX TCP TSO: ");
if (dev_info.tx_offload_capa & RTE_ETH_TX_OFFLOAD_TCP_TSO) {
@@ -712,6 +713,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok\n");
} else printf("no\n");
}
+#endif
{
printf("TX UDP checksum: ");
if (dev_info.tx_offload_capa & RTE_ETH_TX_OFFLOAD_UDP_CKSUM) {
changes made to disable TSO and checksum offload on the sender side
--- a/main.c
+++ b/main.c
@@ -39,7 +39,7 @@
#include <rte_bus_pci.h>
#include <rte_thash.h>
-#define NUM_RX_DESC (128)
+#define NUM_RX_DESC (2048)
#define NUM_TX_DESC NUM_RX_DESC
#define NUM_NETSTACK_PB (8192)
#define NUM_NETSTACK_TCP_CONN (512)
@@ -669,6 +669,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok (nic feature %lx udp-rss-all %lx)\n", dev_info.flow_type_rss_offloads, RTE_ETH_RSS_TCP);
} else printf("no\n"); /* TODO: software-based RSS */
}
+#if 0
{
printf("RX checksum: ");
if (dev_info.rx_offload_capa & RTE_ETH_RX_OFFLOAD_CHECKSUM) {
@@ -676,6 +677,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok\n");
} else printf("no\n");
}
+#endif
{
printf("RX LRO: ");
if (dev_info.rx_offload_capa & RTE_ETH_RX_OFFLOAD_TCP_LRO) {
@@ -691,6 +693,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok\n");
} else printf("no\n");
}
+#if 0
{
printf("TX IPv4 checksum: ");
if (dev_info.tx_offload_capa & RTE_ETH_TX_OFFLOAD_IPV4_CKSUM) {
@@ -712,6 +715,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok\n");
} else printf("no\n");
}
+#endif
{
printf("TX UDP checksum: ");
if (dev_info.tx_offload_capa & RTE_ETH_TX_OFFLOAD_UDP_CKSUM) {
changes made to disable LRO on the receiver side
--- a/main.c
+++ b/main.c
@@ -676,6 +676,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok\n");
} else printf("no\n");
}
+#if 0
{
printf("RX LRO: ");
if (dev_info.rx_offload_capa & RTE_ETH_RX_OFFLOAD_TCP_LRO) {
@@ -684,6 +685,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok (max lro pkt size %u)\n", nic_conf[portid].rxmode.max_lro_pkt_size);
} else printf("no\n");
}
+#endif
{
printf("TX multi-seg: ");
if (dev_info.tx_offload_capa & RTE_ETH_TX_OFFLOAD_MULTI_SEGS) {
changes made to disable checksum offload on the receiver side
--- a/main.c
+++ b/main.c
@@ -669,6 +669,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok (nic feature %lx udp-rss-all %lx)\n", dev_info.flow_type_rss_offloads, RTE_ETH_RSS_TCP);
} else printf("no\n"); /* TODO: software-based RSS */
}
+#if 0
{
printf("RX checksum: ");
if (dev_info.rx_offload_capa & RTE_ETH_RX_OFFLOAD_CHECKSUM) {
@@ -676,6 +677,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok\n");
} else printf("no\n");
}
+#endif
{
printf("RX LRO: ");
if (dev_info.rx_offload_capa & RTE_ETH_RX_OFFLOAD_TCP_LRO) {
@@ -691,6 +693,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok\n");
} else printf("no\n");
}
+#if 0
{
printf("TX IPv4 checksum: ");
if (dev_info.tx_offload_capa & RTE_ETH_TX_OFFLOAD_IPV4_CKSUM) {
@@ -705,6 +708,7 @@ static int __iosub_main(int argc, char *const *argv)
printf("ok\n");
} else printf("no\n");
}
+#endif
{
printf("TX TCP TSO: ");
if (dev_info.tx_offload_capa & RTE_ETH_TX_OFFLOAD_TCP_TSO) {
- results:
Potentially, there are three CPU core assignment models which we call split, merge, and unified, respectively.
- The split model runs the networking logic and the application logic on two different threads, and dedicates a CPU core to each of the threads.
- The merge model runs the networking logic and the application logic on two different threads similarly to the first model, but executes the two threads on the same CPU core.
- The unified model executes the networking and application logic on the same thread.
The following program instantiates sub threads, besides the threads launched by an I/O subsystem such as DPDK, to execute the application logic implemented in bench-iip for testing the split and merge models.
Please save the following program as bench-iip/sub/main.c
.
please click here to show the program
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#define __app_exit __o__app_exit
#define __app_should_stop __o__app_should_stop
#define __app_loop __o__app_loop
#define __app_thread_init __o__app_thread_init
#define __app_init __o__app_init
#pragma push_macro("IOSUB_MAIN_C")
#undef IOSUB_MAIN_C
#define IOSUB_MAIN_C pthread.h
static int __iosub_main(int argc, char *const *argv);
#define IIP_MAIN_C "./iip_main.c"
#include "../main.c"
#undef IOSUB_MAIN_C
#pragma pop_macro("IOSUB_MAIN_C")
#undef __app_init
#undef __app_thread_init
#undef __app_loop
#undef __app_should_stop
#undef __app_exit
#undef iip_ops_pkt_alloc
#undef iip_ops_pkt_free
#undef iip_ops_pkt_get_data
#undef iip_ops_pkt_get_len
#undef iip_ops_pkt_set_len
#undef iip_ops_pkt_increment_head
#undef iip_ops_pkt_decrement_tail
#undef iip_ops_pkt_clone
#undef iip_ops_pkt_scatter_gather_chain_append
#undef iip_ops_pkt_scatter_gather_chain_get_next
#undef iip_ops_arp_reply
#undef iip_ops_icmp_reply
#undef iip_ops_tcp_accept
#undef iip_ops_tcp_accepted
#undef iip_ops_tcp_connected
#undef iip_ops_tcp_payload
#undef iip_ops_tcp_acked
#undef iip_ops_tcp_closed
#undef iip_ops_udp_payload
#include <stdatomic.h>
#include <sys/poll.h>
#include <pthread.h>
#define SUB_MAX_CORE (128)
#define NUM_OP_SLOT (512)
enum {
OP_ARP_REPLY = 1,
OP_ICMP_REPLY,
OP_TCP_ACCEPT,
OP_TCP_ACCEPTED,
OP_TCP_CONNECTED,
OP_TCP_PAYLOAD,
OP_TCP_ACKED,
OP_TCP_CLOSED,
OP_UDP_PAYLOAD,
DO_ARP_REQUEST,
DO_TCP_SEND,
DO_TCP_CLOSE,
#if 0
DO_TCP_RXBUF_CONSUMED,
#endif
DO_TCP_CONNECT,
DO_UDP_SEND,
};
#define SUB_READY (1U << 1)
#define SUB_SHOULD_STOP (1U << 2)
struct sub_data {
void *workspace;
void *opaque_array[5];
volatile uint16_t flags;
uint16_t th_id;
uint16_t core_id;
pthread_t th;
uint8_t mac[IIP_CONF_L2ADDR_LEN_MAX];
uint32_t ip4_be;
uint32_t op_batch_cnt;
uint32_t wait_time_ms;
int pipe_fd[2];
struct {
volatile uint16_t head;
uint16_t cur;
volatile uint16_t tail;
uint16_t tail_cache;
struct {
uint64_t op;
uint64_t arg[9];
uint8_t mac[2][IIP_CONF_L2ADDR_LEN_MAX];
} slot[NUM_OP_SLOT];
} opq[2];
};
struct sub_app_global_data {
uint16_t num_cores;
uint16_t num_io_threads; /* XXX: this is not great but needed to avoid adding an interface to io subsystems */
void *app_global_opaque;
struct sub_data sd[SUB_MAX_CORE];
};
static uint16_t iip_udp_send(void *_mem,
uint8_t local_mac[], uint32_t local_ip4_be, uint16_t local_port_be,
uint8_t peer_mac[], uint32_t peer_ip4_be, uint16_t peer_port_be,
void *pkt, void *opaque)
{
void **opaque_array = (void **) opaque;
struct sub_data *sd = (struct sub_data *) opaque_array[4];
uint16_t c = sd->opq[1].cur;
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[1].tail_cache) {
sd->opq[1].tail_cache = atomic_load_explicit(&sd->opq[1].tail, memory_order_acquire);
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[1].tail_cache) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[1].head, sd->opq[1].cur, memory_order_release);
while ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == atomic_load_explicit(&sd->opq[1].tail, memory_order_acquire)) { IIP_OPS_DEBUG_PRINTF("%u waiting %u %u\n", __LINE__, c, sd->opq[1].tail); }
}
}
sd->opq[1].slot[c].op = DO_UDP_SEND;
sd->opq[1].slot[c].arg[0] = (uint64_t) _mem;
sd->opq[0].slot[c].arg[1] = 0;
__iip_memcpy(sd->opq[0].slot[c].mac[0], local_mac, IIP_CONF_L2ADDR_LEN_MAX /* FIXME */);
sd->opq[1].slot[c].arg[2] = (uint64_t) local_ip4_be;
sd->opq[1].slot[c].arg[3] = (uint64_t) local_port_be;
sd->opq[0].slot[c].arg[4] = 0;
__iip_memcpy(sd->opq[0].slot[c].mac[1], peer_mac, IIP_CONF_L2ADDR_LEN_MAX /* FIXME */);
sd->opq[1].slot[c].arg[5] = (uint64_t) peer_ip4_be;
sd->opq[1].slot[c].arg[6] = (uint64_t) peer_port_be;
sd->opq[1].slot[c].arg[7] = (uint64_t) pkt;
sd->opq[1].slot[c].arg[8] = 0; /* opaque */
c = (c == NUM_OP_SLOT - 1 ? 0 : c + 1);
if (sd->op_batch_cnt <= (uint32_t)(sd->opq[1].head <= sd->opq[1].cur ? sd->opq[1].cur - sd->opq[1].head : NUM_OP_SLOT + sd->opq[1].head - sd->opq[1].cur)) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[1].head, sd->opq[1].cur, memory_order_release);
}
return 0; /* XXX: assuming always 0 is returned */
}
static uint16_t iip_tcp_connect(void *_mem,
uint8_t local_mac[], uint32_t local_ip4_be, uint16_t local_port_be,
uint8_t peer_mac[], uint32_t peer_ip4_be, uint16_t peer_port_be,
void *opaque)
{
void **opaque_array = (void **) opaque;
struct sub_data *sd = (struct sub_data *) opaque_array[4];
uint16_t c = sd->opq[1].cur;
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[1].tail_cache) {
sd->opq[1].tail_cache = atomic_load_explicit(&sd->opq[1].tail, memory_order_acquire);
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[1].tail_cache) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[1].head, sd->opq[1].cur, memory_order_release);
while ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == atomic_load_explicit(&sd->opq[1].tail, memory_order_acquire)) { IIP_OPS_DEBUG_PRINTF("%u waiting %u %u\n", __LINE__, c, sd->opq[1].tail); }
}
}
sd->opq[1].slot[c].op = DO_TCP_CONNECT;
sd->opq[1].slot[c].arg[0] = (uint64_t) _mem;
sd->opq[0].slot[c].arg[1] = 0;
__iip_memcpy(sd->opq[0].slot[c].mac[0], local_mac, IIP_CONF_L2ADDR_LEN_MAX /* FIXME */);
sd->opq[1].slot[c].arg[2] = (uint64_t) local_ip4_be;
sd->opq[1].slot[c].arg[3] = (uint64_t) local_port_be;
sd->opq[0].slot[c].arg[4] = 0;
__iip_memcpy(sd->opq[0].slot[c].mac[1], peer_mac, IIP_CONF_L2ADDR_LEN_MAX /* FIXME */);
sd->opq[1].slot[c].arg[5] = (uint64_t) peer_ip4_be;
sd->opq[1].slot[c].arg[6] = (uint64_t) peer_port_be;
sd->opq[1].slot[c].arg[7] = 0; /* opaque */
c = (c == NUM_OP_SLOT - 1 ? 0 : c + 1);
if (sd->op_batch_cnt <= (uint32_t)(sd->opq[1].head <= sd->opq[1].cur ? sd->opq[1].cur - sd->opq[1].head : NUM_OP_SLOT + sd->opq[1].head - sd->opq[1].cur)) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[1].head, sd->opq[1].cur, memory_order_release);
}
return 0; /* XXX: assuming always 0 is returned */
}
static void iip_tcp_rxbuf_consumed(void *_mem, void *_handle, uint16_t cnt, void *opaque)
{
#if 0
void **opaque_array = (void **) opaque;
struct sub_data *sd = (struct sub_data *) opaque_array[4];
uint16_t c = sd->opq[1].cur;
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[1].tail_cache) {
sd->opq[1].tail_cache = atomic_load_explicit(&sd->opq[1].tail, memory_order_acquire);
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[1].tail_cache) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[1].head, sd->opq[1].cur, memory_order_release);
while ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == atomic_load_explicit(&sd->opq[1].tail, memory_order_acquire)) { IIP_OPS_DEBUG_PRINTF("%u waiting %u %u\n", __LINE__, c, sd->opq[1].tail); }
}
}
sd->opq[1].slot[c].op = DO_TCP_RXBUF_CONSUMED;
sd->opq[1].slot[c].arg[0] = (uint64_t) _mem;
sd->opq[1].slot[c].arg[1] = (uint64_t) _handle;
sd->opq[1].slot[c].arg[2] = (uint64_t) cnt;
sd->opq[1].slot[c].arg[3] = 0; /* opaque */
c = (c == NUM_OP_SLOT - 1 ? 0 : c + 1);
if (sd->op_batch_cnt <= (uint32_t)(sd->opq[1].head <= sd->opq[1].cur ? sd->opq[1].cur - sd->opq[1].head : NUM_OP_SLOT + sd->opq[1].head - sd->opq[1].cur)) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[1].head, sd->opq[1].cur, memory_order_release);
}
#endif
{
(void) _mem;
(void) _handle;
(void) cnt;
(void) opaque;
}
}
static uint16_t iip_tcp_close(void *_mem, void *_handle, void *opaque)
{
void **opaque_array = (void **) opaque;
struct sub_data *sd = (struct sub_data *) opaque_array[4];
uint16_t c = sd->opq[1].cur;
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[1].tail_cache) {
sd->opq[1].tail_cache = atomic_load_explicit(&sd->opq[1].tail, memory_order_acquire);
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[1].tail_cache) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[1].head, sd->opq[1].cur, memory_order_release);
while ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == atomic_load_explicit(&sd->opq[1].tail, memory_order_acquire)) { IIP_OPS_DEBUG_PRINTF("%u waiting %u %u\n", __LINE__, c, sd->opq[1].tail); }
}
}
sd->opq[1].slot[c].op = DO_TCP_CLOSE;
sd->opq[1].slot[c].arg[0] = (uint64_t) _mem;
sd->opq[1].slot[c].arg[1] = (uint64_t) _handle;
sd->opq[1].slot[c].arg[2] = 0; /* opaque */
sd->opq[1].cur = (c == NUM_OP_SLOT - 1 ? 0 : c + 1);
if (sd->op_batch_cnt <= (uint32_t)(sd->opq[1].head <= sd->opq[1].cur ? sd->opq[1].cur - sd->opq[1].head : NUM_OP_SLOT + sd->opq[1].head - sd->opq[1].cur)) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[1].head, sd->opq[1].cur, memory_order_release);
}
return 0; /* XXX: assuming always 0 is returned */
}
static uint16_t iip_tcp_send(void *_mem, void *_handle, void *pkt, uint16_t tcp_flags, void *opaque)
{
void **opaque_array = (void **) opaque;
struct sub_data *sd = (struct sub_data *) opaque_array[4];
uint16_t c = sd->opq[1].cur;
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[1].tail_cache) {
sd->opq[1].tail_cache = atomic_load_explicit(&sd->opq[1].tail, memory_order_acquire);
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[1].tail_cache) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[1].head, sd->opq[1].cur, memory_order_release);
while ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == atomic_load_explicit(&sd->opq[1].tail, memory_order_acquire)) { IIP_OPS_DEBUG_PRINTF("%u waiting %u %u\n", __LINE__, c, sd->opq[1].tail); }
}
}
sd->opq[1].slot[c].op = DO_TCP_SEND;
sd->opq[1].slot[c].arg[0] = (uint64_t) _mem;
sd->opq[1].slot[c].arg[1] = (uint64_t) _handle;
sd->opq[1].slot[c].arg[2] = (uint64_t) pkt;
sd->opq[1].slot[c].arg[3] = (uint64_t) tcp_flags;
sd->opq[1].slot[c].arg[4] = 0; /* opaque */
sd->opq[1].cur = (c == NUM_OP_SLOT - 1 ? 0 : c + 1);
if (sd->op_batch_cnt <= (uint32_t)(sd->opq[1].head <= sd->opq[1].cur ? sd->opq[1].cur - sd->opq[1].head : NUM_OP_SLOT + sd->opq[1].head - sd->opq[1].cur)) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[1].head, sd->opq[1].cur, memory_order_release);
}
return 0; /* XXX: assuming always 0 is returned */
}
static void iip_arp_request(void *_mem,
uint8_t local_mac[],
uint32_t local_ip4_be,
uint32_t target_ip4_be,
void *opaque)
{
void **opaque_array = (void **) opaque;
struct sub_data *sd = (struct sub_data *) opaque_array[4];
uint16_t c = sd->opq[1].cur;
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[1].tail_cache) {
sd->opq[1].tail_cache = atomic_load_explicit(&sd->opq[1].tail, memory_order_acquire);
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[1].tail_cache) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[1].head, sd->opq[1].cur, memory_order_release);
while ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == atomic_load_explicit(&sd->opq[1].tail, memory_order_acquire)) { IIP_OPS_DEBUG_PRINTF("%u waiting %u %u\n", __LINE__, c, sd->opq[1].tail); }
}
}
sd->opq[1].slot[c].op = DO_ARP_REQUEST;
sd->opq[1].slot[c].arg[0] = (uint64_t) _mem;
sd->opq[0].slot[c].arg[1] = 0;
__iip_memcpy(sd->opq[0].slot[c].mac[0], local_mac, IIP_CONF_L2ADDR_LEN_MAX /* FIXME */);
sd->opq[1].slot[c].arg[2] = (uint64_t) local_ip4_be;
sd->opq[1].slot[c].arg[3] = (uint64_t) target_ip4_be;
sd->opq[1].slot[c].arg[4] = 0; /* opaque */
sd->opq[1].cur = (c == NUM_OP_SLOT - 1 ? 0 : c + 1);
if (sd->op_batch_cnt <= (uint32_t)(sd->opq[1].head <= sd->opq[1].cur ? sd->opq[1].cur - sd->opq[1].head : NUM_OP_SLOT + sd->opq[1].head - sd->opq[1].cur)) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[1].head, sd->opq[1].cur, memory_order_release);
}
}
static void *app_sub_thread(void *data)
{
struct sub_data *sd = (struct sub_data *) data;
while (!(sd->flags & SUB_READY)) { }
IIP_OPS_DEBUG_PRINTF("start sub thread %u on core %u\n", sd->th_id, sd->core_id);
{
sd->opaque_array[1] = (void *) ((struct sub_app_global_data *) sd->opaque_array[3])->app_global_opaque;
sd->opaque_array[2] = __o__app_thread_init(NULL, sd->th_id, sd->opaque_array);
}
if (sd->wait_time_ms) {
struct sched_param sp;
sp.sched_priority = 1;
__iip_assert(!pthread_setschedparam(pthread_self(), SCHED_FIFO, &sp));
}
while (!(sd->flags & SUB_SHOULD_STOP)) {
uint32_t next_us = sd->wait_time_ms * 1000;
{
uint16_t h = atomic_load_explicit(&sd->opq[0].head, memory_order_acquire), t = sd->opq[0].tail;
while (h != t) {
switch (sd->opq[0].slot[t].op) {
case OP_ARP_REPLY:
__o_iip_ops_arp_reply((void *) sd->opq[0].slot[t].arg[0], (void *) sd->opq[0].slot[t].arg[1], (void *) sd->opq[0].slot[t].arg[2]);
iip_ops_pkt_free((void *) sd->opq[0].slot[t].arg[1], sd->opaque_array);
break;
case OP_ICMP_REPLY:
__o_iip_ops_arp_reply((void *) sd->opq[0].slot[t].arg[0], (void *) sd->opq[0].slot[t].arg[1], (void *) sd->opq[0].slot[t].arg[2]);
iip_ops_pkt_free((void *) sd->opq[0].slot[t].arg[1], sd->opaque_array);
break;
case OP_TCP_ACCEPT:
sd->opq[0].slot[t].op = (uint64_t) __o_iip_ops_tcp_accept((void *) sd->opq[0].slot[t].arg[0], (void *) sd->opq[0].slot[t].arg[1], (void *) sd->opq[0].slot[t].arg[2]);
iip_ops_pkt_free((void *) sd->opq[0].slot[t].arg[1], sd->opaque_array);
break;
case OP_TCP_ACCEPTED:
sd->opq[0].slot[t].op = (uint64_t) __o_iip_ops_tcp_accepted((void *) sd->opq[0].slot[t].arg[0], (void *) sd->opq[0].slot[t].arg[1], (void *) sd->opq[0].slot[t].arg[2], (void *) sd->opq[0].slot[t].arg[3]);
iip_ops_pkt_free((void *) sd->opq[0].slot[t].arg[2], sd->opaque_array);
break;
case OP_TCP_CONNECTED:
sd->opq[0].slot[t].op = (uint64_t) __o_iip_ops_tcp_connected((void *) sd->opq[0].slot[t].arg[0], (void *) sd->opq[0].slot[t].arg[1], (void *) sd->opq[0].slot[t].arg[2], (void *) sd->opq[0].slot[t].arg[3]);
iip_ops_pkt_free((void *) sd->opq[0].slot[t].arg[2], sd->opaque_array);
break;
case OP_TCP_PAYLOAD:
__o_iip_ops_tcp_payload((void *) sd->opq[0].slot[t].arg[0], (void *) sd->opq[0].slot[t].arg[1], (void *) sd->opq[0].slot[t].arg[2], (void *) sd->opq[0].slot[t].arg[3], sd->opq[0].slot[t].arg[4], sd->opq[0].slot[t].arg[5], (void *) sd->opq[0].slot[t].arg[6]);
iip_ops_pkt_free((void *) sd->opq[0].slot[t].arg[2], sd->opaque_array);
break;
case OP_TCP_ACKED:
__o_iip_ops_tcp_acked((void *) sd->opq[0].slot[t].arg[0], (void *) sd->opq[0].slot[t].arg[1], (void *) sd->opq[0].slot[t].arg[2], (void *) sd->opq[0].slot[t].arg[3], (void *) sd->opq[0].slot[t].arg[4]);
iip_ops_pkt_free((void *) sd->opq[0].slot[t].arg[2], sd->opaque_array);
break;
case OP_TCP_CLOSED:
__o_iip_ops_tcp_closed((void *) sd->opq[0].slot[t].arg[0], (void *) sd->opq[0].slot[t].mac[0], sd->opq[0].slot[t].arg[2], sd->opq[0].slot[t].arg[3], (void *) sd->opq[0].slot[t].mac[1], sd->opq[0].slot[t].arg[5], sd->opq[0].slot[t].arg[6], (void *) sd->opq[0].slot[t].arg[7], (void *) sd->opq[0].slot[t].arg[8]);
break;
case OP_UDP_PAYLOAD:
__o_iip_ops_udp_payload((void *) sd->opq[0].slot[t].arg[0], (void *) (void *) sd->opq[0].slot[t].arg[1], (void *) sd->opq[0].slot[t].arg[2]);
iip_ops_pkt_free((void *) sd->opq[0].slot[t].arg[1], sd->opaque_array);
break;
default:
assert(0);
break;
}
t = (t == NUM_OP_SLOT - 1 ? 0 : t + 1);
}
if (t != sd->opq[0].tail) {
__asm__ volatile("" ::: "memory");
atomic_store_explicit(&sd->opq[0].tail, t, memory_order_release);
}
if (sd->opq[1].head != sd->opq[1].cur) {
__asm__ volatile("" ::: "memory");
atomic_store_explicit(&sd->opq[1].head, sd->opq[1].cur, memory_order_release);
}
}
{
uint32_t _next_us;
__o__app_loop(sd->workspace, sd->mac, sd->ip4_be, &_next_us, sd->opaque_array);
if (_next_us < next_us)
next_us = _next_us;
}
if (next_us) {
struct pollfd pollfd = {
.fd = sd->pipe_fd[0],
.events = POLLIN,
};
assert(poll(&pollfd, 1, (next_us / 1000)) != -1);
if (pollfd.revents & POLLIN) {
char b;
__iip_assert(read(sd->pipe_fd[0], &b, 1) == 1);
}
}
}
pthread_exit(NULL);
}
static void iip_ops_arp_reply(void *_mem, void *m, void *opaque)
{
void **opaque_array = (void **) opaque;
struct sub_app_global_data *sa = opaque_array[1];
uint16_t core_id = (uint16_t)(uintptr_t) opaque_array[2];
struct sub_data *sd = &sa->sd[((PB_IP4(iip_ops_pkt_get_data(m, opaque))->src_be + PB_TCP(iip_ops_pkt_get_data(m, opaque))->src_be + PB_TCP(iip_ops_pkt_get_data(m, opaque))->dst_be) % (sa->num_cores / sa->num_io_threads)) * (sa->num_cores / sa->num_io_threads) + core_id];
uint16_t c = sd->opq[0].cur;
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[0].tail_cache) {
sd->opq[0].tail_cache = atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire);
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[0].tail_cache) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[0].head, sd->opq[0].cur, memory_order_release);
while ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire)) { IIP_OPS_DEBUG_PRINTF("%u waiting %u %u\n", __LINE__, c, sd->opq[0].tail); }
}
}
sd->opq[0].slot[c].op = OP_ARP_REPLY;
sd->opq[0].slot[c].arg[0] = (uint64_t) _mem;
sd->opq[0].slot[c].arg[1] = (uint64_t) iip_ops_pkt_clone(m, opaque);
__iip_assert(sd->opq[0].slot[c].arg[1]);
sd->opq[0].slot[c].arg[2] = (uint64_t) sd->opaque_array;
sd->opq[0].cur = (c == NUM_OP_SLOT - 1 ? 0 : c + 1);
if (sd->op_batch_cnt <= (uint32_t)(sd->opq[0].head <= sd->opq[0].cur ? sd->opq[0].cur - sd->opq[0].head : NUM_OP_SLOT + sd->opq[0].head - sd->opq[0].cur)) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[0].head, sd->opq[0].cur, memory_order_release);
if (sd->wait_time_ms) {
char b = 0;
__iip_assert(write(sd->pipe_fd[1], &b, 1) == 1);
}
}
}
static void iip_ops_icmp_reply(void *_mem, void *m, void *opaque)
{
void **opaque_array = (void **) opaque;
struct sub_app_global_data *sa = opaque_array[1];
uint16_t core_id = (uint16_t)(uintptr_t) opaque_array[2];
struct sub_data *sd = &sa->sd[((PB_IP4(iip_ops_pkt_get_data(m, opaque))->src_be + PB_TCP(iip_ops_pkt_get_data(m, opaque))->src_be + PB_TCP(iip_ops_pkt_get_data(m, opaque))->dst_be) % (sa->num_cores / sa->num_io_threads)) * (sa->num_cores / sa->num_io_threads) + core_id];
uint16_t c = sd->opq[0].cur;
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[0].tail_cache) {
sd->opq[0].tail_cache = atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire);
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[0].tail_cache) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[0].head, sd->opq[0].cur, memory_order_release);
while ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire)) { IIP_OPS_DEBUG_PRINTF("%u waiting %u %u\n", __LINE__, c, sd->opq[0].tail); }
}
}
sd->opq[0].slot[c].op = OP_ICMP_REPLY;
sd->opq[0].slot[c].arg[0] = (uint64_t) _mem;
sd->opq[0].slot[c].arg[1] = (uint64_t) iip_ops_pkt_clone(m, opaque);
__iip_assert(sd->opq[0].slot[c].arg[1]);
sd->opq[0].slot[c].arg[2] = (uint64_t) sd->opaque_array;
sd->opq[0].cur = (c == NUM_OP_SLOT - 1 ? 0 : c + 1);
if (sd->op_batch_cnt <= (uint32_t)(sd->opq[0].head <= sd->opq[0].cur ? sd->opq[0].cur - sd->opq[0].head : NUM_OP_SLOT + sd->opq[0].head - sd->opq[0].cur)) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[0].head, sd->opq[0].cur, memory_order_release);
if (sd->wait_time_ms) {
char b = 0;
__iip_assert(write(sd->pipe_fd[1], &b, 1) == 1);
}
}
}
static uint8_t iip_ops_tcp_accept(void *mem, void *m, void *opaque)
{
void **opaque_array = (void **) opaque;
struct sub_app_global_data *sa = opaque_array[1];
uint16_t core_id = (uint16_t)(uintptr_t) opaque_array[2];
struct sub_data *sd = &sa->sd[((PB_IP4(iip_ops_pkt_get_data(m, opaque))->src_be + PB_TCP(iip_ops_pkt_get_data(m, opaque))->src_be + PB_TCP(iip_ops_pkt_get_data(m, opaque))->dst_be) % (sa->num_cores / sa->num_io_threads)) * (sa->num_cores / sa->num_io_threads) + core_id];
uint16_t c = sd->opq[0].cur;
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[0].tail_cache) {
sd->opq[0].tail_cache = atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire);
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[0].tail_cache) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[0].head, sd->opq[0].cur, memory_order_release);
while ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire)) { IIP_OPS_DEBUG_PRINTF("%u waiting %u %u\n", __LINE__, c, sd->opq[0].tail); }
}
}
sd->opq[0].slot[c].op = OP_TCP_ACCEPT;
sd->opq[0].slot[c].arg[0] = (uint64_t) mem;
sd->opq[0].slot[c].arg[1] = (uint64_t) iip_ops_pkt_clone(m, opaque);
__iip_assert(sd->opq[0].slot[c].arg[1]);
sd->opq[0].slot[c].arg[2] = (uint64_t) sd->opaque_array;
sd->opq[0].cur = (c == NUM_OP_SLOT - 1 ? 0 : c + 1);
__asm__ volatile("" ::: "memory");
atomic_store_explicit(&sd->opq[0].head, sd->opq[0].cur, memory_order_release);
if (sd->wait_time_ms) {
char b = 0;
__iip_assert(write(sd->pipe_fd[1], &b, 1) == 1);
}
while ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) != atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire)) { } /* wait until app sets result */
return (uint8_t) sd->opq[0].slot[c].op;
}
static void *iip_ops_tcp_accepted(void *mem, void *handle, void *m, void *opaque)
{
void **opaque_array = (void **) opaque;
struct sub_app_global_data *sa = opaque_array[1];
uint16_t core_id = (uint16_t)(uintptr_t) opaque_array[2];
struct sub_data *sd = &sa->sd[((PB_IP4(iip_ops_pkt_get_data(m, opaque))->src_be + PB_TCP(iip_ops_pkt_get_data(m, opaque))->src_be + PB_TCP(iip_ops_pkt_get_data(m, opaque))->dst_be) % (sa->num_cores / sa->num_io_threads)) * (sa->num_cores / sa->num_io_threads) + core_id];
uint16_t c = sd->opq[0].cur;
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[0].tail_cache) {
sd->opq[0].tail_cache = atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire);
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[0].tail_cache) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[0].head, sd->opq[0].cur, memory_order_release);
while ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire)) { IIP_OPS_DEBUG_PRINTF("%u waiting %u %u\n", __LINE__, c, sd->opq[0].tail); }
}
}
sd->opq[0].slot[c].op = OP_TCP_ACCEPTED;
sd->opq[0].slot[c].arg[0] = (uint64_t) mem;
sd->opq[0].slot[c].arg[1] = (uint64_t) handle;
sd->opq[0].slot[c].arg[2] = (uint64_t) iip_ops_pkt_clone(m, opaque);
__iip_assert(sd->opq[0].slot[c].arg[2]);
sd->opq[0].slot[c].arg[3] = (uint64_t) sd->opaque_array;
sd->opq[0].cur = (c == NUM_OP_SLOT - 1 ? 0 : c + 1);
__asm__ volatile("" ::: "memory");
atomic_store_explicit(&sd->opq[0].head, sd->opq[0].cur, memory_order_release);
if (sd->wait_time_ms) {
char b = 0;
__iip_assert(write(sd->pipe_fd[1], &b, 1) == 1);
}
while ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) != atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire)) { } /* wait until app sets result */
return (void *) sd->opq[0].slot[c].op;
}
static void *iip_ops_tcp_connected(void *mem, void *handle, void *m, void *opaque)
{
void **opaque_array = (void **) opaque;
struct sub_app_global_data *sa = opaque_array[1];
uint16_t core_id = (uint16_t)(uintptr_t) opaque_array[2];
struct sub_data *sd = &sa->sd[((PB_IP4(iip_ops_pkt_get_data(m, opaque))->src_be + PB_TCP(iip_ops_pkt_get_data(m, opaque))->src_be + PB_TCP(iip_ops_pkt_get_data(m, opaque))->dst_be) % (sa->num_cores / sa->num_io_threads)) * (sa->num_cores / sa->num_io_threads) + core_id];
uint16_t c = sd->opq[0].cur;
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[0].tail_cache) {
sd->opq[0].tail_cache = atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire);
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[0].tail_cache) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[0].head, sd->opq[0].cur, memory_order_release);
while ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire)) { IIP_OPS_DEBUG_PRINTF("%u waiting %u %u\n", __LINE__, c, sd->opq[0].tail); }
}
}
sd->opq[0].slot[c].op = OP_TCP_CONNECTED;
sd->opq[0].slot[c].arg[0] = (uint64_t) mem;
sd->opq[0].slot[c].arg[1] = (uint64_t) handle;
sd->opq[0].slot[c].arg[2] = (uint64_t) iip_ops_pkt_clone(m, opaque);
__iip_assert(sd->opq[0].slot[c].arg[2]);
sd->opq[0].slot[c].arg[3] = (uint64_t) sd->opaque_array;
sd->opq[0].cur = (c == NUM_OP_SLOT - 1 ? 0 : c + 1);
__asm__ volatile("" ::: "memory");
atomic_store_explicit(&sd->opq[0].head, sd->opq[0].cur, memory_order_release);
if (sd->wait_time_ms) {
char b = 0;
__iip_assert(write(sd->pipe_fd[1], &b, 1) == 1);
}
while ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) != atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire)) { } /* wait until app sets result */
return (void *) sd->opq[0].slot[c].op;
}
static void iip_ops_tcp_payload(void *mem, void *handle, void *m,
void *tcp_opaque, uint16_t head_off, uint16_t tail_off,
void *opaque)
{
void **opaque_array = (void **) opaque;
struct sub_app_global_data *sa = opaque_array[1];
uint16_t core_id = (uint16_t)(uintptr_t) opaque_array[2];
struct sub_data *sd = &sa->sd[((PB_IP4(iip_ops_pkt_get_data(m, opaque))->src_be + PB_TCP(iip_ops_pkt_get_data(m, opaque))->src_be + PB_TCP(iip_ops_pkt_get_data(m, opaque))->dst_be) % (sa->num_cores / sa->num_io_threads)) * (sa->num_cores / sa->num_io_threads) + core_id];
uint16_t c = sd->opq[0].cur;
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[0].tail_cache) {
sd->opq[0].tail_cache = atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire);
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[0].tail_cache) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[0].head, sd->opq[0].cur, memory_order_release);
while ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire)) { IIP_OPS_DEBUG_PRINTF("%u waiting %u %u\n", __LINE__, c, sd->opq[0].tail); }
}
}
sd->opq[0].slot[c].op = OP_TCP_PAYLOAD;
sd->opq[0].slot[c].arg[0] = (uint64_t) mem;
sd->opq[0].slot[c].arg[1] = (uint64_t) handle;
sd->opq[0].slot[c].arg[2] = (uint64_t) iip_ops_pkt_clone(m, opaque);
__iip_assert(sd->opq[0].slot[c].arg[2]);
sd->opq[0].slot[c].arg[3] = (uint64_t) tcp_opaque;
sd->opq[0].slot[c].arg[4] = (uint64_t) head_off;
sd->opq[0].slot[c].arg[5] = (uint64_t) tail_off;
sd->opq[0].slot[c].arg[6] = (uint64_t) sd->opaque_array;
sd->opq[0].cur = (c == NUM_OP_SLOT - 1 ? 0 : c + 1);
if (sd->op_batch_cnt <= (uint32_t)(sd->opq[0].head <= sd->opq[0].cur ? sd->opq[0].cur - sd->opq[0].head : NUM_OP_SLOT + sd->opq[0].head - sd->opq[0].cur)) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[0].head, sd->opq[0].cur, memory_order_release);
if (sd->wait_time_ms) {
char b = 0;
__iip_assert(write(sd->pipe_fd[1], &b, 1) == 1);
}
}
__o_iip_tcp_rxbuf_consumed(mem, handle, 1, opaque);
}
static void iip_ops_tcp_acked(void *mem,
void *handle,
void *m,
void *tcp_opaque,
void *opaque)
{
void **opaque_array = (void **) opaque;
struct sub_app_global_data *sa = opaque_array[1];
uint16_t core_id = (uint16_t)(uintptr_t) opaque_array[2];
struct sub_data *sd = &sa->sd[((PB_IP4(iip_ops_pkt_get_data(m, opaque))->src_be + PB_TCP(iip_ops_pkt_get_data(m, opaque))->src_be + PB_TCP(iip_ops_pkt_get_data(m, opaque))->dst_be) % (sa->num_cores / sa->num_io_threads)) * (sa->num_cores / sa->num_io_threads) + core_id];
uint16_t c = sd->opq[0].cur;
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[0].tail_cache) {
sd->opq[0].tail_cache = atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire);
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[0].tail_cache) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[0].head, sd->opq[0].cur, memory_order_release);
while ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire)) { IIP_OPS_DEBUG_PRINTF("%u waiting %u %u\n", __LINE__, c, sd->opq[0].tail); }
}
}
sd->opq[0].slot[c].op = OP_TCP_ACKED;
sd->opq[0].slot[c].arg[0] = (uint64_t) mem;
sd->opq[0].slot[c].arg[1] = (uint64_t) handle;
sd->opq[0].slot[c].arg[2] = (uint64_t) iip_ops_pkt_clone(m, opaque);
__iip_assert(sd->opq[0].slot[c].arg[2]);
sd->opq[0].slot[c].arg[3] = (uint64_t) tcp_opaque;
sd->opq[0].slot[c].arg[4] = (uint64_t) sd->opaque_array;
sd->opq[0].cur = (c == NUM_OP_SLOT - 1 ? 0 : c + 1);
if (sd->op_batch_cnt <= (uint32_t)(sd->opq[0].head <= sd->opq[0].cur ? sd->opq[0].cur - sd->opq[0].head : NUM_OP_SLOT + sd->opq[0].head - sd->opq[0].cur)) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[0].head, sd->opq[0].cur, memory_order_release);
if (sd->wait_time_ms) {
char b = 0;
__iip_assert(write(sd->pipe_fd[1], &b, 1) == 1);
}
}
}
static void iip_ops_tcp_closed(void *handle,
uint8_t local_mac[], uint32_t local_ip4_be, uint16_t local_port_be,
uint8_t peer_mac[], uint32_t peer_ip4_be, uint16_t peer_port_be,
void *tcp_opaque, void *opaque)
{
void **opaque_array = (void **) opaque;
struct sub_app_global_data *sa = opaque_array[1];
uint16_t core_id = (uint16_t)(uintptr_t) opaque_array[2];
struct sub_data *sd = &sa->sd[((peer_ip4_be + peer_port_be + local_port_be) % (sa->num_cores / sa->num_io_threads)) * (sa->num_cores / sa->num_io_threads) + core_id];
uint16_t c = sd->opq[0].cur;
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[0].tail_cache) {
sd->opq[0].tail_cache = atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire);
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[0].tail_cache) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[0].head, sd->opq[0].cur, memory_order_release);
while ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire)) { IIP_OPS_DEBUG_PRINTF("%u waiting %u %u\n", __LINE__, c, sd->opq[0].tail); }
}
}
sd->opq[0].slot[c].op = OP_TCP_CLOSED;
sd->opq[0].slot[c].arg[0] = (uint64_t) handle;
sd->opq[0].slot[c].arg[1] = 0;
__iip_memcpy(sd->opq[0].slot[c].mac[0], local_mac, IIP_CONF_L2ADDR_LEN_MAX /* FIXME */);
sd->opq[0].slot[c].arg[2] = local_ip4_be;
sd->opq[0].slot[c].arg[3] = local_port_be;
sd->opq[0].slot[c].arg[4] = 0;
__iip_memcpy(sd->opq[0].slot[c].mac[1], peer_mac, IIP_CONF_L2ADDR_LEN_MAX /* FIXME */);
sd->opq[0].slot[c].arg[5] = peer_ip4_be;
sd->opq[0].slot[c].arg[6] = peer_port_be;
sd->opq[0].slot[c].arg[7] = (uint64_t) tcp_opaque;
sd->opq[0].slot[c].arg[8] = (uint64_t) sd->opaque_array;
sd->opq[0].cur = (c == NUM_OP_SLOT - 1 ? 0 : c + 1);
if (sd->op_batch_cnt <= (uint32_t)(sd->opq[0].head <= sd->opq[0].cur ? sd->opq[0].cur - sd->opq[0].head : NUM_OP_SLOT + sd->opq[0].head - sd->opq[0].cur)) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[0].head, sd->opq[0].cur, memory_order_release);
if (sd->wait_time_ms) {
char b = 0;
__iip_assert(write(sd->pipe_fd[1], &b, 1) == 1);
}
}
}
static void iip_ops_udp_payload(void *mem, void *m, void *opaque)
{
void **opaque_array = (void **) opaque;
struct sub_app_global_data *sa = opaque_array[1];
uint16_t core_id = (uint16_t)(uintptr_t) opaque_array[2];
struct sub_data *sd = &sa->sd[((PB_IP4(iip_ops_pkt_get_data(m, opaque))->src_be + PB_TCP(iip_ops_pkt_get_data(m, opaque))->src_be + PB_TCP(iip_ops_pkt_get_data(m, opaque))->dst_be) % (sa->num_cores / sa->num_io_threads)) * (sa->num_cores / sa->num_io_threads) + core_id];
uint16_t c = sd->opq[0].cur;
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[0].tail_cache) {
sd->opq[0].tail_cache = atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire);
if ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == sd->opq[0].tail_cache) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[0].head, sd->opq[0].cur, memory_order_release);
while ((c == NUM_OP_SLOT - 1 ? 0 : c + 1) == atomic_load_explicit(&sd->opq[0].tail, memory_order_acquire)) { IIP_OPS_DEBUG_PRINTF("%u waiting %u %u\n", __LINE__, c, sd->opq[0].tail); }
}
}
sd->opq[0].slot[c].op = OP_UDP_PAYLOAD;
sd->opq[0].slot[c].arg[0] = (uint64_t) mem;
sd->opq[0].slot[c].arg[1] = (uint64_t) iip_ops_pkt_clone(m, opaque);
__iip_assert(sd->opq[0].slot[c].arg[1]);
sd->opq[0].slot[c].arg[2] = (uint64_t) sd->opaque_array;
sd->opq[0].cur = (c == NUM_OP_SLOT - 1 ? 0 : c + 1);
if (sd->op_batch_cnt <= (uint32_t)(sd->opq[0].head <= sd->opq[0].cur ? sd->opq[0].cur - sd->opq[0].head : NUM_OP_SLOT + sd->opq[0].head - sd->opq[0].cur)) {
__asm__ volatile ("" ::: "memory");
atomic_store_explicit(&sd->opq[0].head, sd->opq[0].cur, memory_order_release);
if (sd->wait_time_ms) {
char b = 0;
__iip_assert(write(sd->pipe_fd[1], &b, 1) == 1);
}
}
}
static void __app_loop(void *mem, uint8_t mac[], uint32_t ip4_be, uint32_t *next_us, void *opaque)
{
void **opaque_array = (void **) opaque;
struct sub_app_global_data *sa = opaque_array[1];
uint16_t core_id = (uint16_t)(uintptr_t) opaque_array[2];
if (!(sa->sd[core_id].flags & SUB_READY)) {
uint16_t i;
for (i = 0; i < sa->num_cores; i++) {
if (core_id == i % sa->num_io_threads) {
__iip_memcpy(sa->sd[i].mac, mac, IIP_CONF_L2ADDR_LEN_MAX /* FIXME */);
sa->sd[i].ip4_be = ip4_be;
}
}
__asm__ volatile("" ::: "memory");
for (i = 0; i < sa->num_cores; i++) {
if (core_id == i % sa->num_io_threads)
sa->sd[i].flags |= SUB_READY;
}
}
{
uint16_t i;
for (i = 0; i < sa->num_cores; i++) {
if (core_id == i % sa->num_io_threads) {
uint16_t h = atomic_load_explicit(&sa->sd[i].opq[1].head, memory_order_acquire), t = sa->sd[i].opq[1].tail;
while (h != t) {
switch (sa->sd[i].opq[1].slot[t].op) {
case DO_ARP_REQUEST:
__o_iip_arp_request((void *) sa->sd[i].opq[1].slot[t].arg[0], (void *) sa->sd[i].opq[1].slot[t].mac[0], sa->sd[i].opq[1].slot[t].arg[2], sa->sd[i].opq[1].slot[t].arg[3], opaque);
break;
case DO_TCP_SEND:
sa->sd[i].opq[1].slot[t].op = __o_iip_tcp_send((void *) sa->sd[i].opq[1].slot[t].arg[0], (void *) sa->sd[i].opq[1].slot[t].arg[1], (void *) sa->sd[i].opq[1].slot[t].arg[2], sa->sd[i].opq[1].slot[t].arg[3], opaque);
break;
case DO_TCP_CLOSE:
sa->sd[i].opq[1].slot[t].op = __o_iip_tcp_close((void *) sa->sd[i].opq[1].slot[t].arg[0], (void *) sa->sd[i].opq[1].slot[t].arg[1], opaque);
break;
#if 0
case DO_TCP_RXBUF_CONSUMED:
__o_iip_tcp_rxbuf_consumed((void *) sa->sd[i].opq[1].slot[t].arg[0], (void *) sa->sd[i].opq[1].slot[t].arg[1], sa->sd[i].opq[1].slot[t].arg[2], opaque);
break;
#endif
case DO_TCP_CONNECT:
sa->sd[i].opq[1].slot[t].op = __o_iip_tcp_connect((void *) sa->sd[i].opq[1].slot[t].arg[0], (void *) sa->sd[i].opq[1].slot[t].mac[0], sa->sd[i].opq[1].slot[t].arg[2], sa->sd[i].opq[1].slot[t].arg[3], (void *) sa->sd[i].opq[1].slot[t].mac[1], sa->sd[i].opq[1].slot[t].arg[5], sa->sd[i].opq[1].slot[t].arg[6], opaque);
break;
case DO_UDP_SEND:
sa->sd[i].opq[1].slot[t].op = __o_iip_udp_send((void *) sa->sd[i].opq[1].slot[t].arg[0], (void *) sa->sd[i].opq[1].slot[t].mac[0], sa->sd[i].opq[1].slot[t].arg[2], sa->sd[i].opq[1].slot[t].arg[3], (void *) sa->sd[i].opq[1].slot[t].mac[1], sa->sd[i].opq[1].slot[t].arg[5], sa->sd[i].opq[1].slot[t].arg[6], (void *) sa->sd[i].opq[1].slot[t].arg[7], opaque);
break;
default:
assert(0);
break;
}
t = (t == NUM_OP_SLOT - 1 ? 0 : t + 1);
}
if (t != sa->sd[i].opq[1].tail) {
__asm__ volatile("" ::: "memory");
atomic_store_explicit(&sa->sd[i].opq[1].tail, t, memory_order_release);
}
if (sa->sd[i].opq[0].head != sa->sd[i].opq[0].cur) {
__asm__ volatile("" ::: "memory");
atomic_store_explicit(&sa->sd[i].opq[0].head, sa->sd[i].opq[0].cur, memory_order_release);
if (sa->sd[i].wait_time_ms) {
char b = 0;
__iip_assert(write(sa->sd[i].pipe_fd[1], &b, 1) == 1);
}
}
}
}
}
*next_us = 100;
{ /* unused */
(void) mem;
}
}
static void __app_exit(void *app_global_opaque)
{
if (app_global_opaque) {
struct sub_app_global_data *sa = (struct sub_app_global_data *) app_global_opaque;
{
uint16_t i;
for (i = 0; i < sa->num_cores; i++)
sa->sd[i].flags |= SUB_SHOULD_STOP;
}
{
uint16_t i;
for (i = 0; i < sa->num_cores; i++)
__iip_assert(!pthread_join(sa->sd[i].th, NULL));
}
if (sa->app_global_opaque)
__o__app_exit(sa->app_global_opaque);
mem_free(app_global_opaque, sizeof(struct sub_app_global_data));
}
}
static uint8_t __app_should_stop(void *opaque)
{
void **opaque_array = (void **) opaque;
struct sub_app_global_data *sa = (struct sub_app_global_data *) opaque_array[1];
if (sa->sd[0].opaque_array[2])
return __o__app_should_stop(sa->sd[0].opaque_array);
else
return 0;
}
static void *__app_thread_init(void *workspace, uint16_t core_id, void *opaque)
{
void **opaque_array = (void **) opaque;
struct sub_app_global_data *sa = (struct sub_app_global_data *) opaque_array[1];
{
uint16_t i;
for (i = 0; i < sa->num_cores; i++) {
if (core_id == i % sa->num_io_threads) {
sa->sd[i].workspace = workspace;
sa->sd[i].opaque_array[0] = opaque_array[0]; /* FIXME: this assumes that io subsystems pay attention to thread-safety while they do not */
}
}
}
return (void *)((uintptr_t) core_id);
}
static void *__app_init(int argc, char *const *argv)
{
struct sub_app_global_data *sa = (struct sub_app_global_data *) mem_alloc_local(sizeof(struct sub_app_global_data));
memset(sa, 0, sizeof(struct sub_app_global_data));
sa->app_global_opaque = __o__app_init(argc, argv);
{ /* parse arguments */
int ch;
while ((ch = getopt(argc, argv, "b:c:e:n:")) != -1) {
switch (ch) {
case 'b':
sa->sd[0].op_batch_cnt = atoi(optarg);
break;
case 'c':
{
ssize_t num_comma = 0, num_hyphen = 0;
{
size_t i;
for (i = 0; i < strlen(optarg); i++) {
switch (optarg[i]) {
case ',':
num_comma++;
break;
case '-':
num_hyphen++;
break;
}
}
}
if (num_hyphen) {
assert(num_hyphen == 1);
assert(!num_comma);
{
char *m;
assert((m = strdup(optarg)) != NULL);
{
size_t i;
for (i = 0; i < strlen(optarg); i++) {
if (m[i] == '-') {
m[i] = '\0';
break;
}
}
assert(i != strlen(optarg) - 1 && i != strlen(optarg));
{
uint16_t from = atoi(&m[0]), to = atoi(&m[i + 1]);
assert(from <= to);
{
uint16_t j, k;
for (j = 0, k = from; k <= to; j++, k++)
sa->sd[j].core_id = k;
sa->num_cores = j;
}
}
}
free(m);
}
} else if (num_comma) {
assert(num_comma + 1 < SUB_MAX_CORE);
{
char *m;
assert((m = strdup(optarg)) != NULL);
{
size_t i, j, k;
for (i = 0, j = 0, k = 0; i < strlen(optarg) + 1; i++) {
if (i == strlen(optarg) || m[i] == ',') {
m[i] = '\0';
if (j != i)
sa->sd[k++].core_id = atoi(&m[j]);
j = i + 1;
}
if (i == strlen(optarg))
break;
}
assert(k);
sa->num_cores = k;
}
free(m);
}
} else {
sa->sd[0].core_id = atoi(optarg);
sa->num_cores = 1;
}
}
break;
case 'e':
sa->sd[0].wait_time_ms = atoi(optarg);
break;
case 'n':
sa->num_io_threads = atoi(optarg);
break;
default:
assert(0);
break;
}
}
}
__iip_assert(sa->num_cores);
__iip_assert(sa->num_io_threads);
__iip_assert(sa->num_io_threads <= sa->num_cores);
__iip_assert(sa->sd[0].op_batch_cnt);
{
uint16_t i;
for (i = 0; i < sa->num_cores; i++) {
sa->sd[i].th_id = i;
sa->sd[i].opaque_array[3] = (void *) sa;
sa->sd[i].opaque_array[4] = (void *) &sa->sd[i];
sa->sd[i].op_batch_cnt = sa->sd[0].op_batch_cnt;
sa->sd[i].wait_time_ms = sa->sd[0].wait_time_ms;
if (sa->sd[i].wait_time_ms)
__iip_assert(!pipe(sa->sd[i].pipe_fd));
__iip_assert(!pthread_create(&sa->sd[i].th, NULL, app_sub_thread, &sa->sd[i]));
{
cpu_set_t cs;
CPU_ZERO(&cs);
CPU_SET(sa->sd[i].core_id, &cs);
__iip_assert(!pthread_setaffinity_np(sa->sd[i].th, sizeof(cs), &cs));
}
}
}
return sa;
}
#define M2S(s) _M2S(s)
#define _M2S(s) #s
#include M2S(IOSUB_MAIN_C)
#undef _M2S
#undef M2S
Please save the following program as bench-iip/sub/iip_main.c
.
please click here to show the program
#define iip_udp_send __o_iip_udp_send
#define iip_tcp_connect __o_iip_tcp_connect
#define iip_tcp_rxbuf_consumed __o_iip_tcp_rxbuf_consumed
#define iip_tcp_close __o_iip_tcp_close
#define iip_tcp_send __o_iip_tcp_send
#define iip_arp_request __o_iip_arp_request
#include "../iip/main.c"
#undef iip_udp_send
#undef iip_tcp_connect
#undef iip_tcp_rxbuf_consumed
#undef iip_tcp_close
#undef iip_tcp_send
#undef iip_arp_request
static uint16_t iip_udp_send(void *_mem,
uint8_t local_mac[], uint32_t local_ip4_be, uint16_t local_port_be,
uint8_t peer_mac[], uint32_t peer_ip4_be, uint16_t peer_port_be,
void *pkt, void *opaque);
static uint16_t iip_tcp_connect(void *_mem,
uint8_t local_mac[], uint32_t local_ip4_be, uint16_t local_port_be,
uint8_t peer_mac[], uint32_t peer_ip4_be, uint16_t peer_port_be,
void *opaque);
static void iip_tcp_rxbuf_consumed(void *_mem, void *_handle, uint16_t cnt, void *opaque);
static uint16_t iip_tcp_close(void *_mem, void *_handle, void *opaque);
static uint16_t iip_tcp_send(void *_mem, void *_handle, void *pkt, uint16_t tcp_flags, void *opaque);
static void iip_arp_request(void *_mem,
uint8_t local_mac[],
uint32_t local_ip4_be,
uint32_t target_ip4_be,
void *opaque);
#define iip_ops_arp_reply __o_iip_ops_arp_reply
#define iip_ops_icmp_reply __o_iip_ops_icmp_reply
#define iip_ops_tcp_accept __o_iip_ops_tcp_accept
#define iip_ops_tcp_accepted __o_iip_ops_tcp_accepted
#define iip_ops_tcp_connected __o_iip_ops_tcp_connected
#define iip_ops_tcp_payload __o_iip_ops_tcp_payload
#define iip_ops_tcp_acked __o_iip_ops_tcp_acked
#define iip_ops_tcp_closed __o_iip_ops_tcp_closed
#define iip_ops_udp_payload __o_iip_ops_udp_payload
In bench-iip/sub
, please type the following command to generate a file bench-iip/sub/a.out
.
IOSUB_DIR=../iip-dpdk make -f ../Makefile
The generated file can be executed by the following commands.
sudo LD_LIBRARY_PATH=../iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -n 1 -l 0 --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.20 -- -p 10000 -g 1 -l 1 -v 1 -- -c 1 -n 1
The specification in the last section -c 1 -n 1
means the sub thread uses CPU core 1 (-c 1
) to execute the application logic, and tells that there is 1 I/O (DPDK) thread (-n 1
); the number of I/O (DPDK) thread is specified by -l 0
in the first section.
Note: for this thread separation program and particularly for request-response workloads, the following change in iip/main.c
(e423db4bee7c75d028a5f5ae0cb3a4a249caa940) omits the code to immediately transmit an ack packet for received data, and leads to better performance; This change usually does not imporove performance for bulk transfer workloads.
please click here to show the change
--- a/main.c
+++ b/main.c
@@ -3242,10 +3242,12 @@ static uint16_t iip_run(void *_mem, uint8_t mac[], uint32_t ip4_be, void *pkt[],
_next_us = _next_us_tmp;
}
}
+#if 0
if (!conn->head[3][0]) {
if ((__iip_ntohl(conn->ack_seq_be) != conn->ack_seq_sent)) /* we got payload, but ack is not pushed by the app */
__iip_tcp_push(s, conn, NULL, 0, 1, 0, 0, 0, NULL, opaque);
}
+#endif
if (conn->do_ack_cnt) { /* push ack telling rx misses */
struct pb *queue[2] = { 0 };
if (conn->sack_ok && conn->head[4][1]) {
- The command for the benchmark client.
cnt=0; while [ $cnt -le 32 ]; do sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -n 2 -l 0-$(($cnt == 0 ? 0 : $(($cnt-1)))) --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.10 -- -s 10.100.0.20 -p 10000 -g 1 -t 5 -c 1 -d 1 -l 1 2>&1 | tee -a ./result.txt; cnt=$(($cnt+2)); done
- The command for the benchmark server with the split model.
cnt=0; while [ $cnt -le 32 ]; do sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./sub/a.out -n 1 -l 0 --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.20 -- -p 10000 -g 1 -l 1 -- -b 1 -c 2 -n 1; cnt=$(($cnt+2)); done
- The command for the benchmark server with the merge model.
cnt=0; while [ $cnt -le 32 ]; do sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./sub/a.out -n 2 -l 0-1 --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.20 -- -p 10000 -g 1 -l 1 -- -b 32 -c 0-1 -n 2 -e 100; cnt=$(($cnt+2)); done
- The command for the benchmark server with the unified model.
cnt=0; while [ $cnt -le 32 ]; do sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -n 2 -l 0-1 --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.20 -- -p 10000 -g 1 -l 1; cnt=$(($cnt+2)); done
- results:
We show rough performance numbers of other TCP/IP stacks.
CAUTION: Please note that it is impossible to conduct a fair comparison among TCP/IP stacks having different implementations, properties, and features. What we show here is an apples-to-oranges comparison, in other words, the results shown here do not indicate the superiority of the TCP/IP stack implementations.
We run benchmarks with the following TCP/IP stack implementations.
- Linux
- lwIP: paper, web page
- Seastar: web page, GitHub
- F-Stack: web page, GitHub
- TAS: paper, GitHub
- Caladan: paper, GitHub
We run a simple TCP ping-pong workload that exchanges a 1-byte TCP message; for the benchmark server programs, we use example programs in the publicly available repositories of the TCP/IP stack implementations, and we use the bench-iip program as the client.
We use the same machines described above; one machine is for the server, and the other machine runs the client that is the bench-iip program.
The server and client programs communicate over the 100 Gbps Mellanox NICs.
client (iip)
- bench-iip: 4e98b9af786299ec44f79b2bf67c046a301075bd
- iip: 0da3100f108f786d923e41acd84b6614082a72be
- iip-dpdk: 9fd10fd9410dbc43ab487784f4cb72300199354b
- Linux kernel 6.2 (Ubuntu 22.04)
sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -n 2 -l 0-31 --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.10 -- -s 10.100.0.20 -p 10000 -g 1 -t 0 -c 1 -d 1 -l 1
Linux setup
- Linux kernel 6.2 (Ubuntu 22.04)
- We use the same server implementation shown above.
command to launch the benchmark server
./app -c 0 -g 1 -l 1 -p 10000
lwIP setup
- https://github.com/yasukata/tinyhttpd-lwip-dpdk
- a3e1ea3d3917554573024483fb159b73e8bc3aa5
- Linux kernel 6.2 (Ubuntu 22.04)
please click here to see changes made for this test
We change the program to always reply "A".
--- a/main.c
+++ b/main.c
@@ -132,13 +132,11 @@ static err_t tcp_recv_handler(void *arg, struct tcp_pcb *tpcb,
if (!arg) { /* server mode */
char buf[4] = { 0 };
pbuf_copy_partial(p, buf, 3, 0);
- if (!strncmp(buf, "GET", 3)) {
io_stat[0]++;
io_stat[2] += httpdatalen;
assert(tcp_sndbuf(tpcb) >= httpdatalen);
assert(tcp_write(tpcb, httpbuf, httpdatalen, TCP_WRITE_FLAG_COPY) == ERR_OK);
assert(tcp_output(tpcb) == ERR_OK);
- }
} else { /* client mode */
struct http_response *r = (struct http_response *) arg;
assert(p->tot_len < (sizeof(r->buf) - r->cur));
@@ -385,7 +383,7 @@ int main(int argc, char *const *argv)
assert((content = (char *) malloc(content_len + 1)) != NULL);
memset(content, 'A', content_len);
content[content_len] = '\0';
- httpdatalen = snprintf(httpbuf, buflen, "HTTP/1.1 200 OK\r\nContent-Length: %lu\r\nConnection: keep-alive\r\n\r\n%s", content_len, content);
+ httpdatalen = snprintf(httpbuf, buflen, "A");
free(content);
printf("http data length: %lu bytes\n", httpdatalen);
}
command to launch the benchmark server
sudo LD_LIBRARY_PATH=./dpdk/install/lib/x86_64-linux-gnu ./app -l 0 --proc-type=primary --file-prefix=pmd1 --allow=0000:17:00.0 -- -a 10.100.0.20 -g 10.100.0.10 -m 255.255.255.0 -l 1 -p 10000
Seastar setup
- https://github.com/scylladb/seastar.git
- 10b7d604d1f5037a733879d8d171d4405faebbe9
- Linux kernel 6.2 (Ubuntu 22.04)
please click here to see changes made for this test
A build-relevant file in the seastar directory is changed to involve the mlx5 driver.
--- a/cmake/Finddpdk.cmake
+++ b/cmake/Finddpdk.cmake
@@ -25,6 +25,7 @@ find_path (dpdk_INCLUDE_DIR
PATH_SUFFIXES dpdk)
find_library (dpdk_PMD_VMXNET3_UIO_LIBRARY rte_pmd_vmxnet3_uio)
+find_library (dpdk_PMD_MLX5_LIBRARY rte_pmd_mlx5)
find_library (dpdk_PMD_I40E_LIBRARY rte_pmd_i40e)
find_library (dpdk_PMD_IXGBE_LIBRARY rte_pmd_ixgbe)
find_library (dpdk_PMD_E1000_LIBRARY rte_pmd_e1000)
@@ -58,6 +59,7 @@ include (FindPackageHandleStandardArgs)
set (dpdk_REQUIRED
dpdk_INCLUDE_DIR
dpdk_PMD_VMXNET3_UIO_LIBRARY
+ dpdk_PMD_MLX5_LIBRARY
dpdk_PMD_I40E_LIBRARY
dpdk_PMD_IXGBE_LIBRARY
dpdk_PMD_E1000_LIBRARY
@@ -113,6 +115,7 @@ if (dpdk_FOUND AND NOT (TARGET dpdk::dpdk))
${dpdk_PMD_ENA_LIBRARY}
${dpdk_PMD_ENIC_LIBRARY}
${dpdk_PMD_QEDE_LIBRARY}
+ ${dpdk_PMD_MLX5_LIBRARY}
${dpdk_PMD_I40E_LIBRARY}
${dpdk_PMD_IXGBE_LIBRARY}
${dpdk_PMD_NFP_LIBRARY}
@@ -146,6 +149,17 @@ if (dpdk_FOUND AND NOT (TARGET dpdk::dpdk))
IMPORTED_LOCATION ${dpdk_PMD_VMXNET3_UIO_LIBRARY}
INTERFACE_INCLUDE_DIRECTORIES ${dpdk_INCLUDE_DIR})
+ #
+ # pmd_mlx5
+ #
+
+ add_library (dpdk::pmd_mlx5 UNKNOWN IMPORTED)
+
+ set_target_properties (dpdk::pmd_mlx5
+ PROPERTIES
+ IMPORTED_LOCATION ${dpdk_PMD_MLX5_LIBRARY}
+ INTERFACE_INCLUDE_DIRECTORIES ${dpdk_INCLUDE_DIR})
+
#
# pmd_i40e
#
@@ -468,6 +482,7 @@ if (dpdk_FOUND AND NOT (TARGET dpdk::dpdk))
dpdk::pmd_ena
dpdk::pmd_enic
dpdk::pmd_qede
+ dpdk::pmd_mlx5
dpdk::pmd_i40e
dpdk::pmd_ixgbe
dpdk::pmd_nfp
A build config file in the seastar/dpdk directory is also changed.
--- a/config/common_base
+++ b/config/common_base
@@ -343,7 +343,7 @@ CONFIG_RTE_LIBRTE_MLX4_DEBUG=n
# Compile burst-oriented Mellanox ConnectX-4, ConnectX-5,
# ConnectX-6 & Bluefield (MLX5) PMD
#
-CONFIG_RTE_LIBRTE_MLX5_PMD=n
+CONFIG_RTE_LIBRTE_MLX5_PMD=y
CONFIG_RTE_LIBRTE_MLX5_DEBUG=n
We modify the memcached application to use it as a simple TCP ping-pong server; after the change, the server always return "A" to a TCP message without running the memcached-specific event handler.
--- a/apps/memcached/memcache.cc
+++ b/apps/memcached/memcache.cc
@@ -1042,6 +1042,13 @@ class ascii_protocol {
}
future<> handle(input_stream<char>& in, output_stream<char>& out) {
+ return in.read().then([this, &out] (temporary_buffer<char> buf) -> future<> {
+ if (!buf.empty())
+ return out.write("A");
+ else
+ return make_ready_future<>();
+ });
+
_parser.init();
return in.consume(_parser).then([this, &out] () -> future<> {
switch (_parser._state) {
The configure command used.
./configure.py --mode=release --enable-dpdk --without-tests --without-demos
In the file build/release/build.ninja
, -libverbs -lmlx5 -lmnl
has to be added to LINK_LIBRARIES
like as follows.
#############################################
# Link the executable apps/memcached/memcached
build apps/memcached/memcached: CXX_EXECUTABLE_LINKER__app_memcached_RelWithDebInfo apps/memcached/CMakeFiles/app_memcached.dir/memcache.cc.o | libseastar.a /usr/lib/x86_64-linux-gnu/libboost_program_options.so /usr/lib/x86_64-linux-gnu/libboost_thread.so /usr/lib/x86_64-linux-gnu/libboost_chrono.so /usr/lib/x86_64-linux-gnu/libboost_date_time.so /usr/lib/x86_64-linux-gnu/libboost_atomic.so /usr/lib/x86_64-linux-gnu/libcares.so /usr/lib/x86_64-linux-gnu/libcryptopp.so /usr/lib/x86_64-linux-gnu/libfmt.so.8.1.1 /usr/lib/x86_64-linux-gnu/liblz4.so /usr/lib/x86_64-linux-gnu/libgnutls.so /usr/lib/x86_64-linux-gnu/libsctp.so /usr/lib/x86_64-linux-gnu/libyaml-cpp.so _cooking/installed/lib/librte_cfgfile.a _cooking/installed/lib/librte_cmdline.a _cooking/installed/lib/librte_ethdev.a _cooking/installed/lib/librte_hash.a _cooking/installed/lib/librte_mbuf.a _cooking/installed/lib/librte_eal.a _cooking/installed/lib/librte_kvargs.a _cooking/installed/lib/librte_mempool.a _cooking/installed/lib/librte_mempool_ring.a _cooking/installed/lib/librte_pmd_bnxt.a _cooking/installed/lib/librte_pmd_cxgbe.a _cooking/installed/lib/librte_pmd_e1000.a _cooking/installed/lib/librte_pmd_ena.a _cooking/installed/lib/librte_pmd_enic.a _cooking/installed/lib/librte_pmd_qede.a _cooking/installed/lib/librte_pmd_mlx5.a _cooking/installed/lib/librte_pmd_i40e.a _cooking/installed/lib/librte_pmd_ixgbe.a _cooking/installed/lib/librte_pmd_nfp.a _cooking/installed/lib/librte_pmd_ring.a _cooking/installed/lib/librte_pmd_vmxnet3_uio.a _cooking/installed/lib/librte_ring.a _cooking/installed/lib/librte_net.a _cooking/installed/lib/librte_timer.a _cooking/installed/lib/librte_pci.a _cooking/installed/lib/librte_bus_pci.a _cooking/installed/lib/librte_bus_vdev.a _cooking/installed/lib/librte_pmd_fm10k.a _cooking/installed/lib/librte_pmd_sfc_efx.a /usr/lib/x86_64-linux-gnu/libhwloc.so /usr/lib/x86_64-linux-gnu/liburing.so /usr/lib/x86_64-linux-gnu/libnuma.so || apps/memcached/app_memcached_ascii libseastar.a
FLAGS = -O2 -g -DNDEBUG
#LINK_LIBRARIES = libseastar.a /usr/lib/x86_64-linux-gnu/libboost_program_options.so /usr/lib/x86_64-linux-gnu/libboost_thread.so /usr/lib/x86_64-linux-gnu/libboost_chrono.so /usr/lib/x86_64-linux-gnu/libboost_date_time.so /usr/lib/x86_64-linux-gnu/libboost_atomic.so /usr/lib/x86_64-linux-gnu/libcares.so /usr/lib/x86_64-linux-gnu/libcryptopp.so /usr/lib/x86_64-linux-gnu/libfmt.so.8.1.1 -Wl,--as-needed /usr/lib/x86_64-linux-gnu/liblz4.so -ldl /usr/lib/x86_64-linux-gnu/libgnutls.so -latomic /usr/lib/x86_64-linux-gnu/libsctp.so /usr/lib/x86_64-linux-gnu/libyaml-cpp.so _cooking/installed/lib/librte_cfgfile.a _cooking/installed/lib/librte_cmdline.a _cooking/installed/lib/librte_ethdev.a _cooking/installed/lib/librte_hash.a _cooking/installed/lib/librte_mbuf.a _cooking/installed/lib/librte_eal.a _cooking/installed/lib/librte_kvargs.a _cooking/installed/lib/librte_mempool.a _cooking/installed/lib/librte_mempool_ring.a _cooking/installed/lib/librte_pmd_bnxt.a _cooking/installed/lib/librte_pmd_cxgbe.a _cooking/installed/lib/librte_pmd_e1000.a _cooking/installed/lib/librte_pmd_ena.a _cooking/installed/lib/librte_pmd_enic.a _cooking/installed/lib/librte_pmd_qede.a _cooking/installed/lib/librte_pmd_mlx5.a _cooking/installed/lib/librte_pmd_i40e.a _cooking/installed/lib/librte_pmd_ixgbe.a _cooking/installed/lib/librte_pmd_nfp.a _cooking/installed/lib/librte_pmd_ring.a _cooking/installed/lib/librte_pmd_vmxnet3_uio.a _cooking/installed/lib/librte_ring.a _cooking/installed/lib/librte_net.a _cooking/installed/lib/librte_timer.a _cooking/installed/lib/librte_pci.a _cooking/installed/lib/librte_bus_pci.a _cooking/installed/lib/librte_bus_vdev.a _cooking/installed/lib/librte_pmd_fm10k.a _cooking/installed/lib/librte_pmd_sfc_efx.a /usr/lib/x86_64-linux-gnu/libhwloc.so /usr/lib/x86_64-linux-gnu/liburing.so /usr/lib/x86_64-linux-gnu/libnuma.so
LINK_LIBRARIES = libseastar.a /usr/lib/x86_64-linux-gnu/libboost_program_options.so /usr/lib/x86_64-linux-gnu/libboost_thread.so /usr/lib/x86_64-linux-gnu/libboost_chrono.so /usr/lib/x86_64-linux-gnu/libboost_date_time.so /usr/lib/x86_64-linux-gnu/libboost_atomic.so /usr/lib/x86_64-linux-gnu/libcares.so /usr/lib/x86_64-linux-gnu/libcryptopp.so /usr/lib/x86_64-linux-gnu/libfmt.so.8.1.1 -Wl,--as-needed /usr/lib/x86_64-linux-gnu/liblz4.so -ldl -libverbs -lmlx5 -lmnl /usr/lib/x86_64-linux-gnu/libgnutls.so -latomic /usr/lib/x86_64-linux-gnu/libsctp.so /usr/lib/x86_64-linux-gnu/libyaml-cpp.so _cooking/installed/lib/librte_cfgfile.a _cooking/installed/lib/librte_cmdline.a _cooking/installed/lib/librte_ethdev.a _cooking/installed/lib/librte_hash.a _cooking/installed/lib/librte_mbuf.a _cooking/installed/lib/librte_eal.a _cooking/installed/lib/librte_kvargs.a _cooking/installed/lib/librte_mempool.a _cooking/installed/lib/librte_mempool_ring.a _cooking/installed/lib/librte_pmd_bnxt.a _cooking/installed/lib/librte_pmd_cxgbe.a _cooking/installed/lib/librte_pmd_e1000.a _cooking/installed/lib/librte_pmd_ena.a _cooking/installed/lib/librte_pmd_enic.a _cooking/installed/lib/librte_pmd_qede.a _cooking/installed/lib/librte_pmd_mlx5.a _cooking/installed/lib/librte_pmd_i40e.a _cooking/installed/lib/librte_pmd_ixgbe.a _cooking/installed/lib/librte_pmd_nfp.a _cooking/installed/lib/librte_pmd_ring.a _cooking/installed/lib/librte_pmd_vmxnet3_uio.a _cooking/installed/lib/librte_ring.a _cooking/installed/lib/librte_net.a _cooking/installed/lib/librte_timer.a _cooking/installed/lib/librte_pci.a _cooking/installed/lib/librte_bus_pci.a _cooking/installed/lib/librte_bus_vdev.a _cooking/installed/lib/librte_pmd_fm10k.a _cooking/installed/lib/librte_pmd_sfc_efx.a /usr/lib/x86_64-linux-gnu/libhwloc.so /usr/lib/x86_64-linux-gnu/liburing.so /usr/lib/x86_64-linux-gnu/libnuma.so
OBJECT_DIR = apps/memcached/CMakeFiles/app_memcached.dir
POST_BUILD = :
PRE_LINK = :
TARGET_FILE = apps/memcached/memcached
TARGET_PDB = memcached.dbg
command to launch the benchmark server
build/release/apps/memcached/memcached --network-stack native --dpdk-pmd --dhcp 0 --host-ipv4-addr 10.100.0.20 --netmask-ipv4-addr 255.255.255.0 --collectd 0 --smp 1 --port 10000
F-Stack setup
- https://github.com/F-Stack/f-stack.git
- 81b0219b097156693e6061ce215dc79687ef7f92
- Linux kernel 6.2 (Ubuntu 22.04)
please click here to see changes made for this test
The following is the change made to the configuration file.
--- a/config.ini
+++ b/config.ini
@@ -33,13 +33,14 @@ idle_sleep=0
# if set 0, means send pkts immediately.
# if set >100, will dealy 100 us.
# unit: microseconds
-pkt_tx_delay=100
+pkt_tx_delay=0
# use symmetric Receive-side Scaling(RSS) key, default: disabled.
symmetric_rss=0
# PCI device enable list.
# And driver options
+allow=17:00.0
#allow=02:00.0
# for multiple PCI devices
#allow=02:00.0,03:00.0
@@ -85,10 +86,10 @@ savepath=.
# Port config section
# Correspond to dpdk.port_list's index: port0, port1...
[port0]
-addr=192.168.1.2
+addr=10.100.0.20
netmask=255.255.255.0
-broadcast=192.168.1.255
-gateway=192.168.1.1
+broadcast=10.100.0.255
+gateway=10.100.0.10
# set interface name, Optional parameter.
#if_name=eno7
We changed an HTTP-server like example to always return "A" to incoming TCP messages.
--- a/example/main.c
+++ b/example/main.c
@@ -26,37 +26,7 @@ int sockfd;
int sockfd6;
#endif
-char html[] =
-"HTTP/1.1 200 OK\r\n"
-"Server: F-Stack\r\n"
-"Date: Sat, 25 Feb 2017 09:26:33 GMT\r\n"
-"Content-Type: text/html\r\n"
-"Content-Length: 438\r\n"
-"Last-Modified: Tue, 21 Feb 2017 09:44:03 GMT\r\n"
-"Connection: keep-alive\r\n"
-"Accept-Ranges: bytes\r\n"
-"\r\n"
-"<!DOCTYPE html>\r\n"
-"<html>\r\n"
-"<head>\r\n"
-"<title>Welcome to F-Stack!</title>\r\n"
-"<style>\r\n"
-" body { \r\n"
-" width: 35em;\r\n"
-" margin: 0 auto; \r\n"
-" font-family: Tahoma, Verdana, Arial, sans-serif;\r\n"
-" }\r\n"
-"</style>\r\n"
-"</head>\r\n"
-"<body>\r\n"
-"<h1>Welcome to F-Stack!</h1>\r\n"
-"\r\n"
-"<p>For online documentation and support please refer to\r\n"
-"<a href=\"http://F-Stack.org/\">F-Stack.org</a>.<br/>\r\n"
-"\r\n"
-"<p><em>Thank you for using F-Stack.</em></p>\r\n"
-"</body>\r\n"
-"</html>";
+char html[] = "A";
int loop(void *arg)
{
@@ -143,7 +113,7 @@ int main(int argc, char * argv[])
struct sockaddr_in my_addr;
bzero(&my_addr, sizeof(my_addr));
my_addr.sin_family = AF_INET;
- my_addr.sin_port = htons(80);
+ my_addr.sin_port = htons(10000);
my_addr.sin_addr.s_addr = htonl(INADDR_ANY);
command to launch the benchmark server
./example/helloworld
TAS setup
- https://github.com/tcp-acceleration-service/tas.git
- d3926baf6ad65211dc724206a8420715eb5ab645
- Linux kernel 6.2 (Ubuntu 22.04)
please click here to see changes made for this test
We remove -Werror
and manipulate several path settings to pass the compilation.
--- a/Makefile
+++ b/Makefile
@@ -5,15 +5,15 @@
CPPFLAGS += -Iinclude/
CPPFLAGS += $(EXTRA_CPPFLAGS)
-CFLAGS += -std=gnu99 -O3 -g -Wall -Werror -march=native -fno-omit-frame-pointer
+CFLAGS += -std=gnu99 -O3 -g -Wall -march=native -fno-omit-frame-pointer
CFLAGS += $(EXTRA_CFLAGS)
CFLAGS_SHARED += $(CFLAGS) -fPIC
LDFLAGS += -pthread -g
LDFLAGS += $(EXTRA_LDFLAGS)
-LDLIBS += -lm -lpthread -lrt -ldl
+LDLIBS += -lm -lpthread -lrt -ldl -lrte_kvargs
LDLIBS += $(EXTRA_LDLIBS)
-PREFIX ?= /usr/local
+PREFIX ?= $(HOME)/dpdk-inst
SBINDIR ?= $(PREFIX)/sbin
LIBDIR ?= $(PREFIX)/lib
INCDIR ?= $(PREFIX)/include
@@ -23,13 +23,13 @@ INCDIR ?= $(PREFIX)/include
# DPDK configuration
# Prefix for dpdk
-RTE_SDK ?= /usr/
+RTE_SDK ?= $(HOME)/dpdk-inst
# mpdts to compile
-DPDK_PMDS ?= ixgbe i40e tap virtio
+DPDK_PMDS ?= ixgbe i40e tap virtio mlx5
DPDK_CPPFLAGS += -I$(RTE_SDK)/include -I$(RTE_SDK)/include/dpdk \
- -I$(RTE_SDK)/include/x86_64-linux-gnu/dpdk/
-DPDK_LDFLAGS+= -L$(RTE_SDK)/lib/
+ -I$(RTE_SDK)/include/x86_64-linux-gnu/dpdk/ -I$(RTE_SDK)/include/
+DPDK_LDFLAGS+= -L$(RTE_SDK)/lib/ -L/root/dpdk-inst/lib/x86_64-linux-gnu
DPDK_LDLIBS+= \
-Wl,--whole-archive \
$(addprefix -lrte_pmd_,$(DPDK_PMDS)) \
We replace pthread_yield
with sched_yield
because the compiler suggested.
--- a/lib/sockets/interpose.c
+++ b/lib/sockets/interpose.c
@@ -779,7 +779,7 @@ static inline void ensure_init(void)
init_done = 1;
} else {
while (init_done == 0) {
- pthread_yield();
+ sched_yield();
}
MEM_BARRIER();
}
--- a/lib/sockets/libc.c
+++ b/lib/sockets/libc.c
@@ -150,7 +150,7 @@ static inline void ensure_init(void)
init_done = 1;
} else {
while (init_done == 0) {
- pthread_yield();
+ sched_yield();
}
MEM_BARRIER();
}
command to launch the service process of TAS
LD_LIBRARY_PATH=$HOME/dpdk-inst/lib/x86_64-linux-gnu ./tas/tas --ip-addr=10.100.0.20/24 --fp-cores-max=1
command to launch the benchmark server
./tests/bench_ll_echo 10000 1 64 128
Caladan setup
- https://github.com/shenango/caladan.git
- 1ab795053531dacf6bde366471a4439ae72313c4
- Linux kernel 5.15 (Ubuntu 22.04)
please click here to see changes made for this test
--- a/apps/synthetic/src/main.rs
+++ b/apps/synthetic/src/main.rs
@@ -1,4 +1,4 @@
-#![feature(integer_atomics)]
+//#![feature(integer_atomics)]
#![feature(nll)]
#![feature(test)]
#[macro_use]
We changed the build config file to include the mlx5 NIC driver.
The directpath optimization CONFIG_DIRECTPATH
is not activated.
--- a/build/config
+++ b/build/config
@@ -1,7 +1,7 @@
# build configuration options (set to y for "yes", n for "no")
# Enable Mellanox ConnectX-4,5 NIC Support
-CONFIG_MLX5=n
+CONFIG_MLX5=y
# Enable Mellanox ConnectX-3 NIC Support
CONFIG_MLX4=n
# Enable SPDK NVMe support
The configuration file is edited as follows.
--- a/server.config
+++ b/server.config
@@ -1,7 +1,6 @@
-# an example runtime config file
-host_addr 192.168.1.3
+host_addr 10.100.0.20
host_netmask 255.255.255.0
-host_gateway 192.168.1.1
-runtime_kthreads 4
-runtime_guaranteed_kthreads 4
+host_gateway 10.100.0.10
+runtime_kthreads 1
+runtime_guaranteed_kthreads 1
runtime_priority lc
We change the port number that the servers listens on.
--- a/tests/netperf.c
+++ b/tests/netperf.c
@@ -14,7 +14,7 @@
#include <runtime/sync.h>
#include <runtime/tcp.h>
-#define NETPERF_PORT 8000
+#define NETPERF_PORT 10000
/* experiment parameters */
static struct netaddr raddr;
command to launch the process for Caladan's IOKernel
./iokerneld simple noht
command to launch the benchmark server
./tests/netperf ./server.config SERVER 1 10.100.0.20 0 1 1
iip setup
- bench-iip: 4e98b9af786299ec44f79b2bf67c046a301075bd
- iip: 0da3100f108f786d923e41acd84b6614082a72be
- iip-dpdk: 9fd10fd9410dbc43ab487784f4cb72300199354b
- Linux kernel 6.2 (Ubuntu 22.04)
command to launch the benchmark server
sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -n 1 -l 0 --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.20 -- -p 10000 -l 1
result
system | throughput (requests/sec) | 99th percentile latency (us) |
---|---|---|
Linux | 255872 | 160.381 |
lwIP | 2330425 | 14.188 |
Seastar | 1135152 | 30.286 |
F-Stack | 1368221 | 23.884 |
TAS | 1628830 | 26.794 |
Caladan | 2427353 | 17.263 |
iip | 2894734 | 15.314 |
note
For this benchmark, TAS uses three CPU cores and Caladan uses two CPU cores; TAS needs two extra CPU cores for its service process, and Caladan requires a dedicated CPU core for its scheduler. The other cases use one CPU core.
client (iip)
sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -n 2 -l 0-31 --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.10 -- -s 10.100.0.20 -p 10000 -g 1 -t 0 -c 8 -d 1 -l 1
Linux setup
./app -c 0-7 -g 1 -l 1 -p 10000
Seastar setup
build/release/apps/memcached/memcached --network-stack native --dpdk-pmd --dhcp 0 --host-ipv4-addr 10.100.0.20 --netmask-ipv4-addr 255.255.255.0 --collectd 0 --smp 8 --port 10000
Caladan setup
The file server.config
is changed so that runtime_kthreads
will be 8.
please click here to see the configuration used for this test
host_addr 10.100.0.20
host_netmask 255.255.255.0
host_gateway 10.100.0.10
runtime_kthreads 8
runtime_guaranteed_kthreads 8
runtime_priority lc
command to launch the benchmark server
./tests/netperf ./server.config SERVER 0 10.100.0.20 0 1 1
iip setup
sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -n 2 -l 0-7 --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.20 -- -p 10000 -l 1
result
system | throughput (requests/sec) | 99th percentile latency (us) |
---|---|---|
Linux | 1891960 | 304.695 |
Seastar | 8837323 | 39.769 |
Caladan | 9997752 | 50.033 |
iip | 22040945 | 16.538 |
note
Caladan uses 9 CPU cores for this benchmark (Caladan requires a dedicated CPU core for its scheduler), and the other cases use 8 CPU cores.
client (iip)
sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -n 2 -l 0-31 --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.10 -- -s 10.100.0.20 -p 10000 -g 1 -t 0 -c 32 -d 1 -l 1
Linux setup
./app -c 0-31 -g 1 -l 1 -p 10000
Seastar setup
build/release/apps/memcached/memcached --network-stack native --dpdk-pmd --dhcp 0 --host-ipv4-addr 10.100.0.20 --netmask-ipv4-addr 255.255.255.0 --collectd 0 --smp 32 --port 10000
iip setup
sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -n 2 -l 0-31 --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.20 -- -p 10000 -l 1
result
system | throughput (requests/sec) | 99th percentile latency (us) |
---|---|---|
Linux | 4809453 | 528.557 |
Seastar | 29381169 | 53.341 |
iip | 71007462 | 21.247 |
note
In our environment, 14 was the maximum number specified for runtime_kthreads
.
please click here to see the configuration used for this test
host_addr 10.100.0.20
host_netmask 255.255.255.0
host_gateway 10.100.0.10
runtime_kthreads 14
runtime_guaranteed_kthreads 14
runtime_priority lc
The client is launched by the following command.
sudo LD_LIBRARY_PATH=./iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -n 2 -l 0-31 --proc-type=primary --file-prefix=pmd1 --allow 17:00.0 -- -a 0,10.100.0.10 -- -s 10.100.0.20 -p 10000 -g 1 -t 0 -c 14 -d 1 -l 1
Caladan's throughput with this 14 runtime_kthreads
setup was 9708941 requests/sec and its 99th percentile latency was 72.881 us.
We can have cache-relevant statistics by, in another console/terminal, executing the following command during the benchmark execution.
sudo pqos -m all:0-31 2>&1 | tee -a pqos-output.txt
The following extracts, from the entire pqos output, the result for the second that is two seconds before the benchmark execution completes:
for 32 cores
ta=(`cat result.txt|grep "sec has passed"|awk '{ print $2 }'`); for i in ${ta[@]}; do tac pqos-output.txt|grep -v NOTE|grep -v CAT|grep -v CORE|awk -v timestr="$i" 'BEGIN{ pcnt = 0; } { num = match($0, timestr); if (0 < num) { pcnt = 1; }; if (0 < pcnt && pcnt < 67) { if (34 < pcnt) { print $n; }; pcnt += 1; }; }'; done
for CPU core 0
ta=(`cat result.txt|grep "sec has passed"|awk '{ print $2 }'`); for i in ${ta[@]}; do tac pqos-output.txt|grep -v NOTE|grep -v CAT|grep -v CORE|awk -v timestr="$i" 'BEGIN{ pcnt = 0; } { num = match($0, timestr); if (0 < num) { pcnt = 1; }; if (0 < pcnt && pcnt < 67) { if (34 < pcnt) { print $n; }; pcnt += 1; }; }'|sort|awk '{ if (NR == 1) { print $n; exit } }'; done
get average of 32 cores
numcore=1; ta=(`cat result.txt|grep "sec has passed"|awk '{ print $2 }'`); for i in ${ta[@]}; do tac pqos-output.txt|grep -v NOTE|grep -v CAT|grep -v CORE|awk -v timestr="$i" -v numcore=$numcore 'BEGIN{ pcnt = 0; ipc = 0; missk = 0; util = 0; } { num = match($0, timestr); if (0 < num) { pcnt = 1; }; if (0 < pcnt && pcnt < 67) { if (34 + (32 - numcore) < pcnt) { ipc += $2; missk += $3; util += $4; }; pcnt += 1; }; } END{ print ipc / numcore ", " missk /numcore ", " util / numcore }'; numcore=$(($numcore+1)); done
The following program is a simple packet generator program mainly made for performance measurement of I/O subsystems used by bench-iip
.
WARNING: this packet generator program transmits packets at a high rate and may cause trouble to systems sharing the network with this packet generator program, therefore, please try this packet generator program only if you understand the consequences of your actions.
To compile the packet generator program, please first enter the top directory of this repository.
cd bench-iip
Then, please make a directory named pkt-gen
.
mkdir pkt-gen
Please enter the pkt-gen
directory.
cd pkt-gen
Please seve the following program as a file named main.c
.
please click here to show main.c
#define __app_thread_init __o__app_thread_init
#define __app_init __o__app_init
#pragma push_macro("IOSUB_MAIN_C")
#undef IOSUB_MAIN_C
#define IOSUB_MAIN_C pthread.h
static int __iosub_main(int argc, char *const *argv);
#define IIP_MAIN_C "./iip_main.c"
#include "../main.c"
#undef IOSUB_MAIN_C
#pragma pop_macro("IOSUB_MAIN_C")
#undef __app_thread_init
#undef __app_init
struct __pkt_gen_addr {
uint8_t mac[6];
uint32_t ip4_be;
uint16_t l4_port_be;
};
#ifndef __PKT_GEN_PKT_CNT
#define __PKT_GEN_PKT_CNT (128)
#endif
#if MAX_PAYLOAD_PKT_CNT < __PKT_GEN_PKT_CNT
#error "MAX_PAYLOAD_PKT_CNT < __PKT_GEN_PKT_CNT"
#endif
static struct __pkt_gen_addr __pkt_gen_src_addrs[__PKT_GEN_PKT_CNT] = { 0 };
static struct __pkt_gen_addr __pkt_gen_dst_addrs[__PKT_GEN_PKT_CNT] = { 0 };
static uint16_t __pkt_gen_src_addr_cnt = 0;
static uint16_t __pkt_gen_dst_addr_cnt = 0;
static uint16_t __pkt_gen_payload_len = 1;
static uint16_t __pkt_gen_batch_size = 0;
static void *__app_thread_init(void *workspace, uint16_t core_id, void *opaque)
{
void **opaque_array = (void **) opaque;
struct app_data *ad = (struct app_data *) opaque_array[1];
struct thread_data *td;
assert((td = (struct thread_data *) mem_alloc_local(sizeof(struct thread_data))) != NULL);
memset(td, 0, sizeof(struct thread_data));
td->core_id = core_id;
{
uint16_t i;
for (i = 0; i < (__pkt_gen_src_addr_cnt < __pkt_gen_dst_addr_cnt ? __pkt_gen_dst_addr_cnt : __pkt_gen_src_addr_cnt); i++) {
void *out_pkt = iip_ops_pkt_alloc(opaque);
assert(out_pkt != NULL);
iip_ops_l2_hdr_craft(out_pkt,
__pkt_gen_src_addrs[i % __pkt_gen_src_addr_cnt].mac,
__pkt_gen_dst_addrs[i % __pkt_gen_dst_addr_cnt].mac,
__iip_htons(0x0800),
opaque);
{
struct iip_ip4_hdr *ip4h = PB_IP4(iip_ops_pkt_get_data(out_pkt, opaque));
ip4h->vl = (4 /* ver ipv4 */ << 4) | (sizeof(struct iip_ip4_hdr) / 4 /* len in octet */);
ip4h->len_be = __iip_htons((ip4h->vl & 0x0f) * 4 + sizeof(struct iip_udp_hdr) + __pkt_gen_payload_len);
ip4h->tos = 0;
ip4h->id_be = 0; /* no ip4 fragment */
ip4h->off_be = 0; /* no ip4 fragment */
ip4h->ttl = IIP_CONF_IP4_TTL;
ip4h->proto = 17; /* udp */
ip4h->src_be = __pkt_gen_src_addrs[i % __pkt_gen_src_addr_cnt].ip4_be;
ip4h->dst_be = __pkt_gen_dst_addrs[i % __pkt_gen_dst_addr_cnt].ip4_be;
ip4h->csum_be = 0;
{
uint8_t *_b[1]; _b[0] = (uint8_t *) ip4h;
{
uint16_t _l[1]; _l[0] = (uint16_t) ((ip4h->vl & 0x0f) * 4);
ip4h->csum_be = __iip_htons(__iip_netcsum16(_b, _l, 1, 0));
}
}
__iip_memset(&((uint8_t *) iip_ops_pkt_get_data(out_pkt, opaque))[iip_ops_l2_hdr_len(out_pkt, opaque) + (ip4h->vl & 0x0f) * 4 + sizeof(struct iip_udp_hdr)], 'A', __pkt_gen_payload_len);
{
struct iip_udp_hdr *udph = PB_UDP(iip_ops_pkt_get_data(out_pkt, opaque));
udph->src_be = __pkt_gen_src_addrs[i % __pkt_gen_src_addr_cnt].l4_port_be;
udph->dst_be = __pkt_gen_dst_addrs[i % __pkt_gen_dst_addr_cnt].l4_port_be;
udph->len_be = __iip_htons(sizeof(struct iip_udp_hdr) + __pkt_gen_payload_len);
udph->csum_be = 0;
if (!iip_ops_nic_feature_offload_udp_tx_checksum(opaque)) { /* udp csum */
struct iip_l4_ip4_pseudo_hdr _pseudo;
_pseudo.ip4_src_be = __pkt_gen_src_addrs[i % __pkt_gen_src_addr_cnt].ip4_be;
_pseudo.ip4_dst_be = __pkt_gen_dst_addrs[i % __pkt_gen_dst_addr_cnt].ip4_be;
_pseudo.pad = 0;
_pseudo.proto = 17;
_pseudo.len_be = __iip_htons(sizeof(struct iip_udp_hdr) + __pkt_gen_payload_len);
{
uint8_t *_b[3]; _b[0] = (uint8_t *) &_pseudo; _b[1] = (uint8_t *) udph; _b[2] = (uint8_t *) &((uint8_t *) iip_ops_pkt_get_data(out_pkt, opaque))[iip_ops_l2_hdr_len(out_pkt, opaque) + (ip4h->vl & 0x0f) * 4 + sizeof(struct iip_udp_hdr)];
{
uint16_t _l[3]; _l[0] = sizeof(_pseudo); _l[1] = sizeof(struct iip_udp_hdr); _l[2] = __pkt_gen_payload_len;
udph->csum_be = __iip_htons(__iip_netcsum16(_b, _l, 3, 0));
}
}
}
iip_ops_pkt_set_len(out_pkt, iip_ops_l2_hdr_len(out_pkt, opaque) + (ip4h->vl & 0x0f) * 4 + __iip_ntohs(udph->len_be), opaque);
}
}
printf("pkt[%hu] src %02x:%02x:%02x:%02x:%02x:%02x %u.%u.%u.%u %hu dst %02x:%02x:%02x:%02x:%02x:%02x %u.%u.%u.%u %hu (%u bytes)\n",
i,
__pkt_gen_src_addrs[i % __pkt_gen_src_addr_cnt].mac[0],
__pkt_gen_src_addrs[i % __pkt_gen_src_addr_cnt].mac[1],
__pkt_gen_src_addrs[i % __pkt_gen_src_addr_cnt].mac[2],
__pkt_gen_src_addrs[i % __pkt_gen_src_addr_cnt].mac[3],
__pkt_gen_src_addrs[i % __pkt_gen_src_addr_cnt].mac[4],
__pkt_gen_src_addrs[i % __pkt_gen_src_addr_cnt].mac[5],
(__pkt_gen_src_addrs[i % __pkt_gen_src_addr_cnt].ip4_be >> 0) & 0xff,
(__pkt_gen_src_addrs[i % __pkt_gen_src_addr_cnt].ip4_be >> 8) & 0xff,
(__pkt_gen_src_addrs[i % __pkt_gen_src_addr_cnt].ip4_be >> 16) & 0xff,
(__pkt_gen_src_addrs[i % __pkt_gen_src_addr_cnt].ip4_be >> 24) & 0xff,
ntohs(__pkt_gen_src_addrs[i % __pkt_gen_src_addr_cnt].l4_port_be),
__pkt_gen_dst_addrs[i % __pkt_gen_dst_addr_cnt].mac[0],
__pkt_gen_dst_addrs[i % __pkt_gen_dst_addr_cnt].mac[1],
__pkt_gen_dst_addrs[i % __pkt_gen_dst_addr_cnt].mac[2],
__pkt_gen_dst_addrs[i % __pkt_gen_dst_addr_cnt].mac[3],
__pkt_gen_dst_addrs[i % __pkt_gen_dst_addr_cnt].mac[4],
__pkt_gen_dst_addrs[i % __pkt_gen_dst_addr_cnt].mac[5],
(__pkt_gen_dst_addrs[i % __pkt_gen_dst_addr_cnt].ip4_be >> 0) & 0xff,
(__pkt_gen_dst_addrs[i % __pkt_gen_dst_addr_cnt].ip4_be >> 8) & 0xff,
(__pkt_gen_dst_addrs[i % __pkt_gen_dst_addr_cnt].ip4_be >> 16) & 0xff,
(__pkt_gen_dst_addrs[i % __pkt_gen_dst_addr_cnt].ip4_be >> 24) & 0xff,
ntohs(__pkt_gen_dst_addrs[i % __pkt_gen_dst_addr_cnt].l4_port_be),
iip_ops_pkt_get_len(out_pkt, opaque));
td->payload.pkt[i] = out_pkt;
}
if (i) {
unsigned int j;
for (j = i; j < __PKT_GEN_PKT_CNT; j++) {
void *out_pkt = iip_ops_pkt_alloc(opaque);
assert(out_pkt);
memcpy(iip_ops_pkt_get_data(out_pkt, opaque), iip_ops_pkt_get_data(td->payload.pkt[j % i], opaque), iip_ops_pkt_get_len(td->payload.pkt[j % i], opaque));
iip_ops_pkt_set_len(out_pkt, iip_ops_pkt_get_len(td->payload.pkt[j % i], opaque), opaque);
td->payload.pkt[j] = out_pkt;
}
}
}
ad->tds[td->core_id] = td;
return td;
{ /* unused */
(void) workspace;
(void) __o__app_thread_init;
}
}
static void *__app_init(int argc, char *const *argv)
{
struct app_data *ad = (struct app_data *) mem_alloc_local(sizeof(struct app_data));
assert(ad);
memset(ad, 0, sizeof(struct app_data));
{ /* parse arguments */
int ch, cnt = 0;
while ((ch = getopt(argc, argv, "b:d:l:s:")) != -1) {
cnt += 2;
switch (ch) {
case 'b':
sscanf(optarg, "%hu", &__pkt_gen_batch_size);
break;
case 'd':
case 's':
if (ch == 's')
assert(__pkt_gen_src_addr_cnt < __PKT_GEN_PKT_CNT);
else
assert(__pkt_gen_dst_addr_cnt < __PKT_GEN_PKT_CNT);
{ /* format: mac,ip,port (e.g., ab:cd:ef:01:23:45,192.168.0.1,10000 */
char tmpbuf[64] = { 0 };
size_t l = strlen(optarg);
assert(l < (sizeof(tmpbuf) - 1));
memcpy(tmpbuf, optarg, l);
{
size_t i, j = 0, c = 0;
for (i = 0; i < l; i++) {
if (tmpbuf[i] == ',' || i == l - 1) {
if (tmpbuf[i] == ',')
tmpbuf[i] = '\0';
switch (c) {
case 0:
if (ch == 's') {
sscanf(tmpbuf, "%hhx:%hhx:%hhx:%hhx:%hhx:%hhx",
&__pkt_gen_src_addrs[__pkt_gen_src_addr_cnt].mac[0],
&__pkt_gen_src_addrs[__pkt_gen_src_addr_cnt].mac[1],
&__pkt_gen_src_addrs[__pkt_gen_src_addr_cnt].mac[2],
&__pkt_gen_src_addrs[__pkt_gen_src_addr_cnt].mac[3],
&__pkt_gen_src_addrs[__pkt_gen_src_addr_cnt].mac[4],
&__pkt_gen_src_addrs[__pkt_gen_src_addr_cnt].mac[5]);
} else {
sscanf(tmpbuf, "%hhx:%hhx:%hhx:%hhx:%hhx:%hhx",
&__pkt_gen_dst_addrs[__pkt_gen_dst_addr_cnt].mac[0],
&__pkt_gen_dst_addrs[__pkt_gen_dst_addr_cnt].mac[1],
&__pkt_gen_dst_addrs[__pkt_gen_dst_addr_cnt].mac[2],
&__pkt_gen_dst_addrs[__pkt_gen_dst_addr_cnt].mac[3],
&__pkt_gen_dst_addrs[__pkt_gen_dst_addr_cnt].mac[4],
&__pkt_gen_dst_addrs[__pkt_gen_dst_addr_cnt].mac[5]);
}
break;
case 1:
if (ch == 's')
assert(inet_pton(AF_INET, &tmpbuf[j], &__pkt_gen_src_addrs[__pkt_gen_src_addr_cnt].ip4_be) == 1);
else
assert(inet_pton(AF_INET, &tmpbuf[j], &__pkt_gen_dst_addrs[__pkt_gen_dst_addr_cnt].ip4_be) == 1);
break;
case 2:
{
uint16_t port;
sscanf(&tmpbuf[j], "%hu", &port);
if (ch == 's')
__pkt_gen_src_addrs[__pkt_gen_src_addr_cnt].l4_port_be = htons(port);
else
__pkt_gen_dst_addrs[__pkt_gen_dst_addr_cnt].l4_port_be = htons(port);
}
break;
}
j = i + 1;
c++;
}
}
assert(c == 3);
if (ch == 's')
__pkt_gen_src_addr_cnt++;
else
__pkt_gen_dst_addr_cnt++;
}
}
break;
case 'l':
{
int err = sscanf(optarg, "%hu", &__pkt_gen_payload_len);
assert(err == 1);
}
break;
default:
break;
}
}
}
signal(SIGINT, sig_h);
if (__pkt_gen_batch_size) {
printf("sender mode\n");
printf("batch size %u\n", __pkt_gen_batch_size);
printf("payload len %u\n", __pkt_gen_payload_len);
assert(__pkt_gen_src_addr_cnt);
assert(__pkt_gen_dst_addr_cnt);
} else
printf("receiver mode\n");
return (void *) ad;
{ /* unused */
(void) __o__app_init;
}
}
#define M2S(s) _M2S(s)
#define _M2S(s) #s
#include M2S(IOSUB_MAIN_C)
#undef _M2S
#undef M2S
static uint16_t iip_run(void *_mem, uint8_t mac[], uint32_t ip4_be, void *pkt[], uint16_t cnt, uint32_t *next_us, void *opaque)
{
{ /* rx */
void **opaque_array = (void **) opaque;
struct thread_data *td = (struct thread_data *) opaque_array[2];
{
uint16_t i;
for (i = 0; i < cnt; i++) {
(void) pkt;
td->monitor.counter[td->monitor.idx].rx_bytes += iip_ops_pkt_get_len(pkt[i], opaque);
td->monitor.counter[td->monitor.idx].rx_pkt++;
iip_ops_pkt_free(pkt[i], opaque);
}
}
}
{ /* tx */
void **opaque_array = (void **) opaque;
struct thread_data *td = (struct thread_data *) opaque_array[2];
{
uint16_t i;
for (i = 0; i < __pkt_gen_batch_size; i++) {
void *_pkt = iip_ops_pkt_clone(td->payload.pkt[td->payload.cnt], opaque);
assert(_pkt != NULL);
if (++td->payload.cnt == __PKT_GEN_PKT_CNT)
td->payload.cnt = 0;
td->monitor.counter[td->monitor.idx].tx_bytes += iip_ops_pkt_get_len(_pkt, opaque);
td->monitor.counter[td->monitor.idx].tx_pkt++;
iip_ops_l2_push(_pkt, opaque);
}
if (__pkt_gen_batch_size)
*next_us = 0;
}
}
{ /* unused */
(void) _mem;
(void) mac;
(void) ip4_be;
(void) next_us;
(void) __o__iip_run;
}
return cnt;
}
Please seve the following program as a file named iip_main.c
.
please click here to show iip_main.c
#define iip_run __o__iip_run
#include "../iip/main.c"
#undef iip_run
static uint16_t iip_run(void *_mem, uint8_t mac[], uint32_t ip4_be, void *pkt[], uint16_t cnt, uint32_t *next_us, void *opaque);
Supposedly, the following command generates an executable file named a.out
.
IOSUB_DIR=../iip-dpdk make -f ../Makefile
The following command launches the packet generator program with the receiver mode.
sudo LD_LIBRARY_PATH=../iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -n 1 -l 0 --proc-type=primary --file-prefix=pmd1 --allow 31:00.0 -- -a 0,10.100.0.20
The following command launches the packet generator program with the sender mode; this sender mode transmits preallocated UDP packets as quick as possible.
sudo LD_LIBRARY_PATH=../iip-dpdk/dpdk/install/lib/x86_64-linux-gnu ./a.out -n 1 -l 0 --proc-type=primary --file-prefix=pmd1 --allow 31:00.0 -- -a 0,10.100.0.20 -- -s 00:00:00:00:00:00,10.100.0.20,1 -d ff:ff:ff:ff:ff:ff,255.255.255.255,1 -b 16 -l 22
This packet generator program accepts the following commandline options for the sender mode.
-b
: batch size for packet transmisison (if a value bigger than 0 is specified, the program works as a sender; otherwise, it works as a receiver)-l
: udp payload length-d
: specify the destination of generated packets in a format of MAC,IP,UDP-port-s
: specify the source of generated packets in a format of MAC,IP,UDP-port
The following executes make clean
.
IOSUB_DIR=../iip-dpdk make -f ../Makefile clean
This bench-iip program should work with an AF_XDP-based backend.
Please first download the files of bench-iip and iip by the following commands; these are the same as the ones described in the build section, therefore, if you already have them, you do not need to execute these commands.
git clone https://github.com/yasukata/bench-iip.git
cd bench-iip
git clone https://github.com/yasukata/iip.git
From here, the procedure is specific for the AF_XDP-based backend and not described in the build section.
Please download the code of the AF_XDP-based backend by the following command; please note that this we assume we are in the directory bench-iip
by the command cd bench-iip
above.
git clone https://github.com/yasukata/iip-af_xdp.git
If you already have a compiled binary for the DPDK-based backend, please clean the directory by the following command.
IOSUB_DIR=./iip-dpdk make clean
Then, the following command will generate the bench-iip application named by a.out
whose packet I/O is performed by AF_XDP.
IOSUB_DIR=./iip-af_xdp make
Please turn on a network interface to be used for the AF_XDP-based backend and assign an IP address to it; the following example command turns on a physical NIC named enp23s0f0np0
and assigns 10.100.0.20/24
to it.
sudo ifconfig enp23s0f0np0 10.100.0.20 netmask 255.255.255.0 up
To run the program with the behavior equivalent to the one shown in the run section, please type the following command.
sudo ethtool -L enp23s0f0np0 combined 1; sudo ./a.out -l 0 -i enp23s0f0np0 -- -p 10000 -m "```echo -e 'HTTP/1.1 200 OK\r\nContent-Length: 2\r\nConnection: keep-alive\r\n\r\nAA'```"
The launched program will use 10.100.0.20/24
as its IP address and listen on TCP port 10000 for serving the specified HTTP message, and supposedly, you will see the same behavior if you try the ping
and telnet
tests shown in the run section from another machine reachable to the enp23s0f0np0
interface.
The arguments for a.out
are divided into two sections by --
and the first part is passed to the AF_XDP-based backend and the second part is processed by the bench-iip program.
The first part of the arguments are:
-l
: specification for CPU cores to be used, and its syntax is the same as the-l
option for the DPDK-based backend-i
: specification for a network interface to be used
The second part of the arguments are the same as the one shown in the previous section.
One important point in the command above is to use ethtool
to configure the number of NIC queues to be the same as the number of CPU cores used by the bench-iip program that is specified through the -l
option; this is necessary because this benchmark program uses one CPU core to monitor one NIC queue. Therefore, if you use, for example, two CPU cores by specifying -l 0-1
, please sudo ethtool -L enp23s0f0np0 combined 2
beforehand.
By default, when the AF_XDP setting is applied for a network interface, incoming packets bypass the kernel-space network stack and diretly go to the AF_XDP socket.
This means that, by default, a user-space program leveraging AF_XDP and the kernel-space network stack cannot share the same network interface.
We can avoid this limitation by installing our own eBPF program that steers specific packets, for example, destinated to a particular TCP port to an AF_XDP socket.
To apply this setting, please first make a directory; this time, we name it af_xdp-bpf
.
mkdir af_xdp-bpf
cd af_xdp-bpf
Then, please save the following as a file named main.bpf.c
; this forward packets destinated to TCP port 10000 to AF_XDP sockets and the other packets are passed to the kernel-space network stack, and to change the TCP port number from 10000 to another one, please edit bpf_htons(10000)
.
please click here to show the program
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
struct {
__uint(type, BPF_MAP_TYPE_XSKMAP);
__uint(max_entries, 64);
__uint(key_size, sizeof(int));
__uint(value_size, sizeof(int));
} xsks_map SEC(".maps");
SEC("prog") int xsk_prog(struct xdp_md *ctx)
{
/* length check */
if ((long) ctx->data + 38 < (long) ctx->data_end) {
/* ether type is ip4 */
if (*((__be16 *)((long) ctx->data + 12)) == bpf_htons(0x0800)) {
/* tcp */
if (*((__u8 *)((long) ctx->data + 23)) == 6) {
/* port */
if (*((__be16 *)((long) ctx->data + 36)) == bpf_htons(10000)) {
return bpf_redirect_map(&xsks_map, ctx->rx_queue_index, XDP_PASS);
}
}
}
}
return XDP_PASS;
}
Please save the following as a file named Makefile
.
please click here to show the program
PROGS = a.out
CC = clang
LD = llc
CFLAGS = -g -O3 -emit-llvm
LDFLAGS = -march=bpf -filetype=obj
C_SRCS = main.bpf.c
OBJS = $(C_SRCS:.c=.o)
CLEANFILES = $(PROGS) *.o
.PHONY: all
all: $(PROGS)
$(PROGS): $(OBJS)
$(LD) -o $@ $^ $(LDFLAGS)
clean:
-@rm -rf $(CLEANFILES)
Afterward, please type the following to generate a BPF program binary named a.out
.
make
The following command installs the BPF program, a.out
, to the network interface identified as enp49s0f0
; please change enp49s0f0
according to the environment.
sudo ip link set dev enp49s0f0 xdp obj a.out
Then, please go back to the bench-iip
directory.
cd ../
Please apply the following changes to iip-af_xdp/main.c
.
The points are:
- include
bpf/bpf.h
- configure
XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD
forlibbpf_flags
- associate AF_XDP sockets, created by
iip-af_xdp/main.c
, with a BPF map namedxsks_map
which is instantiated bymain.bpf.c
shown above
please click here to show the changes
diff --git a/main.c b/main.c
--- a/main.c
+++ b/main.c
@@ -35,6 +35,7 @@
#ifdef XSK_HEADER_LIBXDP
#include <xdp/xsk.h>
#endif
+#include <bpf/bpf.h>
#define __IOSUB_MAX_CORE (256)
@@ -742,7 +743,7 @@ static void *__thread_fn(void *__data)
struct xsk_socket_config cfg = {
.rx_size = NUM_RX_DESC,
.tx_size = NUM_TX_DESC,
- .libbpf_flags = 0,
+ .libbpf_flags = XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD,
.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE,
.bind_flags = XDP_USE_NEED_WAKEUP | XDP_ZEROCOPY /* | XDP_USE_SG */,
};
@@ -763,6 +764,40 @@ static void *__thread_fn(void *__data)
}
}
+ {
+ unsigned int id = 0, done = 0;
+ while (1) {
+ {
+ int err = bpf_map_get_next_id(id, &id);
+ assert(!err);
+ }
+ {
+ int fd = bpf_map_get_fd_by_id(id);
+ assert(fd >= 0);
+ {
+ struct bpf_map_info info = { 0 };
+ uint32_t len = sizeof(info);
+ {
+ int err = bpf_obj_get_info_by_fd(fd, &info, &len);
+ assert(!err);
+ }
+ if (strlen("xsks_map") == strlen(info.name) &&
+ !strncmp("xsks_map", info.name, strlen("xsks_map"))) {
+ int key = ti->id, val = xsk_socket__fd(xsk);
+ {
+ int err = bpf_map_update_elem(fd, &key, &val, 0);
+ assert(!err);
+ done = 1;
+ break;
+ }
+ } else
+ close(fd);
+ }
+ }
+ }
+ assert(done);
+ }
+
setup_core_id++;
io_opaque[ti->id].af_xdp.xsk = xsk;
Afterward, please recompile the bench-iip
program.
IOSUB_DIR=./iip-af_xdp make clean
IOSUB_DIR=./iip-af_xdp make
After this setting, supposedly, this bench-iip
program can share a network interface with the kernel-space network stack; only packets for TCP port 10000 are handled by bench-iip
.
Note that the installed BPF program can be uninstalled by the following command.
sudo ip link set dev enp49s0f0 xdp off
Please first download the files of bench-iip and iip by the following commands; these are the same as the ones described in the build section, therefore, if you already have them, you do not need to execute these commands.
git clone https://github.com/yasukata/bench-iip.git
cd bench-iip
git clone https://github.com/yasukata/iip.git
From here, the procedure is specific for the netmap-based backend and not described in the build section.
Please download the code of the netmap-based backend by the following command; please note that this we assume we are in the directory bench-iip
by the command cd bench-iip
above.
git clone https://github.com/yasukata/iip-netmap.git
This part is specific to Linux environments; to use netmap on Linux, we need to build its source code and install it and this step is not necessary on FreeBSD.
Please enter the iip-netmap directory.
cd iip-netmap
Please download the netmap source code by the following command.
git clone https://github.com/luigirizzo/netmap.git
Then, please enter the downloaded netmap directory.
cd netmap
To build netmap on Linux, we should have Linux kernel headers; you could install them using the following commands.
sudo apt install linux-headers-`uname -r`
Please type the following command to execute configure
of netmap. Please note that here we specify --drivers=
to skip building NIC device drivers having netmap-specific patches; if you plan to use physical NICs, you need to specify the names of drivers for the physical NICs shown in driver_avail
of netmap/LINUX/configure
for --drivers=
or if you remove the ``````--drivers``` option, the netmap build tool will try to build all possible NIC drivers.
./configure --drivers=
Afterward, pleae type make
to generate a kernel module named netmap.ko
.
make
Then, the following command will install the netmap kernel module.
sudo insmod netmap.ko
Please get back to bench-iip
directory.
cd ../../
WARNING: If you plan to use netmap with physical NICs, you need to replace the currently loaded device drivers for the physical NICs with the ones having netmap-specific changes. This part is a little bit complicated and, in the worst case, you may lose network reachability to the machine where you try to install netmap because of the NIC device driver module deletion. For details, please refer to the instruction provided from the official netmap repository at https://github.com/luigirizzo/netmap/blob/master/LINUX/README.md#how-to-load-netmap-in-your-system , and please try it at your own risk.
If you already have a compiled binary for the DPDK-based backend, please clean the directory by the following command.
IOSUB_DIR=./iip-dpdk make clean
Then, the following command will generate the bench-iip application named by a.out
whose packet I/O is performed by netmap.
IOSUB_DIR=./iip-netmap make
The example here uses netmap-specific virtual ports rather than physical NICs for simplicity.
The following command runs, on CPU core 0, a server program which has a virtual port named if20
which is associated with a virtual switch named vale0
, and the virtual port's MAC address is aa:bb:cc:dd:ee:ff and IP address is 10.100.0.20 respectively and it listens on TCP port 10000 and serves an HTTP content.
sudo ./a.out -a aa:bb:cc:dd:ee:ff,10.100.0.20 -l 0 -i vale0:if20 -- -p 10000 -m "```echo -e 'HTTP/1.1 200 OK\r\nContent-Length: 2\r\nConnection: keep-alive\r\n\r\nAA'```"
Please open another terminal/console on the same machine executed the command above, and please type the following command; it runs, on CPU core 1, a client program which has a virtual port named if10
which is associated with a virtual switch named vale0
, and the virtual port's MAC address is 11:22:33:44:55:66 and IP address is 10.100.0.10 respectively and it triest to connect to TCP port 10000 of 10.100.0.20 and sends GET
to fetch the HTTP data through 1 TCP connection.
sudo ./a.out -a 11:22:33:44:55:66,10.100.0.10 -l 1 -i vale0:if10 -- -s 10.100.0.20 -p 10000 -m "GET " -c 1
The arguments for a.out
are divided into two sections by --
and the first part is passed to the netmap-based backend and the second part is processed by the bench-iip program.
The first part of the arguments are:
-a
: specification for MAC and IP addresses (for example,-a aa:bb:cc:dd:ee:ff,10.100.0.20
specifies aa:bb:cc:dd:ee:ff for the MAC address and 10.100.0.20 for the IP address)-l
: specification for CPU cores to be used, and its syntax is the same as the-l
option for the DPDK-based backend-i
: specification for a network interface to be used
The second part of the arguments are the same as the one shown in the previous section.
The following commands are to see the dependencies introduced by main.c
in this repository.
mkdir ./iip-iostub
The content of iip-iostub/main.c
.
please click here to show the program
static uint16_t helper_ip4_get_connection_affinity(uint16_t protocol, uint32_t local_ip4_be, uint16_t local_port_be, uint32_t peer_ip4_be, uint16_t peer_port_be, void *opaque)
{
return 0;
{ /* unused */
(void) protocol;
(void) local_ip4_be;
(void) local_port_be;
(void) peer_ip4_be;
(void) peer_port_be;
(void) opaque;
}
}
static uint16_t iip_ops_l2_hdr_len(void *pkt, void *opaque)
{
return 0;
{ /* unused */
(void) pkt;
(void) opaque;
}
}
static uint8_t *iip_ops_l2_hdr_src_ptr(void *pkt, void *opaque)
{
return NULL;
{ /* unused */
(void) pkt;
(void) opaque;
}
}
static uint8_t *iip_ops_l2_hdr_dst_ptr(void *pkt, void *opaque)
{
return NULL;
{ /* unused */
(void) pkt;
(void) opaque;
}
}
static uint8_t iip_ops_l2_skip(void *pkt, void *opaque)
{
return 0;
{ /* unused */
(void) pkt;
(void) opaque;
}
}
static uint16_t iip_ops_l2_ethertype_be(void *pkt, void *opaque)
{
return 0;
{ /* unused */
(void) pkt;
(void) opaque;
}
}
static uint16_t iip_ops_l2_addr_len(void *opaque)
{
return 0;
{ /* unused */
(void) opaque;
}
}
static void iip_ops_l2_broadcast_addr(uint8_t bc_mac[], void *opaque)
{
{ /* unused */
(void) bc_mac;
(void) opaque;
}
}
static void iip_ops_l2_hdr_craft(void *pkt, uint8_t src[], uint8_t dst[], uint16_t ethertype_be, void *opaque)
{
{ /* unused */
(void) pkt;
(void) src;
(void) dst;
(void) ethertype_be;
(void) opaque;
}
}
static uint8_t iip_ops_arp_lhw(void *opaque)
{
return 0;
{ /* unused */
(void) opaque;
}
}
static uint8_t iip_ops_arp_lproto(void *opaque)
{
return 0;
{ /* unused */
(void) opaque;
}
}
static void *iip_ops_pkt_alloc(void *opaque)
{
return NULL;
{ /* unused */
(void) opaque;
}
}
static void iip_ops_pkt_free(void *pkt, void *opaque)
{
{ /* unused */
(void) pkt;
(void) opaque;
}
}
static void *iip_ops_pkt_get_data(void *pkt, void *opaque)
{
return NULL;
{ /* unused */
(void) pkt;
(void) opaque;
}
}
static uint16_t iip_ops_pkt_get_len(void *pkt, void *opaque)
{
return 0;
{ /* unused */
(void) pkt;
(void) opaque;
}
}
static void iip_ops_pkt_set_len(void *pkt, uint16_t len, void *opaque)
{
{ /* unused */
(void) pkt;
(void) len;
(void) opaque;
}
}
static void iip_ops_pkt_increment_head(void *pkt, uint16_t len, void *opaque)
{
{ /* unused */
(void) pkt;
(void) len;
(void) opaque;
}
}
static void iip_ops_pkt_decrement_tail(void *pkt, uint16_t len, void *opaque)
{
{ /* unused */
(void) pkt;
(void) len;
(void) opaque;
}
}
static void *iip_ops_pkt_clone(void *pkt, void *opaque)
{
return NULL;
{ /* unused */
(void) pkt;
(void) opaque;
}
}
static void iip_ops_pkt_scatter_gather_chain_append(void *pkt_head, void *pkt_tail, void *opaque)
{
{ /* unused */
(void) pkt_head;
(void) pkt_tail;
(void) opaque;
}
}
static void *iip_ops_pkt_scatter_gather_chain_get_next(void *pkt_head, void *opaque)
{
return NULL;
{ /* unused */
(void) pkt_head;
(void) opaque;
}
}
static void iip_ops_l2_flush(void *opaque)
{
{ /* unused */
(void) opaque;
}
}
static void iip_ops_l2_push(void *_m, void *opaque)
{
{ /* unused */
(void) _m;
(void) opaque;
}
}
static uint8_t iip_ops_nic_feature_offload_tx_scatter_gather(void *opaque)
{
return 0;
{ /* unused */
(void) opaque;
}
}
static uint8_t iip_ops_nic_feature_offload_ip4_rx_checksum(void *opaque)
{
return 0;
{ /* unused */
(void) opaque;
}
}
static uint8_t iip_ops_nic_feature_offload_ip4_tx_checksum(void *opaque)
{
return 0;
{ /* unused */
(void) opaque;
}
}
static uint8_t iip_ops_nic_offload_ip4_rx_checksum(void *m, void *opaque)
{
return 0;
{ /* unused */
(void) m;
(void) opaque;
}
}
static uint8_t iip_ops_nic_offload_tcp_rx_checksum(void *m, void *opaque)
{
return 0;
{ /* unused */
(void) m;
(void) opaque;
}
}
static uint8_t iip_ops_nic_offload_udp_rx_checksum(void *m, void *opaque)
{
return 0;
{ /* unused */
(void) m;
(void) opaque;
}
}
static void iip_ops_nic_offload_ip4_tx_checksum_mark(void *m, void *opaque)
{
{ /* unused */
(void) m;
(void) opaque;
}
}
static uint8_t iip_ops_nic_feature_offload_tcp_rx_checksum(void *opaque)
{
return 0;
{ /* unused */
(void) opaque;
}
}
static uint8_t iip_ops_nic_feature_offload_tcp_tx_checksum(void *opaque)
{
return 0;
{ /* unused */
(void) opaque;
}
}
static uint8_t iip_ops_nic_feature_offload_tcp_tx_tso(void *opaque)
{
return 0;
{ /* unused */
(void) opaque;
}
}
static void iip_ops_nic_offload_tcp_tx_checksum_mark(void *m, void *opaque)
{
{ /* unused */
(void) m;
(void) opaque;
}
}
static void iip_ops_nic_offload_tcp_tx_tso_mark(void *m, void *opaque)
{
{ /* unused */
(void) m;
(void) opaque;
}
}
static uint8_t iip_ops_nic_feature_offload_udp_rx_checksum(void *opaque)
{
return 0;
{ /* unused */
(void) opaque;
}
}
static uint8_t iip_ops_nic_feature_offload_udp_tx_checksum(void *opaque)
{
return 0;
{ /* unused */
(void) opaque;
}
}
static uint8_t iip_ops_nic_feature_offload_udp_tx_tso(void *opaque)
{
return 0;
{ /* unused */
(void) opaque;
}
}
static void iip_ops_nic_offload_udp_tx_checksum_mark(void *m, void *opaque)
{
{ /* unused */
(void) m;
(void) opaque;
}
}
static void iip_ops_nic_offload_udp_tx_tso_mark(void *m, void *opaque)
{
{ /* unused */
(void) m;
(void) opaque;
}
}
static int __iosub_main(int argc, char *const *argv)
{
return 0;
{ /* unused */
(void) argc;
(void) argv;
}
{ /* unused */
(void) __app_init;
(void) __app_thread_init;
(void) __app_loop;
(void) __app_should_stop;
(void) __app_exit;
}
{ /* unused */
(void) iip_run;
(void) iip_udp_send;
(void) iip_tcp_connect;
(void) iip_tcp_rxbuf_consumed;
(void) iip_tcp_close;
(void) iip_tcp_send;
(void) iip_arp_request;
(void) iip_add_tcp_conn;
(void) iip_add_pb;
(void) iip_tcp_conn_size;
(void) iip_pb_size;
(void) iip_workspace_size;
}
}
The content of iip-iostub/build.mk
.
please click here to show the program
CFLAGS += -pedantic
OSNAME = $(shell uname -s)
ifeq ($(OSNAME),Linux)
CFLAGS += -D_POSIX_C_SOURCE=200112L -std=c17
else ifeq ($(OSNAME),FreeBSD)
CFLAGS += -std=c17
endif
Supposedly, we will have a.out
by the following command.
IOSUB_DIR=./iip-iostub make
Note that a.out
generated with IOSUB_DIR=./iip-iostub
is does not work; it is just for the compilation test.