-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathdraft-ietf-mptcp-rfc6824bis-18.xml
2632 lines (2221 loc) · 204 KB
/
draft-ietf-mptcp-rfc6824bis-18.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="US-ASCII"?>
<!-- Convert to HTML and Text with xml2rfc: http://xml2rfc.ietf.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY RFC5533 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.5533.xml">
<!ENTITY RFC5062 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.5062.xml">
<!ENTITY RFC5061 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.5061.xml">
<!ENTITY RFC4960 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.4960.xml">
<!ENTITY RFC4987 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.4987.xml">
<!ENTITY RFC6234 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.6234.xml">
<!ENTITY RFC4086 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.4086.xml">
<!ENTITY RFC5681 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.5681.xml">
<!ENTITY RFC2119 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC2992 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.2992.xml">
<!ENTITY RFC2979 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.2979.xml">
<!ENTITY RFC2104 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.2104.xml">
<!ENTITY RFC2018 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.2018.xml">
<!ENTITY RFC1918 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.1918.xml">
<!ENTITY RFC0793 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.0793.xml">
<!ENTITY RFC7323 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.7323.xml">
<!ENTITY RFC1122 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.1122.xml">
<!ENTITY RFC3135 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.3135.xml">
<!ENTITY RFC3022 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.3022.xml">
<!ENTITY RFC6181 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.6181.xml">
<!ENTITY RFC6182 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.6182.xml">
<!ENTITY RFC6356 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.6356.xml">
<!ENTITY RFC6555 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.6555.xml">
<!ENTITY RFC8126 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.8126.xml">
<!ENTITY RFC6897 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.6897.xml">
<!ENTITY RFC6528 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.6528.xml">
<!ENTITY RFC5961 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.5961.xml">
<!ENTITY RFC7413 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.7413.xml">
<!ENTITY RFC7430 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.7430.xml">
<!ENTITY RFC8174 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml">
<!ENTITY RFC8041 SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.8041.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<?rfc strict="no" ?>
<?rfc toc="yes"?>
<?rfc tocdepth="4"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes" ?>
<?rfc compact="yes" ?>
<?rfc subcompact="no" ?>
<?rfc rfcedstyle="yes"?>
<rfc category="std" docName="draft-ietf-mptcp-rfc6824bis-18" ipr="trust200902" obsoletes="6824">
<front>
<title abbrev="Multipath TCP">TCP Extensions for Multipath Operation with Multiple Addresses</title>
<author fullname="Alan Ford" initials="A." surname="Ford">
<organization>Pexip</organization>
<address>
<!-- <postal>
<street>Beech Court</street>
<city>Hurst</city>
<region>Berkshire</region>
<code>RG10 0RQ</code>
<country>UK</country>
</postal> -->
<email>[email protected]</email>
</address>
</author>
<author fullname="Costin Raiciu" initials="C." surname="Raiciu">
<organization abbrev="U. Politechnica of Bucharest">University Politehnica of Bucharest</organization>
<address>
<postal>
<street>Splaiul Independentei 313</street>
<city>Bucharest</city>
<country>Romania</country>
</postal>
<email>[email protected]</email>
</address>
</author>
<author fullname="Mark Handley" initials="M." surname="Handley">
<organization abbrev="U. College London">University College London</organization>
<address>
<postal>
<street>Gower Street</street>
<city>London</city>
<code>WC1E 6BT</code>
<country>UK</country>
</postal>
<email>[email protected]</email>
</address>
</author>
<author fullname="Olivier Bonaventure" initials="O." surname="Bonaventure">
<organization abbrev="U. catholique de Louvain">Université catholique de Louvain</organization>
<address>
<postal>
<street>Pl. Ste Barbe, 2</street>
<code>1348</code>
<city>Louvain-la-Neuve</city>
<country>Belgium</country>
</postal>
<email>[email protected]</email>
</address>
</author>
<author fullname="Christoph Paasch" initials="C." surname="Paasch">
<organization abbrev="Apple, Inc.">Apple, Inc.</organization>
<address>
<postal>
<street></street>
<city>Cupertino</city>
<country>US</country>
</postal>
<email>[email protected]</email>
</address>
</author>
<date year="2019" />
<area>General</area>
<workgroup>Internet Engineering Task Force</workgroup>
<keyword>tcp extensions multipath multihomed subflow</keyword>
<abstract>
<t>TCP/IP communication is currently restricted to a single path per connection, yet multiple paths often exist between peers. The simultaneous use of these multiple paths for a TCP/IP session would improve resource usage within the network and, thus, improve user experience through higher throughput and improved resilience to network failure.</t>
<t>Multipath TCP provides the ability to simultaneously use multiple paths between peers. This document presents a set of extensions to traditional TCP to support multipath operation. The protocol offers the same type of service to applications as TCP (i.e., reliable bytestream), and it provides the components necessary to establish and use multiple TCP flows across potentially disjoint paths.</t>
<t>This document specifies v1 of Multipath TCP, obsoleting v0 as specified in RFC6824, through clarifications and modifications primarily driven by deployment experience.</t>
</abstract>
</front>
<middle>
<section title="Introduction" anchor="sec_intro">
<t>Multipath TCP (MPTCP) is a set of extensions to regular TCP <xref target="RFC0793"/> to provide a Multipath TCP <xref target="RFC6182"/> service, which enables a transport connection to operate across multiple paths
simultaneously. This document presents the protocol changes required to add multipath capability to TCP; specifically, those for signaling and setting up multiple paths ("subflows"), managing these subflows, reassembly of data, and termination of sessions.
This is not the only information required to create a Multipath TCP implementation, however. This document is complemented by three others:
<list style="symbols">
<t>Architecture <xref target="RFC6182"/>, which explains the motivations behind Multipath TCP, contains a discussion of high-level design decisions on which this design is based, and an explanation of a functional separation through which an extensible MPTCP implementation can be developed.</t>
<t>Congestion control <xref target="RFC6356"/> presents a safe congestion control algorithm for coupling the behavior of the multiple paths in order to "do no harm" to other network users.</t>
<t>Application considerations <xref target="RFC6897"/> discusses what impact MPTCP will have on applications, what applications will want to do with MPTCP, and as a consequence of these factors, what API extensions an MPTCP implementation should present.</t>
</list>
This document is an update to, and obsoletes, the v0 specification of Multipath TCP (RFC6824). This document specifies MPTCP v1, which is not backward compatible with MPTCP v0. This document additionally defines version negotiation procedures for implementations that support both versions.
</t>
<section title="Design Assumptions" anchor="sec_assum">
<t>In order to limit the potentially huge design space, the mptcp working group imposed two key constraints on the Multipath TCP design presented in this document:
<list style="symbols">
<t>It must be backwards-compatible with current, regular TCP, to increase its chances of deployment.</t>
<t>It can be assumed that one or both hosts are multihomed and multiaddressed.</t>
</list>
</t>
<t>To simplify the design, we assume that the presence of multiple addresses at a host is sufficient to indicate the existence of multiple paths. These paths need not be entirely disjoint: they may share one or many routers between them. Even in such a situation, making use of multiple paths is beneficial, improving resource utilization and resilience to a subset of node failures. The congestion control algorithms defined in <xref target="RFC6356"/> ensure this does not act detrimentally. Furthermore, there may be some scenarios where different TCP ports on a single host can provide disjoint paths (such as through certain Equal-Cost Multipath (ECMP) implementations <xref target="RFC2992"/>), and so the MPTCP design also supports the use of ports in path identifiers.</t>
<t>There are three aspects to the backwards-compatibility listed above (discussed in more detail in <xref target="RFC6182"/>):
<list style="hanging">
<t hangText="External Constraints:"> The protocol must function through the vast majority of existing
middleboxes such as NATs, firewalls, and proxies, and as such must resemble existing TCP as far as possible on the
wire. Furthermore, the protocol must not assume the segments it sends on the wire arrive unmodified at the destination:
they may be split or coalesced; TCP options may be removed or duplicated. </t>
<t hangText="Application Constraints:"> The protocol must be usable with no change to existing applications that use the common TCP API (although it is reasonable that not all features would be available to such legacy applications). Furthermore, the protocol must provide the same service model as regular TCP to the application.</t>
<t hangText="Fallback:"> The protocol should be able to fall back to standard TCP with no interference from the user, to be able to communicate with legacy hosts.</t>
</list>
</t>
<t>The complementary application considerations document <xref target="RFC6897"/> discusses the necessary features of an API to provide backwards-compatibility, as well as API extensions to convey the behavior of MPTCP at a level of control and information equivalent to that available with regular, single-path TCP.</t>
<t>Further discussion of the design constraints and associated design decisions are given in the MPTCP Architecture document <xref target="RFC6182"/> and in <xref target="howhard"/>.</t>
</section>
<section title="Multipath TCP in the Networking Stack" anchor="sec_layers">
<t>MPTCP operates at the transport layer and aims to be transparent to both higher and lower
layers. It is a set of additional features on top of standard TCP; <xref target="fig_arch" /> illustrates
this layering. MPTCP is designed to be usable by legacy applications with no changes; detailed discussion
of its interactions with applications is given in <xref target="RFC6897"/>.</t>
<figure align="center" anchor="fig_arch" title="Comparison of Standard TCP and MPTCP Protocol Stacks">
<artwork align="left"><![CDATA[
+-------------------------------+
| Application |
+---------------+ +-------------------------------+
| Application | | MPTCP |
+---------------+ + - - - - - - - + - - - - - - - +
| TCP | | Subflow (TCP) | Subflow (TCP) |
+---------------+ +-------------------------------+
| IP | | IP | IP |
+---------------+ +-------------------------------+
]]></artwork>
</figure>
</section>
<section title="Terminology">
<t>This document makes use of a number of terms that are either MPTCP-specific or have defined meaning in the context of MPTCP, as follows:
<list style="hanging">
<t hangText="Path:"> A sequence of links between a sender and a receiver, defined in this context by a 4-tuple of source and destination address/port pairs.</t>
<t hangText="Subflow:"> A flow of TCP segments operating over an individual path, which forms part of a larger MPTCP connection. A subflow is started and terminated similar to a regular TCP connection.</t>
<t hangText="(MPTCP) Connection:"> A set of one or more subflows, over which an application can communicate between two hosts. There is a one-to-one mapping between a connection and an application socket.</t>
<t hangText="Data-level:"> The payload data is nominally transferred over a connection, which in turn is transported over subflows. Thus, the term "data-level" is synonymous with "connection level", in contrast to "subflow-level", which refers to properties of an individual subflow.</t>
<t hangText="Token:"> A locally unique identifier given to a multipath connection by a host. May also be referred to as a "Connection ID".</t>
<t hangText="Host:"> An end host operating an MPTCP implementation, and either initiating or accepting an MPTCP connection.</t>
</list>
In addition to these terms, note that MPTCP's interpretation of, and effect on, regular single-path TCP semantics are discussed in <xref target="sec_semantics"/>.</t>
</section>
<section title="MPTCP Concept" anchor="sec_operation">
<t>This section provides a high-level summary of normal
operation of MPTCP, and is illustrated by the scenario shown in
<xref target="fig_scenario"/>. A detailed description of operation is given in <xref target="sec_protocol"/>.
<list style="symbols">
<t>To a non-MPTCP-aware application, MPTCP will behave the same as normal TCP. Extended APIs could provide
additional control to MPTCP-aware applications <xref target="RFC6897"/>.
An application begins by opening a TCP socket in the normal way.
MPTCP signaling and operation are handled by the MPTCP implementation.
</t>
<t>An MPTCP connection begins similarly to a regular TCP connection. This is
illustrated in <xref target="fig_scenario"/> where an MPTCP connection is established between
addresses A1 and B1 on Hosts A and B, respectively.</t>
<t>If extra paths are available, additional TCP sessions (termed MPTCP "subflows")
are created on these paths, and are combined with the existing session, which continues
to appear as a single connection to the applications at both ends. The creation of the
additional TCP session is illustrated between Address A2 on Host A and Address B1 on
Host B.</t>
<t>MPTCP identifies multiple paths by the presence of multiple addresses
at hosts. Combinations of these multiple addresses equate to the additional paths.
In the example, other potential paths that could be set up are A1<->B2 and A2<->B2.
Although this additional session is shown as being initiated from A2, it could equally have
been initiated from B1 or B2.</t>
<t>The discovery and setup of additional subflows
will be achieved through a path management method; this document describes a mechanism
by which a host can initiate new subflows by using its own additional addresses, or by
signaling its available addresses to the other host.</t>
<t>MPTCP adds connection-level sequence numbers to allow the reassembly of
segments arriving on multiple subflows with differing network delays. </t>
<t>Subflows are terminated as regular TCP connections, with a four-way FIN
handshake. The MPTCP connection is terminated by a connection-level FIN.</t>
</list>
</t>
<?rfc needLines='17'?>
<figure align="center" anchor="fig_scenario" title="Example MPTCP Usage Scenario">
<artwork align="left"><![CDATA[
Host A Host B
------------------------ ------------------------
Address A1 Address A2 Address B1 Address B2
---------- ---------- ---------- ----------
| | | |
| (initial connection setup) | |
|----------------------------------->| |
|<-----------------------------------| |
| | | |
| (additional subflow setup) |
| |--------------------->| |
| |<---------------------| |
| | | |
| | | |
]]></artwork>
</figure>
</section>
<section title="Requirements Language">
<t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED",
"MAY", and "OPTIONAL" in this document are to be interpreted as
described in BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/>
when, and only when, they appear in all capitals, as shown here.</t>
</section>
</section>
<section title="Operation Overview" anchor="sec_overview">
<t>This section presents a single description of common MPTCP operation, with reference to the protocol operation. This is a high-level overview of the key functions; the full specification follows in <xref target="sec_protocol"/>. Extensibility and negotiated features are not discussed here. Considerable reference is made to symbolic names of MPTCP options throughout this section -- these are subtypes of the IANA-assigned MPTCP option (see <xref target="IANA"/>), and their formats are defined in the detailed protocol specification that follows in <xref target="sec_protocol"/>.</t>
<t>A Multipath TCP connection provides a bidirectional bytestream between two hosts communicating like normal TCP and, thus, does not require any change to the applications. However, Multipath TCP enables the hosts to use different paths with different IP addresses to exchange packets belonging to the MPTCP connection. A Multipath TCP connection appears like a normal TCP connection to an application. However, to the network layer, each MPTCP subflow looks like a regular TCP flow whose segments carry a new TCP option type. Multipath TCP manages the creation, removal, and utilization of these subflows to send data. The number of subflows that are managed within a Multipath TCP connection is not fixed and it can fluctuate during the lifetime of the Multipath TCP connection.</t>
<t>All MPTCP operations are signaled with a TCP option -- a single numerical type for MPTCP, with "sub-types" for each MPTCP message. What follows is a summary of the purpose and rationale of these messages.</t>
<section title="Initiating an MPTCP Connection">
<t>This is the same signaling as for initiating a normal TCP connection, but the SYN, SYN/ACK, and initial ACK (and data) packets also carry the MP_CAPABLE option. This option has a variable length and serves multiple purposes. Firstly, it verifies whether the remote host supports Multipath TCP; secondly, this option allows the hosts to exchange some information to authenticate the establishment of additional subflows. Further details are given in <xref target="sec_init"/>.</t>
<figure><artwork align="left"><![CDATA[
Host A Host B
------ ------
MP_CAPABLE ->
[flags]
<- MP_CAPABLE
[B's key, flags]
ACK + MP_CAPABLE (+ data) ->
[A's key, B's key, flags, (data-level details)]
]]></artwork></figure>
<t>Retransmission of the ACK + MP_CAPABLE can occur if it is not known if it has been received. The following diagrams show all possible exchanges for the initial subflow setup to ensure this reliability.</t>
<figure><artwork align="left"><![CDATA[
Host A (with data to send immediately) Host B
------ ------
MP_CAPABLE ->
[flags]
<- MP_CAPABLE
[B's key, flags]
ACK + MP_CAPABLE + data ->
[A's key, B's key, flags, data-level details]
Host A (with data to send later) Host B
------ ------
MP_CAPABLE ->
[flags]
<- MP_CAPABLE
[B's key, flags]
ACK + MP_CAPABLE ->
[A's key, B's key, flags]
ACK + MP_CAPABLE + data ->
[A's key, B's key, flags, data-level details]
Host A Host B (sending first)
------ ------
MP_CAPABLE ->
[flags]
<- MP_CAPABLE
[B's key, flags]
ACK + MP_CAPABLE ->
[A's key, B's key, flags]
<- ACK + DSS + data
[data-level details]
]]></artwork></figure>
</section>
<section title="Associating a New Subflow with an Existing MPTCP Connection">
<t>The exchange of keys in the MP_CAPABLE handshake provides material that can be used to authenticate the endpoints when new subflows will be set up.
Additional subflows begin in the same way as initiating a normal TCP connection, but the SYN, SYN/ACK, and ACK packets also carry the MP_JOIN option. </t>
<t>Host A initiates a new subflow between one of its addresses and one of Host B's addresses. The token -- generated from the key -- is used to identify which MPTCP connection it is joining, and the HMAC is used for authentication. The Hash-based Message Authentication Code (HMAC) uses the keys exchanged in the MP_CAPABLE handshake, and the random numbers (nonces) exchanged in these MP_JOIN options. MP_JOIN also contains flags and an Address ID that can be used to refer to the source address without the sender needing to know if it has been changed by a NAT. Further details are in <xref target="sec_join"/>.</t>
<figure><artwork align="left"><![CDATA[
Host A Host B
------ ------
MP_JOIN ->
[B's token, A's nonce,
A's Address ID, flags]
<- MP_JOIN
[B's HMAC, B's nonce,
B's Address ID, flags]
ACK + MP_JOIN ->
[A's HMAC]
<- ACK
]]></artwork></figure>
</section>
<section title="Informing the Other Host about Another Potential Address">
<t>The set of IP addresses associated to a multihomed host may change during the lifetime of an MPTCP connection. MPTCP supports the addition and removal of addresses on a host both implicitly and explicitly. If Host A has established a subflow starting at address/port pair IP#-A1 and wants to open a second subflow starting at address/port pair IP#-A2, it simply initiates the establishment of the subflow as explained above. The remote host will then be implicitly informed about the new address.</t>
<t>In some circumstances, a host may want to advertise to the remote host the availability of an address without establishing a new subflow, for example, when a NAT prevents setup in one direction. In the example below, Host A informs Host B about its alternative IP address/port pair (IP#-A2). Host B may later send an MP_JOIN to this new address. The ADD_ADDR option contains a HMAC to authenticate the address as having been sent from the originator of the connection. The receiver of this option echoes it back to the client to indicate successful receipt. Further details are in <xref target="sec_add_address"/>.</t>
<figure><artwork align="left"><![CDATA[
Host A Host B
------ ------
ADD_ADDR ->
[Echo-flag=0,
IP#-A2,
IP#-A2's Address ID,
HMAC of IP#-A2]
<- ADD_ADDR
[Echo-flag=1,
IP#-A2,
IP#-A2's Address ID,
HMAC of IP#-A2]
]]></artwork></figure>
<t>There is a corresponding signal for address removal, making use of the Address ID that is signaled in the add address handshake. Further details in <xref target="sec_remove_addr"/>.</t>
<figure><artwork align="left"><![CDATA[
Host A Host B
------ ------
REMOVE_ADDR ->
[IP#-A2's Address ID]
]]></artwork></figure>
</section>
<section title="Data Transfer Using MPTCP">
<t>To ensure reliable, in-order delivery of data over subflows that may appear and disappear at any time, MPTCP uses a 64-bit data sequence number (DSN) to number all data sent over the MPTCP connection. Each subflow has its own 32-bit sequence number space, utilising the regular TCP sequence number header, and an MPTCP option maps the subflow sequence space to the data sequence space. In this way, data can be retransmitted on different subflows (mapped to the same DSN) in the event of failure.</t>
<t>The Data Sequence Signal (DSS) carries the Data Sequence Mapping. The Data Sequence Mapping consists of the subflow sequence number, data sequence number, and length for which this mapping is valid. This option can also carry a connection-level acknowledgment (the "Data ACK") for the received DSN.</t>
<t>With MPTCP, all subflows share the same receive buffer and advertise the same receive window. There are two levels of acknowledgment in MPTCP. Regular TCP acknowledgments are used on each subflow to acknowledge the reception of the segments sent over the subflow independently of their DSN. In addition, there are connection-level acknowledgments for the data sequence space. These acknowledgments track the advancement of the bytestream and slide the receiving window.</t>
<t>Further details are in <xref target="sec_generalop"/>.</t>
<figure><artwork align="left"><![CDATA[
Host A Host B
------ ------
DSS ->
[Data Sequence Mapping]
[Data ACK]
[Checksum]
]]></artwork></figure>
</section>
<section title="Requesting a Change in a Path's Priority">
<t>Hosts can indicate at initial subflow setup whether they wish the subflow to be used as a regular or backup path -- a backup path only being used if there are no regular paths available. During a connection, Host A can request a change in the priority of a subflow through the MP_PRIO signal to Host B. Further details are in <xref target="sec_policy"/>.</t>
<figure><artwork align="left"><![CDATA[
Host A Host B
------ ------
MP_PRIO ->
]]></artwork></figure>
</section>
<section title="Closing an MPTCP Connection">
<t>When a host wants to close an existing subflow, but not the whole connection, it can initiate a regular TCP FIN/ACK exchange.</t>
<t>When Host A wants to inform Host B that it has no more data to send, it signals this "Data FIN" as part of the Data Sequence Signal (see above). It has the same semantics and behavior as a regular TCP FIN, but at the connection level. Once all the data on the MPTCP connection has been successfully received, then this message is acknowledged at the connection level with a Data ACK. Further details are in <xref target="sec_close"/>.</t>
<figure><artwork align="left"><![CDATA[
Host A Host B
------ ------
DSS ->
[Data FIN]
<- DSS
[Data ACK]
]]></artwork></figure>
<t>There is an additional method of connection closure, referred to as "Fast Close", which is analogous to closing a single-path TCP connection with a RST signal. The MP_FASTCLOSE signal is used to indicate to the peer that the connection will be abruptly closed and no data will be accepted anymore. This can be used on an ACK (ensuring reliability of the signal), or a RST (which is not). Both examples are shown in the following diagrams. Further details are in <xref target="sec_fastclose"/>.</t>
<figure><artwork align="left"><![CDATA[
Host A Host B
------ ------
ACK + MP_FASTCLOSE ->
[B's key]
[RST on all other subflows] ->
<- [RST on all subflows]
Host A Host B
------ ------
RST + MP_FASTCLOSE ->
[B's key] [on all subflows]
<- [RST on all subflows]
]]></artwork></figure>
</section>
<section title="Notable Features">
<t>It is worth highlighting that MPTCP's signaling has been designed with several key requirements in mind:
<list style="symbols">
<t>To cope with NATs on the path, addresses are referred to by Address IDs, in case the IP packet's source
address gets changed by a NAT. Setting up a new TCP flow is not possible if the receiver of the SYN is behind a NAT;
to allow subflows to be created when either end is behind a NAT, MPTCP uses the ADD_ADDR message. </t>
<t>MPTCP falls back to ordinary TCP if MPTCP operation is not possible, for example, if one host is not MPTCP capable or if a middlebox alters the payload. This is discussed in <xref target="sec_fallback"/>.</t>
<t>To address the threats identified in <xref target="RFC6181"/>, the following steps are taken: keys are sent in the clear in the MP_CAPABLE messages; MP_JOIN messages are secured with HMAC-SHA256 (<xref target="RFC2104"/>, <xref target="RFC6234"/>) using those keys; and standard TCP validity checks are made on the other messages (ensuring sequence numbers are in-window <xref target="RFC5961"/>). Residual threats to MPTCP v0 were identified in <xref target="RFC7430"/>, and those affecting the protocol (i.e. modification to ADD_ADDR) have been incorporated in this document. Further discussion of security can be found in <xref target="sec_security"/>.</t>
</list></t>
</section>
</section>
<section title="MPTCP Protocol" anchor="sec_protocol">
<t>This section describes the operation of the MPTCP protocol, and is subdivided into sections for each key part of the protocol operation.</t>
<t>All MPTCP operations are signaled using optional TCP header fields. A single TCP option number ("Kind") has been assigned by IANA for MPTCP (see <xref target="IANA"/>), and then individual messages will be determined by a "subtype", the values of which are also stored in an IANA registry (and are also listed in <xref target="IANA"/>). As with all TCP options, the Length field is specified in bytes, and includes the 2 bytes of Kind and Length.</t>
<t>Throughout this document, when reference is made to an MPTCP option by symbolic name, such as "MP_CAPABLE", this refers to a TCP option with the single MPTCP option type, and with the subtype value of the symbolic name as defined in <xref target="IANA"/>. This subtype is a 4-bit field -- the first 4 bits of the option payload, as shown in <xref target="fig_option"/>. The MPTCP messages are defined in the following sections.</t>
<?rfc needLines='8'?>
<figure align="center" anchor="fig_option" title="MPTCP Option Format">
<artwork align="left"><![CDATA[
1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+---------------+---------------+-------+-----------------------+
| Kind | Length |Subtype| |
+---------------+---------------+-------+ |
| Subtype-specific data |
| (variable length) |
+---------------------------------------------------------------+
]]></artwork>
</figure>
<t>Those MPTCP options associated with subflow initiation are used on packets with the SYN flag set. Additionally, there is one MPTCP option for signaling metadata to ensure segmented data can be recombined for delivery to the application.</t>
<t>The remaining options, however, are signals that do not need to be on a specific packet, such as those for signaling additional addresses. Whilst an implementation may desire to send MPTCP options as soon as possible, it may not be possible to combine all desired options (both those for MPTCP and for regular TCP, such as SACK (selective acknowledgment) <xref target="RFC2018"/>) on a single packet. Therefore, an implementation may choose to send duplicate ACKs containing the additional signaling information. This changes the semantics of a duplicate ACK; these are usually only sent as a signal of a lost segment <xref target="RFC5681"/> in regular TCP. Therefore, an MPTCP implementation receiving a duplicate ACK that contains an MPTCP option MUST NOT treat it as a signal of congestion. Additionally, an MPTCP implementation SHOULD NOT send more than two duplicate ACKs in a row for the purposes of sending MPTCP options alone, in order to ensure no middleboxes misinterpret this as a sign of congestion.</t>
<t>Furthermore, standard TCP validity checks (such as ensuring the sequence number and acknowledgment number are within window) MUST be undertaken before processing any MPTCP signals, as described in <xref target="RFC5961"/>, and initial subflow sequence numbers SHOULD be generated according to the recommendations in <xref target="RFC6528"/>.</t>
<section title="Connection Initiation" anchor="sec_init">
<t>Connection initiation begins with a SYN, SYN/ACK, ACK exchange
on a single path. Each packet
contains the Multipath Capable (MP_CAPABLE) MPTCP option
(<xref target="tcpm_capable"/>). This option declares its
sender is capable of performing Multipath TCP and wishes to do
so on this particular connection.</t>
<t>The MP_CAPABLE exchange in this specification (v1) is different to
that specified in v0. If a host supports multiple versions
of MPTCP, the sender of the MP_CAPABLE option SHOULD signal the
highest version number it supports. In return, in its MP_CAPABLE option,
the receiver will signal the version number it wishes to use, which MUST
be equal to or lower than the version number indicated in the initial
MP_CAPABLE.
There is a caveat though with respect to this version negotiation with
old listeners that only support v0. A listener that supports v0 expects that
the MP_CAPABLE option in the SYN-segment includes the initiator's key. If
the initiator however already upgraded to v1, it won't include the key in the
SYN-segment. Thus, the listener will ignore the MP_CAPABLE of this SYN-segment
and reply with a SYN/ACK that does not include an MP_CAPABLE. The initiator MAY
choose to immediately fall back to TCP or MAY choose to attempt a connection
using MPTCP v0 (if the initiator supports v0), in order to discover whether the
listener supports the earlier version of MPTCP. In general a MPTCP v0 connection
is likely to be preferred to a TCP one, however in a particular deployment scenario
it may be known that the listener is unlikely to support MPTCPv0 and so the
initiator may prefer not to attempt a v0 connection. An initiator MAY cache
information for a peer about what version of MPTCP it supports if any, and use
this information for future connection attempts.</t>
<t>The MP_CAPABLE option is variable-length, with different fields
included depending on which packet the option is used on. The full
MP_CAPABLE option is shown in <xref target="tcpm_capable"/>.</t>
<?rfc needLines='10'?>
<figure align="center" anchor="tcpm_capable" title="Multipath Capable (MP_CAPABLE) Option">
<artwork align="left"><![CDATA[
1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+---------------+---------------+-------+-------+---------------+
| Kind | Length |Subtype|Version|A|B|C|D|E|F|G|H|
+---------------+---------------+-------+-------+---------------+
| Option Sender's Key (64 bits) |
| (if option Length > 4) |
| |
+---------------------------------------------------------------+
| Option Receiver's Key (64 bits) |
| (if option Length > 12) |
| |
+-------------------------------+-------------------------------+
| Data-Level Length (16 bits) | Checksum (16 bits, optional) |
+-------------------------------+-------------------------------+
]]></artwork>
</figure>
<t>The MP_CAPABLE option is carried on the SYN, SYN/ACK, and ACK packets that start the first subflow of an MPTCP connection, as well as the first packet that carries data, if the initiator wishes to send first. The data carried by each option is as follows, where A = initiator and B = listener.
<list style="symbols">
<t>SYN (A->B): only the first four octets (Length = 4).</t>
<t>SYN/ACK (B->A): B's Key for this connection (Length = 12).</t>
<t>ACK (no data) (A->B): A's Key followed by B's Key (Length = 20).</t>
<t>ACK (with first data) (A->B): A's Key followed by B's Key followed by Data-Level Length, and optional Checksum (Length = 22 or 24).</t>
</list>
The contents of the option is determined by the SYN and ACK flags of the packet, along with the option's length field. For the diagram shown in <xref target="tcpm_capable"/>, "sender" and "receiver" refer to the sender or receiver of the TCP packet (which can be either host).</t>
<t>The initial SYN, containing just the MP_CAPABLE header, is used
to define the version of MPTCP being requested, as well as exchanging
flags to negotiate connection features, described later.</t>
<t>This option is used to declare the 64-bit keys that the end hosts have generated for this MPTCP connection. These keys are used to authenticate the addition of future subflows to this connection. This is the only time the key will be sent in clear on the wire (unless "fast close", <xref target="sec_fastclose"/>, is used); all future subflows will identify the connection using a 32-bit "token". This token is a cryptographic hash of this key. The algorithm for this process is dependent on the authentication algorithm selected; the method of selection is defined later in this section.</t>
<t>Upon reception of the initial SYN-segment, a stateful server generates a random key and replies with a SYN/ACK. The key's method of generation is implementation specific. The key MUST be hard to guess, and it MUST be unique for the sending host across all its current MPTCP connections. Recommendations for generating random numbers for use in keys are given in <xref target="RFC4086"/>. Connections will be indexed at each host by the token (a one-way hash of the key). Therefore, an implementation will require a mapping from each token to the corresponding connection, and in turn to the keys for the connection.</t>
<t>There is a risk that two different keys will hash to the same token. The risk of hash collisions is usually small, unless the host is handling many tens of thousands of connections. Therefore, an implementation SHOULD check its list of connection tokens to ensure there is no collision before sending its key, and if there is, then it should generate a new key. This would, however, be costly for a server with thousands of connections. The subflow handshake mechanism (<xref target="sec_join"/>) will ensure that new subflows only join the correct connection, however, through the cryptographic handshake, as well as checking the connection tokens in both directions, and ensuring sequence numbers are in-window. So in the worst case if there was a token collision, the new subflow would not succeed, but the MPTCP connection would continue to provide a regular TCP service.</t>
<t>Since key generation is implementation-specific, there is no requirement that they be simply random numbers. An implementation is free to exchange cryptographic material out-of-band and generate these keys from this, in order to provide additional mechanisms by which to verify the identity of the communicating entities. For example, an implementation could choose to link its MPTCP keys to those used in higher-layer TLS or SSH connections.</t>
<t>If the server behaves in a
stateless manner, it has to generate its own key in a verifiable
fashion. This verifiable way of generating the key can be done by
using a hash of the 4-tuple, sequence number and a local secret
(similar to what is done for the TCP-sequence number <xref target="RFC4987"/>).
It will thus be able to verify whether it is indeed the originator of
the key echoed back in the later MP_CAPABLE option.
As for a stateful server, the tokens SHOULD be checked for uniqueness, however
if uniqueness is not met, and there is no way to generate an alternative verifiable
key, then the connection MUST fall back to using regular TCP by not sending a
MP_CAPABLE in the SYN/ACK.</t>
<t>The ACK carries both A's key and B's key. This is the first time that A's key is seen on the wire, although it is expected that A will have generated a key locally before the initial SYN. The echoing of B's key allows B to operate statelessly, as described above. Therefore, A's key must be delivered reliably to B, and in order to do this, the transmission of this packet must be made reliable.</t>
<t>If B has data to send first, then the reliable delivery of the ACK+MP_CAPABLE can be inferred by the receipt of this data with a MPTCP Data Sequence Signal (DSS) option (<xref target="sec_generalop"/>). If, however, A wishes to send data first, it has two options to ensure the reliable delivery of the ACK+MP_CAPABLE. If it immediately has data to send, then the third ACK (with data) would also contain an MP_CAPABLE option with additional data parameters (the Data-Level Length and optional Checksum as shown in <xref target="tcpm_capable"/>). If A does not immediately have data to send, it MUST include the MP_CAPABLE on the third ACK, but without the additional data parameters. When A does have data to send, it must repeat the sending of the MP_CAPABLE option from the third ACK, with additional data parameters. This MP_CAPABLE option is in place of the DSS, and simply specifies the data-level length of the payload, and the checksum (if the use of checksums is negotiated). This is the minimal data required to establish a MPTCP connection - it allows validation of the payload, and given it is the first data, the Initial Data Sequence Number (IDSN) is also known (as it is generated from the key, as described below). Conveying the keys on the first data packet allows the TCP reliability mechanisms to ensure the packet is successfully delivered. The receiver will acknowledge this data at the connection level with a Data ACK, as if a DSS option has been received.</t>
<t>There could be situations where both A and B attempt to transmit initial data at the same time. For example, if A did not initially have data to send, but then needed to transmit data before it had received anything from B, it would use a MP_CAPABLE option with data parameters (since it would not know if the MP_CAPABLE on the ACK was received). In such a situation, B may also have transmitted data with a DSS option, but it had not yet been received at A. Therefore, B has received data with a MP_CAPABLE mapping after it has sent data with a DSS option. To ensure these situations can be handled, it follows that the data parameters in a MP_CAPABLE are semantically equivalent to those in a DSS option and can be used interchangeably. Similar situations could occur when the MP_CAPABLE with data is lost and retransmitted. Furthermore, in the case of TCP Segmentation Offloading, the MP_CAPABLE with data parameters may be duplicated across multiple packets, and implementations must also be able to cope with duplicate MP_CAPABLE mappings as well as duplicate DSS mappings.</t>
<t>Additionally, the MP_CAPABLE exchange allows the safe passage of MPTCP options on SYN packets to be determined. If any of these options are dropped, MPTCP will gracefully fall back to regular single-path TCP, as documented in <xref target="sec_fallback"/>. If at any point in the handshake either party thinks the MPTCP negotiation is compromised, for example by a middlebox corrupting the TCP options, or unexpected ACK numbers being present, the host MUST stop using MPTCP and no longer include MPTCP options in future TCP packets. The other host will then also fall back to regular TCP using the fall back mechanism. Note that new subflows MUST NOT be established (using the process documented in <xref target="sec_join"/>) until a Data Sequence Signal (DSS) option has been successfully received across the path (as documented in <xref target="sec_generalop"/>).</t>
<t>Like all MPTCP options, the MP_CAPABLE option starts with the Kind and Length to specify the TCP-option kind and its length. Followed by that is the MP_CAPABLE option. The first 4 bits of the first octet in the MP_CAPABLE option (<xref target="tcpm_capable"/>) define the MPTCP option subtype (see <xref target="IANA"/>; for MP_CAPABLE, this is 0x0), and the remaining 4 bits of this octet specify the MPTCP version in use (for this specification, this is 1).</t>
<t>The second octet is reserved for flags, allocated as follows:
<list style="hanging">
<t hangText="A:"> The leftmost bit, labeled "A", SHOULD be set to 1 to indicate "Checksum Required", unless the system administrator has decided that checksums are not required (for example, if the environment is controlled and no middleboxes exist that might adjust the payload).</t>
<t hangText="B:"> The second bit, labeled "B", is an extensibility flag, and MUST be set to 0 for current implementations. This will be used for an extensibility mechanism in a future specification, and the impact of this flag will be defined at a later date. It is expected, but not mandated, that this flag would be used as part of an alternative security mechanism that does not require a full version upgrade of the protocol, but does require redefining some elements of the handshake. If receiving a message with the 'B' flag set to 1, and this is not understood, then the MP_CAPABLE in this SYN MUST be silently ignored, which triggers a fallback to regular TCP; the sender is expected to retry with a format compatible with this legacy specification. Note that the length of the MP_CAPABLE option, and the meanings of bits "D" through "H", may be altered by setting B=1.</t>
<t hangText="C:"> The third bit, labeled "C", is set to "1" to indicate that the sender of this option will not accept additional MPTCP subflows to the source address and port, and therefore the receiver MUST NOT try to open any additional subflows towards this address and port. This is an efficiency improvement for situations where the sender knows a restriction is in place, for example if the sender is behind a strict NAT, or operating behind a legacy Layer 4 load balancer.</t>
<t hangText="D through H:"> The remaining bits, labeled "D" through "H", are used for crypto algorithm negotiation. In this specification only the rightmost bit, labeled "H", is assigned. Bit "H" indicates the use of HMAC-SHA256 (as defined in <xref target="sec_join"/>). An implementation that only supports this method MUST set bit "H" to 1, and bits "D" through "G" to 0.</t>
</list>
A crypto algorithm MUST be specified. If flag bits D through H are all 0, the MP_CAPABLE option MUST be treated as invalid and ignored (that is, it must be treated as a regular TCP handshake).</t>
<t>The selection of the authentication algorithm also impacts the algorithm used to generate the token and the Initial Data Sequence Number (IDSN). In this specification, with only the SHA-256 algorithm (bit "H") specified and selected, the token MUST be a truncated (most significant 32 bits) SHA-256 hash (<xref target="RFC6234"/>) of the key. A different, 64-bit truncation (the least significant 64 bits) of the SHA-256 hash of the key MUST be used as the IDSN. Note that the key MUST be hashed in network byte order. Also note that the "least significant" bits MUST be the rightmost bits of the SHA-256 digest, as per <xref target="RFC6234"/>. Future specifications of the use of the crypto bits may choose to specify different algorithms for token and IDSN generation.</t>
<t>Both the crypto and checksum bits negotiate capabilities in similar ways. For the Checksum Required bit (labeled "A"), if either host requires the use of checksums, checksums MUST be used. In other words, the only way for checksums not to be used is if both hosts in their SYNs set A=0. This decision is confirmed by the setting of the "A" bit in the third packet (the ACK) of the handshake. For example, if the initiator sets A=0 in the SYN, but the responder sets A=1 in the SYN/ACK, checksums MUST be used in both directions, and the initiator will set A=1 in the ACK. The decision whether to use checksums will be stored by an implementation in a per-connection binary state variable. If A=1 is received by a host that does not want to use checksums, it MUST fall back to regular TCP by ignoring the MP_CAPABLE option as if it was invalid.</t>
<t>For crypto negotiation, the responder has the choice. The initiator creates a proposal setting a bit for each algorithm it supports to 1 (in this version of the specification, there is only one proposal, so bit "H" will be always set to 1). The responder responds with only 1 bit set -- this is the chosen algorithm. The rationale for this behavior is that the responder will typically be a server with potentially many thousands of connections, so it may wish to choose an algorithm with minimal computational complexity, depending on the load. If a responder does not support (or does not want to support) any of the initiator's proposals, it MUST respond without an MP_CAPABLE option, thus forcing a fallback to regular TCP.</t>
<t>The MP_CAPABLE option is only used in the first subflow of a connection, in order to identify the connection; all following subflows will use the "Join" option (see <xref target="sec_join"/>) to join the existing connection.</t>
<t>If a SYN contains an MP_CAPABLE option but the
SYN/ACK does not, it is assumed that sender of the SYN/ACK is not
multipath capable; thus, the MPTCP session MUST operate as
a regular, single-path TCP. If a SYN does not contain a
MP_CAPABLE option, the SYN/ACK MUST NOT contain one
in response. If the third packet (the ACK) does not contain
the MP_CAPABLE option, then the session MUST fall back to
operating as a regular, single-path TCP. This is to maintain
compatibility with middleboxes on the path that drop some
or all TCP options. Note that an implementation MAY choose
to attempt sending MPTCP options more than one time before
making this decision to operate as regular TCP (see
<xref target="heuristics"/>).</t>
<t>If the SYN packets are unacknowledged, it is up to local
policy to decide how to respond. It is expected that a sender
will eventually fall back to single-path TCP (i.e., without the
MP_CAPABLE option) in order to work around middleboxes that
may drop packets with unknown options; however, the number of
multipath-capable attempts that are made first will be up to
local policy.
It is possible that MPTCP and non-MPTCP SYNs could get reordered
in the network. Therefore, the final state is inferred from the
presence or absence of the MP_CAPABLE option in the third packet
of the TCP handshake. If this option is not present, the
connection SHOULD fall back to regular TCP, as documented in
<xref target="sec_fallback"/>.</t>
<t>The initial data sequence number on an MPTCP connection
is generated from the key. The algorithm for IDSN generation is
also determined from the negotiated authentication algorithm.
In this specification, with only the SHA-256 algorithm specified and
selected, the IDSN of a host MUST be the least significant 64 bits of the
SHA-256 hash of its key, i.e., IDSN-A = Hash(Key-A) and IDSN-B = Hash(Key-B).
This deterministic generation of the IDSN allows a receiver to ensure
that there are no gaps in sequence space at the start of the connection.
The SYN with MP_CAPABLE occupies the first octet of data sequence space,
although this does not need to be acknowledged at the connection level
until the first data is sent (see <xref target="sec_generalop"/>).</t>
</section>
<section title="Starting a New Subflow" anchor="sec_join">
<t>Once an MPTCP connection has begun with the MP_CAPABLE
exchange, further subflows can be added to the connection.
Hosts have knowledge of their own address(es), and can
become aware of the other host's addresses through
signaling exchanges as described in
<xref target="sec_pm"/>. Using this knowledge, a host
can initiate a new subflow over a currently unused pair of
addresses. It is permitted for either host in a connection
to initiate the creation of a new subflow, but it is expected
that this will normally be the original connection initiator
(see <xref target="heuristics"/> for heuristics).</t>
<t>A new subflow is started as a normal TCP SYN/ACK
exchange. The Join Connection (MP_JOIN) MPTCP option
is used to identify the connection to be joined by the new subflow.
It uses keying material that was exchanged in the initial MP_CAPABLE
handshake (<xref target="sec_init"/>), and that handshake also
negotiates the crypto algorithm in use for the MP_JOIN handshake.</t>
<t>This section specifies the behavior of MP_JOIN using the HMAC-SHA256
algorithm. An MP_JOIN option is present in the SYN, SYN/ACK,
and ACK of the three-way handshake, although in each case with a
different format.</t>
<t>In the first MP_JOIN on the SYN packet, illustrated in
<xref target="tcpm_join"/>, the initiator sends a token, random
number, and address ID.</t>
<t>The token is used to identify the MPTCP connection and is a
cryptographic hash of the receiver's key, as exchanged
in the initial MP_CAPABLE handshake (<xref target="sec_init"/>).
In this specification, the tokens presented in this
option are generated by the SHA-256 <xref target="RFC6234"/>
algorithm, truncated to the most significant 32 bits. The token
included in the MP_JOIN option is the token that the receiver
of the packet uses to identify this connection; i.e., Host A
will send Token-B (which is generated from Key-B). Note that the
hash generation algorithm can be overridden by the choice of
cryptographic handshake algorithm, as defined in <xref target="sec_init"/>.</t>
<t>The MP_JOIN SYN sends not only the token (which is static for a
connection) but also random numbers (nonces) that are used to prevent
replay attacks on the authentication method. Recommendations for the
generation of random numbers for this purpose are given in <xref target="RFC4086"/>.</t>
<t>The MP_JOIN option includes an "Address ID". This is an identifier
generated by the sender of the option, used to identify the source address
of this packet, even if the IP header has been changed in transit by a middlebox.
The numeric value of this field is generated by the sender and must map uniquely
to a source IP address for the sending host.
The Address ID allows address removal (<xref target="sec_remove_addr"/>)
without needing to know what the source address at the
receiver is, thus allowing address removal through NATs.
The Address ID also allows correlation between new subflow setup attempts
and address signaling (<xref target="sec_add_address"/>),
to prevent setting up duplicate subflows on the same path, if an MP_JOIN
and ADD_ADDR are sent at the same time.</t>
<t>The Address IDs of the subflow used in the initial SYN
exchange of the first subflow in the connection are implicit,
and have the value zero. A host MUST store the mappings between
Address IDs and addresses both for itself and the remote host.
An implementation will also need to know which local and remote
Address IDs are associated with which established subflows, for
when addresses are removed from a local or remote host.</t>
<t>The MP_JOIN option on packets with the SYN flag set also includes 4 bits of flags, 3 of which are currently reserved and MUST be set to zero by the sender. The final bit, labeled "B", indicates whether the sender of this option wishes this subflow to be used as a backup path (B=1) in the event of failure of other paths, or whether it wants it to be used as part of the connection immediately. By setting B=1, the sender of the option is requesting the other host to only send data on this subflow if there are no available subflows where B=0. Subflow policy is discussed in more detail in <xref target="sec_policy"/>.</t>
<?rfc needLines='10'?>
<figure align="center" anchor="tcpm_join" title="Join Connection (MP_JOIN) Option (for Initial SYN)">
<artwork align="left"><![CDATA[
1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+---------------+---------------+-------+-----+-+---------------+
| Kind | Length = 12 |Subtype|(rsv)|B| Address ID |
+---------------+---------------+-------+-----+-+---------------+
| Receiver's Token (32 bits) |
+---------------------------------------------------------------+
| Sender's Random Number (32 bits) |
+---------------------------------------------------------------+
]]></artwork>
</figure>
<t>When receiving a SYN with an MP_JOIN option that contains
a valid token for an existing MPTCP connection, the recipient
SHOULD respond with a SYN/ACK also containing an MP_JOIN
option containing a random number and a truncated (leftmost 64
bits) Hash-based Message Authentication Code (HMAC). This
version of the option is shown in <xref target="tcpm_join2"/>.
If the token is unknown, or the host wants to refuse subflow
establishment (for example, due to a limit on the number of
subflows it will permit), the receiver will send back a reset
(RST) signal, analogous to an unknown port in TCP, containing a
MP_TCPRST option (<xref target="sec_reset"/>) with a "MPTCP
specific error" reason code. Although calculating an HMAC
requires cryptographic operations, it is believed that the
32-bit token in the MP_JOIN SYN gives sufficient protection against blind state
exhaustion attacks; therefore, there is no need to provide
mechanisms to allow a responder to operate statelessly at the
MP_JOIN stage.</t>
<t>An HMAC is sent by both hosts -- by the initiator (Host A)
in the third packet (the ACK) and by the responder (Host B) in
the second packet (the SYN/ACK). Doing the HMAC exchange at this
stage allows both hosts to have first exchanged random data (in the
first two SYN packets) that is used as the "message". This
specification defines that HMAC as defined in <xref target="RFC2104"/>
is used, along with the SHA-256 hash algorithm <xref target="RFC6234"/>,
and that the output is truncated to the leftmost 160 bits (20 octets).
Due to option space limitations, the HMAC included in
the SYN/ACK is truncated to the leftmost 64 bits, but this is
acceptable since random numbers are used; thus, an attacker
only has one chance to correctly guess the HMAC that matches the random
number previously sent by the peer (if the HMAC is
incorrect, the TCP connection is closed, so a new MP_JOIN negotiation
with a new random number is required).</t>
<t>The initiator's authentication information is sent in its
first ACK (the third packet of the handshake), as shown in
<xref target="tcpm_join3"/>. This data needs to be sent reliably,
since it is the only time this HMAC is sent;
therefore, receipt of this packet MUST trigger a regular TCP ACK
in response, and the packet MUST be retransmitted if this
ACK is not received. In other words, sending the ACK/MP_JOIN
packet places the subflow in the PRE_ESTABLISHED state, and it
moves to the ESTABLISHED state only on receipt of an ACK from
the receiver. It is not permitted to send data while in the
PRE_ESTABLISHED state. The reserved bits in this option MUST be set
to zero by the sender.</t>
<t>The key for the HMAC algorithm, in the case of the message transmitted by Host A, will be Key-A followed by Key-B, and in the case of Host B, Key-B followed by Key-A. These are the keys that were exchanged in the original MP_CAPABLE handshake. The "message" for the HMAC algorithm in each case is the concatenations of random number for each host (denoted by R): for Host A, R-A followed by R-B; and for Host B, R-B followed by R-A.</t>
<?rfc needLines='10'?>
<figure align="center" anchor="tcpm_join2" title="Join Connection (MP_JOIN) Option (for Responding SYN/ACK)">
<artwork align="left"><![CDATA[
1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+---------------+---------------+-------+-----+-+---------------+
| Kind | Length = 16 |Subtype|(rsv)|B| Address ID |
+---------------+---------------+-------+-----+-+---------------+
| |
| Sender's Truncated HMAC (64 bits) |
| |
+---------------------------------------------------------------+
| Sender's Random Number (32 bits) |
+---------------------------------------------------------------+
]]></artwork>
</figure>
<?rfc needLines='12'?>
<figure align="center" anchor="tcpm_join3" title="Join Connection (MP_JOIN) Option (for Third ACK)">
<artwork align="left"><![CDATA[
1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+---------------+---------------+-------+-----------------------+
| Kind | Length = 24 |Subtype| (reserved) |
+---------------+---------------+-------+-----------------------+
| |
| |
| Sender's Truncated HMAC (160 bits) |
| |
| |
+---------------------------------------------------------------+
]]></artwork>
</figure>
<t>These various MPTCP options fit together to enable authenticated subflow setup as illustrated in <xref target="fig_tokens"/>.</t>
<?rfc needLines='24'?>
<figure align="center" anchor="fig_tokens" title="Example Use of MPTCP Authentication">
<artwork align="left"><![CDATA[
Host A Host B
------------------------ ----------
Address A1 Address A2 Address B1
---------- ---------- ----------
| | |
| | SYN + MP_CAPABLE |
|--------------------------------------------->|
|<---------------------------------------------|
| SYN/ACK + MP_CAPABLE(Key-B) |
| | |
| ACK + MP_CAPABLE(Key-A, Key-B) |
|--------------------------------------------->|
| | |
| | SYN + MP_JOIN(Token-B, R-A) |
| |------------------------------->|
| |<-------------------------------|
| | SYN/ACK + MP_JOIN(HMAC-B, R-B) |
| | |
| | ACK + MP_JOIN(HMAC-A) |
| |------------------------------->|
| |<-------------------------------|
| | ACK |
HMAC-A = HMAC(Key=(Key-A+Key-B), Msg=(R-A+R-B))
HMAC-B = HMAC(Key=(Key-B+Key-A), Msg=(R-B+R-A))
]]></artwork>
</figure>
<t>If the token received at Host B is unknown or local policy
prohibits the acceptance of the new subflow, the recipient MUST
respond with a TCP RST for the subflow. If appropriate, a MP_TCPRST
option with a "Administratively prohibited" reason code
(<xref target="sec_reset"/>) should be included.</t>
<t>If the token is accepted at Host B, but the HMAC returned to
Host A does not match the one expected, Host A MUST close the
subflow with a TCP RST. In this, and all following cases of sending
a RST in this section, the sender SHOULD send a MP_TCPRST option
(<xref target="sec_reset"/>) on this RST packet with the reason
code for a "MPTCP specific error".</t>
<t>If Host B does not receive the expected HMAC, or the MP_JOIN
option is missing from the ACK, it MUST close the subflow with a
TCP RST.</t>
<t>If the HMACs are verified as correct, then both hosts have
verified each other as being the same peers as existed at
the start of the connection, and they have agreed of which
connection this subflow will become a part.</t>
<t>If the SYN/ACK as received at Host A does not have an MP_JOIN
option, Host A MUST close the subflow with a TCP RST.</t>
<t>This covers all cases of the loss of an MP_JOIN. In more detail,
if MP_JOIN is stripped from the SYN on the path from A to
B, and Host B does not have a listener on the relevant
port, it will respond with a RST in the normal way. If in
response to a SYN with an MP_JOIN option, a SYN/ACK is
received without the MP_JOIN option (either since it was
stripped on the return path, or it was stripped on the
outgoing path but Host B responded as if
it were a new regular TCP session), then the subflow is
unusable and Host A MUST close it with a RST.</t>
<t>Note that additional subflows can be created
between any pair of ports (but see <xref target="heuristics"/> for
heuristics); no explicit application-level accept calls or
bind calls are required to open additional subflows. To
associate a new subflow with an existing connection, the token
supplied in the subflow's SYN exchange is used for
demultiplexing. This then binds the 5-tuple of the TCP
subflow to the local token of the connection. A consequence is
that it is possible to allow any port pairs to be used for a
connection. </t>
<t>Demultiplexing subflow SYNs MUST be done using the token;
this is unlike traditional TCP, where the destination port is
used for demultiplexing SYN packets. Once a subflow is set up,
demultiplexing packets is done using the 5-tuple, as in
traditional TCP. The 5-tuples will be mapped to the local
connection identifier (token). Note that Host A will know its
local token for the subflow even though it is not sent on the
wire -- only the responder's token is sent.</t>
</section>
<section title="General MPTCP Operation" anchor="sec_generalop">
<t>This section discusses operation of MPTCP for data transfer. At a high level, an MPTCP implementation will take one input data stream from an application, and split it into one or more subflows, with sufficient control information to allow it to be reassembled and delivered reliably and in order to the recipient application. The following subsections define this behavior in detail.</t>
<t>The data sequence mapping and the Data ACK are signaled in the Data Sequence Signal (DSS) option (<xref target="tcpm_dsn"/>). Either or both can be signaled in one DSS, depending on the flags set. The data sequence mapping defines how the sequence space on the subflow maps to the connection level, and the Data ACK acknowledges receipt of data at the connection level. These functions are described in more detail in the following two subsections.</t>
<?rfc needLines='18'?>
<figure align="center" anchor="tcpm_dsn" title="Data Sequence Signal (DSS) Option">
<artwork align="left"><![CDATA[
1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+---------------+---------------+-------+----------------------+
| Kind | Length |Subtype| (reserved) |F|m|M|a|A|
+---------------+---------------+-------+----------------------+
| Data ACK (4 or 8 octets, depending on flags) |
+--------------------------------------------------------------+
| Data sequence number (4 or 8 octets, depending on flags) |
+--------------------------------------------------------------+
| Subflow Sequence Number (4 octets) |
+-------------------------------+------------------------------+
| Data-Level Length (2 octets) | Checksum (2 octets) |
+-------------------------------+------------------------------+
]]></artwork>
</figure>
<t>The flags, when set, define the contents of this option, as follows:
<list style="symbols">
<t>A = Data ACK present</t>
<t>a = Data ACK is 8 octets (if not set, Data ACK is 4 octets)</t>
<t>M = Data Sequence Number (DSN), Subflow Sequence Number (SSN), Data-Level Length, and Checksum (if negotiated) present</t>
<t>m = Data sequence number is 8 octets (if not set, DSN is 4 octets)</t>
</list>
The flags 'a' and 'm' only have meaning if the corresponding 'A' or 'M' flags are set; otherwise, they will be ignored. The maximum length of this option, with all flags set, is 28 octets.</t>
<t>The 'F' flag indicates "Data FIN". If present, this means that this mapping covers the final data from the sender. This is the connection-level equivalent to the FIN flag in single-path TCP. A connection is not closed unless there has been a Data FIN exchange, a MP_FASTCLOSE (<xref target="sec_fastclose"/>) message, or an implementation-specific, connection-level send timeout. The purpose of the Data FIN and the interactions between this flag, the subflow-level FIN flag, and the data sequence mapping are described in <xref target="sec_close"/>.
The remaining reserved bits MUST be set to zero by an implementation of this specification.</t>
<t>Note that the checksum is only present in this option if the use of MPTCP checksumming has been negotiated at the MP_CAPABLE handshake (see <xref target="sec_init"/>). The presence of the checksum can be inferred from the length of the option. If a checksum is present, but its use had not been negotiated in the MP_CAPABLE handshake, the receiver MUST close the subflow with a RST as it not behaving as negotiated. If a checksum is not present when its use has been negotiated, the receiver MUST close the subflow with a RST as it is considered broken. In both cases, this RST SHOULD be accompanied with a MP_TCPRST option (<xref target="sec_reset"/>) with the reason code for a "MPTCP specific error".</t>
<section title="Data Sequence Mapping" anchor="sec_dsn">
<t>The data stream as a whole can be reassembled through the use of the data sequence mapping components of the DSS option (<xref target="tcpm_dsn"/>), which define the
mapping from the subflow sequence number to the data sequence number. This is used by the receiver to ensure in-order delivery to the application layer. Meanwhile, the subflow-level sequence numbers (i.e., the regular sequence numbers in the TCP header) have subflow-only relevance. It is expected (but not mandated) that SACK <xref target='RFC2018'/> is used at the subflow level to improve efficiency.</t>
<t>The data sequence mapping specifies a mapping from subflow sequence space to data sequence space. This is expressed in terms of starting sequence numbers for the subflow and the data level, and a length of bytes for which this mapping is valid.
This explicit mapping for a range of data was chosen rather than per-packet signaling to assist with compatibility with situations where TCP/IP segmentation or coalescing is undertaken separately from the stack that is generating the data flow (e.g., through the use of TCP segmentation offloading on network interface cards, or by middleboxes such as performance enhancing proxies). It also allows a single mapping to cover many packets, which may be useful in bulk transfer situations.</t>
<t>A mapping is fixed, in that the subflow sequence number is bound to the data sequence number after the mapping has been processed. A sender MUST NOT change this mapping
after it has been declared; however, the same data sequence number can be mapped to by different subflows for retransmission purposes (see <xref target="sec_retransmit"/>). This would also permit the same data to be sent simultaneously on multiple subflows for resilience or efficiency purposes, especially in the case of lossy links. Although the detailed specification of such operation is outside the scope of this document, an implementation SHOULD treat the first data that is received at a subflow for the data sequence space as that which should be delivered to the application, and any later data for that sequence space SHOULD be ignored.</t>
<t>The data sequence number is specified as an absolute value, whereas the subflow sequence numbering is relative (the SYN at the start of the subflow has relative subflow sequence number 0). This is to allow middleboxes to change the initial sequence number of a subflow, such as firewalls that undertake Initial Sequence Number (ISN) randomization.</t>
<t>The data sequence mapping also contains a checksum of the data that this mapping covers, if use of checksums has been negotiated at the MP_CAPABLE exchange. Checksums are used to detect if the payload has been adjusted in any way by a non-MPTCP-aware middlebox. If this checksum fails, it will trigger a failure of the subflow, or a fallback to regular TCP, as documented in <xref target="sec_fallback"/>, since MPTCP can no longer reliably know the subflow sequence space at the receiver to build data sequence mappings. Without checksumming enabled, corrupt data may be delivered to the application if a middlebox alters segment boundaries, alters content, or does not deliver all segments covered by a data sequence mapping. It is therefore RECOMMENDED to use checksumming unless it is known the network path contains no such devices.</t>
<t>The checksum algorithm used is the standard TCP checksum <xref target="RFC0793"/>, operating over the data covered by this mapping, along with a pseudo-header as shown in <xref target="fig_pseudo"/>.</t>
<?rfc needLines='18'?>
<figure align="center" anchor="fig_pseudo" title="Pseudo-Header for DSS Checksum">
<artwork align="left"><![CDATA[
1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+--------------------------------------------------------------+
| |
| Data Sequence Number (8 octets) |
| |
+--------------------------------------------------------------+
| Subflow Sequence Number (4 octets) |
+-------------------------------+------------------------------+
| Data-Level Length (2 octets) | Zeros (2 octets) |
+-------------------------------+------------------------------+
]]></artwork>
</figure>
<t>Note that the data sequence number used in the pseudo-header is always the 64-bit value, irrespective of what length is used in the DSS option itself. The standard TCP checksum algorithm has been chosen since it will be calculated anyway for the TCP subflow, and if calculated first over the data before adding the pseudo-headers, it only needs to be calculated once. Furthermore, since the TCP checksum is additive, the checksum for a DSN_MAP can be constructed by simply adding together the checksums for the data of each constituent TCP segment, and adding the checksum for the DSS pseudo-header.</t>
<t>Note that checksumming relies on the TCP subflow containing contiguous data; therefore, a TCP subflow MUST NOT use the Urgent Pointer to interrupt an existing mapping. Further note, however, that if Urgent data is received on a subflow, it SHOULD be mapped to the data sequence space and delivered to the application analogous to Urgent data in regular TCP.</t>
<t>To avoid possible deadlock scenarios, subflow-level
processing should be undertaken separately from that at
connection level. Therefore, even if a mapping does not exist
from the subflow space to the data-level space, the data
SHOULD still be ACKed at the subflow (if it is in-window).
This data cannot, however, be acknowledged at the data level
(<xref target="sec_dataack"/>) because its data sequence
numbers are unknown. Implementations MAY hold onto such
unmapped data for a short while in the expectation that a
mapping will arrive shortly. Such unmapped data cannot be
counted as being within the connection level receive window because this is
relative to the data sequence numbers, so if the receiver runs
out of memory to hold this data, it will have to be discarded.
If a mapping for that subflow-level sequence space does not
arrive within a receive window of data, that subflow SHOULD be
treated as broken, closed with a RST, and any unmapped data
silently discarded.</t>
<t>Data sequence numbers are always 64-bit quantities, and
MUST be maintained as such in implementations. If a
connection is progressing at a slow rate, so protection
against wrapped sequence numbers is not required,
then an implementation MAY include just the lower 32
bits of the data sequence number in the data sequence mapping and/or
Data ACK as an optimization, and an implementation can make this choice
independently for each packet. An implementation MUST be able to receive
and process both 64-bit or 32-bit sequence number values, but it is not
required that an implementation is able to send both.</t>