summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)AuthorFilesLines
2013-06-29ip_tunnel: fix kernel panic with icmp_dest_unreachEric Dumazet2-2/+2
[ Upstream commit a622260254ee481747cceaaa8609985b29a31565 ] Daniel Petre reported crashes in icmp_dst_unreach() with following call graph: Daniel found a similar problem mentioned in http://lkml.indiana.edu/hypermail/linux/kernel/1007.0/00961.html And indeed this is the root cause : skb->cb[] contains data fooling IP stack. We must clear IPCB in ip_tunnel_xmit() sooner in case dst_link_failure() is called. Or else skb->cb[] might contain garbage from GSO segmentation layer. A similar fix was tested on linux-3.9, but gre code was refactored in linux-3.10. I'll send patches for stable kernels as well. Many thanks to Daniel for providing reports, patches and testing ! Reported-by: Daniel Petre <daniel.petre@rcs-rds.ro> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-06-29tcp: xps: fix reordering issuesEric Dumazet1-4/+6
[ Upstream commit 547669d483e5783d722772af1483fa474da7caf9 ] commit 3853b5841c01a ("xps: Improvements in TX queue selection") introduced ooo_okay flag, but the condition to set it is slightly wrong. In our traces, we have seen ACK packets being received out of order, and RST packets sent in response. We should test if we have any packets still in host queue. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-06-29tcp: fix tcp_md5_hash_skb_data()Eric Dumazet1-2/+5
[ Upstream commit 54d27fcb338bd9c42d1dfc5a39e18f6f9d373c2e ] TCP md5 communications fail [1] for some devices, because sg/crypto code assume page offsets are below PAGE_SIZE. This was discovered using mlx4 driver [2], but I suspect loopback might trigger the same bug now we use order-3 pages in tcp_sendmsg() [1] Failure is giving following messages. huh, entered softirq 3 NET_RX ffffffff806ad230 preempt_count 00000100, exited with 00000101? [2] mlx4 driver uses order-2 pages to allocate RX frags Reported-by: Matt Schnall <mischnal@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Bernhard Beck <bbeck@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-05-13net: drop dst before queueing fragmentsEric Dumazet1-4/+11
[ Upstream commit 97599dc792b45b1669c3cdb9a4b365aad0232f65 ] Commit 4a94445c9a5c (net: Use ip_route_input_noref() in input path) added a bug in IP defragmentation handling, as non refcounted dst could escape an RCU protected section. Commit 64f3b9e203bd068 (net: ip_expire() must revalidate route) fixed the case of timeouts, but not the general problem. Tom Parkin noticed crashes in UDP stack and provided a patch, but further analysis permitted us to pinpoint the root cause. Before queueing a packet into a frag list, we must drop its dst, as this dst has limited lifetime (RCU protected) When/if a packet is finally reassembled, we use the dst of the very last skb, still protected by RCU and valid, as the dst of the reassembled packet. Use same logic in IPv6, as there is no need to hold dst references. Reported-by: Tom Parkin <tparkin@katalix.com> Tested-by: Tom Parkin <tparkin@katalix.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-05-13tcp: call tcp_replace_ts_recent() from tcp_ack()Eric Dumazet1-33/+32
[ Upstream commit 12fb3dd9dc3c64ba7d64cec977cca9b5fb7b1d4e ] commit bd090dfc634d (tcp: tcp_replace_ts_recent() should not be called from tcp_validate_incoming()) introduced a TS ecr bug in slow path processing. 1 A > B P. 1:10001(10000) ack 1 <nop,nop,TS val 1001 ecr 200> 2 B < A . 1:1(0) ack 1 win 257 <sack 9001:10001,TS val 300 ecr 1001> 3 A > B . 1:1001(1000) ack 1 win 227 <nop,nop,TS val 1002 ecr 200> 4 A > B . 1001:2001(1000) ack 1 win 227 <nop,nop,TS val 1002 ecr 200> (ecr 200 should be ecr 300 in packets 3 & 4) Problem is tcp_ack() can trigger send of new packets (retransmits), reflecting the prior TSval, instead of the TSval contained in the currently processed incoming packet. Fix this by calling tcp_replace_ts_recent() from tcp_ack() after the checks, but before the actions. Reported-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-05-13esp4: fix error return code in esp_output()Wei Yongjun1-3/+3
[ Upstream commit 06848c10f720cbc20e3b784c0df24930b7304b93 ] Fix to return a negative error code from the error handling case instead of 0, as returned elsewhere in this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-05-13tcp: incoming connections might use wrong route under synfloodDmitry Popov1-2/+2
[ Upstream commit d66954a066158781ccf9c13c91d0316970fe57b6 ] There is a bug in cookie_v4_check (net/ipv4/syncookies.c): flowi4_init_output(&fl4, 0, sk->sk_mark, RT_CONN_FLAGS(sk), RT_SCOPE_UNIVERSE, IPPROTO_TCP, inet_sk_flowi_flags(sk), (opt && opt->srr) ? opt->faddr : ireq->rmt_addr, ireq->loc_addr, th->source, th->dest); Here we do not respect sk->sk_bound_dev_if, therefore wrong dst_entry may be taken. This dst_entry is used by new socket (get_cookie_sock -> tcp_v4_syn_recv_sock), so its packets may take the wrong path. Signed-off-by: Dmitry Popov <dp@highloadlab.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-04-10tcp: undo spurious timeout after SACK renegingYuchung Cheng1-5/+2
[ Upstream commit 7ebe183c6d444ef5587d803b64a1f4734b18c564 ] On SACK reneging the sender immediately retransmits and forces a timeout but disables Eifel (undo). If the (buggy) receiver does not drop any packet this can trigger a false slow-start retransmit storm driven by the ACKs of the original packets. This can be detected with undo and TCP timestamps. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-04-10tcp: preserve ACK clocking in TSOEric Dumazet1-2/+5
[ Upstream commit f4541d60a449afd40448b06496dcd510f505928e ] A long standing problem with TSO is the fact that tcp_tso_should_defer() rearms the deferred timer, while it should not. Current code leads to following bad bursty behavior : 20:11:24.484333 IP A > B: . 297161:316921(19760) ack 1 win 119 20:11:24.484337 IP B > A: . ack 263721 win 1117 20:11:24.485086 IP B > A: . ack 265241 win 1117 20:11:24.485925 IP B > A: . ack 266761 win 1117 20:11:24.486759 IP B > A: . ack 268281 win 1117 20:11:24.487594 IP B > A: . ack 269801 win 1117 20:11:24.488430 IP B > A: . ack 271321 win 1117 20:11:24.489267 IP B > A: . ack 272841 win 1117 20:11:24.490104 IP B > A: . ack 274361 win 1117 20:11:24.490939 IP B > A: . ack 275881 win 1117 20:11:24.491775 IP B > A: . ack 277401 win 1117 20:11:24.491784 IP A > B: . 316921:332881(15960) ack 1 win 119 20:11:24.492620 IP B > A: . ack 278921 win 1117 20:11:24.493448 IP B > A: . ack 280441 win 1117 20:11:24.494286 IP B > A: . ack 281961 win 1117 20:11:24.495122 IP B > A: . ack 283481 win 1117 20:11:24.495958 IP B > A: . ack 285001 win 1117 20:11:24.496791 IP B > A: . ack 286521 win 1117 20:11:24.497628 IP B > A: . ack 288041 win 1117 20:11:24.498459 IP B > A: . ack 289561 win 1117 20:11:24.499296 IP B > A: . ack 291081 win 1117 20:11:24.500133 IP B > A: . ack 292601 win 1117 20:11:24.500970 IP B > A: . ack 294121 win 1117 20:11:24.501388 IP B > A: . ack 295641 win 1117 20:11:24.501398 IP A > B: . 332881:351881(19000) ack 1 win 119 While the expected behavior is more like : 20:19:49.259620 IP A > B: . 197601:202161(4560) ack 1 win 119 20:19:49.260446 IP B > A: . ack 154281 win 1212 20:19:49.261282 IP B > A: . ack 155801 win 1212 20:19:49.262125 IP B > A: . ack 157321 win 1212 20:19:49.262136 IP A > B: . 202161:206721(4560) ack 1 win 119 20:19:49.262958 IP B > A: . ack 158841 win 1212 20:19:49.263795 IP B > A: . ack 160361 win 1212 20:19:49.264628 IP B > A: . ack 161881 win 1212 20:19:49.264637 IP A > B: . 206721:211281(4560) ack 1 win 119 20:19:49.265465 IP B > A: . ack 163401 win 1212 20:19:49.265886 IP B > A: . ack 164921 win 1212 20:19:49.266722 IP B > A: . ack 166441 win 1212 20:19:49.266732 IP A > B: . 211281:215841(4560) ack 1 win 119 20:19:49.267559 IP B > A: . ack 167961 win 1212 20:19:49.268394 IP B > A: . ack 169481 win 1212 20:19:49.269232 IP B > A: . ack 171001 win 1212 20:19:49.269241 IP A > B: . 215841:221161(5320) ack 1 win 119 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Van Jacobson <vanj@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-03-27inet: limit length of fragment queue hash table bucket listsHannes Frederic Sowa2-7/+25
[ Upstream commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055 ] This patch introduces a constant limit of the fragment queue hash table bucket list lengths. Currently the limit 128 is choosen somewhat arbitrary and just ensures that we can fill up the fragment cache with empty packets up to the default ip_frag_high_thresh limits. It should just protect from list iteration eating considerable amounts of cpu. If we reach the maximum length in one hash bucket a warning is printed. This is implemented on the caller side of inet_frag_find to distinguish between the different users of inet_fragment.c. I dropped the out of memory warning in the ipv4 fragment lookup path, because we already get a warning by the slab allocator. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jesper Dangaard Brouer <jbrouer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-03-27tcp: fix skb_availroom()Eric Dumazet2-2/+1
[ Upstream commit 16fad69cfe4adbbfa813de516757b87bcae36d93 ] Chrome OS team reported a crash on a Pixel ChromeBook in TCP stack : https://code.google.com/p/chromium/issues/detail?id=182056 commit a21d45726acac (tcp: avoid order-1 allocations on wifi and tx path) did a poor choice adding an 'avail_size' field to skb, while what we really needed was a 'reserved_tailroom' one. It would have avoided commit 22b4a4f22da (tcp: fix retransmit of partially acked frames) and this commit. Crash occurs because skb_split() is not aware of the 'avail_size' management (and should not be aware) Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Mukesh Agrawal <quiche@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-03-27net/ipv4: Ensure that location of timestamp option is storedDavid Ward1-4/+1
[ Upstream commit 4660c7f498c07c43173142ea95145e9dac5a6d14 ] This is needed in order to detect if the timestamp option appears more than once in a packet, to remove the option if the packet is fragmented, etc. My previous change neglected to store the option location when the router addresses were prespecified and Pointer > Length. But now the option location is also stored when Flag is an unrecognized value, to ensure these option handling behaviors are still performed. Signed-off-by: David Ward <david.ward@ll.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-03-27tcp: fix double-counted receiver RTT when leaving receiver fast pathNeal Cardwell1-3/+3
[ Upstream commit aab2b4bf224ef8358d262f95b568b8ad0cecf0a0 ] We should not update ts_recent and call tcp_rcv_rtt_measure_ts() both before and after going to step5. That wastes CPU and double-counts the receiver-side RTT sample. Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-03-06ipv6: use a stronger hash for tcpEric Dumazet1-2/+7
[ Upstream commit 08dcdbf6a7b9d14c2302c5bd0c5390ddf122f664 ] It looks like its possible to open thousands of TCP IPv6 sessions on a server, all landing in a single slot of TCP hash table. Incoming packets have to lookup sockets in a very long list. We should hash all bits from foreign IPv6 addresses, using a salt and hash mix, not a simple XOR. inet6_ehashfn() can also separately use the ports, instead of xoring them. Reported-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-03-06ipv4: fix a bug in ping_err().Li Wei1-2/+2
[ Upstream commit b531ed61a2a2a77eeb2f7c88b49aa5ec7d9880d8 ] We should get 'type' and 'code' from the outer ICMP header. Signed-off-by: Li Wei <lw@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-02-20tcp: fix for zero packets_in_flight was too broadIlpo Järvinen1-2/+6
[ Upstream commit 6731d2095bd4aef18027c72ef845ab1087c3ba63 ] There are transients during normal FRTO procedure during which the packets_in_flight can go to zero between write_queue state updates and firing the resulting segments out. As FRTO processing occurs during that window the check must be more precise to not match "spuriously" :-). More specificly, e.g., when packets_in_flight is zero but FLAG_DATA_ACKED is true the problematic branch that set cwnd into zero would not be taken and new segments might be sent out later. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Tested-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-02-20tcp: frto should not set snd_cwnd to 0Eric Dumazet1-1/+2
[ Upstream commit 2e5f421211ff76c17130b4597bc06df4eeead24f ] Commit 9dc274151a548 (tcp: fix ABC in tcp_slow_start()) uncovered a bug in FRTO code : tcp_process_frto() is setting snd_cwnd to 0 if the number of in flight packets is 0. As Neal pointed out, if no packet is in flight we lost our chance to disambiguate whether a loss timeout was spurious. We should assume it was a proper loss. Reported-by: Pasi Kärkkäinen <pasik@iki.fi> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-02-20net: prevent setting ttl=0 via IP_TTLCong Wang1-1/+1
[ Upstream commit c9be4a5c49cf51cc70a993f004c5bb30067a65ce ] A regression is introduced by the following commit: commit 4d52cfbef6266092d535237ba5a4b981458ab171 Author: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue Jun 2 00:42:16 2009 -0700 net: ipv4/ip_sockglue.c cleanups Pure cleanups but it is not a pure cleanup... - if (val != -1 && (val < 1 || val>255)) + if (val != -1 && (val < 0 || val > 255)) Since there is no reason provided to allow ttl=0, change it back. Reported-by: nitin padalia <padalia.nitin@gmail.com> Cc: nitin padalia <padalia.nitin@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-01-16tcp: RFC 5961 5.2 Blind Data Injection Attack MitigationEric Dumazet1-18/+25
[ Upstream commit 354e4aa391ed50a4d827ff6fc11e0667d0859b25 ] RFC 5961 5.2 [Blind Data Injection Attack].[Mitigation] All TCP stacks MAY implement the following mitigation. TCP stacks that implement this mitigation MUST add an additional input check to any incoming segment. The ACK value is considered acceptable only if it is in the range of ((SND.UNA - MAX.SND.WND) <= SEG.ACK <= SND.NXT). All incoming segments whose ACK value doesn't satisfy the above condition MUST be discarded and an ACK sent back. Move tcp_send_challenge_ack() before tcp_ack() to avoid a forward declaration. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Jerry Chu <hkchu@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-01-16tcp: tcp_replace_ts_recent() should not be called from tcp_validate_incoming()Eric Dumazet1-5/+10
[ Upstream commit bd090dfc634ddd711a5fbd0cadc6e0ab4977bcaf ] We added support for RFC 5961 in latest kernels but TCP fails to perform exhaustive check of ACK sequence. We can update our view of peer tsval from a frame that is later discarded by tcp_ack() This makes timestamps enabled sessions vulnerable to injection of a high tsval : peers start an ACK storm, since the victim sends a dupack each time it receives an ACK from the other peer. As tcp_validate_incoming() is called before tcp_ack(), we should not peform tcp_replace_ts_recent() from it, and let callers do it at the right time. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: Romain Francoise <romain@orebokech.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-01-16tcp: refine SYN handling in tcp_validate_incomingEric Dumazet1-1/+5
[ Upstream commit e371589917011efe6ff8c7dfb4e9e81934ac5855 ] Followup of commit 0c24604b68fc (tcp: implement RFC 5961 4.2) As reported by Vijay Subramanian, we should send a challenge ACK instead of a dup ack if a SYN flag is set on a packet received out of window. This permits the ratelimiting to work as intended, and to increase correct SNMP counters. Suggested-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Vijay Subramanian <subramanian.vijay@gmail.com> Cc: Kiran Kumar Kella <kkiran@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-01-16tcp: implement RFC 5961 4.2Eric Dumazet2-18/+16
[ Upstream commit 0c24604b68fc7810d429d6c3657b6f148270e528 ] Implement the RFC 5691 mitigation against Blind Reset attack using SYN bit. Section 4.2 of RFC 5961 advises to send a Challenge ACK and drop incoming packet, instead of resetting the session. Add a new SNMP counter to count number of challenge acks sent in response to SYN packets. (netstat -s | grep TCPSYNChallenge) Remove obsolete TCPAbortOnSyn, since we no longer abort a TCP session because of a SYN flag. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kiran Kumar Kella <kkiran@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-01-16tcp: implement RFC 5961 3.2Eric Dumazet3-1/+38
[ Upstream commit 282f23c6ee343126156dd41218b22ece96d747e3 ] Implement the RFC 5691 mitigation against Blind Reset attack using RST bit. Idea is to validate incoming RST sequence, to match RCV.NXT value, instead of previouly accepted window : (RCV.NXT <= SEG.SEQ < RCV.NXT+RCV.WND) If sequence is in window but not an exact match, send a "challenge ACK", so that the other part can resend an RST with the appropriate sequence. Add a new sysctl, tcp_challenge_ack_limit, to limit number of challenge ACK sent per second. Add a new SNMP counter to count number of challenge acks sent. (netstat -s | grep TCPChallengeACK) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kiran Kumar Kella <kkiran@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-01-16inet: Fix kmemleak in tcp_v4/6_syn_recv_sock and dccp_v4/6_request_recv_sockChristoph Paasch2-3/+18
[ Upstream commit e337e24d6624e74a558aa69071e112a65f7b5758 ] If in either of the above functions inet_csk_route_child_sock() or __inet_inherit_port() fails, the newsk will not be freed: unreferenced object 0xffff88022e8a92c0 (size 1592): comm "softirq", pid 0, jiffies 4294946244 (age 726.160s) hex dump (first 32 bytes): 0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00 ................ 02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff8153d190>] kmemleak_alloc+0x21/0x3e [<ffffffff810ab3e7>] kmem_cache_alloc+0xb5/0xc5 [<ffffffff8149b65b>] sk_prot_alloc.isra.53+0x2b/0xcd [<ffffffff8149b784>] sk_clone_lock+0x16/0x21e [<ffffffff814d711a>] inet_csk_clone_lock+0x10/0x7b [<ffffffff814ebbc3>] tcp_create_openreq_child+0x21/0x481 [<ffffffff814e8fa5>] tcp_v4_syn_recv_sock+0x3a/0x23b [<ffffffff814ec5ba>] tcp_check_req+0x29f/0x416 [<ffffffff814e8e10>] tcp_v4_do_rcv+0x161/0x2bc [<ffffffff814eb917>] tcp_v4_rcv+0x6c9/0x701 [<ffffffff814cea9f>] ip_local_deliver_finish+0x70/0xc4 [<ffffffff814cec20>] ip_local_deliver+0x4e/0x7f [<ffffffff814ce9f8>] ip_rcv_finish+0x1fc/0x233 [<ffffffff814cee68>] ip_rcv+0x217/0x267 [<ffffffff814a7bbe>] __netif_receive_skb+0x49e/0x553 [<ffffffff814a7cc3>] netif_receive_skb+0x50/0x82 This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus a single sock_put() is not enough to free the memory. Additionally, things like xfrm, memcg, cookie_values,... may have been initialized. We have to free them properly. This is fixed by forcing a call to tcp_done(), ending up in inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary, because it ends up doing all the cleanup on xfrm, memcg, cookie_values, xfrm,... Before calling tcp_done, we have to set the socket to SOCK_DEAD, to force it entering inet_csk_destroy_sock. To avoid the warning in inet_csk_destroy_sock, inet_num has to be set to 0. As inet_csk_destroy_sock does a dec on orphan_count, we first have to increase it. Calling tcp_done() allows us to remove the calls to tcp_clear_xmit_timer() and tcp_cleanup_congestion_control(). A similar approach is taken for dccp by calling dccp_done(). This is in the kernel since 093d282321 (tproxy: fix hash locking issue when using port redirection in __inet_inherit_port()), thus since version >= 2.6.37. Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-01-03ipv4: ip_check_defrag must not modify skb before unsharingJohannes Berg1-10/+9
[ Upstream commit 1bf3751ec90cc3174e01f0d701e8449ce163d113 ] ip_check_defrag() might be called from af_packet within the RX path where shared SKBs are used, so it must not modify the input SKB before it has unshared it for defragmentation. Use skb_copy_bits() to get the IP header and only pull in everything later. The same is true for the other caller in macvlan as it is called from dev->rx_handler which can also get a shared SKB. Reported-by: Eric Leblond <eric@regit.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-12-06ipv4: avoid undefined behavior in do_ip_setsockopt()Xi Wang1-12/+21
[ Upstream commit 0c9f79be295c99ac7e4b569ca493d75fdcc19e4e ] (1<<optname) is undefined behavior in C with a negative optname or optname larger than 31. In those cases the result of the shift is not necessarily zero (e.g., on x86). This patch simplifies the code with a switch statement on optname. It also allows the compiler to generate better code (e.g., using a 64-bit mask). Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-12-06netfilter: nf_nat: don't check for port change on ICMP tuplesUlrich Weber1-2/+4
commit 38fe36a248ec3228f8e6507955d7ceb0432d2000 upstream. ICMP tuples have id in src and type/code in dst. So comparing src.u.all with dst.u.all will always fail here and ip_xfrm_me_harder() is called for every ICMP packet, even if there was no NAT. Signed-off-by: Ulrich Weber <ulrich.weber@sophos.com> [Pablo Neira Ayuso: Backported to.3.0] Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-11-16net: fix divide by zero in tcp algorithm illinoisJesper Dangaard Brouer1-3/+5
[ Upstream commit 8f363b77ee4fbf7c3bbcf5ec2c5ca482d396d664 ] Reading TCP stats when using TCP Illinois congestion control algorithm can cause a divide by zero kernel oops. The division by zero occur in tcp_illinois_info() at: do_div(t, ca->cnt_rtt); where ca->cnt_rtt can become zero (when rtt_reset is called) Steps to Reproduce: 1. Register tcp_illinois: # sysctl -w net.ipv4.tcp_congestion_control=illinois 2. Monitor internal TCP information via command "ss -i" # watch -d ss -i 3. Establish new TCP conn to machine Either it fails at the initial conn, or else it needs to wait for a loss or a reset. This is only related to reading stats. The function avg_delay() also performs the same divide, but is guarded with a (ca->cnt_rtt > 0) at its calling point in update_params(). Thus, simply fix tcp_illinois_info(). Function tcp_illinois_info() / get_info() is called without socket lock. Thus, eliminate any race condition on ca->cnt_rtt by using a local stack variable. Simply reuse info.tcpv_rttcnt, as its already set to ca->cnt_rtt. Function avg_delay() is not affected by this race condition, as its called with the socket lock. Cc: Petr Matousek <pmatouse@redhat.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-11-16tcp: fix FIONREAD/SIOCINQEric Dumazet1-5/+3
[ Upstream commit a3374c42aa5f7237e87ff3b0622018636b0c847e ] tcp_ioctl() tries to take into account if tcp socket received a FIN to report correct number bytes in receive queue. But its flaky because if the application ate the last skb, we return 1 instead of 0. Correct way to detect that FIN was received is to test SOCK_DONE. Reported-by: Elliot Hughes <enh@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-10-31tcp: resets are misroutedAlexey Kuznetsov1-3/+4
[ Upstream commit 4c67525849e0b7f4bd4fab2487ec9e43ea52ef29 ] After commit e2446eaa ("tcp_v4_send_reset: binding oif to iif in no sock case").. tcp resets are always lost, when routing is asymmetric. Yes, backing out that patch will result in misrouting of resets for dead connections which used interface binding when were alive, but we actually cannot do anything here. What's died that's died and correct handling normal unbound connections is obviously a priority. Comment to comment: > This has few benefits: > 1. tcp_v6_send_reset already did that. It was done to route resets for IPv6 link local addresses. It was a mistake to do so for global addresses. The patch fixes this as well. Actually, the problem appears to be even more serious than guaranteed loss of resets. As reported by Sergey Soloviev <sol@eqv.ru>, those misrouted resets create a lot of arp traffic and huge amount of unresolved arp entires putting down to knees NAT firewalls which use asymmetric routing. Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-10-17netfilter: nf_nat_sip: fix via header translation with multiple parametersPatrick McHardy1-2/+3
commit f22eb25cf5b1157b29ef88c793b71972efc47143 upstream. Via-headers are parsed beginning at the first character after the Via-address. When the address is translated first and its length decreases, the offset to start parsing at is incorrect and header parameters might be missed. Update the offset after translating the Via-address to fix this. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-10-17netfilter: nf_nat_sip: fix incorrect handling of EBUSY for RTCP expectationPablo Neira Ayuso1-1/+4
commit 3f509c689a07a4aa989b426893d8491a7ffcc410 upstream. We're hitting bug while trying to reinsert an already existing expectation: kernel BUG at kernel/timer.c:895! invalid opcode: 0000 [#1] SMP [...] Call Trace: <IRQ> [<ffffffffa0069563>] nf_ct_expect_related_report+0x4a0/0x57a [nf_conntrack] [<ffffffff812d423a>] ? in4_pton+0x72/0x131 [<ffffffffa00ca69e>] ip_nat_sdp_media+0xeb/0x185 [nf_nat_sip] [<ffffffffa00b5b9b>] set_expected_rtp_rtcp+0x32d/0x39b [nf_conntrack_sip] [<ffffffffa00b5f15>] process_sdp+0x30c/0x3ec [nf_conntrack_sip] [<ffffffff8103f1eb>] ? irq_exit+0x9a/0x9c [<ffffffffa00ca738>] ? ip_nat_sdp_media+0x185/0x185 [nf_nat_sip] We have to remove the RTP expectation if the RTCP expectation hits EBUSY since we keep trying with other ports until we succeed. Reported-by: Rafal Fitt <rafalf@aplusc.com.pl> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-10-17netfilter: nf_ct_ipv4: packets with wrong ihl are invalidJozsef Kadlecsik1-0/+8
commit 07153c6ec074257ade76a461429b567cff2b3a1e upstream. It was reported that the Linux kernel sometimes logs: klogd: [2629147.402413] kernel BUG at net / netfilter / nf_conntrack_proto_tcp.c: 447! klogd: [1072212.887368] kernel BUG at net / netfilter / nf_conntrack_proto_tcp.c: 392 ipv4_get_l4proto() in nf_conntrack_l3proto_ipv4.c and tcp_error() in nf_conntrack_proto_tcp.c should catch malformed packets, so the errors at the indicated lines - TCP options parsing - should not happen. However, tcp_error() relies on the "dataoff" offset to the TCP header, calculated by ipv4_get_l4proto(). But ipv4_get_l4proto() does not check bogus ihl values in IPv4 packets, which then can slip through tcp_error() and get caught at the TCP options parsing routines. The patch fixes ipv4_get_l4proto() by invalidating packets with bogus ihl value. The patch closes netfilter bugzilla id 771. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-10-10ipv4: raw: fix icmp_filter()Eric Dumazet1-6/+8
[ Upstream commit ab43ed8b7490cb387782423ecf74aeee7237e591 ] icmp_filter() should not modify its input, or else its caller would need to recompute ip_hdr() if skb->head is reallocated. Use skb_header_pointer() instead of pskb_may_pull() and change the prototype to make clear both sk and skb are const. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-10-10tcp: flush DMA queue before sk_wait_data if rcv_wnd is zeroMichal Kubeček1-2/+8
[ Upstream commit 15c041759bfcd9ab0a4e43f1c16e2644977d0467 ] If recv() syscall is called for a TCP socket so that - IOAT DMA is used - MSG_WAITALL flag is used - requested length is bigger than sk_rcvbuf - enough data has already arrived to bring rcv_wnd to zero then when tcp_recvmsg() gets to calling sk_wait_data(), receive window can be still zero while sk_async_wait_queue exhausts enough space to keep it zero. As this queue isn't cleaned until the tcp_service_net_dma() call, sk_wait_data() cannot receive any data and blocks forever. If zero receive window and non-empty sk_async_wait_queue is detected before calling sk_wait_data(), process the queue first. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-09-19net: ipv4: ipmr_expire_timer causes crash when removing net namespaceFrancesco Ruggeri1-2/+12
[ Upstream commit acbb219d5f53821b2d0080d047800410c0420ea1 ] When tearing down a net namespace, ipv4 mr_table structures are freed without first deactivating their timers. This can result in a crash in run_timer_softirq. This patch mimics the corresponding behaviour in ipv6. Locking and synchronization seem to be adequate. We are about to kfree mrt, so existing code should already make sure that no other references to mrt are pending or can be created by incoming traffic. The functions invoked here do not cause new references to mrt or other race conditions to be created. Invoking del_timer_sync guarantees that ipmr_expire_timer is inactive. Both ipmr_expire_process (whose completion we may have to wait in del_timer_sync) and mroute_clean_tables internally use mfc_unres_lock or other synchronizations when needed, and they both only modify mrt. Tested in Linux 3.4.8. Signed-off-by: Francesco Ruggeri <fruggeri@aristanetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-09-19tcp: Apply device TSO segment limit earlierBen Hutchings3-11/+17
[ Upstream commit 1485348d2424e1131ea42efc033cbd9366462b01 ] Cache the device gso_max_segs in sock::sk_gso_max_segs and use it to limit the size of TSO skbs. This avoids the need to fall back to software GSO for local TCP senders. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-19tcp: perform DMA to userspace only if there is a task waiting for itJiri Kosina1-1/+3
[ Upstream commit 59ea33a68a9083ac98515e4861c00e71efdc49a1 ] Back in 2006, commit 1a2449a87b ("[I/OAT]: TCP recv offload to I/OAT") added support for receive offloading to IOAT dma engine if available. The code in tcp_rcv_established() tries to perform early DMA copy if applicable. It however does so without checking whether the userspace task is actually expecting the data in the buffer. This is not a problem under normal circumstances, but there is a corner case where this doesn't work -- and that's when MSG_TRUNC flag to recvmsg() is used. If the IOAT dma engine is not used, the code properly checks whether there is a valid ucopy.task and the socket is owned by userspace, but misses the check in the dmaengine case. This problem can be observed in real trivially -- for example 'tbench' is a good reproducer, as it makes a heavy use of MSG_TRUNC. On systems utilizing IOAT, you will soon find tbench waiting indefinitely in sk_wait_data(), as they have been already early-copied in tcp_rcv_established() using dma engine. This patch introduces the same check we are performing in the simple iovec copy case to the IOAT case as well. It fixes the indefinite recvmsg(MSG_TRUNC) hangs. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-19tcp: Add TCP_USER_TIMEOUT negative value checkHangbin Liu1-1/+4
[ Upstream commit 42493570100b91ef663c4c6f0c0fdab238f9d3c2 ] TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int. But patch "tcp: Add TCP_USER_TIMEOUT socket option"(dca43c75) didn't check the negative values. If a user assign -1 to it, the socket will set successfully and wait for 4294967295 miliseconds. This patch add a negative value check to avoid this issue. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-19cipso: don't follow a NULL pointer when setsockopt() is calledPaul Moore1-2/+4
[ Upstream commit 89d7ae34cdda4195809a5a987f697a517a2a3177 ] As reported by Alan Cox, and verified by Lin Ming, when a user attempts to add a CIPSO option to a socket using the CIPSO_V4_TAG_LOCAL tag the kernel dies a terrible death when it attempts to follow a NULL pointer (the skb argument to cipso_v4_validate() is NULL when called via the setsockopt() syscall). This patch fixes this by first checking to ensure that the skb is non-NULL before using it to find the incoming network interface. In the unlikely case where the skb is NULL and the user attempts to add a CIPSO option with the _TAG_LOCAL tag we return an error as this is not something we want to allow. A simple reproducer, kindly supplied by Lin Ming, although you must have the CIPSO DOI #3 configure on the system first or you will be caught early in cipso_v4_validate(): #include <sys/types.h> #include <sys/socket.h> #include <linux/ip.h> #include <linux/in.h> #include <string.h> struct local_tag { char type; char length; char info[4]; }; struct cipso { char type; char length; char doi[4]; struct local_tag local; }; int main(int argc, char **argv) { int sockfd; struct cipso cipso = { .type = IPOPT_CIPSO, .length = sizeof(struct cipso), .local = { .type = 128, .length = sizeof(struct local_tag), }, }; memset(cipso.doi, 0, 4); cipso.doi[3] = 3; sockfd = socket(AF_INET, SOCK_DGRAM, 0); #define SOL_IP 0 setsockopt(sockfd, SOL_IP, IP_OPTIONS, &cipso, sizeof(struct cipso)); return 0; } CC: Lin Ming <mlin@ss.pku.edu.cn> Reported-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25tcp: drop SYN+FIN messagesEric Dumazet1-0/+2
commit fdf5af0daf8019cec2396cdef8fb042d80fe71fa upstream. Denys Fedoryshchenko reported that SYN+FIN attacks were bringing his linux machines to their limits. Dont call conn_request() if the TCP flags includes SYN flag Reported-by: Denys Fedoryshchenko <denys@visp.net.lb> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-06-10xfrm: take net hdr len into account for esp payload size calculationBenjamin Poirier1-15/+9
[ Upstream commit 91657eafb64b4cb53ec3a2fbc4afc3497f735788 ] Corrects the function that determines the esp payload size. The calculations done in esp{4,6}_get_mtu() lead to overlength frames in transport mode for certain mtu values and suboptimal frames for others. According to what is done, mainly in esp{,6}_output() and tcp_mtu_to_mss(), net_header_len must be taken into account before doing the alignment calculation. Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-06-10ipv4: fix the rcu race between free_fib_info and ip_route_output_slowYanmin Zhang1-6/+6
[ Upstream commit e49cc0da7283088c5e03d475ffe2fdcb24a6d5b1 ] We hit a kernel OOPS. <3>[23898.789643] BUG: sleeping function called from invalid context at /data/buildbot/workdir/ics/hardware/intel/linux-2.6/arch/x86/mm/fault.c:1103 <3>[23898.862215] in_atomic(): 0, irqs_disabled(): 0, pid: 10526, name: Thread-6683 <4>[23898.967805] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me to suspend... <4>[23899.258526] Pid: 10526, comm: Thread-6683 Tainted: G W 3.0.8-137685-ge7742f9 #1 <4>[23899.357404] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me to suspend... <4>[23899.904225] Call Trace: <4>[23899.989209] [<c1227f50>] ? pgtable_bad+0x130/0x130 <4>[23900.000416] [<c1238c2a>] __might_sleep+0x10a/0x110 <4>[23900.007357] [<c1228021>] do_page_fault+0xd1/0x3c0 <4>[23900.013764] [<c18e9ba9>] ? restore_all+0xf/0xf <4>[23900.024024] [<c17c007b>] ? napi_complete+0x8b/0x690 <4>[23900.029297] [<c1227f50>] ? pgtable_bad+0x130/0x130 <4>[23900.123739] [<c1227f50>] ? pgtable_bad+0x130/0x130 <4>[23900.128955] [<c18ea0c3>] error_code+0x5f/0x64 <4>[23900.133466] [<c1227f50>] ? pgtable_bad+0x130/0x130 <4>[23900.138450] [<c17f6298>] ? __ip_route_output_key+0x698/0x7c0 <4>[23900.144312] [<c17f5f8d>] ? __ip_route_output_key+0x38d/0x7c0 <4>[23900.150730] [<c17f63df>] ip_route_output_flow+0x1f/0x60 <4>[23900.156261] [<c181de58>] ip4_datagram_connect+0x188/0x2b0 <4>[23900.161960] [<c18e981f>] ? _raw_spin_unlock_bh+0x1f/0x30 <4>[23900.167834] [<c18298d6>] inet_dgram_connect+0x36/0x80 <4>[23900.173224] [<c14f9e88>] ? _copy_from_user+0x48/0x140 <4>[23900.178817] [<c17ab9da>] sys_connect+0x9a/0xd0 <4>[23900.183538] [<c132e93c>] ? alloc_file+0xdc/0x240 <4>[23900.189111] [<c123925d>] ? sub_preempt_count+0x3d/0x50 Function free_fib_info resets nexthop_nh->nh_dev to NULL before releasing fi. Other cpu might be accessing fi. Fixing it by delaying the releasing. With the patch, we ran MTBF testing on Android mobile for 12 hours and didn't trigger the issue. Thank Eric for very detailed review/checking the issue. Signed-off-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Signed-off-by: Kun Jiang <kunx.jiang@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-06-10ipv4: Do not use dead fib_info entries.David S. Miller1-0/+2
[ Upstream commit dccd9ecc374462e5d6a5b8f8110415a86c2213d8 ] Due to RCU lookups and RCU based release, fib_info objects can be found during lookup which have fi->fib_dead set. We must ignore these entries, otherwise we risk dereferencing the parts of the entry which are being torn down. Reported-by: Yevgen Pronenko <yevgen.pronenko@sonymobile.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-21tcp: do_tcp_sendpages() must try to push data out on oom conditionsWilly Tarreau1-2/+1
commit bad115cfe5b509043b684d3a007ab54b80090aa1 upstream. Since recent changes on TCP splicing (starting with commits 2f533844 "tcp: allow splice() to build full TSO packets" and 35f9c09f "tcp: tcp_sendpages() should call tcp_push() once"), I started seeing massive stalls when forwarding traffic between two sockets using splice() when pipe buffers were larger than socket buffers. Latest changes (net: netdev_alloc_skb() use build_skb()) made the problem even more apparent. The reason seems to be that if do_tcp_sendpages() fails on out of memory condition without being able to send at least one byte, tcp_push() is not called and the buffers cannot be flushed. After applying the attached patch, I cannot reproduce the stalls at all and the data rate it perfectly stable and steady under any condition which previously caused the problem to be permanent. The issue seems to have been there since before the kernel migrated to git, which makes me think that the stalls I occasionally experienced with tux during stress-tests years ago were probably related to the same issue. This issue was first encountered on 3.0.31 and 3.2.17, so please backport to -stable. Signed-off-by: Willy Tarreau <w@1wt.eu> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-21tcp: change tcp_adv_win_scale and tcp_rmem[2]Eric Dumazet2-5/+6
[ Upstream commit b49960a05e32121d29316cfdf653894b88ac9190 ] tcp_adv_win_scale default value is 2, meaning we expect a good citizen skb to have skb->len / skb->truesize ratio of 75% (3/4) In 2.6 kernels we (mis)accounted for typical MSS=1460 frame : 1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392. So these skbs were considered as not bloated. With recent truesize fixes, a typical MSS=1460 frame truesize is now the more precise : 2048 + 256 = 2304. But 2304 * 3/4 = 1728. So these skb are not good citizen anymore, because 1460 < 1728 (GRO can escape this problem because it build skbs with a too low truesize.) This also means tcp advertises a too optimistic window for a given allocated rcvspace : When receiving frames, sk_rmem_alloc can hit sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often, especially when application is slow to drain its receive queue or in case of losses (netperf is fast, scp is slow). This is a major latency source. We should adjust the len/truesize ratio to 50% instead of 75% This patch : 1) changes tcp_adv_win_scale default to 1 instead of 2 2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account better truesize tracking and to allow autotuning tcp receive window to reach same value than before. Note that same amount of kernel memory is consumed compared to 2.6 kernels. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-21tcp: fix infinite cwnd in tcp_complete_cwr()Yuchung Cheng1-3/+6
[ Upstream commit 1cebce36d660c83bd1353e41f3e66abd4686f215 ] When the cwnd reduction is done, ssthresh may be infinite if TCP enters CWR via ECN or F-RTO. If cwnd is not undone, i.e., undo_marker is set, tcp_complete_cwr() falsely set cwnd to the infinite ssthresh value. The correct operation is to keep cwnd intact because it has been updated in ECN or F-RTO. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-11tcp: fix tcp_grow_window() for large incoming framesEric Dumazet1-0/+1
[ Upstream commit 4d846f02392a710f9604892ac3329e628e60a230 ] tcp_grow_window() has to grow rcv_ssthresh up to window_clamp, allowing sender to increase its window. tcp_grow_window() still assumes a tcp frame is under MSS, but its no longer true with LRO/GRO. This patch fixes one of the performance issue we noticed with GRO on. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-11tcp: avoid order-1 allocations on wifi and tx pathEric Dumazet2-5/+6
[ This combines upstream commit a21d45726acacc963d8baddf74607d9b74e2b723 and the follow-on bug fix commit 22b4a4f22da4b39c6f7f679fd35f3d35c91bf851 ] Marc Merlin reported many order-1 allocations failures in TX path on its wireless setup, that dont make any sense with MTU=1500 network, and non SG capable hardware. After investigation, it turns out TCP uses sk_stream_alloc_skb() and used as a convention skb_tailroom(skb) to know how many bytes of data payload could be put in this skb (for non SG capable devices) Note : these skb used kmalloc-4096 (MTU=1500 + MAX_HEADER + sizeof(struct skb_shared_info) being above 2048) Later, mac80211 layer need to add some bytes at the tail of skb (IEEE80211_ENCRYPT_TAILROOM = 18 bytes) and since no more tailroom is available has to call pskb_expand_head() and request order-1 allocations. This patch changes sk_stream_alloc_skb() so that only sk->sk_prot->max_header bytes of headroom are reserved, and use a new skb field, avail_size to hold the data payload limit. This way, order-0 allocations done by TCP stack can leave more than 2 KB of tailroom and no more allocation is performed in mac80211 layer (or any layer needing some tailroom) avail_size is unioned with mark/dropcount, since mark will be set later in IP stack for output packets. Therefore, skb size is unchanged. Reported-by: Marc MERLIN <marc@merlins.org> Tested-by: Marc MERLIN <marc@merlins.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> [bwh: Correct commit hash for follow-on bug fix] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-11tcp: fix tcp_trim_head()Eric Dumazet1-5/+8
[ Upstream commit 4fa48bf3c75069d636fc8830743c929a062e80dc ] commit f07d960df3 (tcp: avoid frag allocation for small frames) breaked assumption in tcp stack that skb is either linear (skb->data_len == 0), or fully fragged (skb->data_len == skb->len) tcp_trim_head() made this assumption, we must fix it. Thanks to Vijay for providing a very detailed explanation. Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>