commit cbbdd64c412590666c1817c0754a38b2975ea075 Author: Jiri Slaby Date: Tue Sep 15 16:19:31 2015 +0200 Linux 3.12.48 commit a7775d15b11a277f8af0dc4df69ae420b266e3dd Author: Pablo Neira Ayuso Date: Fri Sep 11 14:26:08 2015 +0200 netfilter: nf_conntrack: don't release a conntrack with non-zero refcnt [ Upstream commit e53376bef2cd97d3e3f61fdc677fb8da7d03d0da ] With this patch, the conntrack refcount is initially set to zero and it is bumped once it is added to any of the list, so we fulfill Eric's golden rule which is that all released objects always have a refcount that equals zero. Andrey Vagin reports that nf_conntrack_free can't be called for a conntrack with non-zero ref-counter, because it can race with nf_conntrack_find_get(). A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero ref-counter says that this conntrack is used. So when we release a conntrack with non-zero counter, we break this assumption. CPU1 CPU2 ____nf_conntrack_find() nf_ct_put() destroy_conntrack() ... init_conntrack __nf_conntrack_alloc (set use = 1) atomic_inc_not_zero(&ct->use) (use = 2) if (!l4proto->new(ct, skb, dataoff, timeouts)) nf_conntrack_free(ct); (use = 2 !!!) ... __nf_conntrack_alloc (set use = 1) if (!nf_ct_key_equal(h, tuple, zone)) nf_ct_put(ct); (use = 0) destroy_conntrack() /* continue to work with CT */ After applying the path "[PATCH] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get" another bug was triggered in destroy_conntrack(): <4>[67096.759334] ------------[ cut here ]------------ <2>[67096.759353] kernel BUG at net/netfilter/nf_conntrack_core.c:211! ... <4>[67096.759837] Pid: 498649, comm: atdd veid: 666 Tainted: G C --------------- 2.6.32-042stab084.18 #1 042stab084_18 /DQ45CB <4>[67096.759932] RIP: 0010:[] [] destroy_conntrack+0x15c/0x190 [nf_conntrack] <4>[67096.760255] Call Trace: <4>[67096.760255] [] nf_conntrack_destroy+0x17/0x30 <4>[67096.760255] [] nf_conntrack_find_get+0x85/0x130 [nf_conntrack] <4>[67096.760255] [] nf_conntrack_in+0x352/0xb60 [nf_conntrack] <4>[67096.760255] [] ipv4_conntrack_local+0x51/0x60 [nf_conntrack_ipv4] <4>[67096.760255] [] nf_iterate+0x69/0xb0 <4>[67096.760255] [] ? dst_output+0x0/0x20 <4>[67096.760255] [] nf_hook_slow+0x74/0x110 <4>[67096.760255] [] ? dst_output+0x0/0x20 <4>[67096.760255] [] raw_sendmsg+0x775/0x910 <4>[67096.760255] [] ? flush_tlb_others_ipi+0x128/0x130 <4>[67096.760255] [] ? apic_timer_interrupt+0xe/0x20 <4>[67096.760255] [] ? apic_timer_interrupt+0xe/0x20 <4>[67096.760255] [] inet_sendmsg+0x4a/0xb0 <4>[67096.760255] [] ? sock_sendmsg+0x13/0x140 <4>[67096.760255] [] sock_sendmsg+0x117/0x140 <4>[67096.760255] [] ? native_smp_send_reschedule+0x49/0x60 <4>[67096.760255] [] ? _spin_unlock_bh+0x1b/0x20 <4>[67096.760255] [] ? autoremove_wake_function+0x0/0x40 <4>[67096.760255] [] ? do_ip_setsockopt+0x90/0xd80 <4>[67096.760255] [] ? apic_timer_interrupt+0xe/0x20 <4>[67096.760255] [] ? apic_timer_interrupt+0xe/0x20 <4>[67096.760255] [] sys_sendto+0x139/0x190 <4>[67096.760255] [] ? audit_syscall_entry+0x1d7/0x200 <4>[67096.760255] [] ? __audit_syscall_exit+0x265/0x290 <4>[67096.760255] [] compat_sys_socketcall+0x13f/0x210 <4>[67096.760255] [] ia32_sysret+0x0/0x5 I have reused the original title for the RFC patch that Andrey posted and most of the original patch description. Cc: Eric Dumazet Cc: Andrew Vagin Cc: Florian Westphal Reported-by: Andrew Vagin Signed-off-by: Pablo Neira Ayuso Reviewed-by: Eric Dumazet Acked-by: Andrew Vagin Signed-off-by: Jiri Slaby commit 336323b7bd2cd60417a56703666d4b05c85e148a Author: Andrey Vagin Date: Fri Sep 11 14:26:07 2015 +0200 netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get [ Upstream commit c6825c0976fa7893692e0e43b09740b419b23c09 ] Lets look at destroy_conntrack: hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); ... nf_conntrack_free(ct) kmem_cache_free(net->ct.nf_conntrack_cachep, ct); net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU. The hash is protected by rcu, so readers look up conntracks without locks. A conntrack is removed from the hash, but in this moment a few readers still can use the conntrack. Then this conntrack is released and another thread creates conntrack with the same address and the equal tuple. After this a reader starts to validate the conntrack: * It's not dying, because a new conntrack was created * nf_ct_tuple_equal() returns true. But this conntrack is not initialized yet, so it can not be used by two threads concurrently. In this case BUG_ON may be triggered from nf_nat_setup_info(). Florian Westphal suggested to check the confirm bit too. I think it's right. task 1 task 2 task 3 nf_conntrack_find_get ____nf_conntrack_find destroy_conntrack hlist_nulls_del_rcu nf_conntrack_free kmem_cache_free __nf_conntrack_alloc kmem_cache_alloc memset(&ct->tuplehash[IP_CT_DIR_MAX], if (nf_ct_is_dying(ct)) if (!nf_ct_tuple_equal() I'm not sure, that I have ever seen this race condition in a real life. Currently we are investigating a bug, which is reproduced on a few nodes. In our case one conntrack is initialized from a few tasks concurrently, we don't have any other explanation for this. <2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322! ... <4>[46267.083951] RIP: 0010:[] [] nf_nat_setup_info+0x564/0x590 [nf_nat] ... <4>[46267.085549] Call Trace: <4>[46267.085622] [] alloc_null_binding+0x5b/0xa0 [iptable_nat] <4>[46267.085697] [] nf_nat_rule_find+0x5c/0x80 [iptable_nat] <4>[46267.085770] [] nf_nat_fn+0x111/0x260 [iptable_nat] <4>[46267.085843] [] nf_nat_out+0x48/0xd0 [iptable_nat] <4>[46267.085919] [] nf_iterate+0x69/0xb0 <4>[46267.085991] [] ? ip_finish_output+0x0/0x2f0 <4>[46267.086063] [] nf_hook_slow+0x74/0x110 <4>[46267.086133] [] ? ip_finish_output+0x0/0x2f0 <4>[46267.086207] [] ? dst_output+0x0/0x20 <4>[46267.086277] [] ip_output+0xa4/0xc0 <4>[46267.086346] [] raw_sendmsg+0x8b4/0x910 <4>[46267.086419] [] inet_sendmsg+0x4a/0xb0 <4>[46267.086491] [] ? sock_update_classid+0x3a/0x50 <4>[46267.086562] [] sock_sendmsg+0x117/0x140 <4>[46267.086638] [] ? _spin_unlock_bh+0x1b/0x20 <4>[46267.086712] [] ? autoremove_wake_function+0x0/0x40 <4>[46267.086785] [] ? do_ip_setsockopt+0x90/0xd80 <4>[46267.086858] [] ? call_function_interrupt+0xe/0x20 <4>[46267.086936] [] ? ub_slab_ptr+0x20/0x90 <4>[46267.087006] [] ? ub_slab_ptr+0x20/0x90 <4>[46267.087081] [] ? kmem_cache_alloc+0xd8/0x1e0 <4>[46267.087151] [] sys_sendto+0x139/0x190 <4>[46267.087229] [] ? sock_setsockopt+0x16d/0x6f0 <4>[46267.087303] [] ? audit_syscall_entry+0x1d7/0x200 <4>[46267.087378] [] ? __audit_syscall_exit+0x265/0x290 <4>[46267.087454] [] ? compat_sys_setsockopt+0x75/0x210 <4>[46267.087531] [] compat_sys_socketcall+0x13f/0x210 <4>[46267.087607] [] ia32_sysret+0x0/0x5 <4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74 <1>[46267.088023] RIP [] nf_nat_setup_info+0x564/0x590 Cc: Eric Dumazet Cc: Florian Westphal Cc: Pablo Neira Ayuso Cc: Patrick McHardy Cc: Jozsef Kadlecsik Cc: "David S. Miller" Cc: Cyrill Gorcunov Signed-off-by: Andrey Vagin Acked-by: Eric Dumazet Signed-off-by: Pablo Neira Ayuso Signed-off-by: Jiri Slaby commit 09c59fc80e03795406c648c28dba4aa1a365bc0e Author: Benjamin LaHaise Date: Sun Aug 24 13:14:05 2014 -0400 aio: fix reqs_available handling commit d856f32a86b2b015ab180ab7a55e455ed8d3ccc5 upstream. As reported by Dan Aloni, commit f8567a3845ac ("aio: fix aio request leak when events are reaped by userspace") introduces a regression when user code attempts to perform io_submit() with more events than are available in the ring buffer. Reverting that commit would reintroduce a regression when user space event reaping is used. Fixing this bug is a bit more involved than the previous attempts to fix this regression. Since we do not have a single point at which we can count events as being reaped by user space and io_getevents(), we have to track event completion by looking at the number of events left in the event ring. So long as there are as many events in the ring buffer as there have been completion events generate, we cannot call put_reqs_available(). The code to check for this is now placed in refill_reqs_available(). A test program from Dan and modified by me for verifying this bug is available at http://www.kvack.org/~bcrl/20140824-aio_bug.c . Reported-by: Dan Aloni Signed-off-by: Benjamin LaHaise Acked-by: Dan Aloni Cc: Kent Overstreet Cc: Mateusz Guzik Cc: Petr Matousek Signed-off-by: Linus Torvalds Signed-off-by: Jiri Slaby commit ca6292e4e78f9448c62509f35ace5bd1f092b330 Author: Heinz Mauelshagen Date: Fri Feb 28 12:02:56 2014 -0500 dm cache mq: fix memory allocation failure for large cache devices commit 14f398ca2f26a2ed6236aec54395e0fa06ec8a82 upstream. The memory allocated for the multiqueue policy's hash table doesn't need to be physically contiguous. Use vzalloc() instead of kzalloc(). Fedora has been carrying this fix since 10/10/2013. Failure seen during creation of a 10TB cached device with a 2048 sector block size and 411GB cache size: dmsetup: page allocation failure: order:9, mode:0x10c0d0 CPU: 11 PID: 29235 Comm: dmsetup Not tainted 3.10.4 #3 Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1a 12/30/2011 000000000010c0d0 ffff880090941898 ffffffff81387ab4 ffff880090941928 ffffffff810bb26f 0000000000000009 000000000010c0d0 ffff880090941928 ffffffff81385dbc ffffffff815f3840 ffffffff00000000 000002000010c0d0 Call Trace: [] dump_stack+0x19/0x1b [] warn_alloc_failed+0x110/0x124 [] ? __alloc_pages_direct_compact+0x17c/0x18e [] __alloc_pages_nodemask+0x6c7/0x75e [] __get_free_pages+0x12/0x3f [] kmalloc_order_trace+0x29/0x88 [] __kmalloc+0x36/0x11b [] ? mq_create+0x1dc/0x2cf [dm_cache_mq] [] mq_create+0x2af/0x2cf [dm_cache_mq] [] dm_cache_policy_create+0xa7/0xd2 [dm_cache] [] ? cache_ctr+0x245/0xa13 [dm_cache] [] cache_ctr+0x353/0xa13 [dm_cache] [] dm_table_add_target+0x227/0x2ce [dm_mod] [] table_load+0x286/0x2ac [dm_mod] [] ? dev_wait+0x8a/0x8a [dm_mod] [] ctl_ioctl+0x39a/0x3c2 [dm_mod] [] dm_ctl_ioctl+0xe/0x12 [dm_mod] [] vfs_ioctl+0x21/0x34 [] do_vfs_ioctl+0x3b1/0x3f4 [] ? ____fput+0x9/0xb [] ? task_work_run+0x7e/0x92 [] SyS_ioctl+0x52/0x82 [] system_call_fastpath+0x16/0x1b Signed-off-by: Heinz Mauelshagen Signed-off-by: Mike Snitzer Signed-off-by: Jiri Slaby commit 52c312106a2268b7c2a93464373d5ca60d3566fd Author: Akinobu Mita Date: Mon Nov 18 22:11:42 2013 +0900 bio: fix argument of __bio_add_page() for max_sectors > 0xffff commit 34f2fd8dfe6185b0eaaf7d661281713a6170b077 upstream. The data type of max_sectors and max_hw_sectors in queue settings are unsigned int. But these values are passed to __bio_add_page() as an argument whose data type is unsigned short. In the worst case such as max_sectors is 0x10000, bio_add_page() can't add a page and IOs can't proceed. Cc: Jens Axboe Cc: Alexander Viro Signed-off-by: Akinobu Mita Signed-off-by: Jens Axboe Signed-off-by: Jiri Slaby commit 62d80e38ed89ca1e7dc9424c7c4438d46be924d9 Author: James Smart Date: Fri May 22 10:42:39 2015 -0400 lpfc: Fix scsi prep dma buf error. commit 5116fbf136ea21b8678a85eee5c03508736ada9f upstream. Didn't check for less-than-or-equal zero. Means we may later call scsi_dma_unmap() even though we don't have valid mappings. Signed-off-by: Dick Kennedy Signed-off-by: James Smart Reviewed-by: Hannes Reinecke Signed-off-by: James Bottomley Signed-off-by: Jiri Slaby commit a67172a013953664b1dad03c648200c70b90506c Author: Shirish Pargaonkar Date: Sat Oct 12 10:06:03 2013 -0500 cifs: Send a logoff request before removing a smb session commit 7f48558e6489d032b1584b0cc9ac4bb11072c034 upstream. Send a smb session logoff request before removing smb session off of the list. On a signed smb session, remvoing a session off of the list before sending a logoff request results in server returning an error for lack of smb signature. Never seen an error during smb logoff, so as per MS-SMB2 3.2.5.1, not sure how an error during logoff should be retried. So for now, if a server returns an error to a logoff request, log the error and remove the session off of the list. Signed-off-by: Shirish Pargaonkar Reviewed-by: Jeff Layton Signed-off-by: Steve French Signed-off-by: Jiri Slaby commit 6f767a3106227c491d3b27c43887b92c0f9287f8 Author: David Milburn Date: Thu May 23 16:23:45 2013 -0500 mtip32xx: dynamically allocate buffer in debugfs functions commit c8afd0dcbd14e2352258f2e2d359b36d0edd459f upstream. Dynamically allocate buf to prevent warnings: drivers/block/mtip32xx/mtip32xx.c: In function ‘mtip_hw_read_device_status’: drivers/block/mtip32xx/mtip32xx.c:2823: warning: the frame size of 1056 bytes is larger than 1024 bytes drivers/block/mtip32xx/mtip32xx.c: In function ‘mtip_hw_read_registers’: drivers/block/mtip32xx/mtip32xx.c:2894: warning: the frame size of 1056 bytes is larger than 1024 bytes drivers/block/mtip32xx/mtip32xx.c: In function ‘mtip_hw_read_flags’: drivers/block/mtip32xx/mtip32xx.c:2917: warning: the frame size of 1056 bytes is larger than 1024 bytes Signed-off-by: David Milburn Acked-by: Asai Thambi S P Signed-off-by: Jens Axboe Signed-off-by: Jiri Slaby commit e19d13637d27df69e2e6ce031d7776f32705ebcd Author: Dan Carpenter Date: Sat Aug 1 15:33:26 2015 +0300 rds: fix an integer overflow test in rds_info_getsockopt() [ Upstream commit 468b732b6f76b138c0926eadf38ac88467dcd271 ] "len" is a signed integer. We check that len is not negative, so it goes from zero to INT_MAX. PAGE_SIZE is unsigned long so the comparison is type promoted to unsigned long. ULONG_MAX - 4095 is a higher than INT_MAX so the condition can never be true. I don't know if this is harmful but it seems safe to limit "len" to INT_MAX - 4095. Fixes: a8c879a7ee98 ('RDS: Info and stats') Signed-off-by: Dan Carpenter Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit 80bcc280c1f44715eb79a748d3d61575ea969335 Author: Jack Morgenstein Date: Wed Jul 22 16:53:47 2015 +0300 net/mlx4_core: Fix wrong index in propagating port change event to VFs [ Upstream commit 1c1bf34951e8d17941bf708d1901c47e81b15d55 ] The port-change event processing in procedure mlx4_eq_int() uses "slave" as the vf_oper array index. Since the value of "slave" is the PF function index, the result is that the PF link state is used for deciding to propagate the event for all the VFs. The VF link state should be used, so the VF function index should be used here. Fixes: 948e306d7d64 ('net/mlx4: Add VF link state support') Signed-off-by: Jack Morgenstein Signed-off-by: Matan Barak Signed-off-by: Or Gerlitz Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit 6056cd7880deb1567c7b875dd0b5cb6b5e684ef7 Author: Florian Westphal Date: Tue Jul 21 16:33:50 2015 +0200 netlink: don't hold mutex in rcu callback when releasing mmapd ring [ Upstream commit 0470eb99b4721586ccac954faac3fa4472da0845 ] Kirill A. Shutemov says: This simple test-case trigers few locking asserts in kernel: int main(int argc, char **argv) { unsigned int block_size = 16 * 4096; struct nl_mmap_req req = { .nm_block_size = block_size, .nm_block_nr = 64, .nm_frame_size = 16384, .nm_frame_nr = 64 * block_size / 16384, }; unsigned int ring_size; int fd; fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC); if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0) exit(1); if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0) exit(1); ring_size = req.nm_block_nr * req.nm_block_size; mmap(NULL, 2 * ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); return 0; } +++ exited with 0 +++ BUG: sleeping function called from invalid context at /home/kas/git/public/linux-mm/kernel/locking/mutex.c:616 in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init 3 locks held by init/1: #0: (reboot_mutex){+.+...}, at: [] SyS_reboot+0xa9/0x220 #1: ((reboot_notifier_list).rwsem){.+.+..}, at: [] __blocking_notifier_call_chain+0x39/0x70 #2: (rcu_callback){......}, at: [] rcu_do_batch.isra.49+0x160/0x10c0 Preemption disabled at:[] __delay+0xf/0x20 CPU: 1 PID: 1 Comm: init Not tainted 4.1.0-00009-gbddf4c4818e0 #253 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014 ffff88017b3d8000 ffff88027bc03c38 ffffffff81929ceb 0000000000000102 0000000000000000 ffff88027bc03c68 ffffffff81085a9d 0000000000000002 ffffffff81ca2a20 0000000000000268 0000000000000000 ffff88027bc03c98 Call Trace: [] dump_stack+0x4f/0x7b [] ___might_sleep+0x16d/0x270 [] __might_sleep+0x4d/0x90 [] mutex_lock_nested+0x2f/0x430 [] ? _raw_spin_unlock_irqrestore+0x5d/0x80 [] ? __this_cpu_preempt_check+0x13/0x20 [] netlink_set_ring+0x1ed/0x350 [] ? netlink_undo_bind+0x70/0x70 [] netlink_sock_destruct+0x80/0x150 [] __sk_free+0x1d/0x160 [] sk_free+0x19/0x20 [..] Cong Wang says: We can't hold mutex lock in a rcu callback, [..] Thomas Graf says: The socket should be dead at this point. It might be simpler to add a netlink_release_ring() function which doesn't require locking at all. Reported-by: "Kirill A. Shutemov" Diagnosed-by: Cong Wang Suggested-by: Thomas Graf Signed-off-by: Florian Westphal Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit 76c4d538cebac5e63b08777903d53e4376e294da Author: Edward Hyunkoo Jee Date: Tue Jul 21 09:43:59 2015 +0200 inet: frags: fix defragmented packet's IP header for af_packet [ Upstream commit 0848f6428ba3a2e42db124d41ac6f548655735bf ] When ip_frag_queue() computes positions, it assumes that the passed sk_buff does not contain L2 headers. However, when PACKET_FANOUT_FLAG_DEFRAG is used, IP reassembly functions can be called on outgoing packets that contain L2 headers. Also, IPv4 checksum is not corrected after reassembly. Fixes: 7736d33f4262 ("packet: Add pre-defragmentation support for ipv4 fanouts.") Signed-off-by: Edward Hyunkoo Jee Signed-off-by: Eric Dumazet Cc: Willem de Bruijn Cc: Jerry Chu Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit 0068a2e07e38b0dd17c3b30efb3af5089c49cef1 Author: dingtianhong Date: Thu Jul 16 16:30:02 2015 +0800 bonding: correct the MAC address for "follow" fail_over_mac policy [ Upstream commit a951bc1e6ba58f11df5ed5ddc41311e10f5fd20b ] The "follow" fail_over_mac policy is useful for multiport devices that either become confused or incur a performance penalty when multiple ports are programmed with the same MAC address, but the same MAC address still may happened by this steps for this policy: 1) echo +eth0 > /sys/class/net/bond0/bonding/slaves bond0 has the same mac address with eth0, it is MAC1. 2) echo +eth1 > /sys/class/net/bond0/bonding/slaves eth1 is backup, eth1 has MAC2. 3) ifconfig eth0 down eth1 became active slave, bond will swap MAC for eth0 and eth1, so eth1 has MAC1, and eth0 has MAC2. 4) ifconfig eth1 down there is no active slave, and eth1 still has MAC1, eth2 has MAC2. 5) ifconfig eth0 up the eth0 became active slave again, the bond set eth0 to MAC1. Something wrong here, then if you set eth1 up, the eth0 and eth1 will have the same MAC address, it will break this policy for ACTIVE_BACKUP mode. This patch will fix this problem by finding the old active slave and swap them MAC address before change active slave. Signed-off-by: Ding Tianhong Tested-by: Nikolay Aleksandrov Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit 7832561194c5dca63afaca38b562cd122e4b22bf Author: Nikolay Aleksandrov Date: Wed Jul 15 21:52:51 2015 +0200 bonding: fix destruction of bond with devices different from arphrd_ether [ Upstream commit 06f6d1094aa0992432b1e2a0920b0ee86ccd83bf ] When the bonding is being unloaded and the netdevice notifier is unregistered it executes NETDEV_UNREGISTER for each device which should remove the bond's proc entry but if the device enslaved is not of ARPHRD_ETHER type and is in front of the bonding, it may execute bond_release_and_destroy() first which would release the last slave and destroy the bond device leaving the proc entry and thus we will get the following error (with dynamic debug on for bond_netdev_event to see the events order): [ 908.963051] eql: event: 9 [ 908.963052] eql: IFF_SLAVE [ 908.963054] eql: event: 2 [ 908.963056] eql: IFF_SLAVE [ 908.963058] eql: event: 6 [ 908.963059] eql: IFF_SLAVE [ 908.963110] bond0: Releasing active interface eql [ 908.976168] bond0: Destroying bond bond0 [ 908.976266] bond0 (unregistering): Released all slaves [ 908.984097] ------------[ cut here ]------------ [ 908.984107] WARNING: CPU: 0 PID: 1787 at fs/proc/generic.c:575 remove_proc_entry+0x112/0x160() [ 908.984110] remove_proc_entry: removing non-empty directory 'net/bonding', leaking at least 'bond0' [ 908.984111] Modules linked in: bonding(-) eql(O) 9p nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev qxl drm_kms_helper snd_hda_codec_generic aesni_intel ttm aes_x86_64 glue_helper pcspkr lrw gf128mul ablk_helper cryptd snd_hda_intel virtio_console snd_hda_codec psmouse serio_raw snd_hwdep snd_hda_core 9pnet_virtio 9pnet evdev joydev drm virtio_balloon snd_pcm snd_timer snd soundcore i2c_piix4 i2c_core pvpanic acpi_cpufreq parport_pc parport processor thermal_sys button autofs4 ext4 crc16 mbcache jbd2 hid_generic usbhid hid sg sr_mod cdrom ata_generic virtio_blk virtio_net floppy ata_piix e1000 libata ehci_pci virtio_pci scsi_mod uhci_hcd ehci_hcd virtio_ring virtio usbcore usb_common [last unloaded: bonding] [ 908.984168] CPU: 0 PID: 1787 Comm: rmmod Tainted: G W O 4.2.0-rc2+ #8 [ 908.984170] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 908.984172] 0000000000000000 ffffffff81732d41 ffffffff81525b34 ffff8800358dfda8 [ 908.984175] ffffffff8106c521 ffff88003595af78 ffff88003595af40 ffff88003e3a4280 [ 908.984178] ffffffffa058d040 0000000000000000 ffffffff8106c59a ffffffff8172ebd0 [ 908.984181] Call Trace: [ 908.984188] [] ? dump_stack+0x40/0x50 [ 908.984193] [] ? warn_slowpath_common+0x81/0xb0 [ 908.984196] [] ? warn_slowpath_fmt+0x4a/0x50 [ 908.984199] [] ? remove_proc_entry+0x112/0x160 [ 908.984205] [] ? bond_destroy_proc_dir+0x26/0x30 [bonding] [ 908.984208] [] ? bond_net_exit+0x8e/0xa0 [bonding] [ 908.984217] [] ? ops_exit_list.isra.4+0x37/0x70 [ 908.984225] [] ? unregister_pernet_operations+0x8d/0xd0 [ 908.984228] [] ? unregister_pernet_subsys+0x1d/0x30 [ 908.984232] [] ? bonding_exit+0x23/0xdba [bonding] [ 908.984236] [] ? SyS_delete_module+0x18a/0x250 [ 908.984241] [] ? task_work_run+0x89/0xc0 [ 908.984244] [] ? entry_SYSCALL_64_fastpath+0x16/0x75 [ 908.984247] ---[ end trace 7c006ed4abbef24b ]--- Thus remove the proc entry manually if bond_release_and_destroy() is used. Because of the checks in bond_remove_proc_entry() it's not a problem for a bond device to change namespaces (the bug fixed by the Fixes commit) but since commit f9399814927ad ("bonding: Don't allow bond devices to change network namespaces.") that can't happen anyway. Reported-by: Carol Soto Signed-off-by: Nikolay Aleksandrov Fixes: a64d49c3dd50 ("bonding: Manage /proc/net/bonding/ entries from the netdev events") Tested-by: Carol L Soto Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit b29321b96de6167daf84306903cedc5b214fe5b7 Author: Eric Dumazet Date: Tue Jul 14 08:10:22 2015 +0200 ipv6: lock socket in ip6_datagram_connect() [ Upstream commit 03645a11a570d52e70631838cb786eb4253eb463 ] ip6_datagram_connect() is doing a lot of socket changes without socket being locked. This looks wrong, at least for udp_lib_rehash() which could corrupt lists because of concurrent udp_sk(sk)->udp_portaddr_hash accesses. Signed-off-by: Eric Dumazet Acked-by: Herbert Xu Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit 4a187391f4c7d54b9c07eed08de48ffd6e0a3f20 Author: Tilman Schmidt Date: Tue Jul 14 00:37:13 2015 +0200 isdn/gigaset: reset tty->receive_room when attaching ser_gigaset [ Upstream commit fd98e9419d8d622a4de91f76b306af6aa627aa9c ] Commit 79901317ce80 ("n_tty: Don't flush buffer when closing ldisc"), first merged in kernel release 3.10, caused the following regression in the Gigaset M101 driver: Before that commit, when closing the N_TTY line discipline in preparation to switching to N_GIGASET_M101, receive_room would be reset to a non-zero value by the call to n_tty_flush_buffer() in n_tty's close method. With the removal of that call, receive_room might be left at zero, blocking data reception on the serial line. The present patch fixes that regression by setting receive_room to an appropriate value in the ldisc open method. Fixes: 79901317ce80 ("n_tty: Don't flush buffer when closing ldisc") Signed-off-by: Tilman Schmidt Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit eefc4b8868b173154789fd1ed5f624040c4f0bcc Author: Nikolay Aleksandrov Date: Mon Jul 13 06:36:19 2015 -0700 bridge: mdb: fix double add notification [ Upstream commit 5ebc784625ea68a9570d1f70557e7932988cd1b4 ] Since the mdb add/del code was introduced there have been 2 br_mdb_notify calls when doing br_mdb_add() resulting in 2 notifications on each add. Example: Command: bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent Before patch: root@debian:~# bridge monitor all [MDB]dev br0 port eth1 grp 239.0.0.1 permanent [MDB]dev br0 port eth1 grp 239.0.0.1 permanent After patch: root@debian:~# bridge monitor all [MDB]dev br0 port eth1 grp 239.0.0.1 permanent Signed-off-by: Nikolay Aleksandrov Fixes: cfd567543590 ("bridge: add support of adding and deleting mdb entries") Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit 9b87cdeed7c5e837b70a829fb4abbdda8170d6b6 Author: Herbert Xu Date: Tue Aug 4 15:42:47 2015 +0800 net: Fix skb_set_peeked use-after-free bug [ Upstream commit a0a2a6602496a45ae838a96db8b8173794b5d398 ] The commit 738ac1ebb96d02e0d23bc320302a6ea94c612dec ("net: Clone skb before setting peeked flag") introduced a use-after-free bug in skb_recv_datagram. This is because skb_set_peeked may create a new skb and free the existing one. As it stands the caller will continue to use the old freed skb. This patch fixes it by making skb_set_peeked return the new skb (or the old one if unchanged). Fixes: 738ac1ebb96d ("net: Clone skb before setting peeked flag") Reported-by: Brenden Blanco Signed-off-by: Herbert Xu Tested-by: Brenden Blanco Reviewed-by: Konstantin Khlebnikov Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit 75ebad1ba5abced03ebd2e2a596d6bd6581e4c26 Author: Herbert Xu Date: Mon Jul 13 20:01:42 2015 +0800 net: Fix skb csum races when peeking [ Upstream commit 89c22d8c3b278212eef6a8cc66b570bc840a6f5a ] When we calculate the checksum on the recv path, we store the result in the skb as an optimisation in case we need the checksum again down the line. This is in fact bogus for the MSG_PEEK case as this is done without any locking. So multiple threads can peek and then store the result to the same skb, potentially resulting in bogus skb states. This patch fixes this by only storing the result if the skb is not shared. This preserves the optimisations for the few cases where it can be done safely due to locking or other reasons, e.g., SIOCINQ. Signed-off-by: Herbert Xu Acked-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit 339b0b16eddd5ac63419f0176ef95cc3c917b0f9 Author: Herbert Xu Date: Mon Jul 13 16:04:13 2015 +0800 net: Clone skb before setting peeked flag [ Upstream commit 738ac1ebb96d02e0d23bc320302a6ea94c612dec ] Shared skbs must not be modified and this is crucial for broadcast and/or multicast paths where we use it as an optimisation to avoid unnecessary cloning. The function skb_recv_datagram breaks this rule by setting peeked without cloning the skb first. This causes funky races which leads to double-free. This patch fixes this by cloning the skb and replacing the skb in the list when setting skb->peeked. Fixes: a59322be07c9 ("[UDP]: Only increment counter on first peek/recv") Reported-by: Konstantin Khlebnikov Signed-off-by: Herbert Xu Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit d5f8ff935843779207a4100b38b6a43b0bf5cd07 Author: Julian Anastasov Date: Thu Jul 9 09:59:10 2015 +0300 net: call rcu_read_lock early in process_backlog [ Upstream commit 2c17d27c36dcce2b6bf689f41a46b9e909877c21 ] Incoming packet should be either in backlog queue or in RCU read-side section. Otherwise, the final sequence of flush_backlog() and synchronize_net() may miss packets that can run without device reference: CPU 1 CPU 2 skb->dev: no reference process_backlog:__skb_dequeue process_backlog:local_irq_enable on_each_cpu for flush_backlog => IPI(hardirq): flush_backlog - packet not found in backlog CPU delayed ... synchronize_net - no ongoing RCU read-side sections netdev_run_todo, rcu_barrier: no ongoing callbacks __netif_receive_skb_core:rcu_read_lock - too late free dev process packet for freed dev Fixes: 6e583ce5242f ("net: eliminate refcounting in backlog queue") Cc: Eric W. Biederman Cc: Stephen Hemminger Signed-off-by: Julian Anastasov Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit f6533fed1bfa842e391ee50f9a9c5e963c71e579 Author: Julian Anastasov Date: Thu Jul 9 09:59:09 2015 +0300 net: do not process device backlog during unregistration [ Upstream commit e9e4dd3267d0c5234c5c0f47440456b10875dec9 ] commit 381c759d9916 ("ipv4: Avoid crashing in ip_error") fixes a problem where processed packet comes from device with destroyed inetdev (dev->ip_ptr). This is not expected because inetdev_destroy is called in NETDEV_UNREGISTER phase and packets should not be processed after dev_close_many() and synchronize_net(). Above fix is still required because inetdev_destroy can be called for other reasons. But it shows the real problem: backlog can keep packets for long time and they do not hold reference to device. Such packets are then delivered to upper levels at the same time when device is unregistered. Calling flush_backlog after NETDEV_UNREGISTER_FINAL still accounts all packets from backlog but before that some packets continue to be delivered to upper levels long after the synchronize_net call which is supposed to wait the last ones. Also, as Eric pointed out, processed packets, mostly from other devices, can continue to add new packets to backlog. Fix the problem by moving flush_backlog early, after the device driver is stopped and before the synchronize_net() call. Then use netif_running check to make sure we do not add more packets to backlog. We have to do it in enqueue_to_backlog context when the local IRQ is disabled. As result, after the flush_backlog and synchronize_net sequence all packets should be accounted. Thanks to Eric W. Biederman for the test script and his valuable feedback! Reported-by: Vittorio Gambaletta Fixes: 6e583ce5242f ("net: eliminate refcounting in backlog queue") Cc: Eric W. Biederman Cc: Stephen Hemminger Signed-off-by: Julian Anastasov Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit 6f137d152fcf65346b399050835addf72ec04642 Author: Oleg Nesterov Date: Wed Jul 8 21:42:11 2015 +0200 net: pktgen: fix race between pktgen_thread_worker() and kthread_stop() [ Upstream commit fecdf8be2d91e04b0a9a4f79ff06499a36f5d14f ] pktgen_thread_worker() is obviously racy, kthread_stop() can come between the kthread_should_stop() check and set_current_state(). Signed-off-by: Oleg Nesterov Reported-by: Jan Stancek Reported-by: Marcelo Leitner Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit 1085192164c63e293bbfa799a1e64ae4d7dae4a4 Author: Nikolay Aleksandrov Date: Tue Jul 7 15:55:56 2015 +0200 bridge: mdb: zero out the local br_ip variable before use [ Upstream commit f1158b74e54f2e2462ba5e2f45a118246d9d5b43 ] Since commit b0e9a30dd669 ("bridge: Add vlan id to multicast groups") there's a check in br_ip_equal() for a matching vlan id, but the mdb functions were not modified to use (or at least zero it) so when an entry was added it would have a garbage vlan id (from the local br_ip variable in __br_mdb_add/del) and this would prevent it from being matched and also deleted. So zero out the whole local ip var to protect ourselves from future changes and also to fix the current bug, since there's no vlan id support in the mdb uapi - use always vlan id 0. Example before patch: root@debian:~# bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent root@debian:~# bridge mdb dev br0 port eth1 grp 239.0.0.1 permanent root@debian:~# bridge mdb del dev br0 port eth1 grp 239.0.0.1 permanent RTNETLINK answers: Invalid argument After patch: root@debian:~# bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent root@debian:~# bridge mdb dev br0 port eth1 grp 239.0.0.1 permanent root@debian:~# bridge mdb del dev br0 port eth1 grp 239.0.0.1 permanent root@debian:~# bridge mdb Signed-off-by: Nikolay Aleksandrov Fixes: b0e9a30dd669 ("bridge: Add vlan id to multicast groups") Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit f5aeec43af3251629bd50bd0075712325086f721 Author: Stephen Smalley Date: Tue Jul 7 09:43:45 2015 -0400 net/tipc: initialize security state for new connection socket [ Upstream commit fdd75ea8df370f206a8163786e7470c1277a5064 ] Calling connect() with an AF_TIPC socket would trigger a series of error messages from SELinux along the lines of: SELinux: Invalid class 0 type=AVC msg=audit(1434126658.487:34500): avc: denied { } for pid=292 comm="kworker/u16:5" scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass= permissive=0 This was due to a failure to initialize the security state of the new connection sock by the tipc code, leaving it with junk in the security class field and an unlabeled secid. Add a call to security_sk_clone() to inherit the security state from the parent socket. Reported-by: Tim Shearer Signed-off-by: Stephen Smalley Acked-by: Paul Moore Acked-by: Ying Xue Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit 18213234d22ff613c6e5aa4e42ff1cd04598d855 Author: Timo Teräs Date: Tue Jul 7 08:34:13 2015 +0300 ip_tunnel: fix ipv4 pmtu check to honor inner ip header df [ Upstream commit fc24f2b2094366da8786f59f2606307e934cea17 ] Frag needed should be sent only if the inner header asked to not fragment. Currently fragmentation is broken if the tunnel has df set, but df was not asked in the original packet. The tunnel's df needs to be still checked to update internally the pmtu cache. Commit 23a3647bc4f93bac broke it, and this commit fixes the ipv4 df check back to the way it was. Fixes: 23a3647bc4f93bac ("ip_tunnels: Use skb-len to PMTU check.") Cc: Pravin B Shelar Signed-off-by: Timo Teräs Acked-by: Pravin B Shelar Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit 9e7281c3e6677719d7db775f3df2a1031c57f3e7 Author: Daniel Borkmann Date: Tue Jul 7 00:07:52 2015 +0200 rtnetlink: verify IFLA_VF_INFO attributes before passing them to driver [ Upstream commit 4f7d2cdfdde71ffe962399b7020c674050329423 ] Jason Gunthorpe reported that since commit c02db8c6290b ("rtnetlink: make SR-IOV VF interface symmetric"), we don't verify IFLA_VF_INFO attributes anymore with respect to their policy, that is, ifla_vfinfo_policy[]. Before, they were part of ifla_policy[], but they have been nested since placed under IFLA_VFINFO_LIST, that contains the attribute IFLA_VF_INFO, which is another nested attribute for the actual VF attributes such as IFLA_VF_MAC, IFLA_VF_VLAN, etc. Despite the policy being split out from ifla_policy[] in this commit, it's never applied anywhere. nla_for_each_nested() only does basic nla_ok() testing for struct nlattr, but it doesn't know about the data context and their requirements. Fix, on top of Jason's initial work, does 1) parsing of the attributes with the right policy, and 2) using the resulting parsed attribute table from 1) instead of the nla_for_each_nested() loop (just like we used to do when still part of ifla_policy[]). Reference: http://thread.gmane.org/gmane.linux.network/368913 Fixes: c02db8c6290b ("rtnetlink: make SR-IOV VF interface symmetric") Reported-by: Jason Gunthorpe Cc: Chris Wright Cc: Sucheta Chakraborty Cc: Greg Rose Cc: Jeff Kirsher Cc: Rony Efraim Cc: Vlad Zolotarov Cc: Nicolas Dichtel Cc: Thomas Graf Signed-off-by: Jason Gunthorpe Signed-off-by: Daniel Borkmann Acked-by: Vlad Zolotarov Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit 161a5358f11f2958c1699724d2661f907c948d73 Author: Eric Dumazet Date: Mon Jul 6 17:13:26 2015 +0200 net: graceful exit from netif_alloc_netdev_queues() [ Upstream commit d339727c2b1a10f25e6636670ab6e1841170e328 ] User space can crash kernel with ip link add ifb10 numtxqueues 100000 type ifb We must replace a BUG_ON() by proper test and return -EINVAL for crazy values. Fixes: 60877a32bce00 ("net: allow large number of tx queues") Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit e27e696c238ec5d7c2d445135be808de6047ced3 Author: Angga Date: Fri Jul 3 14:40:52 2015 +1200 ipv6: Make MLD packets to only be processed locally [ Upstream commit 4c938d22c88a9ddccc8c55a85e0430e9c62b1ac5 ] Before commit daad151263cf ("ipv6: Make ipv6_is_mld() inline and use it from ip6_mc_input().") MLD packets were only processed locally. After the change, a copy of MLD packet goes through ip6_mr_input, causing MRT6MSG_NOCACHE message to be generated to user space. Make MLD packet only processed locally. Fixes: daad151263cf ("ipv6: Make ipv6_is_mld() inline and use it from ip6_mc_input().") Signed-off-by: Hermin Anggawijaya Signed-off-by: David S. Miller Signed-off-by: Jiri Slaby commit 76c74a6e163886101a99e9caed3937b552f5250a Author: Dave Airlie Date: Thu Aug 20 10:13:55 2015 +1000 drm/radeon: fix hotplug race at startup commit 7f98ca454ad373fc1b76be804fa7138ff68c1d27 upstream. We apparantly get a hotplug irq before we've initialised modesetting, [drm] Loading R100 Microcode BUG: unable to handle kernel NULL pointer dereference at (null) IP: [] __mutex_lock_slowpath+0x23/0x91 *pde = 00000000 Oops: 0002 [#1] Modules linked in: radeon(+) drm_kms_helper ttm drm i2c_algo_bit backlight pcspkr psmouse evdev sr_mod input_leds led_class cdrom sg parport_pc parport floppy intel_agp intel_gtt lpc_ich acpi_cpufreq processor button mfd_core agpgart uhci_hcd ehci_hcd rng_core snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm usbcore usb_common i2c_i801 i2c_core snd_timer snd soundcore thermal_sys CPU: 0 PID: 15 Comm: kworker/0:1 Not tainted 4.2.0-rc7-00015-gbf67402 #111 Hardware name: MicroLink /D850MV , BIOS MV85010A.86A.0067.P24.0304081124 04/08/2003 Workqueue: events radeon_hotplug_work_func [radeon] task: f6ca5900 ti: f6d3e000 task.ti: f6d3e000 EIP: 0060:[] EFLAGS: 00010282 CPU: 0 EIP is at __mutex_lock_slowpath+0x23/0x91 EAX: 00000000 EBX: f5e900fc ECX: 00000000 EDX: fffffffe ESI: f6ca5900 EDI: f5e90100 EBP: f5e90000 ESP: f6d3ff0c DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 CR0: 8005003b CR2: 00000000 CR3: 36f61000 CR4: 000006d0 Stack: f5e90100 00000000 c103c4c1 f6d2a5a0 f5e900fc f6df394c c125f162 f8b0faca f6d2a5a0 c138ca00 f6df394c f7395600 c1034741 00d40000 00000000 f6d2a5a0 c138ca00 f6d2a5b8 c138ca10 c1034b58 00000001 f6d40000 f6ca5900 f6d0c940 Call Trace: [] ? dequeue_task_fair+0xa4/0xb7 [] ? mutex_lock+0x9/0xa [] ? radeon_hotplug_work_func+0x17/0x57 [radeon] [] ? process_one_work+0xfc/0x194 [] ? worker_thread+0x18d/0x218 [] ? rescuer_thread+0x1d5/0x1d5 [] ? kthread+0x7b/0x80 [] ? ret_from_kernel_thread+0x20/0x30 [] ? init_completion+0x18/0x18 Code: 42 08 e8 8e a6 dd ff c3 57 56 53 83 ec 0c 8b 35 48 f7 37 c1 8b 10 4a 74 1a 89 c3 8d 78 04 8b 40 08 89 63 Reported-and-Tested-by: Meelis Roos Signed-off-by: Dave Airlie Signed-off-by: Jiri Slaby commit 350624fa32cb152bfec51f236b0b62b8d480a05a Author: Mika Westerberg Date: Tue Jun 9 12:17:07 2015 +0300 mfd: lpc_ich: Assign subdevice ids automatically commit 1abf25a25b86dcfe28d243a5af71bd1c9d6de1ef upstream. Using -1 as platform device id means that the platform driver core will not assign any id to the device (the device name will not have id at all). This results problems on systems that have multiple PCHs (Platform Controller HUBs) because all of them also include their own copy of LPC device. All the subsequent device creations will fail because there already exists platform device with the same name. Fix this by passing PLATFORM_DEVID_AUTO as platform device id. This makes the platform device core to allocate new ids automatically. Signed-off-by: Mika Westerberg Signed-off-by: Lee Jones Signed-off-by: Jiri Slaby