atom feed15 messages in com.xensource.lists.xen-develRe: [Xen-devel] xen-netfront crash wh...
FromSent OnAttachments
Marek Marczykowski-GóreckiMay 22, 2015 4:49 am 
David VrabelMay 22, 2015 9:25 am 
Marek Marczykowski-GóreckiMay 22, 2015 9:42 am 
David VrabelMay 22, 2015 9:58 am 
Marek Marczykowski-GóreckiMay 22, 2015 10:13 am 
David VrabelMay 26, 2015 3:56 am 
Marek Marczykowski-GóreckiMay 26, 2015 3:03 pm 
Marek Marczykowski-GóreckiOct 21, 2015 11:57 am 
Marek Marczykowski-GóreckiNov 16, 2015 6:45 pm 
David VrabelNov 17, 2015 3:59 am 
Konrad Rzeszutek WilkDec 1, 2015 2:00 pm 
Marek Marczykowski-GóreckiDec 1, 2015 2:32 pm 
Konrad Rzeszutek WilkJan 20, 2016 1:59 pm 
Joao MartinsJan 21, 2016 4:30 am 
Marek Marczykowski-GóreckiJan 22, 2016 11:23 am.log, .conf, .txt, 3 more
Subject:Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
From:Marek Marczykowski-Górecki (marm@invisiblethingslab.com)
Date:Oct 21, 2015 11:57:34 am
List:com.xensource.lists.xen-devel

On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki wrote:

On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:

On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:

Hi all,

I'm experiencing xen-netfront crash when doing xl network-detach while some network activity is going on at the same time. It happens only when domU has more than one vcpu. Not sure if this matters, but the backend is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel 3.9.4 and 4.1-rc1 as well.

Steps to reproduce: 1. Start the domU with some network interface 2. Call there 'ping -f some-IP' 3. Call 'xl network-detach NAME 0'

There's a use-after-free in xennet_remove(). Does this patch fix it?

Unfortunately not. Note that the crash is in xennet_disconnect_backend, which is called before xennet_destroy_queues in xennet_remove. I've tried to add napi_disable and even netif_napi_del just after napi_synchronize in xennet_disconnect_backend (which would probably cause crash when trying to cleanup the same later again), but it doesn't help - the crash is the same (still in gnttab_end_foreign_access called from xennet_disconnect_backend).

Finally I've found some more time to debug this... All tests redone on v4.3-rc6 frontend and 3.18.17 backend.

Looking at xennet_tx_buf_gc(), I have an impression that shared page (queue->grant_tx_page[id]) is/should be freed in some other means than (indirectly) calling to free_page via gnttab_end_foreign_access. Maybe the bug is that the page _is_ actually freed somewhere else already? At least changing gnttab_end_foreign_access to gnttab_end_foreign_access_ref makes the crash gone.

Relevant xennet_tx_buf_gc fragment: gnttab_end_foreign_access_ref( queue->grant_tx_ref[id], GNTMAP_readonly); gnttab_release_grant_reference( &queue->gref_tx_head, queue->grant_tx_ref[id]); queue->grant_tx_ref[id] = GRANT_INVALID_REF; queue->grant_tx_page[id] = NULL; add_id_to_freelist(&queue->tx_skb_freelist, queue->tx_skbs, id); dev_kfree_skb_irq(skb);

And similar fragment from xennet_release_tx_bufs: get_page(queue->grant_tx_page[i]); gnttab_end_foreign_access(queue->grant_tx_ref[i], GNTMAP_readonly, (unsigned long)page_address(queue->grant_tx_page[i])); queue->grant_tx_page[i] = NULL; queue->grant_tx_ref[i] = GRANT_INVALID_REF; add_id_to_freelist(&queue->tx_skb_freelist, queue->tx_skbs, i); dev_kfree_skb_irq(skb);

Note that both have dev_kfree_skb_irq, but the former use gnttab_end_foreign_access_ref, while the later - gnttab_end_foreign_access. Also note that the crash is in gnttab_end_foreign_access, so before dev_kfree_skb_irq. If that would be double free, I'd expect crash in the later.

This change was introduced by cefe007 "xen-netfront: fix resource leak in netfront". I'm not sure if changing gnttab_end_foreign_access back to gnttab_end_foreign_access_ref would not (re)introduce some memory leak.

Let me paste again the error message: [ 73.718636] page:ffffea000043b1c0 count:0 mapcount:0 mapping: (null)
index:0x0 [ 73.718661] flags: 0x3ffc0000008000(tail) [ 73.718684] page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) ==
0) [ 73.718725] ------------[ cut here ]------------ [ 73.718743] kernel BUG at include/linux/mm.h:338!

Also it all look quite strange - there is get_page() call just before gnttab_end_foreign_access, but page->_count is still 0. Maybe it have something to do how get_page() works on "tail" pages (whatever it means)?

static inline void get_page(struct page *page) { if (unlikely(PageTail(page))) if (likely(__get_page_tail(page))) return; /* * Getting a normal page or the head of a compound page * requires to already have an elevated page->_count. */ VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page); atomic_inc(&page->_count); }

which (I think) ends up in:

static inline void __get_page_tail_foll(struct page *page, bool get_page_head) { /* * If we're getting a tail page, the elevated page->_count is * required only in the head page and we will elevate the head * page->_count and tail page->_mapcount. * * We elevate page_tail->_mapcount for tail pages to force * page_tail->_count to be zero at all times to avoid getting * false positives from get_page_unless_zero() with * speculative page access (like in * page_cache_get_speculative()) on tail pages. */ VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page); if (get_page_head) atomic_inc(&page->first_page->_count); get_huge_page_tail(page); }

So the use counter is incremented in page->first_page->_count, not page->_count. But according to the comment, it should influence page->_mapcount, but the error message says it does not.

Any ideas?