| From | Sent On | Attachments |
|---|---|---|
| Marek Marczykowski-Górecki | May 22, 2015 4:49 am | |
| David Vrabel | May 22, 2015 9:25 am | |
| Marek Marczykowski-Górecki | May 22, 2015 9:42 am | |
| David Vrabel | May 22, 2015 9:58 am | |
| Marek Marczykowski-Górecki | May 22, 2015 10:13 am | |
| David Vrabel | May 26, 2015 3:56 am | |
| Marek Marczykowski-Górecki | May 26, 2015 3:03 pm | |
| Marek Marczykowski-Górecki | Oct 21, 2015 11:57 am | |
| Marek Marczykowski-Górecki | Nov 16, 2015 6:45 pm | |
| David Vrabel | Nov 17, 2015 3:59 am | |
| Konrad Rzeszutek Wilk | Dec 1, 2015 2:00 pm | |
| Marek Marczykowski-Górecki | Dec 1, 2015 2:32 pm | |
| Konrad Rzeszutek Wilk | Jan 20, 2016 1:59 pm | |
| Joao Martins | Jan 21, 2016 4:30 am | |
| Marek Marczykowski-Górecki | Jan 22, 2016 11:23 am | .log, .conf, .txt, 3 more |
| Subject: | Re: [Xen-devel] xen-netfront crash when detaching network while some network activity | |
|---|---|---|
| From: | Marek Marczykowski-Górecki (marm...@invisiblethingslab.com) | |
| Date: | Oct 21, 2015 11:57:34 am | |
| List: | com.xensource.lists.xen-devel | |
On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki wrote:
On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:
On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
Hi all,
I'm experiencing xen-netfront crash when doing xl network-detach while some network activity is going on at the same time. It happens only when domU has more than one vcpu. Not sure if this matters, but the backend is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel 3.9.4 and 4.1-rc1 as well.
Steps to reproduce: 1. Start the domU with some network interface 2. Call there 'ping -f some-IP' 3. Call 'xl network-detach NAME 0'
There's a use-after-free in xennet_remove(). Does this patch fix it?
Unfortunately not. Note that the crash is in xennet_disconnect_backend, which is called before xennet_destroy_queues in xennet_remove. I've tried to add napi_disable and even netif_napi_del just after napi_synchronize in xennet_disconnect_backend (which would probably cause crash when trying to cleanup the same later again), but it doesn't help - the crash is the same (still in gnttab_end_foreign_access called from xennet_disconnect_backend).
Finally I've found some more time to debug this... All tests redone on v4.3-rc6 frontend and 3.18.17 backend.
Looking at xennet_tx_buf_gc(), I have an impression that shared page (queue->grant_tx_page[id]) is/should be freed in some other means than (indirectly) calling to free_page via gnttab_end_foreign_access. Maybe the bug is that the page _is_ actually freed somewhere else already? At least changing gnttab_end_foreign_access to gnttab_end_foreign_access_ref makes the crash gone.
Relevant xennet_tx_buf_gc fragment: gnttab_end_foreign_access_ref( queue->grant_tx_ref[id], GNTMAP_readonly); gnttab_release_grant_reference( &queue->gref_tx_head, queue->grant_tx_ref[id]); queue->grant_tx_ref[id] = GRANT_INVALID_REF; queue->grant_tx_page[id] = NULL; add_id_to_freelist(&queue->tx_skb_freelist, queue->tx_skbs, id); dev_kfree_skb_irq(skb);
And similar fragment from xennet_release_tx_bufs: get_page(queue->grant_tx_page[i]); gnttab_end_foreign_access(queue->grant_tx_ref[i], GNTMAP_readonly, (unsigned long)page_address(queue->grant_tx_page[i])); queue->grant_tx_page[i] = NULL; queue->grant_tx_ref[i] = GRANT_INVALID_REF; add_id_to_freelist(&queue->tx_skb_freelist, queue->tx_skbs, i); dev_kfree_skb_irq(skb);
Note that both have dev_kfree_skb_irq, but the former use gnttab_end_foreign_access_ref, while the later - gnttab_end_foreign_access. Also note that the crash is in gnttab_end_foreign_access, so before dev_kfree_skb_irq. If that would be double free, I'd expect crash in the later.
This change was introduced by cefe007 "xen-netfront: fix resource leak in netfront". I'm not sure if changing gnttab_end_foreign_access back to gnttab_end_foreign_access_ref would not (re)introduce some memory leak.
Let me paste again the error message:
[ 73.718636] page:ffffea000043b1c0 count:0 mapcount:0 mapping: (null)
index:0x0
[ 73.718661] flags: 0x3ffc0000008000(tail)
[ 73.718684] page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) ==
0)
[ 73.718725] ------------[ cut here ]------------
[ 73.718743] kernel BUG at include/linux/mm.h:338!
Also it all look quite strange - there is get_page() call just before gnttab_end_foreign_access, but page->_count is still 0. Maybe it have something to do how get_page() works on "tail" pages (whatever it means)?
static inline void get_page(struct page *page) { if (unlikely(PageTail(page))) if (likely(__get_page_tail(page))) return; /* * Getting a normal page or the head of a compound page * requires to already have an elevated page->_count. */ VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page); atomic_inc(&page->_count); }
which (I think) ends up in:
static inline void __get_page_tail_foll(struct page *page, bool get_page_head) { /* * If we're getting a tail page, the elevated page->_count is * required only in the head page and we will elevate the head * page->_count and tail page->_mapcount. * * We elevate page_tail->_mapcount for tail pages to force * page_tail->_count to be zero at all times to avoid getting * false positives from get_page_unless_zero() with * speculative page access (like in * page_cache_get_speculative()) on tail pages. */ VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page); if (get_page_head) atomic_inc(&page->first_page->_count); get_huge_page_tail(page); }
So the use counter is incremented in page->first_page->_count, not page->_count. But according to the comment, it should influence page->_mapcount, but the error message says it does not.
Any ideas?
-- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
_______________________________________________ Xen-devel mailing list Xen-...@lists.xen.org http://lists.xen.org/xen-devel






.log, .conf, .txt, 3 more