The current open/lock state recovery unfortunately does not handle errors
such as NFS4ERR_CONN_NOT_BOUND_TO_SESSION correctly. Instead of looping,
just proceeds as if the state manager is finished recovering.
This patch ensures that we loop back, handle higher priority errors
and complete the open/lock state recovery.
If a NFSv4.x server returns NFS4ERR_STALE_CLIENTID in response to a
CREATE_SESSION or SETCLIENTID_CONFIRM in order to tell us that it rebooted
a second time, then the client will currently take this to mean that it must
declare all locks to be stale, and hence ineligible for reboot recovery.
RFC3530 and RFC5661 both suggest that the client should instead rely on the
server to respond to inelegible open share, lock and delegation reclaim
requests with NFS4ERR_NO_GRACE in this situation.
If the chosen baud rate is large enough (e.g. 3.5 megabaud), the
calculated n values in serial_omap_is_baud_mode16() may become 0. This
causes a division by zero when calculating the difference between
calculated and desired baud rates. To prevent this, cap the n13 and n16
values on 1.
Division by zero in kernel.
[<c00132e0>] (unwind_backtrace) from [<c00112ec>] (show_stack+0x10/0x14)
[<c00112ec>] (show_stack) from [<c01ed7bc>] (Ldiv0+0x8/0x10)
[<c01ed7bc>] (Ldiv0) from [<c023805c>] (serial_omap_baud_is_mode16+0x4c/0x68)
[<c023805c>] (serial_omap_baud_is_mode16) from [<c02396b4>] (serial_omap_set_termios+0x90/0x8d8)
[<c02396b4>] (serial_omap_set_termios) from [<c0230a0c>] (uart_change_speed+0xa4/0xa8)
[<c0230a0c>] (uart_change_speed) from [<c0231798>] (uart_set_termios+0xa0/0x1fc)
[<c0231798>] (uart_set_termios) from [<c022bb44>] (tty_set_termios+0x248/0x2c0)
[<c022bb44>] (tty_set_termios) from [<c022c17c>] (set_termios+0x248/0x29c)
[<c022c17c>] (set_termios) from [<c022c3e4>] (tty_mode_ioctl+0x1c8/0x4e8)
[<c022c3e4>] (tty_mode_ioctl) from [<c0227e70>] (tty_ioctl+0xa94/0xb18)
[<c0227e70>] (tty_ioctl) from [<c00cf45c>] (do_vfs_ioctl+0x4a0/0x560)
[<c00cf45c>] (do_vfs_ioctl) from [<c00cf568>] (SyS_ioctl+0x4c/0x74)
[<c00cf568>] (SyS_ioctl) from [<c000e480>] (ret_fast_syscall+0x0/0x30)
Signed-off-by: Frans Klaver <frans.klaver@xsens.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
This fix ensures that we never meet an integer overflow while adding
255 while parsing a variable length encoding. It works differently from
commit 206a81c ("lzo: properly check for overruns") because instead of
ensuring that we don't overrun the input, which is tricky to guarantee
due to many assumptions in the code, it simply checks that the cumulated
number of 255 read cannot overflow by bounding this number.
The MAX_255_COUNT is the maximum number of times we can add 255 to a base
count without overflowing an integer. The multiply will overflow when
multiplying 255 by more than MAXINT/255. The sum will overflow earlier
depending on the base count. Since the base count is taken from a u8
and a few bits, it is safe to assume that it will always be lower than
or equal to 2*255, thus we can always prevent any overflow by accepting
two less 255 steps.
This patch also reduces the CPU overhead and actually increases performance
by 1.1% compared to the initial code, while the previous fix costs 3.1%
(measured on x86_64).
The fix needs to be backported to all currently supported stable kernels.
Reported-by: Willem Pinckaers <willem@lekkertech.net> Cc: "Don A. Bailey" <donb@securitymouse.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Add a complete description of the LZO format as processed by the
decompressor. I have not found a public specification of this format
hence this analysis, which will be used to better understand the code.
Cc: Willem Pinckaers <willem@lekkertech.net> Cc: "Don A. Bailey" <donb@securitymouse.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
hwreg_present() and hwreg_write() temporarily change the VBR register to
another vector table. This table contains a valid bus error handler
only, all other entries point to arbitrary addresses.
If an interrupt comes in while the temporary table is active, the
processor will start executing at such an arbitrary address, and the
kernel will crash.
While most callers run early, before interrupts are enabled, or
explicitly disable interrupts, Finn Thain pointed out that macsonic has
one callsite that doesn't, causing intermittent boot crashes.
There's another unsafe callsite in hilkbd.
Fix this for good by disabling and restoring interrupts inside
hwreg_present() and hwreg_write().
Explicitly disabling interrupts can be removed from the callsites later.
function 'strncpy' will fill whole buffer 'id.name' of fixed size (32)
with string value and will not leave place for NULL-terminator.
Possible buffer boundaries violation in following string operations.
Replace strncpy with strlcpy.
Signed-off-by: Alexander Usyskin <alexander.usyskin@intel.com> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Eliminate the call to BUG_ON() by waiting for the host to respond. We are
trying to reclaim the ownership of memory that was given to the host and so
we will have to wait until the host responds.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Eliminate calls to BUG_ON() by properly handling errors. In cases where
rollback is possible, we will return the appropriate error to have the
calling code decide how to rollback state. In the case where we are
transferring ownership of the guest physical pages to the host,
we will wait for the host to respond.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Posting messages to the host can fail because of transient resource
related failures. Correctly deal with these failures and increase the
number of attempts to post the message before giving up.
In this version of the patch, I have normalized the error code to
Linux error code.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
When using a virtual SCSI disk in a VMWare VM if blkdev_issue_zeroout is used
data can be improperly zeroed out using the mptfusion driver. This patch
disables write_same for this driver and the vmware subsystem_vendor which
ensures that manual zeroing out is used instead.
BugLink: http://bugs.launchpad.net/bugs/1371591 Reported-by: Bruce Lucas <bruce.lucas@mongodb.com> Tested-by: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Chris J Arges <chris.j.arges@canonical.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Dan Carpenter found a issue where be2iscsi would copy the ip
from userspace to the driver buffer before checking the len
of the data being copied:
http://marc.info/?l=linux-scsi&m=140982651504251&w=2
This patch just has us only copy what we the driver buffer
can support.
Tested-by: John Soni Jose <sony.john-n@emulex.com> Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Since we cannot make sure the 'val_count' will always be none zero
here, and then if it equals to zero, the kmemdup() will return
ZERO_SIZE_PTR, which equals to ((void *)16).
So this patch fix this with just doing the zero check before calling
kmemdup().
Signed-off-by: Xiubo Li <Li.Xiubo@freescale.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
If LOG_DEVICE is defined and map->dev is NULL it will lead to NULL
pointer dereference. This patch fixes this issue by adding check for
dev->NULL in all such places in regmap.c
Signed-off-by: Pankaj Dubey <pankaj.dubey@samsung.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
If 'map->dev' is NULL and there will lead dev_name() to be NULL pointer
dereference. So before dev_name(), we need to have check of the map->dev
pionter.
We also should make sure that the 'name' pointer shouldn't be NULL for
debugfs_create_dir(). So here using one default "dummy" debugfs name when
the 'name' pointer and 'map->dev' are both NULL.
Signed-off-by: Xiubo Li <Li.Xiubo@freescale.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
In case of 8 bit mode and DMA usage we end up with every second byte written as
0. We have to respect bits_per_word settings what this patch actually does.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Quark x1000 advertises PGE via the standard CPUID method
PGE bits exist in Quark X1000's PTEs. In order to flush
an individual PTE it is necessary to reload CR3 irrespective
of the PTE.PGE bit.
See Quark Core_DevMan_001.pdf section 6.4.11
This bug was fixed in Galileo kernels, unfixed vanilla kernels are expected to
crash and burn on this platform.
vcpu ioctls can hang the calling thread if issued while a vcpu is running.
However, invalid ioctls can happen when userspace tries to probe the kind
of file descriptors (e.g. isatty() calls ioctl(TCGETS)); in that case,
we know the ioctl is going to be rejected as invalid anyway and we can
fail before trying to take the vcpu mutex.
This patch does not change functionality, it just makes invalid ioctls
fail faster.
Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
vcpu exits and memslot mutations can run concurrently as long as the
vcpu does not aquire the slots mutex. Thus it is theoretically possible
for memslots to change underneath a vcpu that is handling an exit.
If we increment the memslot generation number again after
synchronize_srcu_expedited(), vcpus can safely cache memslot generation
without maintaining a single rcu_dereference through an entire vm exit.
And much of the x86/kvm code does not maintain a single rcu_dereference
of the current memslots during each exit.
We can prevent the following case:
vcpu (CPU 0) | thread (CPU 1)
--------------------------------------------+--------------------------
1 vm exit |
2 srcu_read_unlock(&kvm->srcu) |
3 decide to cache something based on |
old memslots |
4 | change memslots
| (increments generation)
5 | synchronize_srcu(&kvm->srcu);
6 retrieve generation # from new memslots |
7 tag cache with new memslot generation |
8 srcu_read_unlock(&kvm->srcu) |
... |
<action based on cache occurs even |
though the caching decision was based |
on the old memslots> |
... |
<action *continues* to occur until next |
memslot generation change, which may |
be never> |
|
By incrementing the generation after synchronizing with kvm->srcu readers,
we ensure that the generation retrieved in (6) will become invalid soon
after (8).
Keeping the existing increment is not strictly necessary, but we
do keep it and just move it for consistency from update_memslots to
install_new_memslots. It invalidates old cached MMIOs immediately,
instead of having to wait for the end of synchronize_srcu_expedited,
which makes the code more clearly correct in case CPU 1 is preempted
right after synchronize_srcu() returns.
To avoid halving the generation space in SPTEs, always presume that the
low bit of the generation is zero when reconstructing a generation number
out of an SPTE. This effectively disables MMIO caching in SPTEs during
the call to synchronize_srcu_expedited. Using the low bit this way is
somewhat like a seqcount---where the protected thing is a cache, and
instead of retrying we can simply punt if we observe the low bit to be 1.
Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
up to userspace:
(1) Guest accesses gpa X without a memory slot. The gfn is cached in
struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
the SPTE write-execute-noread so that future accesses cause
EPT_MISCONFIGs.
(2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
covering the page just accessed.
(3) Guest attempts to read or write to gpa X again. On Intel, this
generates an EPT_MISCONFIG. The memory slot generation number that
was incremented in (2) would normally take care of this but we fast
path mmio faults through quickly_check_mmio_pf(), which only checks
the per-vcpu mmio cache. Since we hit the cache, KVM passes a
KVM_EXIT_MMIO up to userspace.
This patch fixes the issue by using the memslot generation number
to validate the mmio cache.
Signed-off-by: David Matlack <dmatlack@google.com>
[xiaoguangrong: adjust the code to make it simpler for stable-tree fix.] Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Tested-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
This patch adds the PCI id for Intel Quark ILB.
It will be used for GPIO and Multifunction device driver.
Signed-off-by: Josef Ahmad <josef.ahmad@intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Chang Rebecca Swee Fun <rebecca.swee.fun.chang@intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
We check whether transid is already committed via last_trans_committed and
then search through trans_list for pending transactions. If
last_trans_committed is updated by btrfs_commit_transaction after we check
it (there is no locking), we will fail to find the committed transaction
and return EINVAL to the caller. This has been observed occasionally by
ceph-osd (which uses this ioctl heavily).
Fix by rechecking whether the provided transid <= last_trans_committed
after the search fails, and if so return 0.
Signed-off-by: Sage Weil <sage@redhat.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Marc Merlin sent me a broken fs image months ago where it would blow up in the
upper->checked BUG_ON() in build_backref_tree. This is because we had a
scenario like this
block a -- level 4 (not shared)
|
block b -- level 3 (reloc block, shared)
|
block c -- level 2 (not shared)
|
block d -- level 1 (shared)
|
block e -- level 0 (shared)
We go to build a backref tree for block e, we notice block d is shared and add
it to the list of blocks to lookup it's backrefs for. Now when we loop around
we will check edges for the block, so we will see we looked up block c last
time. So we lookup block d and then see that the block that points to it is
block c and we can just skip that edge since we've already been up this path.
The problem is because we clear need_check when we see block d (as it is shared)
we never add block b as needing to be checked. And because block c is in our
path already we bail out before we walk up to block b and add it to the backref
check list.
To fix this we need to reset need_check if we trip over a block that doesn't
need to be checked. This will make sure that any subsequent blocks in the path
as we're walking up afterwards are added to the list to be processed. With this
patch I can now mount Marc's fs image and it'll complete the balance without
panicing. Thanks,
Reported-by: Marc MERLIN <marc@merlins.org> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
test, because it means it couldn't build the backref tree properly. This is
annoying to users and frankly a recoverable error, nothing in this function is
actually fatal since it is just an in-memory building of the backrefs for a
given bytenr. So go through and change all the BUG_ON()'s to ASSERT()'s, and
fix the BUG_ON(!upper->checked) thing to just return an error.
This patch also fixes the error handling so it tears down the work we've done
properly. This code was horribly broken since we always just panic'ed instead
of actually erroring out, so it needed to be completely re-worked. With this
patch my broken image no longer panics when I mount it. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
When doing log replay we may have to update inodes, which traditionally goes
through our delayed inode stuff. This will try to move space over from the
trans handle, but we don't reserve space in our trans handle on replay since we
don't know how much we will need, so instead we try to flush. But because we
have a trans handle open we won't flush anything, so if we are out of reserve
space we will simply return ENOSPC. Since we know that if an operation made it
into the log then we definitely had space before the box bought the farm then we
don't need to worry about doing this space reservation. Use the
fs_info->log_root_recovering flag to skip the delayed inode stuff and update the
item directly. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The transaction thread may want to do more work, namely it pokes the
cleaner ktread that will start processing uncleaned subvols.
This can be triggered by user via the 'btrfs fi sync' command, otherwise
there was a delay up to 30 seconds before the cleaner started to clean
old snapshots.
Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
swapper_low_pmd_dir and swapper_pud_dir are actually completely
useless and unnecessary.
We just need swapper_pg_dir[]. Naturally the other page table chunks
will be allocated on an as-needed basis. Since the kernel actually
accesses these tables in the PAGE_OFFSET view, there is not even a TLB
locality advantage of placing them in the kernel image.
Use the hard coded vmlinux.ld.S slot for swapper_pg_dir which is
naturally page aligned.
Increase MAX_BANKS to 1024 in order to handle heavily fragmented
virtual guests.
Even with this MAX_BANKS increase, the kernel is 20K+ smaller.
Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Bob Picco <bob.picco@oracle.com>
This patch attempts to do a few things. The highlights are: 1) enable
SPARSE_IRQ unconditionally, 2) kills off !SPARSE_IRQ code 3) allocates
ivector_table at boot time and 4) default to cookie only VIRQ mechanism
for supported firmware. The first firmware with cookie only support for
me appears on T5. You can optionally force the HV firmware to not cookie
only mode which is the sysino support.
The sysino is a deprecated HV mechanism according to the most recent
SPARC Virtual Machine Specification. HV_GRP_INTR is what controls the
cookie/sysino firmware versioning.
The history of this interface is:
1) Major version 1.0 only supported sysino based interrupt interfaces.
2) Major version 2.0 added cookie based VIRQs, however due to the fact
that OSs were using the VIRQs without negoatiating major version
2.0 (Linux and Solaris are both guilty), the VIRQs calls were
allowed even with major version 1.0
To complicate things even further, the VIRQ interfaces were only
actually hooked up in the hypervisor for LDC interrupt sources.
VIRQ calls on other device types would result in HV_EINVAL errors.
So effectively, major version 2.0 is unusable.
3) Major version 3.0 was created to signal use of VIRQs and the fact
that the hypervisor has these calls hooked up for all interrupt
sources, not just those for LDC devices.
A new boot option is provided should cookie only HV support have issues.
hvirq - this is the version for HV_GRP_INTR. This is related to HV API
versioning. The code attempts major=3 first by default. The option can
be used to override this default.
I've tested with SPARSE_IRQ on T5-8, M7-4 and T4-X and Jalap?no.
Signed-off-by: Bob Picco <bob.picco@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
In order to accomodate embedded per-cpu allocation with large numbers
of cpus and numa nodes, we have to use as much virtual address space
as possible for the vmalloc region. Otherwise we can get things like:
PERCPU: max_distance=0x380001c10000 too large for vmalloc space 0xff00000000
So, once we select a value for PAGE_OFFSET, derive the size of the
vmalloc region based upon that.
Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Bob Picco <bob.picco@oracle.com>
For sparse memory configurations, the vmemmap array behaves terribly
and it takes up an inordinate amount of space in the BSS section of
the kernel image unconditionally.
Just build huge PMDs and look them up just like we do for TLB misses
in the vmalloc area.
Kernel BSS shrinks by about 2MB.
Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Bob Picco <bob.picco@oracle.com>
If max_phys_bits needs to be > 43 (f.e. for T4 chips), things like
DEBUG_PAGEALLOC stop working because the 3-level page tables only
can cover up to 43 bits.
Another problem is that when we increased MAX_PHYS_ADDRESS_BITS up to
47, several statically allocated tables became enormous.
Compounding this is that we will need to support up to 49 bits of
physical addressing for M7 chips.
The two tables in question are sparc64_valid_addr_bitmap and
kpte_linear_bitmap.
The first holds a bitmap, with 1 bit for each 4MB chunk of physical
memory, indicating whether that chunk actually exists in the machine
and is valid.
The second table is a set of 2-bit values which tell how large of a
mapping (4MB, 256MB, 2GB, 16GB, respectively) we can use at each 256MB
chunk of ram in the system.
These tables are huge and take up an enormous amount of the BSS
section of the sparc64 kernel image. Specifically, the
sparc64_valid_addr_bitmap is 4MB, and the kpte_linear_bitmap is 128K.
So let's solve the space wastage and the DEBUG_PAGEALLOC problem
at the same time, by using the kernel page tables (as designed) to
manage this information.
We have to keep using large mappings when DEBUG_PAGEALLOC is disabled,
and we do this by encoding huge PMDs and PUDs.
On a T4-2 with 256GB of ram the kernel page table takes up 16K with
DEBUG_PAGEALLOC disabled and 256MB with it enabled. Furthermore, this
memory is dynamically allocated at run time rather than coded
statically into the kernel image.
Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Bob Picco <bob.picco@oracle.com>
The T5 (niagara5) has different PCR related HV fast trap values and a new
HV API Group. This patch utilizes these and shares when possible with niagara4.
We use the same sparc_pmu niagara4_pmu. Should there be new effort to
obtain the MCU perf statistics then this would have to be changed.
Cc: sparclinux@vger.kernel.org Signed-off-by: Bob Picco <bob.picco@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
On sparc64 "present" and "valid" are seperate PTE bits, this allows us to
naturally distinguish between the user explicitly asking for PROT_NONE
with mprotect() and other situations.
However we weren't handling this properly in the huge PMD paths.
First of all, the page table walker in the TSB miss path only checks
for _PAGE_PMD_HUGE. So the generic pmdp_invalidate() would clear
_PAGE_PRESENT but the TLB miss paths would still load it into the TLB
as a valid huge PMD.
Fix this by clearing the valid bit in pmdp_invalidate(), and also
checking the valid bit in USER_PGTABLE_CHECK_PMD_HUGE using "brgez"
since _PAGE_VALID is bit 63 in both the sun4u and sun4v pte layouts.
Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This commit broke the behavior of __copy_from_user_inatomic when
it is only partially successful. Instead of returning the number
of bytes not copied, it now returns 1. This translates to the
wrong value being returned by iov_iter_copy_from_user_atomic.
xfstests generic/246 and LTP writev01 both fail on btrfs and nfs
because of this.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: sparclinux@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
The SIMBA APB Bridges lacks the 'ranges' of-property describing the
PCI I/O and memory areas located beneath the bridge. Faking this
information has been performed by reading range registers in the
APB bridge, and calculating the corresponding areas.
In commit 01f94c4a6ced476ce69b895426fc29bfc48c69bd
("Fix sabre pci controllers with new probing scheme.") a bug was
introduced into this calculation, causing the PCI memory areas
to be calculated incorrectly: The shift size was set to be
identical for I/O and MEM ranges, which is incorrect.
Now that we have 64-bits for PMDs we can stop using special encodings
for the huge PMD values, and just put real PTEs in there.
We allocate a _PAGE_PMD_HUGE bit to distinguish between plain PMDs and
huge ones. It is the same for both 4U and 4V PTE layouts.
We also use _PAGE_SPECIAL to indicate the splitting state, since a
huge PMD cannot also be special.
All of the PMD --> PTE translation code disappears, and most of the
huge PMD bit modifications and tests just degenerate into the PTE
operations. In particular USER_PGTABLE_CHECK_PMD_HUGE becomes
trivial.
As a side effect, normal PMDs don't shift the physical address around.
This also speeds up the page table walks in the TLB miss paths since
they don't have to do the shifts any more.
Another non-trivial aspect is that pte_modify() has to be changed
to preserve the _PAGE_PMD_HUGE bits as well as the page size field
of the pte.
Signed-off-by: David S. Miller <davem@davemloft.net>
To make the page tables compact, we were using 32-bit PGDs and PMDs.
We only had to support <= 43 bits of physical addresses so this was
quite feasible.
In order to support larger physical addresses we have to move to
64-bit PGDs and PMDs.
Most of the changes are straight-forward:
1) {pgd,pmd}_t --> unsigned long
2) Anything that tries to use plain "unsigned int" types with pgd/pmd
values needs to be adjusted. In particular things like "0U" become
"0UL".
3) {PGDIR,PMD}_BITS decrease by one.
4) In the assembler page table walkers, use "ldxa" instead of "lduwa"
and adjust the low bit masks to clear out the low 3 bits instead of
just the low 2 bits during pgd/pmd address formation.
Also, use PTRS_PER_PGD and PTRS_PER_PMD in the sizing of the
swapper_{pg_dir,low_pmd_dir} arrays.
This patch does not try to take advantage of having 64-bits in the
PMDs to simplify the hugepage code, that will come in a subsequent
change.
Signed-off-by: David S. Miller <davem@davemloft.net>
The impetus for this is that we would like to move to 64-bit PMDs and
PGDs, but that would result in only supporting a 42-bit address space
with the current page table layout. It'd be nice to support at least
43-bits.
The reason we'd end up with only 42-bits after making PMDs and PGDs
64-bit is that we only use half-page sized PTE tables in order to make
PMDs line up to 4MB, the hardware huge page size we use.
So what we do here is we make huge pages 8MB, and fabricate them using
4MB hw TLB entries.
Facilitate this by providing a "REAL_HPAGE_SHIFT" which is used in
places that really need to operate on hardware 4MB pages.
Use full pages (512 entries) for PTE tables, and adjust PMD_SHIFT,
PGD_SHIFT, and the build time CPP test as needed. Use a CPP test to
make sure REAL_HPAGE_SHIFT and the _PAGE_SZHUGE_* we use match up.
This makes the pgtable cache completely unused, so remove the code
managing it and the state used in mm_context_t. Now we have less
spinlocks taken in the page table allocation path.
The technique we use to fabricate the 8MB pages is to transfer bit 22
from the missing virtual address into the PTEs physical address field.
That takes care of the transparent huge pages case.
For hugetlb, we fill things in at the PTE level and that code already
puts the sub huge page physical bits into the PTEs, based upon the
offset, so there is nothing special we need to do. It all just works
out.
So, a small amount of complexity in the THP case, but this code is
about to get much simpler when we move the 64-bit PMDs as we can move
away from the fancy 32-bit huge PMD encoding and just put a real PTE
value in there.
With bug fixes and help from Bob Picco.
Signed-off-by: David S. Miller <davem@davemloft.net>
Choose PAGE_OFFSET dynamically based upon cpu type.
Original UltraSPARC-I (spitfire) chips only supported a 44-bit
virtual address space.
Newer chips (T4 and later) support 52-bit virtual addresses
and up to 47-bits of physical memory space.
Therefore we have to adjust PAGE_SIZE dynamically based upon
the capabilities of the chip.
Note that this change alone does not allow us to support > 43-bit
physical memory, to do that we need to re-arrange our page table
support. The current encodings of the pmd_t and pgd_t pointers
restricts us to "32 + 11" == 43 bits.
This change can waste quite a bit of memory for the various tables.
In particular, a future change should work to size and allocate
kern_linear_bitmap[] and sparc64_valid_addr_bitmap[] dynamically.
This isn't easy as we really cannot take a TLB miss when accessing
kern_linear_bitmap[]. We'd have to lock it into the TLB or similar.
Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Bob Picco <bob.picco@oracle.com>
It is not sufficient to only implement get_user_pages_fast(), you
must also implement the atomic version __get_user_pages_fast()
otherwise you end up using the weak symbol fallback implementation
which simply returns zero.
This is dangerous, because it causes the futex code to loop forever
if transparent hugepages are supported (see get_futex_key()).
Signed-off-by: David S. Miller <davem@davemloft.net>
Meelis Roos reported that kernels built with gcc-4.9 do not boot, we
eventually narrowed this down to only impacting machines using
UltraSPARC-III and derivitive cpus.
The crash happens right when the first user process is spawned:
[ 54.451346] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004
[ 54.451346]
[ 54.571516] CPU: 1 PID: 1 Comm: init Not tainted 3.16.0-rc2-00211-gd7933ab #96
[ 54.666431] Call Trace:
[ 54.698453] [0000000000762f8c] panic+0xb0/0x224
[ 54.759071] [000000000045cf68] do_exit+0x948/0x960
[ 54.823123] [000000000042cbc0] fault_in_user_windows+0xe0/0x100
[ 54.902036] [0000000000404ad0] __handle_user_windows+0x0/0x10
[ 54.978662] Press Stop-A (L1-A) to return to the boot prom
[ 55.050713] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004
Further investigation showed that compiling only per_cpu_patch() with
an older compiler fixes the boot.
Detailed analysis showed that the function is not being miscompiled by
gcc-4.9, but it is using a different register allocation ordering.
With the gcc-4.9 compiled function, something during the code patching
causes some of the %i* input registers to get corrupted. Perhaps
we have a TLB miss path into the firmware that is deep enough to
cause a register window spill and subsequent restore when we get
back from the TLB miss trap.
Let's plug this up by doing two things:
1) Stop using the firmware stack for client interface calls into
the firmware. Just use the kernel's stack.
2) As soon as we can, call into a new function "start_early_boot()"
to put a one-register-window buffer between the firmware's
deepest stack frame and the top-most initial kernel one.
Reported-by: Meelis Roos <mroos@linux.ee> Tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: David S. Miller <davem@davemloft.net>
This is the longest boot string that silo supports.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Bob Picco <bob.picco@oracle.com> Cc: David S. Miller <davem@davemloft.net> Cc: sparclinux@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
This breaks the stack end corruption detection facility.
What that facility does it write a magic value to "end_of_stack()"
and checking to see if it gets overwritten.
"end_of_stack()" is "task_thread_info(p) + 1", which for sparc64 is
the beginning of the FPU register save area.
So once the user uses the FPU, the magic value is overwritten and the
debug checks trigger.
Fix this by making the size explicit.
Due to the size we use for the fpsaved[], gsr[], and xfsr[] arrays we
are limited to 7 levels of FPU state saves. So each FPU register set
is 256 bytes, allocate 256 * 7 for the fpregs area.
Reported-by: Meelis Roos <mroos@linux.ee> Signed-off-by: David S. Miller <davem@davemloft.net>
The AES loops in arch/sparc/crypto/aes_glue.c use a scheme where the
key material is preloaded into the FPU registers, and then we loop
over and over doing the crypt operation, reusing those pre-cooked key
registers.
There are intervening blkcipher*() calls between the crypt operation
calls. And those might perform memcpy() and thus also try to use the
FPU.
The sparc64 kernel FPU usage mechanism is designed to allow such
recursive uses, but with a catch.
There has to be a trap between the two FPU using threads of control.
The mechanism works by, when the FPU is already in use by the kernel,
allocating a slot for FPU saving at trap time. Then if, within the
trap handler, we try to use the FPU registers, the pre-trap FPU
register state is saved into the slot. Then at trap return time we
notice this and restore the pre-trap FPU state.
Over the long term there are various more involved ways we can make
this work, but for a quick fix let's take advantage of the fact that
the situation where this happens is very limited.
All sparc64 chips that support the crypto instructiosn also are using
the Niagara4 memcpy routine, and that routine only uses the FPU for
large copies where we can't get the source aligned properly to a
multiple of 8 bytes.
We look to see if the FPU is already in use in this context, and if so
we use the non-large copy path which only uses integer registers.
Furthermore, we also limit this special logic to when we are doing
kernel copy, rather than a user copy.
Signed-off-by: David S. Miller <davem@davemloft.net>
Inconsistently, the raw_* IRQ routines do not interact with and update
the irqflags tracing and lockdep state, whereas the raw_* spinlock
interfaces do.
This causes problems in p1275_cmd_direct() because we disable hardirqs
by hand using raw_local_irq_restore() and then do a raw_spin_lock()
which triggers a lockdep trace because the CPU's hw IRQ state doesn't
match IRQ tracing's internal software copy of that state.
The CPU's irqs are disabled, yet current->hardirqs_enabled is true.
====================
reboot: Restarting system
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at kernel/locking/lockdep.c:3536 check_flags+0x7c/0x240()
DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
Modules linked in: openpromfs
CPU: 0 PID: 1 Comm: systemd-shutdow Tainted: G W 3.17.0-dirty #145
Call Trace:
[000000000045919c] warn_slowpath_common+0x5c/0xa0
[0000000000459210] warn_slowpath_fmt+0x30/0x40
[000000000048f41c] check_flags+0x7c/0x240
[0000000000493280] lock_acquire+0x20/0x1c0
[0000000000832b70] _raw_spin_lock+0x30/0x60
[000000000068f2fc] p1275_cmd_direct+0x1c/0x60
[000000000068ed28] prom_reboot+0x28/0x40
[000000000043610c] machine_restart+0x4c/0x80
[000000000047d2d4] kernel_restart+0x54/0x80
[000000000047d618] SyS_reboot+0x138/0x200
[00000000004060b4] linux_sparc_syscall32+0x34/0x60
---[ end trace 5c439fe81c05a100 ]---
possible reason: unannotated irqs-off.
irq event stamp: 2010267
hardirqs last enabled at (2010267): [<000000000049a358>] vprintk_emit+0x4b8/0x580
hardirqs last disabled at (2010266): [<0000000000499f08>] vprintk_emit+0x68/0x580
softirqs last enabled at (2010046): [<000000000045d278>] __do_softirq+0x378/0x4a0
softirqs last disabled at (2010039): [<000000000042bf08>] do_softirq_own_stack+0x28/0x40
Resetting ...
====================
Use local_* variables of the hw IRQ interfaces so that IRQ tracing sees
all of our changes.
Reported-by: Meelis Roos <mroos@linux.ee> Tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: David S. Miller <davem@davemloft.net>
When we have to split up a flush request into multiple pieces
(in order to avoid the firmware range) we don't specify the
arguments in the right order for the second piece.
Fix the order, or else we get hangs as the code tries to
flush "a lot" of entries and we get lockups like this:
This makes memset follow the standard (instead of returning 0 on success). This
is needed when certain versions of gcc optimizes around memset calls and assume
that the address argument is preserved in %o0.
Signed-off-by: Andreas Larsson <andreas@gaisler.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
We have seen an issue with guest boot into LDOM that causes early boot failures
because of no matching rules for node identitity of the memory. I analyzed this
on my T4 and concluded there might not be a solution. I saw the issue in
mainline too when booting into the control/primary domain - with guests
configured. Note, this could be a firmware bug on some older machines.
I'll provide a full explanation of the issues below. Should we not find a
matching BEST latency group for a real address (RA) then we will assume node 0.
On the T4-2 here with the information provided I can't see an alternative.
Technically the LDOM shown below should match the MBLOCK to the
favorable latency group. However other factors must be considered too. Were
the memory controllers configured "fine" grained interleave or "coarse"
grain interleaved - T4. Also should a "group" MD node be considered a NUMA
node?
There has to be at least one Machine Description (MD) "group" and hence one
NUMA node. The group can have one or more latency groups (lg) - more than one
memory controller. The current code chooses the smallest latency as the most
favorable per group. The latency and lg information is in MLGROUP below.
MBLOCK is the base and size of the RAs for the machine as fetched from OBP
/memory "available" property. My machine has one MBLOCK but more would be
possible - with holes?
For a T4-2 the following information has been gathered:
with LDOM guest
MEMBLOCK configuration:
memory size = 0x27f870000
memory.cnt = 0x3
memory[0x0] [0x00000020400000-0x0000029fc67fff], 0x27f868000 bytes
memory[0x1] [0x0000029fd8a000-0x0000029fd8bfff], 0x2000 bytes
memory[0x2] [0x0000029fd92000-0x0000029fd97fff], 0x6000 bytes
reserved.cnt = 0x2
reserved[0x0] [0x00000020800000-0x000000216c15c0], 0xec15c1 bytes
reserved[0x1] [0x00000024800000-0x0000002c180c1e], 0x7980c1f bytes
MBLOCK[0]: base[20000000] size[280000000] offset[0]
(note: "base" and "size" reported in "MBLOCK" encompass the "memory[X]" values)
(note: (RA + offset) & mask = val is the formula to detect a match for the
memory controller. should there be no match for find_node node, a return
value of -1 resulted for the node - BAD)
There is one group. It has these forward links
MLGROUP[1]: node[545] latency[1f7e8] match[200000000] mask[200000000]
MLGROUP[2]: node[54d] latency[2de60] match[0] mask[200000000]
NUMA NODE[0]: node[545] mask[200000000] val[200000000] (latency[1f7e8])
(note: "val" is the best lg's (smallest latency) "match")
there are two groups
group node[16d5]
MLGROUP[0]: node[1765] latency[1f7e8] match[0] mask[200000000]
MLGROUP[3]: node[177d] latency[2de60] match[200000000] mask[200000000]
NUMA NODE[0]: node[1765] mask[200000000] val[0] (latency[1f7e8])
group node[171d]
MLGROUP[2]: node[1775] latency[2de60] match[0] mask[200000000]
MLGROUP[1]: node[176d] latency[1f7e8] match[200000000] mask[200000000]
NUMA NODE[1]: node[176d] mask[200000000] val[200000000] (latency[1f7e8])
(note: for this two "group" bare metal machine, 1/2 memory is in group one's
lg and 1/2 memory is in group two's lg).
Cc: sparclinux@vger.kernel.org Signed-off-by: Bob Picco <bob.picco@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Every path that ends up at do_sparc64_fault() must install a valid
FAULT_CODE_* bitmask in the per-thread fault code byte.
Two paths leading to the label winfix_trampoline (which expects the
FAULT_CODE_* mask in register %g4) were not doing so:
1) For pre-hypervisor TLB protection violation traps, if we took
the 'winfix_trampoline' path we wouldn't have %g4 initialized
with the FAULT_CODE_* value yet. Resulting in using the
TLB_TAG_ACCESS register address value instead.
2) In the TSB miss path, when we notice that we are going to use a
hugepage mapping, but we haven't allocated the hugepage TSB yet, we
still have to take the window fixup case into consideration and
in that particular path we leave %g4 not setup properly.
Errors on this sort were largely invisible previously, but after
commit 4ccb9272892c33ef1c19a783cfa87103b30c2784 ("sparc64: sun4v TLB
error power off events") we now have a fault_code mask bit
(FAULT_CODE_BAD_RA) that triggers due to this bug.
FAULT_CODE_BAD_RA triggers because this bit is set in TLB_TAG_ACCESS
(see #1 above) and thus we get seemingly random bus errors triggered
for user processes.
Fixes: 4ccb9272892c ("sparc64: sun4v TLB error power off events") Reported-by: Meelis Roos <mroos@linux.ee> Signed-off-by: David S. Miller <davem@davemloft.net>
We've witnessed a few TLB events causing the machine to power off because
of prom_halt. In one case it was some nfs related area during rmmod. Another
was an mmapper of /dev/mem. A more recent one is an ITLB issue with
a bad pagesize which could be a hardware bug. Bugs happen but we should
attempt to not power off the machine and/or hang it when possible.
This is a DTLB error from an mmapper of /dev/mem:
[root@sparcie ~]# SUN4V-DTLB: Error at TPC[fffff80100903e6c], tl 1
SUN4V-DTLB: TPC<0xfffff80100903e6c>
SUN4V-DTLB: O7[fffff801081979d0]
SUN4V-DTLB: O7<0xfffff801081979d0>
SUN4V-DTLB: vaddr[fffff80100000000] ctx[1250] pte[98000000000f0610] error[2]
.
This is recent mainline for ITLB:
[ 3708.179864] SUN4V-ITLB: TPC<0xfffffc010071cefc>
[ 3708.188866] SUN4V-ITLB: O7[fffffc010071cee8]
[ 3708.197377] SUN4V-ITLB: O7<0xfffffc010071cee8>
[ 3708.206539] SUN4V-ITLB: vaddr[e0003] ctx[1a3c] pte[2900000dcc800eeb] error[4]
.
Normally sun4v_itlb_error_report() and sun4v_dtlb_error_report() would call
prom_halt() and drop us to OF command prompt "ok". This isn't the case for
LDOMs and the machine powers off.
For the HV reported error of HV_ENORADDR for HV HV_MMU_MAP_ADDR_TRAP we cause
a SIGBUS error by qualifying it within do_sparc64_fault() for fault code mask
of FAULT_CODE_BAD_RA. This is done when trap level (%tl) is less or equal
one("1"). Otherwise, for %tl > 1, we proceed eventually to die_if_kernel().
The logic of this patch was partially inspired by David Miller's feedback.
Power off of large sparc64 machines is painful. Plus die_if_kernel provides
more context. A reset sequence isn't a brief period on large sparc64 but
better than power-off/power-on sequence.
Cc: sparclinux@vger.kernel.org Signed-off-by: Bob Picco <bob.picco@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
dma_zalloc_coherent() calls dma_alloc_coherent(__GFP_ZERO)
but the sparc32 implementations sbus_alloc_coherent() and
pci32_alloc_coherent() doesn't take the gfp flags into
account.
Tested on the SPARC32/LEON GRETH Ethernet driver which fails
due to dma_alloc_coherent(__GFP_ZERO) returns non zeroed
pages.
Signed-off-by: Daniel Hellstrom <daniel@gaisler.com> Signed-off-by: David S. Miller <davem@davemloft.net>
da08143b8520 ("vlan: more careful checksum features handling")
which introduced more careful feature intersection in vlan code,
taking into account that HW_CSUM should be considered superset
of IP_CSUM/IPV6_CSUM. The same is needed in netif_skb_features()
in order to avoid offloading mismatch warning when vlan is
created on top of a bond consisting of slaves supporting IP/IPv6
checksumming but not vlan Tx offloading.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Dentry that had been through (or into) __dentry_kill() might be seen
by shrink_dentry_list(); that's normal, it'll be taken off the shrink
list and freed if __dentry_kill() has already finished. The problem
is, its ->d_parent might be pointing to already freed dentry, so
lock_parent() needs to be careful.
We need to check that dentry hasn't already gone into __dentry_kill()
*and* grab rcu_read_lock() before dropping ->d_lock - the latter makes
sure that whatever we see in ->d_parent after dropping ->d_lock it
won't be freed until we drop rcu_read_lock().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
lock_parent() very much on purpose does nested locking of dentries, and
is careful to maintain the right order (lock parent first). But because
it didn't annotate the nested locking order, lockdep thought it might be
a deadlock on d_lock, and complained.
Add the proper annotation for the inner locking of the child dentry to
make lockdep happy.
Introduced by commit 046b961b45f9 ("shrink_dentry_list(): take parent's
->d_lock earlier").
We have the same problem with ->d_lock order in the inner loop, where
we are dropping references to ancestors. Same solution, basically -
instead of using dentry_kill() we use lock_parent() (introduced in the
previous commit) to get that lock in a safe way, recheck ->d_count
(in case if lock_parent() has ended up dropping and retaking ->d_lock
and somebody managed to grab a reference during that window), trylock
the inode->i_lock and use __dentry_kill() to do the rest.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The cause of livelocks there is that we are taking ->d_lock on
dentry and its parent in the wrong order, forcing us to use
trylock on the parent's one. d_walk() takes them in the right
order, and unfortunately it's not hard to create a situation
when shrink_dentry_list() can't make progress since trylock
keeps failing, and shrink_dcache_parent() or check_submounts_and_drop()
keeps calling d_walk() disrupting the very shrink_dentry_list() it's
waiting for.
Solution is straightforward - if that trylock fails, let's unlock
the dentry itself and take locks in the right order. We need to
stabilize ->d_parent without holding ->d_lock, but that's doable
using RCU. And we'd better do that in the very beginning of the
loop in shrink_dentry_list(), since the checks on refcount, etc.
would need to be redone anyway.
That deals with a half of the problem - killing dentries on the
shrink list itself. Another one (dropping their parents) is
in the next commit.
locking parent is interesting - it would be easy to do rcu_read_lock(),
lock whatever we think is a parent, lock dentry itself and check
if the parent is still the right one. Except that we need to check
that *before* locking the dentry, or we are risking taking ->d_lock
out of order. Fortunately, once the D1 is locked, we can check if
D2->d_parent is equal to D1 without the need to lock D2; D2->d_parent
can start or stop pointing to D1 only under D1->d_lock, so taking
D1->d_lock is enough. In other words, the right solution is
rcu_read_lock/lock what looks like parent right now/check if it's
still our parent/rcu_read_unlock/lock the child.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Result will be massaged to saner shape in the next commits. It is
ugly, no questions - the point of that one is to be a provably
equivalent transformation (and it might be worth splitting a bit
more).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
It can happen only when dentry_kill() is called with unlock_on_failure
equal to 0 - other callers had dentry pinned until the moment they've
got ->d_lock and DCACHE_DENTRY_KILLED is set only after lockref_mark_dead().
IOW, only one of three call sites of dentry_kill() might end up reaching
that code. Just move it there.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Start with shrink_dcache_parent(), then scan what remains.
First of all, BUG() is very much an overkill here; we are holding
->s_umount, and hitting BUG() means that a lot of interesting stuff
will be hanging after that point (sync(2), for example). Moreover,
in cases when there had been more than one leak, we'll be better
off reporting all of them. And more than just the last component
of pathname - %pd is there for just such uses...
That was the last user of dentry_lru_del(), so kill it off...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
If we find something already on a shrink list, just increment
data->found and do nothing else. Loops in shrink_dcache_parent() and
check_submounts_and_drop() will do the right thing - everything we
did put into our list will be evicted and if there had been nothing,
but data->found got non-zero, well, we have somebody else shrinking
those guys; just try again.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
If the victim in on the shrink list, don't remove it from there.
If shrink_dentry_list() manages to remove it from the list before
we are done - fine, we'll just free it as usual. If not - mark
it with new flag (DCACHE_MAY_FREE) and leave it there.
Eventually, shrink_dentry_list() will get to it, remove the sucker
from shrink list and call dentry_kill(dentry, 0). Which is where
we'll deal with freeing.
Since now dentry_kill(dentry, 0) may happen after or during
dentry_kill(dentry, 1), we need to recognize that (by seeing
DCACHE_DENTRY_KILLED already set), unlock everything
and either free the sucker (in case DCACHE_MAY_FREE has been
set) or leave it for ongoing dentry_kill(dentry, 1) to deal with.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>