libgo/go/runtime/mgcscavenge.go

   1 // Copyright 2019 The Go Authors. All rights reserved.
   2 // Use of this source code is governed by a BSD-style
   3 // license that can be found in the LICENSE file.
   4
   5 // Scavenging free pages.
   6 //
   7 // This file implements scavenging (the release of physical pages backing mapped
   8 // memory) of free and unused pages in the heap as a way to deal with page-level
   9 // fragmentation and reduce the RSS of Go applications.
  10 //
  11 // Scavenging in Go happens on two fronts: there's the background
  12 // (asynchronous) scavenger and the heap-growth (synchronous) scavenger.
  13 //
  14 // The former happens on a goroutine much like the background sweeper which is
  15 // soft-capped at using scavengePercent of the mutator's time, based on
  16 // order-of-magnitude estimates of the costs of scavenging. The background
  17 // scavenger's primary goal is to bring the estimated heap RSS of the
  18 // application down to a goal.
  19 //
  20 // That goal is defined as:
  21 //   (retainExtraPercent+100) / 100 * (next_gc / last_next_gc) * last_heap_inuse
  22 //
  23 // Essentially, we wish to have the application's RSS track the heap goal, but
  24 // the heap goal is defined in terms of bytes of objects, rather than pages like
  25 // RSS. As a result, we need to take into account for fragmentation internal to
  26 // spans. next_gc / last_next_gc defines the ratio between the current heap goal
  27 // and the last heap goal, which tells us by how much the heap is growing and
  28 // shrinking. We estimate what the heap will grow to in terms of pages by taking
  29 // this ratio and multiplying it by heap_inuse at the end of the last GC, which
  30 // allows us to account for this additional fragmentation. Note that this
  31 // procedure makes the assumption that the degree of fragmentation won't change
  32 // dramatically over the next GC cycle. Overestimating the amount of
  33 // fragmentation simply results in higher memory use, which will be accounted
  34 // for by the next pacing up date. Underestimating the fragmentation however
  35 // could lead to performance degradation. Handling this case is not within the
  36 // scope of the scavenger. Situations where the amount of fragmentation balloons
  37 // over the course of a single GC cycle should be considered pathologies,
  38 // flagged as bugs, and fixed appropriately.
  39 //
  40 // An additional factor of retainExtraPercent is added as a buffer to help ensure
  41 // that there's more unscavenged memory to allocate out of, since each allocation
  42 // out of scavenged memory incurs a potentially expensive page fault.
  43 //
  44 // The goal is updated after each GC and the scavenger's pacing parameters
  45 // (which live in mheap_) are updated to match. The pacing parameters work much
  46 // like the background sweeping parameters. The parameters define a line whose
  47 // horizontal axis is time and vertical axis is estimated heap RSS, and the
  48 // scavenger attempts to stay below that line at all times.
  49 //
  50 // The synchronous heap-growth scavenging happens whenever the heap grows in
  51 // size, for some definition of heap-growth. The intuition behind this is that
  52 // the application had to grow the heap because existing fragments were
  53 // not sufficiently large to satisfy a page-level memory allocation, so we
  54 // scavenge those fragments eagerly to offset the growth in RSS that results.
  55
  56 package runtime
  57
  58 import (
  59         "runtime/internal/atomic"
  60         "runtime/internal/sys"
  61         "unsafe"
  62 )
  63
  64 const (
  65         // The background scavenger is paced according to these parameters.
  66         //
  67         // scavengePercent represents the portion of mutator time we're willing
  68         // to spend on scavenging in percent.
  69         scavengePercent = 1 // 1%
  70
  71         // retainExtraPercent represents the amount of memory over the heap goal
  72         // that the scavenger should keep as a buffer space for the allocator.
  73         //
  74         // The purpose of maintaining this overhead is to have a greater pool of
  75         // unscavenged memory available for allocation (since using scavenged memory
  76         // incurs an additional cost), to account for heap fragmentation and
  77         // the ever-changing layout of the heap.
  78         retainExtraPercent = 10
  79
  80         // maxPagesPerPhysPage is the maximum number of supported runtime pages per
  81         // physical page, based on maxPhysPageSize.
  82         maxPagesPerPhysPage = maxPhysPageSize / pageSize
  83 )
  84
  85 // heapRetained returns an estimate of the current heap RSS.
  86 func heapRetained() uint64 {
  87         return atomic.Load64(&memstats.heap_sys) - atomic.Load64(&memstats.heap_released)
  88 }
  89
  90 // gcPaceScavenger updates the scavenger's pacing, particularly
  91 // its rate and RSS goal.
  92 //
  93 // The RSS goal is based on the current heap goal with a small overhead
  94 // to accommodate non-determinism in the allocator.
  95 //
  96 // The pacing is based on scavengePageRate, which applies to both regular and
  97 // huge pages. See that constant for more information.
  98 //
  99 // mheap_.lock must be held or the world must be stopped.
 100 func gcPaceScavenger() {
 101         // If we're called before the first GC completed, disable scavenging.
 102         // We never scavenge before the 2nd GC cycle anyway (we don't have enough
 103         // information about the heap yet) so this is fine, and avoids a fault
 104         // or garbage data later.
 105         if memstats.last_next_gc == 0 {
 106                 mheap_.scavengeGoal = ^uint64(0)
 107                 return
 108         }
 109         // Compute our scavenging goal.
 110         goalRatio := float64(memstats.next_gc) / float64(memstats.last_next_gc)
 111         retainedGoal := uint64(float64(memstats.last_heap_inuse) * goalRatio)
 112         // Add retainExtraPercent overhead to retainedGoal. This calculation
 113         // looks strange but the purpose is to arrive at an integer division
 114         // (e.g. if retainExtraPercent = 12.5, then we get a divisor of 8)
 115         // that also avoids the overflow from a multiplication.
 116         retainedGoal += retainedGoal / (1.0 / (retainExtraPercent / 100.0))
 117         // Align it to a physical page boundary to make the following calculations
 118         // a bit more exact.
 119         retainedGoal = (retainedGoal + uint64(physPageSize) - 1) &^ (uint64(physPageSize) - 1)
 120
 121         // Represents where we are now in the heap's contribution to RSS in bytes.
 122         //
 123         // Guaranteed to always be a multiple of physPageSize on systems where
 124         // physPageSize <= pageSize since we map heap_sys at a rate larger than
 125         // any physPageSize and released memory in multiples of the physPageSize.
 126         //
 127         // However, certain functions recategorize heap_sys as other stats (e.g.
 128         // stack_sys) and this happens in multiples of pageSize, so on systems
 129         // where physPageSize > pageSize the calculations below will not be exact.
 130         // Generally this is OK since we'll be off by at most one regular
 131         // physical page.
 132         retainedNow := heapRetained()
 133
 134         // If we're already below our goal, or within one page of our goal, then disable
 135         // the background scavenger. We disable the background scavenger if there's
 136         // less than one physical page of work to do because it's not worth it.
 137         if retainedNow <= retainedGoal || retainedNow-retainedGoal < uint64(physPageSize) {
 138                 mheap_.scavengeGoal = ^uint64(0)
 139                 return
 140         }
 141         mheap_.scavengeGoal = retainedGoal
 142         mheap_.pages.resetScavengeAddr()
 143 }
 144
 145 // Sleep/wait state of the background scavenger.
 146 var scavenge struct {
 147         lock   mutex
 148         g      *g
 149         parked bool
 150         timer  *timer
 151 }
 152
 153 // wakeScavenger unparks the scavenger if necessary. It must be called
 154 // after any pacing update.
 155 //
 156 // mheap_.lock and scavenge.lock must not be held.
 157 func wakeScavenger() {
 158         lock(&scavenge.lock)
 159         if scavenge.parked {
 160                 // Try to stop the timer but we don't really care if we succeed.
 161                 // It's possible that either a timer was never started, or that
 162                 // we're racing with it.
 163                 // In the case that we're racing with there's the low chance that
 164                 // we experience a spurious wake-up of the scavenger, but that's
 165                 // totally safe.
 166                 stopTimer(scavenge.timer)
 167
 168                 // Unpark the goroutine and tell it that there may have been a pacing
 169                 // change. Note that we skip the scheduler's runnext slot because we
 170                 // want to avoid having the scavenger interfere with the fair
 171                 // scheduling of user goroutines. In effect, this schedules the
 172                 // scavenger at a "lower priority" but that's OK because it'll
 173                 // catch up on the work it missed when it does get scheduled.
 174                 scavenge.parked = false
 175                 systemstack(func() {
 176                         ready(scavenge.g, 0, false)
 177                 })
 178         }
 179         unlock(&scavenge.lock)
 180 }
 181
 182 // scavengeSleep attempts to put the scavenger to sleep for ns.
 183 //
 184 // Note that this function should only be called by the scavenger.
 185 //
 186 // The scavenger may be woken up earlier by a pacing change, and it may not go
 187 // to sleep at all if there's a pending pacing change.
 188 //
 189 // Returns the amount of time actually slept.
 190 func scavengeSleep(ns int64) int64 {
 191         lock(&scavenge.lock)
 192
 193         // Set the timer.
 194         //
 195         // This must happen here instead of inside gopark
 196         // because we can't close over any variables without
 197         // failing escape analysis.
 198         start := nanotime()
 199         resetTimer(scavenge.timer, start+ns)
 200
 201         // Mark ourself as asleep and go to sleep.
 202         scavenge.parked = true
 203         goparkunlock(&scavenge.lock, waitReasonSleep, traceEvGoSleep, 2)
 204
 205         // Return how long we actually slept for.
 206         return nanotime() - start
 207 }
 208
 209 // Background scavenger.
 210 //
 211 // The background scavenger maintains the RSS of the application below
 212 // the line described by the proportional scavenging statistics in
 213 // the mheap struct.
 214 func bgscavenge(c chan int) {
 215         setSystemGoroutine()
 216
 217         scavenge.g = getg()
 218
 219         lock(&scavenge.lock)
 220         scavenge.parked = true
 221
 222         scavenge.timer = new(timer)
 223         scavenge.timer.f = func(_ interface{}, _ uintptr) {
 224                 wakeScavenger()
 225         }
 226
 227         c <- 1
 228         goparkunlock(&scavenge.lock, waitReasonGCScavengeWait, traceEvGoBlock, 1)
 229
 230         // Exponentially-weighted moving average of the fraction of time this
 231         // goroutine spends scavenging (that is, percent of a single CPU).
 232         // It represents a measure of scheduling overheads which might extend
 233         // the sleep or the critical time beyond what's expected. Assume no
 234         // overhead to begin with.
 235         //
 236         // TODO(mknyszek): Consider making this based on total CPU time of the
 237         // application (i.e. scavengePercent * GOMAXPROCS). This isn't really
 238         // feasible now because the scavenger acquires the heap lock over the
 239         // scavenging operation, which means scavenging effectively blocks
 240         // allocators and isn't scalable. However, given a scalable allocator,
 241         // it makes sense to also make the scavenger scale with it; if you're
 242         // allocating more frequently, then presumably you're also generating
 243         // more work for the scavenger.
 244         const idealFraction = scavengePercent / 100.0
 245         scavengeEWMA := float64(idealFraction)
 246
 247         for {
 248                 released := uintptr(0)
 249
 250                 // Time in scavenging critical section.
 251                 crit := int64(0)
 252
 253                 // Run on the system stack since we grab the heap lock,
 254                 // and a stack growth with the heap lock means a deadlock.
 255                 systemstack(func() {
 256                         lock(&mheap_.lock)
 257
 258                         // If background scavenging is disabled or if there's no work to do just park.
 259                         retained, goal := heapRetained(), mheap_.scavengeGoal
 260                         if retained <= goal {
 261                                 unlock(&mheap_.lock)
 262                                 return
 263                         }
 264                         unlock(&mheap_.lock)
 265
 266                         // Scavenge one page, and measure the amount of time spent scavenging.
 267                         start := nanotime()
 268                         released = mheap_.pages.scavengeOne(physPageSize, false)
 269                         crit = nanotime() - start
 270                 })
 271
 272                 if debug.gctrace > 0 {
 273                         if released > 0 {
 274                                 print("scvg: ", released>>10, " KB released\n")
 275                         }
 276                         print("scvg: inuse: ", memstats.heap_inuse>>20, ", idle: ", memstats.heap_idle>>20, ", sys: ", memstats.heap_sys>>20, ", released: ", memstats.heap_released>>20, ", consumed: ", (memstats.heap_sys-memstats.heap_released)>>20, " (MB)\n")
 277                 }
 278
 279                 if released == 0 {
 280                         lock(&scavenge.lock)
 281                         scavenge.parked = true
 282                         goparkunlock(&scavenge.lock, waitReasonGCScavengeWait, traceEvGoBlock, 1)
 283                         continue
 284                 }
 285
 286                 // If we spent more than 10 ms (for example, if the OS scheduled us away, or someone
 287                 // put their machine to sleep) in the critical section, bound the time we use to
 288                 // calculate at 10 ms to avoid letting the sleep time get arbitrarily high.
 289                 const maxCrit = 10e6
 290                 if crit > maxCrit {
 291                         crit = maxCrit
 292                 }
 293
 294                 // Compute the amount of time to sleep, assuming we want to use at most
 295                 // scavengePercent of CPU time. Take into account scheduling overheads
 296                 // that may extend the length of our sleep by multiplying by how far
 297                 // off we are from the ideal ratio. For example, if we're sleeping too
 298                 // much, then scavengeEMWA < idealFraction, so we'll adjust the sleep time
 299                 // down.
 300                 adjust := scavengeEWMA / idealFraction
 301                 sleepTime := int64(adjust * float64(crit) / (scavengePercent / 100.0))
 302
 303                 // Go to sleep.
 304                 slept := scavengeSleep(sleepTime)
 305
 306                 // Compute the new ratio.
 307                 fraction := float64(crit) / float64(crit+slept)
 308
 309                 // Set a lower bound on the fraction.
 310                 // Due to OS-related anomalies we may "sleep" for an inordinate amount
 311                 // of time. Let's avoid letting the ratio get out of hand by bounding
 312                 // the sleep time we use in our EWMA.
 313                 const minFraction = 1 / 1000
 314                 if fraction < minFraction {
 315                         fraction = minFraction
 316                 }
 317
 318                 // Update scavengeEWMA by merging in the new crit/slept ratio.
 319                 const alpha = 0.5
 320                 scavengeEWMA = alpha*fraction + (1-alpha)*scavengeEWMA
 321         }
 322 }
 323
 324 // scavenge scavenges nbytes worth of free pages, starting with the
 325 // highest address first. Successive calls continue from where it left
 326 // off until the heap is exhausted. Call resetScavengeAddr to bring it
 327 // back to the top of the heap.
 328 //
 329 // Returns the amount of memory scavenged in bytes.
 330 //
 331 // If locked == false, s.mheapLock must not be locked. If locked == true,
 332 // s.mheapLock must be locked.
 333 //
 334 // Must run on the system stack because scavengeOne must run on the
 335 // system stack.
 336 //
 337 //go:systemstack
 338 func (s *pageAlloc) scavenge(nbytes uintptr, locked bool) uintptr {
 339         released := uintptr(0)
 340         for released < nbytes {
 341                 r := s.scavengeOne(nbytes-released, locked)
 342                 if r == 0 {
 343                         // Nothing left to scavenge! Give up.
 344                         break
 345                 }
 346                 released += r
 347         }
 348         return released
 349 }
 350
 351 // resetScavengeAddr sets the scavenge start address to the top of the heap's
 352 // address space. This should be called each time the scavenger's pacing
 353 // changes.
 354 //
 355 // s.mheapLock must be held.
 356 func (s *pageAlloc) resetScavengeAddr() {
 357         s.scavAddr = chunkBase(s.end) - 1
 358 }
 359
 360 // scavengeOne starts from s.scavAddr and walks down the heap until it finds
 361 // a contiguous run of pages to scavenge. It will try to scavenge at most
 362 // max bytes at once, but may scavenge more to avoid breaking huge pages. Once
 363 // it scavenges some memory it returns how much it scavenged and updates s.scavAddr
 364 // appropriately. s.scavAddr must be reset manually and externally.
 365 //
 366 // Should it exhaust the heap, it will return 0 and set s.scavAddr to minScavAddr.
 367 //
 368 // If locked == false, s.mheapLock must not be locked.
 369 // If locked == true, s.mheapLock must be locked.
 370 //
 371 // Must be run on the system stack because it either acquires the heap lock
 372 // or executes with the heap lock acquired.
 373 //
 374 //go:systemstack
 375 func (s *pageAlloc) scavengeOne(max uintptr, locked bool) uintptr {
 376         // Calculate the maximum number of pages to scavenge.
 377         //
 378         // This should be alignUp(max, pageSize) / pageSize but max can and will
 379         // be ^uintptr(0), so we need to be very careful not to overflow here.
 380         // Rather than use alignUp, calculate the number of pages rounded down
 381         // first, then add back one if necessary.
 382         maxPages := max / pageSize
 383         if max%pageSize != 0 {
 384                 maxPages++
 385         }
 386
 387         // Calculate the minimum number of pages we can scavenge.
 388         //
 389         // Because we can only scavenge whole physical pages, we must
 390         // ensure that we scavenge at least minPages each time, aligned
 391         // to minPages*pageSize.
 392         minPages := physPageSize / pageSize
 393         if minPages < 1 {
 394                 minPages = 1
 395         }
 396
 397         // Helpers for locking and unlocking only if locked == false.
 398         lockHeap := func() {
 399                 if !locked {
 400                         lock(s.mheapLock)
 401                 }
 402         }
 403         unlockHeap := func() {
 404                 if !locked {
 405                         unlock(s.mheapLock)
 406                 }
 407         }
 408
 409         lockHeap()
 410         ci := chunkIndex(s.scavAddr)
 411         if ci < s.start {
 412                 unlockHeap()
 413                 return 0
 414         }
 415
 416         // Check the chunk containing the scav addr, starting at the addr
 417         // and see if there are any free and unscavenged pages.
 418         if s.summary[len(s.summary)-1][ci].max() >= uint(minPages) {
 419                 // We only bother looking for a candidate if there at least
 420                 // minPages free pages at all. It's important that we only
 421                 // continue if the summary says we can because that's how
 422                 // we can tell if parts of the address space are unused.
 423                 // See the comment on s.chunks in mpagealloc.go.
 424                 base, npages := s.chunkOf(ci).findScavengeCandidate(chunkPageIndex(s.scavAddr), minPages, maxPages)
 425
 426                 // If we found something, scavenge it and return!
 427                 if npages != 0 {
 428                         s.scavengeRangeLocked(ci, base, npages)
 429                         unlockHeap()
 430                         return uintptr(npages) * pageSize
 431                 }
 432         }
 433
 434         // getInUseRange returns the highest range in the
 435         // intersection of [0, addr] and s.inUse.
 436         //
 437         // s.mheapLock must be held.
 438         getInUseRange := func(addr uintptr) addrRange {
 439                 top := s.inUse.findSucc(addr)
 440                 if top == 0 {
 441                         return addrRange{}
 442                 }
 443                 r := s.inUse.ranges[top-1]
 444                 // addr is inclusive, so treat it as such when
 445                 // updating the limit, which is exclusive.
 446                 if r.limit > addr+1 {
 447                         r.limit = addr + 1
 448                 }
 449                 return r
 450         }
 451
 452         // Slow path: iterate optimistically over the in-use address space
 453         // looking for any free and unscavenged page. If we think we see something,
 454         // lock and verify it!
 455         //
 456         // We iterate over the address space by taking ranges from inUse.
 457 newRange:
 458         for {
 459                 r := getInUseRange(s.scavAddr)
 460                 if r.size() == 0 {
 461                         break
 462                 }
 463                 unlockHeap()
 464
 465                 // Iterate over all of the chunks described by r.
 466                 // Note that r.limit is the exclusive upper bound, but what
 467                 // we want is the top chunk instead, inclusive, so subtract 1.
 468                 bot, top := chunkIndex(r.base), chunkIndex(r.limit-1)
 469                 for i := top; i >= bot; i-- {
 470                         // If this chunk is totally in-use or has no unscavenged pages, don't bother
 471                         // doing a  more sophisticated check.
 472                         //
 473                         // Note we're accessing the summary and the chunks without a lock, but
 474                         // that's fine. We're being optimistic anyway.
 475
 476                         // Check quickly if there are enough free pages at all.
 477                         if s.summary[len(s.summary)-1][i].max() < uint(minPages) {
 478                                 continue
 479                         }
 480
 481                         // Run over the chunk looking harder for a candidate. Again, we could
 482                         // race with a lot of different pieces of code, but we're just being
 483                         // optimistic. Make sure we load the l2 pointer atomically though, to
 484                         // avoid races with heap growth. It may or may not be possible to also
 485                         // see a nil pointer in this case if we do race with heap growth, but
 486                         // just defensively ignore the nils. This operation is optimistic anyway.
 487                         l2 := (*[1 << pallocChunksL2Bits]pallocData)(atomic.Loadp(unsafe.Pointer(&s.chunks[i.l1()])))
 488                         if l2 == nil || !l2[i.l2()].hasScavengeCandidate(minPages) {
 489                                 continue
 490                         }
 491
 492                         // We found a candidate, so let's lock and verify it.
 493                         lockHeap()
 494
 495                         // Find, verify, and scavenge if we can.
 496                         chunk := s.chunkOf(i)
 497                         base, npages := chunk.findScavengeCandidate(pallocChunkPages-1, minPages, maxPages)
 498                         if npages > 0 {
 499                                 // We found memory to scavenge! Mark the bits and report that up.
 500                                 // scavengeRangeLocked will update scavAddr for us, also.
 501                                 s.scavengeRangeLocked(i, base, npages)
 502                                 unlockHeap()
 503                                 return uintptr(npages) * pageSize
 504                         }
 505
 506                         // We were fooled, let's take this opportunity to move the scavAddr
 507                         // all the way down to where we searched as scavenged for future calls
 508                         // and keep iterating. Then, go get a new range.
 509                         s.scavAddr = chunkBase(i-1) + pallocChunkPages*pageSize - 1
 510                         continue newRange
 511                 }
 512                 lockHeap()
 513
 514                 // Move the scavenger down the heap, past everything we just searched.
 515                 // Since we don't check if scavAddr moved while twe let go of the heap lock,
 516                 // it's possible that it moved down and we're moving it up here. This
 517                 // raciness could result in us searching parts of the heap unnecessarily.
 518                 // TODO(mknyszek): Remove this racy behavior through explicit address
 519                 // space reservations, which are difficult to do with just scavAddr.
 520                 s.scavAddr = r.base - 1
 521         }
 522         // We reached the end of the in-use address space and couldn't find anything,
 523         // so signal that there's nothing left to scavenge.
 524         s.scavAddr = minScavAddr
 525         unlockHeap()
 526
 527         return 0
 528 }
 529
 530 // scavengeRangeLocked scavenges the given region of memory.
 531 //
 532 // s.mheapLock must be held.
 533 func (s *pageAlloc) scavengeRangeLocked(ci chunkIdx, base, npages uint) {
 534         s.chunkOf(ci).scavenged.setRange(base, npages)
 535
 536         // Compute the full address for the start of the range.
 537         addr := chunkBase(ci) + uintptr(base)*pageSize
 538
 539         // Update the scav pointer.
 540         s.scavAddr = addr - 1
 541
 542         // Only perform the actual scavenging if we're not in a test.
 543         // It's dangerous to do so otherwise.
 544         if s.test {
 545                 return
 546         }
 547         sysUnused(unsafe.Pointer(addr), uintptr(npages)*pageSize)
 548
 549         // Update global accounting only when not in test, otherwise
 550         // the runtime's accounting will be wrong.
 551         mSysStatInc(&memstats.heap_released, uintptr(npages)*pageSize)
 552 }
 553
 554 // fillAligned returns x but with all zeroes in m-aligned
 555 // groups of m bits set to 1 if any bit in the group is non-zero.
 556 //
 557 // For example, fillAligned(0x0100a3, 8) == 0xff00ff.
 558 //
 559 // Note that if m == 1, this is a no-op.
 560 //
 561 // m must be a power of 2 <= maxPagesPerPhysPage.
 562 func fillAligned(x uint64, m uint) uint64 {
 563         apply := func(x uint64, c uint64) uint64 {
 564                 // The technique used it here is derived from
 565                 // https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
 566                 // and extended for more than just bytes (like nibbles
 567                 // and uint16s) by using an appropriate constant.
 568                 //
 569                 // To summarize the technique, quoting from that page:
 570                 // "[It] works by first zeroing the high bits of the [8]
 571                 // bytes in the word. Subsequently, it adds a number that
 572                 // will result in an overflow to the high bit of a byte if
 573                 // any of the low bits were initially set. Next the high
 574                 // bits of the original word are ORed with these values;
 575                 // thus, the high bit of a byte is set iff any bit in the
 576                 // byte was set. Finally, we determine if any of these high
 577                 // bits are zero by ORing with ones everywhere except the
 578                 // high bits and inverting the result."
 579                 return ^((((x & c) + c) | x) | c)
 580         }
 581         // Transform x to contain a 1 bit at the top of each m-aligned
 582         // group of m zero bits.
 583         switch m {
 584         case 1:
 585                 return x
 586         case 2:
 587                 x = apply(x, 0x5555555555555555)
 588         case 4:
 589                 x = apply(x, 0x7777777777777777)
 590         case 8:
 591                 x = apply(x, 0x7f7f7f7f7f7f7f7f)
 592         case 16:
 593                 x = apply(x, 0x7fff7fff7fff7fff)
 594         case 32:
 595                 x = apply(x, 0x7fffffff7fffffff)
 596         case 64: // == maxPagesPerPhysPage
 597                 x = apply(x, 0x7fffffffffffffff)
 598         default:
 599                 throw("bad m value")
 600         }
 601         // Now, the top bit of each m-aligned group in x is set
 602         // that group was all zero in the original x.
 603
 604         // From each group of m bits subtract 1.
 605         // Because we know only the top bits of each
 606         // m-aligned group are set, we know this will
 607         // set each group to have all the bits set except
 608         // the top bit, so just OR with the original
 609         // result to set all the bits.
 610         return ^((x - (x >> (m - 1))) | x)
 611 }
 612
 613 // hasScavengeCandidate returns true if there's any min-page-aligned groups of
 614 // min pages of free-and-unscavenged memory in the region represented by this
 615 // pallocData.
 616 //
 617 // min must be a non-zero power of 2 <= maxPagesPerPhysPage.
 618 func (m *pallocData) hasScavengeCandidate(min uintptr) bool {
 619         if min&(min-1) != 0 || min == 0 {
 620                 print("runtime: min = ", min, "\n")
 621                 throw("min must be a non-zero power of 2")
 622         } else if min > maxPagesPerPhysPage {
 623                 print("runtime: min = ", min, "\n")
 624                 throw("min too large")
 625         }
 626
 627         // The goal of this search is to see if the chunk contains any free and unscavenged memory.
 628         for i := len(m.scavenged) - 1; i >= 0; i-- {
 629                 // 1s are scavenged OR non-free => 0s are unscavenged AND free
 630                 //
 631                 // TODO(mknyszek): Consider splitting up fillAligned into two
 632                 // functions, since here we technically could get by with just
 633                 // the first half of its computation. It'll save a few instructions
 634                 // but adds some additional code complexity.
 635                 x := fillAligned(m.scavenged[i]|m.pallocBits[i], uint(min))
 636
 637                 // Quickly skip over chunks of non-free or scavenged pages.
 638                 if x != ^uint64(0) {
 639                         return true
 640                 }
 641         }
 642         return false
 643 }
 644
 645 // findScavengeCandidate returns a start index and a size for this pallocData
 646 // segment which represents a contiguous region of free and unscavenged memory.
 647 //
 648 // searchIdx indicates the page index within this chunk to start the search, but
 649 // note that findScavengeCandidate searches backwards through the pallocData. As a
 650 // a result, it will return the highest scavenge candidate in address order.
 651 //
 652 // min indicates a hard minimum size and alignment for runs of pages. That is,
 653 // findScavengeCandidate will not return a region smaller than min pages in size,
 654 // or that is min pages or greater in size but not aligned to min. min must be
 655 // a non-zero power of 2 <= maxPagesPerPhysPage.
 656 //
 657 // max is a hint for how big of a region is desired. If max >= pallocChunkPages, then
 658 // findScavengeCandidate effectively returns entire free and unscavenged regions.
 659 // If max < pallocChunkPages, it may truncate the returned region such that size is
 660 // max. However, findScavengeCandidate may still return a larger region if, for
 661 // example, it chooses to preserve huge pages, or if max is not aligned to min (it
 662 // will round up). That is, even if max is small, the returned size is not guaranteed
 663 // to be equal to max. max is allowed to be less than min, in which case it is as if
 664 // max == min.
 665 func (m *pallocData) findScavengeCandidate(searchIdx uint, min, max uintptr) (uint, uint) {
 666         if min&(min-1) != 0 || min == 0 {
 667                 print("runtime: min = ", min, "\n")
 668                 throw("min must be a non-zero power of 2")
 669         } else if min > maxPagesPerPhysPage {
 670                 print("runtime: min = ", min, "\n")
 671                 throw("min too large")
 672         }
 673         // max may not be min-aligned, so we might accidentally truncate to
 674         // a max value which causes us to return a non-min-aligned value.
 675         // To prevent this, align max up to a multiple of min (which is always
 676         // a power of 2). This also prevents max from ever being less than
 677         // min, unless it's zero, so handle that explicitly.
 678         if max == 0 {
 679                 max = min
 680         } else {
 681                 max = alignUp(max, min)
 682         }
 683
 684         i := int(searchIdx / 64)
 685         // Start by quickly skipping over blocks of non-free or scavenged pages.
 686         for ; i >= 0; i-- {
 687                 // 1s are scavenged OR non-free => 0s are unscavenged AND free
 688                 x := fillAligned(m.scavenged[i]|m.pallocBits[i], uint(min))
 689                 if x != ^uint64(0) {
 690                         break
 691                 }
 692         }
 693         if i < 0 {
 694                 // Failed to find any free/unscavenged pages.
 695                 return 0, 0
 696         }
 697         // We have something in the 64-bit chunk at i, but it could
 698         // extend further. Loop until we find the extent of it.
 699
 700         // 1s are scavenged OR non-free => 0s are unscavenged AND free
 701         x := fillAligned(m.scavenged[i]|m.pallocBits[i], uint(min))
 702         z1 := uint(sys.LeadingZeros64(^x))
 703         run, end := uint(0), uint(i)*64+(64-z1)
 704         if x<<z1 != 0 {
 705                 // After shifting out z1 bits, we still have 1s,
 706                 // so the run ends inside this word.
 707                 run = uint(sys.LeadingZeros64(x << z1))
 708         } else {
 709                 // After shifting out z1 bits, we have no more 1s.
 710                 // This means the run extends to the bottom of the
 711                 // word so it may extend into further words.
 712                 run = 64 - z1
 713                 for j := i - 1; j >= 0; j-- {
 714                         x := fillAligned(m.scavenged[j]|m.pallocBits[j], uint(min))
 715                         run += uint(sys.LeadingZeros64(x))
 716                         if x != 0 {
 717                                 // The run stopped in this word.
 718                                 break
 719                         }
 720                 }
 721         }
 722
 723         // Split the run we found if it's larger than max but hold on to
 724         // our original length, since we may need it later.
 725         size := run
 726         if size > uint(max) {
 727                 size = uint(max)
 728         }
 729         start := end - size
 730
 731         // Each huge page is guaranteed to fit in a single palloc chunk.
 732         //
 733         // TODO(mknyszek): Support larger huge page sizes.
 734         // TODO(mknyszek): Consider taking pages-per-huge-page as a parameter
 735         // so we can write tests for this.
 736         if physHugePageSize > pageSize && physHugePageSize > physPageSize {
 737                 // We have huge pages, so let's ensure we don't break one by scavenging
 738                 // over a huge page boundary. If the range [start, start+size) overlaps with
 739                 // a free-and-unscavenged huge page, we want to grow the region we scavenge
 740                 // to include that huge page.
 741
 742                 // Compute the huge page boundary above our candidate.
 743                 pagesPerHugePage := uintptr(physHugePageSize / pageSize)
 744                 hugePageAbove := uint(alignUp(uintptr(start), pagesPerHugePage))
 745
 746                 // If that boundary is within our current candidate, then we may be breaking
 747                 // a huge page.
 748                 if hugePageAbove <= end {
 749                         // Compute the huge page boundary below our candidate.
 750                         hugePageBelow := uint(alignDown(uintptr(start), pagesPerHugePage))
 751
 752                         if hugePageBelow >= end-run {
 753                                 // We're in danger of breaking apart a huge page since start+size crosses
 754                                 // a huge page boundary and rounding down start to the nearest huge
 755                                 // page boundary is included in the full run we found. Include the entire
 756                                 // huge page in the bound by rounding down to the huge page size.
 757                                 size = size + (start - hugePageBelow)
 758                                 start = hugePageBelow
 759                         }
 760                 }
 761         }
 762         return start, size
 763 }