git.ipfire.org Git - thirdparty/ipxe.git/log

[cloud] Display build architecture in AWS EC2

On some newer (7th and 8th generation) instance types, the 32-bit
build of iPXE cannot access PCI configuration space since the ECAM is
placed outside of the 32-bit address space. The visible symptom is
that iPXE fails to detect any network devices.

The public AMIs are all now built as 64-bit binaries, but there is
nothing that prevents the building and importing of a 32-bit AMI.
There are still potentially valid use cases for 32-bit AMIs (e.g. if
planning to use the AMI only for older instance types), and so we
cannot sensibly prevent this error at build time.

Display the build architecture as part of the AWS EC2 embedded script,
to at least allow for easy identification of this particular failure
mode at run time.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[cloud] Remove AWS public image access block only if not already unblocked

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[cloud] Remove AWS public image access block automatically if needed

Making images public is blocked by default in new AWS regions. Remove
this block automatically whenever creating a public image.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[ena] Limit receive queue size to work around hardware bugs

Commit a801244 ("[ena] Increase receive ring size to 128 entries")
increased the receive ring size to 128 entries (while leaving the fill
level at 16), since using a smaller receive ring caused unexplained
failures on some instance types.

The original hardware bug that resulted in that commit seems to have
been fixed: experiments suggest that the original failure (observed on
a c6i.large instance in eu-west-2) will no longer reproduce when using
a receive ring containing only 16 entries (as was the case prior to
that commit).

Newer generations of the ENA hardware (observed on an m8i.large
instance in eu-south-2) seem to have a new and exciting hardware bug:
these instance types appear to use a hash of the received packet
header to determine which portion of the (out-of-order) receive ring
to use. If that portion of the ring happens to be empty (e.g. because
only 32 entries of the 128-entry ring are filled at any one time),
then the packet will be silently dropped.

Work around this new hardware bug by reducing the receive ring size
down to the current fill level of 32 entries. This appears to work on
all current instance types (but has not been exhaustively tested).

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[ena] Increase transmit queue size to match receive fill level

Avoid running out of transmit descriptors when sending TCP ACKs by
increasing the transmit queue size to match the increased received
fill level.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[ena] Add memory barrier after writing to on-device memory

Ensure that writes to on-device memory have taken place before writing
to the doorbell register.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[ena] Increase receive fill level

Experiments suggest that at least some instance types (observed with
c6i.large in eu-west-2) experience high packet drop rates with only 16
receive buffers allocated. Increase the fill level to 32 buffers.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[ena] Add support for low latency transmit queues

Newer generations of the ENA hardware require the use of low latency
transmit queues, where the submission queues and the initial portion
of the transmitted packet are written to on-device memory via BAR2
instead of being read from host memory.

Detect support for low latency queues and set the placement policy
appropriately. We attempt the use of low latency queues only if the
device reports that it supports inline headers, 128-byte entries, and
two descriptors prior to the inlined header, on the basis that we
don't care about using low latency queues on older versions of the
hardware since those versions will support normal host memory
submission queues anyway.

We reuse the redundant memory allocated for the submission queue as
the bounce buffer for constructing the descriptors and inlined packet
data, since this avoids needing a separate allocation just for the
bounce buffer.

We construct a metadata submission queue entry prior to the actual
submission queue entry, since experimentation suggests that newer
generations of the hardware require this to be present even though it
conveys no information beyond its own existence.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[ena] Record supported device features

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[ena] Cancel uncompleted transmit buffers on close

Avoid spurious assertion failures by ensuring that references to
uncompleted transmit buffers are not retained after the device has
been closed.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[ena] Map the on-device memory, if present

Newer generations of the ENA hardware require the use of low latency
transmit queues, where the submission queues and the initial portion
of the transmitted packet are written to on-device memory via BAR2
instead of being read from host memory.

Prepare for this by mapping the on-device memory BAR. As with the
register BAR, we may need to steal a base address from the upstream
PCI bridge since the BIOS on some instance types (observed with an
m8i.metal-48xl instance in eu-south-2) will fail to assign an address
to the device.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[ena] Add descriptive messages for any admin queue command failures

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[pci] Record prefetchable memory window for PCI bridges

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[ena] Use pci_bar_set() to place device within bridge memory window

Use pci_bar_set() when we need to set a device base address (on
instance types such as c6i.metal where the BIOS fails to do so), so
that 64-bit BARs will be handled automatically.

This particular issue has so far been observed only on 6th generation
instances. These use 32-bit BARs, and so the lack of support for
handling 64-bit BARs has not caused any observable issue.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[pci] Handle sizing of 64-bit BARs

Provide pci_bar_set() to handle setting the base address for a
potentially 64-bit BAR, and rewrite pci_bar_size() to correctly handle
sizing of 64-bit BARs.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[tls] Disable renegotiation unless extended master secret is used

RFC 7627 states that renegotiation becomes no longer secure under
various circumstances when the non-extended master secret is used.
The description of the precise set of circumstances is spread across
various points within the document and is not entirely clear.

Avoid a superset of the circumstances in which renegotiation
apparently becomes insecure by refusing renegotiation completely
unless the extended master secret is used.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[tls] Refuse to resume sessions with mismatched master secret methods

RFC 7627 section 5.3 states that the client must abort the handshake
if the server attempts to resume a session where the master secret
calculation method stored in the session does not match the method
used for the connection being resumed.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[tls] Add support for the Extended Master Secret

RFC 7627 defines the Extended Master Secret (EMS) as an alternative
calculation that uses the digest of all handshake messages rather than
just the client and server random bytes.

Add support for negotiating the Extended Master Secret extension and
performing the relevant calculation of the master secret.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[tls] Generate master secret only after sending Client Key Exchange

The calculation for the extended master secret as defined in RFC 7627
relies upon the digest of all handshake messages up to and including
the Client Key Exchange.

Facilitate this calculation by generating the master secret only after
sending the Client Key Exchange message.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[gve] Rearm interrupts unconditionally on every poll

Experimentation suggests that rearming the interrupt once per observed
completion is not sufficient: we still see occasional delays during
which the hardware fails to write out completions.

As described in commit d2e1e59 ("[gve] Use dummy interrupt to trigger
completion writeback in DQO mode"), there is no documentation around
the precise semantics of the interrupt rearming mechanism, and so
experimentation is the only available guide. Switch to rearming both
TX and RX interrupts unconditionally on every poll, since this
produces better experimental results.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[gve] Use raw DMA addresses in descriptors in DQO-QPL mode

The DQO-QPL operating mode uses registered queue page lists but still
requires the raw DMA address (rather than the linear offset within the
QPL) to be provided in transmit and receive descriptors.

Set the queue page list base device address appropriately.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[gve] Report only packet completions for the transmit ring

The hardware reports descriptor and packet completions separately for
the transmit ring. We currently ignore descriptor completions (since
we cannot free up the transmit buffers in the queue page list and
advance the consumer counter until the packet has also completed).

Now that transmit completions are written out immediately (instead of
being delayed until 128 bytes of completions are available), there is
no value in retaining the descriptor completions.

Omit descriptor completions entirely, and reduce the transmit fill
level back down to its original value.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[gve] Use dummy interrupt to trigger completion writeback in DQO mode

When operating in the DQO operating mode, the device will defer
writing transmit and receive completions until an entire internal
cacheline (128 bytes) is full, or until an associated interrupt is
asserted.  Since each receive descriptor is 32 bytes, this will cause
received packets to be effectively delayed until up to three further
packets have arrived.  When network traffic volumes are very low (such
as during DHCP, DNS lookups, or TCP handshakes), this typically
induces delays of up to 30 seconds and results in a very poor user
experience.

Work around this hardware problem in the same way as for the Intel
40GbE and 100GbE NICs: by enabling dummy MSI-X interrupts to trick the
hardware into believing that it needs to write out completions to host
memory.

There is no documentation around the interrupt rearming mechanism.
The value written to the interrupt doorbell does not include a
consumer counter value, and so must be relying on some undocumented
ordering constraints.  Comments in the Linux driver source suggest
that the authors believe that the device will automatically and
atomically mask an MSI-X interrupt at the point of asserting it, that
any further interrupts arriving before the doorbell is written will be
recorded in the pending bit array, and that writing the doorbell will
therefore immediately assert a new interrupt if needed.

In the absence of any documentation, choose to rearm the interrupt
once per observed completion.  This is overkill, but is less impactful
than the alternative of rearming the interrupt unconditionally on
every poll.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[gve] Add missing memory barriers

Ensure that remainder of completion records are read only after
verifying the generation bit (or sequence number).

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[intelxl] Use default dummy MSI-X target address

Use the default dummy MSI-X target address that is now allocated and
configured automatically by pci_msix_enable().

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[pci] Map all MSI-X interrupts to a dummy target address by default

Interrupts as such are not used in iPXE, which operates in polling
mode.  However, some network cards (such as the Intel 40GbE and 100GbE
NICs) will defer writing out completions until the point of asserting
an MSI-X interrupt.

From the point of view of the PCI device, asserting an MSI-X interrupt
is just a 32-bit DMA write of an opaque value to an opaque target
address.  The PCI device has no know to know whether or not the target
address corresponds to a real APIC.

We can therefore trick the PCI device into believing that it is
asserting an MSI-X interrupt, by configuring it to write an opaque
32-bit value to a dummy target address in host memory.  This is
sufficient to trigger the associated write of the completions to host
memory.

Allocate a dummy target address when enabling MSI-X on a PCI device,
and map all interrupts to this target address by default.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[gve] Select preferred operating mode

Select a preferred operating mode from those advertised as supported
by the device, falling back to the oldest known mode (GQI-QPL) if
no modes are advertised.

Since there are devices in existence that support only QPL addressing,
and since we want to minimise code size, we choose to always use a
single fixed ring buffer even when using raw DMA addressing. Having
paid this penalty, we therefore choose to prefer QPL over RDA since
this allows the (virtual) hardware to minimise the number of page
table manipulations required. We similarly prefer GQI over DQO since
this minimises the amount of work we have to do: in particular, the RX
descriptor ring contents can remain untouched for the lifetime of the
device and refills require only a doorbell write.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[gve] Add support for out-of-order queues

Add support for the "DQO" out-of-order transmit and receive queue
formats.  These are almost entirely different in format and usage (and
even endianness) from the original "GQI" in-order transmit and receive
queues, and arguably should belong to a completely different device
with a different PCI ID.  However, Google chose to essentially crowbar
two unrelated device models into the same virtual hardware, and so we
must handle both of these device models within the same driver.

Most of the new code exists solely to handle the differences in
descriptor sizes and formats.  Out-of-order completions are handled
via a buffer ID ring (as with other devices supporting out-of-order
completions, such as the Xen, Hyper-V, and Amazon virtual NICs).  A
slight twist is that on the transmit datapath (but not the receive
datapath) the Google NIC provides only one completion per packet
instead of one completion per descriptor, and so we must record the
list of chained buffer IDs in a separate array at the time of
transmission.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[gve] Cancel pending transmissions when closing device

We cancel any pending transmissions when (re)starting the device since
any transmissions that were initiated before the admin queue reset
will not complete.

The network device core will also cancel any pending transmissions
after the device is closed. If the device is closed with some
transmissions still pending and is then reopened, this will therefore
result in a stale I/O buffer being passed to netdev_tx_complete_err()
when the device is restarted.

This error has not been observed in practice since transmissions
generally complete almost immediately and it is therefore unlikely
that the device will ever be closed with transmissions still pending.
With out-of-order queues, the device seems to delay transmit
completions (with no upper time limit) until a complete batch is
available to be written out as a block of 128 bytes. It is therefore
very likely that the device will be closed with transmissions still
pending.

Fix by ensuring that we have dropped all references to transmit I/O
buffers before returning from gve_close().

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[bnxt] Handle link related async events

Handle async events related to link speed change, link speed config
change, and port phy config changes.

Signed-off-by: Joseph Wong <joseph.wong@broadcom.com>

[gve] Allow for descriptor and completion lengths to vary by mode

The descriptors and completions in the DQO operating mode are not the
same sizes as the equivalent structures in the GQI operating mode.
Allow the queue stride size to vary by operating mode (and therefore
to be known only after reading the device descriptor and selecting the
operating mode).

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[gve] Rename GQI-specific data structures and constants

Rename data structures and constants that are specific to the GQI
operating mode, to allow for a cleaner separation from other operating
modes.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[gve] Allow for out-of-order buffer consumption

We currently assume that the buffer index is equal to the descriptor
ring index, which is correct only for in-order queues.

Out-of-order queues will include a buffer tag value that is copied
from the descriptor to the completion. Redefine the data buffers as
being indexed by this tag value (rather than by the descriptor ring
index), and add a circular ring buffer to allow for tags to be reused
in whatever order they are released by the hardware.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[gve] Add support for raw DMA addressing

Raw DMA addressing allows the transmit and receive descriptors to
provide the DMA address of the data buffer directly, without requiring
the use of a pre-registered queue page list. It is modelled in the
device as a magic "raw DMA" queue page list (with QPL ID 0xffffffff)
covering the whole of the DMA address space.

When using raw DMA addressing, the transmit and receive datapaths
could use the normal pattern of mapping I/O buffers directly, and
avoid copying packet data into and out of the fixed queue page list
ring buffer. However, since we must retain support for queue page
list addressing (which requires this additional copying), we choose to
minimise code size by continuing to use the fixed ring buffer even
when using raw DMA addressing.

Add support for using raw DMA addressing by setting the queue page
list base device address appropriately, omitting the commands to
register and unregister the queue page lists, and specifying the raw
DMA QPL ID when creating the TX and RX queues.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[gve] Add concept of a queue page list base device address

Allow for the existence of a queue page list where the base device
address is non-zero, as will be the case for the raw DMA addressing
(RDA) operating mode.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[gve] Set descriptor and completion ring sizes when creating queues

The "create TX queue" and "create RX queue" commands have fields for
the descriptor and completion ring sizes, which are currently left
unpopulated since they are not required for the original GQI-QPL
operating mode.

Populate these fields, and allow for the possibility that a transmit
completion ring exists (which will be the case when using the DQO
operating mode).

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[gve] Add concept of operating mode

The GVE family supports two incompatible descriptor queue formats:

  * GQI: in-order descriptor queues
  * DQO: out-of-order descriptor queues

and two addressing modes:

  * QPL: pre-registered queue page list addressing
  * RDA: raw DMA addressing

All four combinations (GQI-QPL, GQI-RDA, DQO-QPL, and DQO-RDA) are
theoretically supported by the Linux driver, which is essentially the
only public reference provided by Google.  The original versions of
the GVE NIC supported only GQI-QPL mode, and so the iPXE driver is
written to target this mode, on the assumption that it would continue
to be supported by all models of the GVE NIC.

This assumption turns out to be incorrect: Google does not deem it
necessary to retain backwards compatibility.  Some newer machine types
(such as a4-highgpu-8g) support only the DQO-RDA operating mode.

Add a definition of operating mode, and pass this as an explicit
parameter to the "configure device resources" admin queue command.  We
choose a representation that subtracts one from the value passed in
this command, since this happens to allow us to decompose the mode
into two independent bits (one representing the use of DQO descriptor
format, one representing the use of QPL addressing).

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[gve] Remove separate concept of "packet descriptor"

The Linux driver occasionally uses the terminology "packet descriptor"
to refer to the portion of the descriptor excluding the buffer
address. This is not a helpful separation, and merely adds
complexity.

Simplify the code by removing this artifical separation.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[gve] Parse option list returned in device descriptor

Provide space for the device to return its list of supported options.
Parse the option list and record the existence of each option in a
support bitmask.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[bnxt] Add error recovery support

Add support to advertise adapter error recovery support to the
firmware. Implement error recovery operations if adapter fault is
detected. Refactor memory allocation to better align with probe and
open functions.

Signed-off-by: Joseph Wong <joseph.wong@broadcom.com>

[efi] Use current boot option as a fallback for obtaining the boot URI

Some systems (observed with a Lenovo X1) fail to populate the loaded
image device path with a Uri() component when performing a UEFI HTTP
boot, instead creating a broken loaded image device path that
represents a DHCP+TFTP boot that has not actually taken place.

If no URI is found within the loaded image device path, then fall back
to looking for a URI within the current boot option.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[efi] Add ability to extract device path from an EFI load option

An EFI boot option (stored in a BootXXXX variable) comprises an
EFI_LOAD_OPTION structure, which includes some undefined number of EFI
device paths. (The structure is extremely messy and awkward to parse
in C, but that's par for the course with EFI.)

Add a function to extract the first device path from an EFI load
option, along with wrapper functions to read and extract the first
device path from an EFI boot variable.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[libc] Add wcsnlen()

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[efi] Drag in MNP driver whenever SNP driver is present

The chainloaded-device-only "snponly" driver already drags in support
for driving SNP, NII, and MNP devices, on the basis that the user
generally doesn't care which UEFI API is used and just wants to boot
from the same network device that was used to load iPXE.

The multi-device "snp" driver already drags in support for driving SNP
and NII devices, but does not drag in support for MNP devices.

There is essentially zero code size overhead to dragging in support
for MNP devices, since this support is always present in any iPXE
application build anyway (as part of the code to download
"autoexec.ipxe" prior to installing our own drivers).

Minimise surprise by dragging in support for MNP devices whenever
using the "snp" driver, following the same reasoning used for the
"snponly" driver.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[bnxt] Update CQ doorbell type

Update completion queue doorbell to a non-arming type, since polling
is used.

Signed-off-by: Joseph Wong <joseph.wong@broadcom.com>

[dwgpio] Use fdt_reg() to get GPIO port numbers

DesignWare GPIO port numbers are represented as unsized single-entry
regions. Use fdt_reg() to obtain the GPIO port number, rather than
requiring access to a region cell size specification stored in the
port group structure.

This allows the field name "regs" in the port group structure to be
repurposed to hold the I/O register base address, which then matches
the common usage in other drivers.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[fdt] Provide fdt_reg() for unsized single-entry regions

Many region types (e.g. I2C bus addresses) can only ever contain a
single region with no size cells specified. Provide fdt_reg() to
reduce boilerplate in this common use case.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[cmdline] Show commands in alphabetical order

Commands were originally ordered by functional group (e.g. keeping the
image management commands together), with arrays used to impose a
functionally meaningful order within the group.

As the number of commands and functional groups has expanded over the
years, this has become essentially useless as an organising principle.
Switch to sorting commands alphabetically (using the linker table
mechanism).

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[digest] Treat inability to acquire an image as a fatal error

The "md5sum" and "sha1sum" commands were originally intended solely as
debugging utilities, and would return success (with a warning message)
even if the specified images did not exist.

To minimise surprise and to be consistent with other commands, treat
the inability to acquire an image as a fatal error.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[digest] Add "--set" option to store digest value in a setting

Allow the result of a digest calculation to be stored in a named
setting.  This allows for digest verification in scripts using e.g.:

  set expected:hexraw cb05def203386f2b33685d177d9f04e3e3d70dd4
  sha1sum --set actual 1mb
  iseq ${expected} ${actual} || goto checksum_bad

Note that digest verification alone cannot be used to set the trusted
execution status of an image.  The only way to mark an image as
trusted is to use the "imgverify" command.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[github] Extend sponsorship link

Add Christian Nilsson <nikize@gmail.com> as a project sponsorship
recipient, to reflect the enormous amount of time invested in
responding to issues and pull requests.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[digest] Add commands for all enabled digest algorithms

Add "sha256sum", "sha512sum", and similar commands. Include these new
commands only when DIGEST_CMD is enabled in config/general.h and the
corresponding algorithm is enabled in config/crypto.h.

Leave "mdsum" and "sha1sum" included whenever only DIGEST_CMD is
enabled, to avoid potentially breaking backwards compatibility with
builds that disabled MD5 or SHA-1 as a TLS or X.509 digest algorithm,
but would still have expected those commands to be present.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[dwgpio] Add driver for the DesignWare GPIO controller

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[gpio] Add a framework for GPIO controllers

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[fdt] Use phandle as device location

Consumption of phandles will be in the form of locating a functional
device (e.g. a GPIO device, or an I2C device, or a reset controller)
by phandle, rather than locating the device tree node to which the
phandle refers.

Repurpose fdt_phandle() to obtain the phandle value (instead of
searching by phandle), and record this value as the bus location
within the generic device structure.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[dwmac] Show core version in debug messages

Read and display the core version immediately after mapping the MMIO
registers, to provide a basic sanity check that the registers have
been correctly mapped and the core is not held in reset.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[bnxt] Remove unnecessary test_if macro

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[bnxt] Remove unnecessary I/O macros

Remove unnecessary driver specific macros. Use standard
pci_read_config_xxxx, pci_write_config_xxx, writel/q calls.

Signed-off-by: Joseph Wong <joseph.wong@broadcom.com>

[serial] Explicitly initialise serial console UART to NULL

When debugging is enabled for the device tree or memory map parsing
code, the active serial console UART variable will be accessed during
early initialisation, before the .bss section has been zeroed.

Place this variable in the .data section (by providing an explicit
initialiser), so that reading this variable is well defined even
during early initialisation.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[riscv] Place explicitly zero-initialised variables in the .data section

Variables in the .bss section cannot be relied upon to have zero
values during early initialisation, before we have relocated ourselves
to somewhere suitable in RAM and zeroed the .bss section.

Place any explicitly zero-initialised variables in the .data section
rather than in .bss, so that we can rely on their values even during
this early initialisation stage.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[riscv] Allow for poisoning .bss section before early initialisation

On startup, we may be running from read-only memory, and therefore
cannot zero the .bss section (or write to the .data section) until we
have parsed the system memory map and relocated ourselves to somewhere
suitable in RAM.  The code that runs during this early initialisation
stage must be carefully written to avoid writing to the .data section
and to avoid reading from or writing to the .bss section.

Detecting code that erroneously writes to the .data or .bss sections
is relatively easy since running from read-only memory (e.g. via
QEMU's -pflash option) will immediately reveal the bug.  Detecting
code that erroneously reads from the .bss section is harder, since in
a freshly powered-on machine (or in a virtual machine) there is a high
probability that the contents of the memory will be zero even before
we explicitly zero out the section.

Add the ability to fill the .bss section with an invalid non-zero
value to expose bugs in early initialisation code that erroneously
relies upon variables in .bss before the section has been zeroed.  We
use the value 0xeb55eb55eb55eb55 ("EBSS") since this is immediately
recognisable as a value in a crash dump, and will trigger a page fault
if dereferenced since the address is in a non-canonical form.

Poisoning the .bss can be done only when the image is known to already
reside in writable memory.  It will overwrite the relocation records,
and so can be done only on a system where relocation is known to be
unnecessary (e.g. because paging is supported).  We therefore do not
enable this behaviour by default, but leave it as a configurable
option via the config/fault.h header.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[undi] Assume that legacy interrupts are broken for any PCIe device

PCI Express devices do not have physical INTx output signals, and on
modern motherboards there is unlikely to be any interrupt controller
with physical interrupt input signals.  There are multiple levels of
abstraction involved in emulating the legacy INTx interrupt mechanism:
the PCIe device sends Assert_INTx and Deassert_INTx messages, PCIe
bridges and switches must collate these virtual wires, and the root
complex must map the virtual wires into messages that can be
understood by the host's emulated 8259 PIC.

This complex chain of emulations is rarely tested on modern hardware,
since operating systems will invariably use MSI-X for PCI devices and
the I/O APIC for non-PCI devices such as the real-time clock.  Since
the legacy interrupt emulation mechanism is rarely tested, it is
frequently unreliable.  We have encountered many issues over the years
in which legacy interrupts are simply not raised as expected, even
when inspection shows that the device believes it is asserting an
interrupt and the controller believes that the interrupt is enabled.

We already maintain a list of devices that are known to fail to
generate legacy interrupts correctly.  This list is based on the PCI
vendor and device IDs, which is not necessarily a fair test since the
root cause may be a board-level misconfiguration rather than a
device-level fault.

Assume that any PCI Express device has a high chance of not being able
to raise legacy interrupts reliably.  This is a relatively intrusive
change since it will affect essentially all modern network devices,
but should hopefully fix all future issues with non-functional legacy
interrupts, without needing to constantly grow the list of known
broken devices.

If some PCI Express devices are found to fail when operated in polling
mode, then this change will need to be revisited.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[pxeprefix] Display PCI vendor and device ID in PXE startup banner

In the case of a misbehaving PXE stack, it is often useful to know the
PCI vendor and device IDs (e.g. for adding the device to the list of
devices with known broken support for generating interrupts).

The PCI vendor and device ID is already available to the prefix code,
and so can trivially be printed out. Add this information to the PXE
prefix startup banner.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[fdt] Add ability to locate node by phandle

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[dwusb] Add driver for DesignWare USB3 host controller

Add a basic driver for the DesignWare USB3 host controller as found in
the Lichee Pi 4A.

This driver covers only the DesignWare host controller hardware.  On
the Lichee Pi 4A, this is sufficient to get the single USB root hub
port (exposed internally via the SODIMM connector) up and running.

The driver does not yet handle the various GPIOs that control power
and signal routing for the Lichee Pi 4A's onboard VL817 USB hub and
the four physical USB-A ports.  This therefore leaves the USB hub and
the USB-A ports unpowered, and the USB2 root hub port routed to the
physical USB-C port.  Devices plugged in to the USB-A ports will not
be powered up, and a device plugged in to the USB-C port will
enumerate as a USB2 device.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[xhci] Allow for non-PCI xHCI host controllers

Allow for the existence of xHCI host controllers where the underlying
hardware is not a PCI device.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[xhci] Use root hub port number to determine slot type

We currently use the downstream hub's port number to determine the
xHCI slot type for a newly connected USB device. The downstream hub
port number is irrelevant to the xHCI controller's supported protocols
table: the relevant value is the number of the root hub port through
which the device is attached.

Fix by using the root hub port number instead of the immediate parent
hub's port number.

This bug has not previously been detected since the slot type for the
first N root hub ports will invariably be zero to indicate that these
are USB ports. For any xHCI controller with a sufficiently large
number of root hub ports, the code would therefore end up happening to
calculate the correct slot type value despite using an incorrect port
number.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[efi] Check only the non-extended WaitForKey event

The WaitForKeyEx event in EFI_SIMPLE_TEXT_INPUT_EX_PROTOCOL is
redundant: by definition it has to signal under exactly the same
conditions as the WaitForKey event in EFI_SIMPLE_TEXT_INPUT_PROTOCOL
and cannot provide any "extended" information since EFI events do not
convey any information beyond their own occurrence.

UEFI keyboard drivers such as Ps2KeyboardDxe and UsbKbDxe invariably
use a single notification function to implement both events.  The
console multiplexer driver ConSplitterDxe uses a single notification
function for both events, which ends up checking only the WaitForKey
event on the underlying console devices.  (Since all console input is
routed through the console multiplexer, this means that in practice
nothing will ever check the underlying devices' WaitForKeyEx events.)

UEFI console consumers such as the UEFI shell tend to use only the
EFI_SIMPLE_TEXT_INPUT_PROTOCOL instance provided as ConIn in the EFI
system table.  With the exception of the UEFI text editor (the "edit"
command in the UEFI shell), almost nothing bothers to open the
EFI_SIMPLE_TEXT_INPUT_EX_PROTOCOL instance on the same handle.

The Lenovo ThinkPad T14s Gen 5 has a very peculiar firmware bug.
Enabling the "UEFI Wi-Fi Network Boot" feature in the BIOS setup will
cause the completely unrelated WaitForKeyEx event pointer to be
overwritten with a pointer to a FAT_DIRENT structure representing the
"BOOT" directory in the EFI system partition.  This happens with 100%
repeatability.  It is not necessary to attempt to boot from Wi-Fi: it
is only necessary to have the feature enabled.  The root cause is
unknown, but is presumably an uninitialised pointer or similar
memory-related bug in Lenovo's UEFI Wi-Fi driver.

Work around this Lenovo firmware bug by checking only the WaitForKey
event, ignoring the WaitForKeyEx event even if we will subsequently
use ReadKeyStrokeEx() to read the keypress.  Since almost all other
UEFI console consumers use only WaitForKey, this ensures that we will
be using code paths that the firmware vendor is likely to have tested
at least once.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[efi] Allow compiler to perform type checks on EFI_EVENT

As with EFI_HANDLE, the EFI headers define EFI_EVENT as a void
pointer, rendering EFI_EVENT compatible with a pointer to itself and
hence guaranteeing that pointer type bugs will be introduced.

Redefine EFI_EVENT as a pointer to an anonymous structure (as we
already do for EFI_HANDLE) to allow the compiler to perform type
checking as expected.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[init] Show initialisation function names in debug messages

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[efi] Assume that vendor wireless drivers are unusable via SNP

The UEFI model for wireless network boot cannot sensibly be described
without cursing.  Commit 758a504 ("[efi] Inhibit calls to Shutdown()
for wireless SNP devices") attempts to work around some of the known
issues.

Experimentation shows that on at least some platforms (observed with a
Lenovo ThinkPad T14s Gen 5) the vendor SNP driver is broken to the
point of being unusable in anything other than the single use case
envisioned by the firwmare authors.  Doing almost anything directly
via the SNP protocol interface has a greater than 50% chance of
locking up the system.

Assume, in the absence of any evidence to the contrary so far, that
vendor SNP drivers for wireless network devices are so badly written
as to be unusable.  Refuse to even attempt to interact with these
drivers via the SNP or NII protocol interfaces.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[efi] Drop to external TPL for calls to ConnectController()

There is nothing in the current versions of the UEFI specification
that limits the TPL at which we may call ConnectController() or
DisconnectController(). However, at least some platforms (observed
with a Lenovo ThinkPad T14s Gen 5) will occasionally and unpredictably
lock up before returning from ConnectController() if called at a TPL
higher than TPL_APPLICATION.

Work around whatever defect is present on these systems by dropping to
the current external TPL for all calls to ConnectController() or
DisconnectController().

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[efi] Provide efi_tpl_name() for transcribing TPLs in debug messages

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[riscv] Ensure coherent DMA allocations do not cross cacheline boundaries

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[riscv] Support the standard Svpbmt extension for page-based memory types

Set the appropriate Svpbmt type bits within page table entries if the
extension is supported. Tested only in QEMU so far, due to the lack
of availability of real hardware supporting Svpbmt.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[riscv] Create coherent DMA mapping of 32-bit address space on demand

Reuse the code that creates I/O device page mappings to create the
coherent DMA mapping of the 32-bit address space on demand, instead of
constructing this mapping as part of the initial page table.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[riscv] Use 1GB pages for I/O device mappings

All 64-bit paging schemes support at least 1GB "gigapages". Use these
to map I/O devices instead of 2MB "megapages". This reduces the
number of consumed page table entries, increases the visual similarity
of I/O remapped addresses to the underlying physical addresses, and
opens up the possibility of reusing the code to create the coherent
DMA map of the 32-bit address space.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[dwmac] Add driver for DesignWare Ethernet MAC

Add a basic driver for the DesignWare Ethernet MAC network interface
as found in the Lichee Pi 4A.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[riscv] Invalidate data cache on completed RX DMA buffers

The data cache must be invalidated twice for RX DMA buffers: once
before passing ownership to the DMA device (in case the cache happens
to contain dirty data that will be written back at an undefined future
point), and once after receiving ownership from the DMA device (in
case the CPU happens to have speculatively accessed data in the buffer
while it was owned by the hardware).

Only the used portion of the buffer needs to be invalidated after
completion, since we do not care about data within the unused portion.

Update the DMA API to include the used length as an additional
parameter to dma_unmap(), and add the necessary second cache
invalidation pass to the RISC-V DMA API implementation.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[riscv] Add optimised TCP/IP checksumming

Add a RISC-V assembly language implementation of TCP/IP checksumming,
which is around 50x faster than the generic algorithm. The main loop
checksums aligned xlen-bit words, using almost entirely compressible
instructions and accumulating carries in a separate register to allow
folding to be deferred until after all loops have completed.

Experimentation on a C910 CPU suggests that this achieves around four
bytes per clock cycle, which is comparable to the x86 implementation.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[riscv] Provide a DMA API implementation for RISC-V bare-metal systems

Provide an implementation of dma_map() that performs cache clean or
invalidation as required, and an implementation of dma_alloc() that
returns virtual addresses within the coherent mapping of the 32-bit
physical address space.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[dma] Use virtual addresses for dma_map()

Cache management operations must generally be performed on virtual
addresses rather than physical addresses.

Change the address parameter in dma_map() to be a virtual address, and
make dma() the API-level primitive instead of dma_phys().

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[build] Handle isohybrid with xorrisofs

Generating an isohybrid image with `xorrisofs` is supposed to happen
with option `-isohybrid-gpt-basdat`, not command `isohybrid`.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[riscv] Support explicit cache management operations on I/O buffers

On platforms where DMA devices are not in the same coherency domain as
the CPU cache, it is necessary to be able to explicitly clean the
cache (i.e. force data to be written back to main memory) and
invalidate the cache (i.e. discard any cached data and force a
subsequent read from main memory).

Add support for cache management via the standard Zicbom extension or
the T-Head cache management operations extension, with the supported
extension detected on first use.

Support cache management operations only on I/O buffers, since these
are guaranteed to not share cachelines with other data.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[riscv] Add support for detecting T-Head vendor extensions

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[iobuf] Ensure I/O buffer data sits within unshared cachelines

On platforms where DMA devices are not in the same coherency domain as
the CPU cache, we must ensure that DMA I/O buffers do not share
cachelines with other data.

Align the start and end of I/O buffers to IOB_ZLEN, which is larger
than any cacheline size we expect to encounter.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[uaccess] Allow for coherent DMA mapping of the 32-bit address space

On platforms where DMA devices are not in the same coherency domain as
the CPU cache, it is necessary to create page table entries where the
translations are marked as uncacheable.

We choose to place iPXE within the low 4GB of memory (since 32-bit DMA
devices are still reasonably common even on systems with 64-bit CPUs).
We therefore need to cover only the low 4GB of memory with these page
table entries.

Update virt_to_phys() to allow for the existence of such a mapping,
assuming that iPXE itself will always reside within the top 4GB of the
64-bit virtual address space (and therefore that the DMA mapping must
lie somewhere below this in the negative virtual address space).

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[riscv] Create coherent DMA mapping for low 4GB of address space

Use PTEs 256-259 to create a mapping of the 32-bit physical address
space with attributes suitable for coherent DMA mappings.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[riscv] Construct invariant portions of page table outside the loop

The page table entries for the identity map vary according to the
paging level in use, and so must be constructed within the loop used
to detect the maximum supported paging level. Other page table
entries are invariant between paging levels, and so may be constructed
just once before entering the loop.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[bnxt] Update supported devices array

Add support for new device IDs. Remove device IDs which were never
in use.

Signed-off-by: Joseph Wong <joseph.wong@broadcom.com>

[bnxt] Update device descriptions

Use human readable strings for dev_description in PCI_ROM array.

Signed-off-by: Joseph Wong <joseph.wong@broadcom.com>

[bnxt] Remove VLAN stripping logic

Remove logic that programs the hardware to strip out VLAN from RX
packets. Do not drop packets due to VLAN mismatch and allow the upper
layer to decide whether to discard the packets.

Signed-off-by: Joseph Wong <joseph.wong@broadcom.com>

[github] Add sponsorship link

iPXE is released under the GNU GPL and is 100% open source software.
There are no "premium editions", no in-app advertisements, and no
hidden costs. The fully public version published to GitHub is and
always will be the definitive and only version of iPXE.

Many large features in iPXE have been commercially funded within this
open source model, with features being published upstream as soon as
they are complete and made available for the whole world to use, not
restricted for use only by the customer funding that particular piece
of development work.

There has not to date been any funding model for smaller pieces of
work, such as occasional code review or guaranteed attention to bug
reports. The overhead of establishing a commercial relationship is
usually too high to be worthwhile for very small units of work.

The GitHub sponsorship mechanism provides a framework for efficiently
handling small commercial requests (or individual tokens of thanks).
Add a FUNDING.yml file to provide a convenient way for anyone who
wants to support the ongoing open source development of iPXE to do so.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[bnxt] Increase Tx descriptors

Increase TX and CMP descriptor counts.

Signed-off-by: Joseph Wong <joseph.wong@broadcom.com>

[build] Disable use of common symbols

We no longer have any requirement for common symbols. Disable common
symbols via the -fno-common compiler option, and simplify the test for
support of -fdata-sections (which can return a false negative when
common symbols are enabled).

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[build] Allow for the existence of small-data sections

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[legacy] Allocate legacy driver .bss-like segments at probe time

Some legacy drivers use large static allocations for transmit and
receive buffers. To avoid bloating the .bss segment, we currently
implement these as a single common symbol named "_shared_bss" (which
is permissible since only one legacy driver may be active at any one
time).

Switch to dynamic allocation of these .bss-like segments, to avoid the
requirement for using common symbols.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[legacy] Rename the global legacy NIC to "legacy_nic"

We currently have contexts in which the local variable "nic" is a
pointer to the global variable also called "nic". This complicates
the creation of macros.

Rename the global variable to "legacy_nic" to reduce pollution of the
global namespace and to allow for the creation of macros referring to
fields within this global variable.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[legacy] Allocate extra padding in receive buffers

Allow for legacy drivers that include VLAN tags or CRCs within their
received packets.

Signed-off-by: Michael Brown <mcb30@ipxe.org>

[pxe] Use a weak symbol for isapnp_read_port

Use a weak symbol for isapnp_read_port used in pxe_preboot.c, rather
than relying on a common symbol.

Signed-off-by: Michael Brown <mcb30@ipxe.org>