gcc/doc/gccint/link-time-optimization/design-overview.rst

   1 ..
   2   Copyright 1988-2022 Free Software Foundation, Inc.
   3   This is part of the GCC manual.
   4   For copying conditions, see the copyright.rst file.
   5
   6 .. _lto-overview:
   7
   8 Design Overview
   9 ***************
  10
  11 Link time optimization is implemented as a GCC front end for a
  12 bytecode representation of GIMPLE that is emitted in special sections
  13 of ``.o`` files.  Currently, LTO support is enabled in most
  14 ELF-based systems, as well as darwin, cygwin and mingw systems.
  15
  16 By default, object files generated with LTO support contain only GIMPLE
  17 bytecode.  Such objects are called 'slim', and they require that
  18 tools like ``ar`` and ``nm`` understand symbol tables of LTO
  19 sections.  For most targets these tools have been extended to use the
  20 plugin infrastructure, so GCC can support 'slim' objects consisting
  21 of the intermediate code alone.
  22
  23 GIMPLE bytecode could also be saved alongside final object code if
  24 the :option:`-ffat-lto-objects` option is passed, or if no plugin support
  25 is detected for ``ar`` and ``nm`` when GCC is configured.  It makes
  26 the object files generated with LTO support larger than regular object
  27 files.  This 'fat' object format allows to ship one set of fat
  28 objects which could be used both for development and the production of
  29 optimized builds.  A, perhaps surprising, side effect of this feature
  30 is that any mistake in the toolchain leads to LTO information not
  31 being used (e.g. an older ``libtool`` calling ``ld`` directly).
  32 This is both an advantage, as the system is more robust, and a
  33 disadvantage, as the user is not informed that the optimization has
  34 been disabled.
  35
  36 At the highest level, LTO splits the compiler in two.  The first half
  37 (the 'writer') produces a streaming representation of all the
  38 internal data structures needed to optimize and generate code.  This
  39 includes declarations, types, the callgraph and the GIMPLE representation
  40 of function bodies.
  41
  42 When :option:`-flto` is given during compilation of a source file, the
  43 pass manager executes all the passes in ``all_lto_gen_passes``.
  44 Currently, this phase is composed of two IPA passes:
  45
  46 * ``pass_ipa_lto_gimple_out``
  47   This pass executes the function ``lto_output`` in
  48   :samp:`lto-streamer-out.cc`, which traverses the call graph encoding
  49   every reachable declaration, type and function.  This generates a
  50   memory representation of all the file sections described below.
  51
  52 * ``pass_ipa_lto_finish_out``
  53   This pass executes the function ``produce_asm_for_decls`` in
  54   :samp:`lto-streamer-out.cc`, which takes the memory image built in the
  55   previous pass and encodes it in the corresponding ELF file sections.
  56
  57 The second half of LTO support is the 'reader'.  This is implemented
  58 as the GCC front end :samp:`lto1` in :samp:`lto/lto.cc`.  When
  59 :samp:`collect2` detects a link set of ``.o`` / ``.a`` files with
  60 LTO information and the :option:`-flto` is enabled, it invokes
  61 :samp:`lto1` which reads the set of files and aggregates them into a
  62 single translation unit for optimization.  The main entry point for
  63 the reader is :samp:`lto/lto.cc`: ``lto_main``.
  64
  65 LTO modes of operation
  66 ^^^^^^^^^^^^^^^^^^^^^^
  67
  68 One of the main goals of the GCC link-time infrastructure was to allow
  69 effective compilation of large programs.  For this reason GCC implements two
  70 link-time compilation modes.
  71
  72 * *LTO mode*, in which the whole program is read into the
  73   compiler at link-time and optimized in a similar way as if it
  74   were a single source-level compilation unit.
  75
  76 * *WHOPR or partitioned mode*, designed to utilize multiple
  77   CPUs and/or a distributed compilation environment to quickly link
  78   large applications.  WHOPR stands for WHOle Program optimizeR (not to
  79   be confused with the semantics of :option:`-fwhole-program`).  It
  80   partitions the aggregated callgraph from many different ``.o``
  81   files and distributes the compilation of the sub-graphs to different
  82   CPUs.
  83
  84   Note that distributed compilation is not implemented yet, but since
  85   the parallelism is facilitated via generating a ``Makefile``, it
  86   would be easy to implement.
  87
  88 WHOPR splits LTO into three main stages:
  89
  90 * Local generation (LGEN)
  91   This stage executes in parallel.  Every file in the program is compiled
  92   into the intermediate language and packaged together with the local
  93   call-graph and summary information.  This stage is the same for both
  94   the LTO and WHOPR compilation mode.
  95
  96 * Whole Program Analysis (WPA)
  97   WPA is performed sequentially.  The global call-graph is generated, and
  98   a global analysis procedure makes transformation decisions.  The global
  99   call-graph is partitioned to facilitate parallel optimization during
 100   phase 3.  The results of the WPA stage are stored into new object files
 101   which contain the partitions of program expressed in the intermediate
 102   language and the optimization decisions.
 103
 104 * Local transformations (LTRANS)
 105   This stage executes in parallel.  All the decisions made during phase 2
 106   are implemented locally in each partitioned object file, and the final
 107   object code is generated.  Optimizations which cannot be decided
 108   efficiently during the phase 2 may be performed on the local
 109   call-graph partitions.
 110
 111 WHOPR can be seen as an extension of the usual LTO mode of
 112 compilation.  In LTO, WPA and LTRANS are executed within a single
 113 execution of the compiler, after the whole program has been read into
 114 memory.
 115
 116 When compiling in WHOPR mode, the callgraph is partitioned during
 117 the WPA stage.  The whole program is split into a given number of
 118 partitions of roughly the same size.  The compiler tries to
 119 minimize the number of references which cross partition boundaries.
 120 The main advantage of WHOPR is to allow the parallel execution of
 121 LTRANS stages, which are the most time-consuming part of the
 122 compilation process.  Additionally, it avoids the need to load the
 123 whole program into memory.