]>
Commit | Line | Data |
---|---|---|
e66d8631 MCC |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ====================================== | |
6e95d0a0 | 4 | EROFS - Enhanced Read-Only File System |
e66d8631 MCC |
5 | ====================================== |
6 | ||
fdb05364 GX |
7 | Overview |
8 | ======== | |
9 | ||
6e95d0a0 GX |
10 | EROFS filesystem stands for Enhanced Read-Only File System. It aims to form a |
11 | generic read-only filesystem solution for various read-only use cases instead | |
12 | of just focusing on storage space saving without considering any side effects | |
13 | of runtime performance. | |
fdb05364 | 14 | |
6e95d0a0 GX |
15 | It is designed to meet the needs of flexibility, feature extendability and user |
16 | payload friendly, etc. Apart from those, it is still kept as a simple | |
17 | random-access friendly high-performance filesystem to get rid of unneeded I/O | |
18 | amplification and memory-resident overhead compared to similar approaches. | |
19 | ||
20 | It is implemented to be a better choice for the following scenarios: | |
e66d8631 | 21 | |
fdb05364 GX |
22 | - read-only storage media or |
23 | ||
24 | - part of a fully trusted read-only solution, which means it needs to be | |
25 | immutable and bit-for-bit identical to the official golden image for | |
6e95d0a0 | 26 | their releases due to security or other considerations and |
fdb05364 | 27 | |
dfeab2e9 GX |
28 | - hope to minimize extra storage space with guaranteed end-to-end performance |
29 | by using compact layout, transparent file compression and direct access, | |
30 | especially for those embedded devices with limited memory and high-density | |
6e95d0a0 | 31 | hosts with numerous containers. |
fdb05364 | 32 | |
2109901d | 33 | Here are the main features of EROFS: |
e66d8631 | 34 | |
fdb05364 GX |
35 | - Little endian on-disk design; |
36 | ||
2109901d GX |
37 | - Block-based distribution and file-based distribution over fscache are |
38 | supported; | |
39 | ||
40 | - Support multiple devices to refer to external blobs, which can be used | |
41 | for container images; | |
42 | ||
d3c4bdcc JX |
43 | - 32-bit block addresses for each device, therefore 16TiB address space at |
44 | most with 4KiB block size for now; | |
fdb05364 | 45 | |
6e95d0a0 | 46 | - Two inode layouts for different requirements: |
e66d8631 | 47 | |
6e95d0a0 | 48 | ===================== ============ ====================================== |
ffafde47 | 49 | compact (v1) extended (v2) |
6e95d0a0 | 50 | ===================== ============ ====================================== |
e66d8631 | 51 | Inode metadata size 32 bytes 64 bytes |
6e95d0a0 | 52 | Max file size 4 GiB 16 EiB (also limited by max. vol size) |
e66d8631 | 53 | Max uids/gids 65536 4294967296 |
a1108dcd | 54 | Per-inode timestamp no yes (64 + 32-bit timestamp) |
e66d8631 | 55 | Max hardlinks 65536 4294967296 |
6e95d0a0 GX |
56 | Metadata reserved 8 bytes 18 bytes |
57 | ===================== ============ ====================================== | |
58 | ||
2109901d | 59 | - Support extended attributes as an option; |
fdb05364 | 60 | |
3048102d JX |
61 | - Support a bloom filter that speeds up negative extended attribute lookups; |
62 | ||
2109901d | 63 | - Support POSIX.1e ACLs by using extended attributes; |
516c115c | 64 | |
46f2e044 | 65 | - Support transparent data compression as an option: |
3048102d JX |
66 | LZ4, MicroLZMA and DEFLATE algorithms can be used on a per-file basis; In |
67 | addition, inplace decompression is also supported to avoid bounce compressed | |
68 | buffers and unnecessary page cache thrashing. | |
6e95d0a0 | 69 | |
2109901d GX |
70 | - Support chunk-based data deduplication and rolling-hash compressed data |
71 | deduplication; | |
72 | ||
73 | - Support tailpacking inline compared to byte-addressed unaligned metadata | |
74 | or smaller block size alternatives; | |
75 | ||
76 | - Support merging tail-end data into a special inode as fragments. | |
77 | ||
e6687b89 JX |
78 | - Support large folios for uncompressed files. |
79 | ||
6e95d0a0 GX |
80 | - Support direct I/O on uncompressed files to avoid double caching for loop |
81 | devices; | |
dfeab2e9 | 82 | |
6e95d0a0 GX |
83 | - Support FSDAX on uncompressed images for secure containers and ramdisks in |
84 | order to get rid of unnecessary page cache. | |
85 | ||
6e95d0a0 | 86 | - Support file-based on-demand loading with the Fscache infrastructure. |
fdb05364 GX |
87 | |
88 | The following git tree provides the file system user-space tools under | |
6e95d0a0 GX |
89 | development, such as a formatting tool (mkfs.erofs), an on-disk consistency & |
90 | compatibility checking tool (fsck.erofs), and a debugging tool (dump.erofs): | |
e66d8631 MCC |
91 | |
92 | - git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git | |
fdb05364 GX |
93 | |
94 | Bugs and patches are welcome, please kindly help us and send to the following | |
95 | linux-erofs mailing list: | |
e66d8631 MCC |
96 | |
97 | - linux-erofs mailing list <linux-erofs@lists.ozlabs.org> | |
fdb05364 | 98 | |
fdb05364 GX |
99 | Mount options |
100 | ============= | |
101 | ||
e66d8631 | 102 | =================== ========================================================= |
fdb05364 GX |
103 | (no)user_xattr Setup Extended User Attributes. Note: xattr is enabled |
104 | by default if CONFIG_EROFS_FS_XATTR is selected. | |
105 | (no)acl Setup POSIX Access Control List. Note: acl is enabled | |
106 | by default if CONFIG_EROFS_FS_POSIX_ACL is selected. | |
4279f3f9 | 107 | cache_strategy=%s Select a strategy for cached decompression from now on: |
e66d8631 MCC |
108 | |
109 | ========== ============================================= | |
110 | disabled In-place I/O decompression only; | |
111 | readahead Cache the last incomplete compressed physical | |
4279f3f9 GX |
112 | cluster for further reading. It still does |
113 | in-place I/O decompression for the rest | |
114 | compressed physical clusters; | |
e66d8631 | 115 | readaround Cache the both ends of incomplete compressed |
4279f3f9 GX |
116 | physical clusters for further reading. |
117 | It still does in-place I/O decompression | |
118 | for the rest compressed physical clusters. | |
e66d8631 | 119 | ========== ============================================= |
06252e9c GX |
120 | dax={always,never} Use direct access (no page cache). See |
121 | Documentation/filesystems/dax.rst. | |
122 | dax A legacy option which is an alias for ``dax=always``. | |
dfeab2e9 | 123 | device=%s Specify a path to an extra device to be used together. |
6e95d0a0 | 124 | fsid=%s Specify a filesystem image ID for Fscache back-end. |
b22c7b97 JX |
125 | domain_id=%s Specify a domain ID in fscache mode so that different images |
126 | with the same blobs under a given domain ID can share storage. | |
e66d8631 | 127 | =================== ========================================================= |
fdb05364 | 128 | |
168e9a76 HJ |
129 | Sysfs Entries |
130 | ============= | |
131 | ||
132 | Information about mounted erofs file systems can be found in /sys/fs/erofs. | |
133 | Each mounted filesystem will have a directory in /sys/fs/erofs based on its | |
134 | device name (i.e., /sys/fs/erofs/sda). | |
135 | (see also Documentation/ABI/testing/sysfs-fs-erofs) | |
136 | ||
fdb05364 GX |
137 | On-disk details |
138 | =============== | |
139 | ||
140 | Summary | |
141 | ------- | |
142 | Different from other read-only file systems, an EROFS volume is designed | |
e66d8631 | 143 | to be as simple as possible:: |
fdb05364 GX |
144 | |
145 | |-> aligned with the block size | |
146 | ____________________________________________________________ | |
147 | | |SB| | ... | Metadata | ... | Data | Metadata | ... | Data | | |
148 | |_|__|_|_____|__________|_____|______|__________|_____|______| | |
149 | 0 +1K | |
150 | ||
151 | All data areas should be aligned with the block size, but metadata areas | |
152 | may not. All metadatas can be now observed in two different spaces (views): | |
e66d8631 | 153 | |
fdb05364 | 154 | 1. Inode metadata space |
e66d8631 | 155 | |
fdb05364 | 156 | Each valid inode should be aligned with an inode slot, which is a fixed |
ffafde47 | 157 | value (32 bytes) and designed to be kept in line with compact inode size. |
fdb05364 GX |
158 | |
159 | Each inode can be directly found with the following formula: | |
160 | inode offset = meta_blkaddr * block_size + 32 * nid | |
161 | ||
e66d8631 MCC |
162 | :: |
163 | ||
1b55767d GX |
164 | |-> aligned with 8B |
165 | |-> followed closely | |
166 | + meta_blkaddr blocks |-> another slot | |
167 | _____________________________________________________________________ | |
168 | | ... | inode | xattrs | extents | data inline | ... | inode ... | |
169 | |________|_______|(optional)|(optional)|__(optional)_|_____|__________ | |
170 | |-> aligned with the inode slot size | |
171 | . . | |
172 | . . | |
173 | . . | |
174 | . . | |
175 | . . | |
176 | . . | |
177 | .____________________________________________________|-> aligned with 4B | |
178 | | xattr_ibody_header | shared xattrs | inline xattrs | | |
179 | |____________________|_______________|_______________| | |
180 | |-> 12 bytes <-|->x * 4 bytes<-| . | |
181 | . . . | |
182 | . . . | |
183 | . . . | |
184 | ._______________________________.______________________. | |
185 | | id | id | id | id | ... | id | ent | ... | ent| ... | | |
186 | |____|____|____|____|______|____|_____|_____|____|_____| | |
187 | |-> aligned with 4B | |
188 | |-> aligned with 4B | |
fdb05364 GX |
189 | |
190 | Inode could be 32 or 64 bytes, which can be distinguished from a common | |
e66d8631 | 191 | field which all inode versions have -- i_format:: |
fdb05364 GX |
192 | |
193 | __________________ __________________ | |
ffafde47 | 194 | | i_format | | i_format | |
fdb05364 GX |
195 | |__________________| |__________________| |
196 | | ... | | ... | | |
197 | | | | | | |
198 | |__________________| 32 bytes | | | |
199 | | | | |
200 | |__________________| 64 bytes | |
201 | ||
202 | Xattrs, extents, data inline are followed by the corresponding inode with | |
ffafde47 | 203 | proper alignment, and they could be optional for different data mappings. |
2a9dc7a8 | 204 | _currently_ total 5 data layouts are supported: |
fdb05364 | 205 | |
e66d8631 | 206 | == ==================================================================== |
ffafde47 GX |
207 | 0 flat file data without data inline (no extent); |
208 | 1 fixed-sized output data compression (with non-compacted indexes); | |
209 | 2 flat file data with tail packing data inline (no extent); | |
2a9dc7a8 GX |
210 | 3 fixed-sized output data compression (with compacted indexes, v5.3+); |
211 | 4 chunk-based file (v5.15+). | |
e66d8631 | 212 | == ==================================================================== |
fdb05364 GX |
213 | |
214 | The size of the optional xattrs is indicated by i_xattr_count in inode | |
215 | header. Large xattrs or xattrs shared by many different files can be | |
216 | stored in shared xattrs metadata rather than inlined right after inode. | |
217 | ||
218 | 2. Shared xattrs metadata space | |
e66d8631 | 219 | |
fdb05364 GX |
220 | Shared xattrs space is similar to the above inode space, started with |
221 | a specific block indicated by xattr_blkaddr, organized one by one with | |
222 | proper align. | |
223 | ||
224 | Each share xattr can also be directly found by the following formula: | |
225 | xattr offset = xattr_blkaddr * block_size + 4 * xattr_id | |
226 | ||
1b55767d | 227 | :: |
e66d8631 | 228 | |
1b55767d GX |
229 | |-> aligned by 4 bytes |
230 | + xattr_blkaddr blocks |-> aligned with 4 bytes | |
231 | _________________________________________________________________________ | |
232 | | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ... | |
233 | |________|_____________|_____________|_____|______________|_______________ | |
fdb05364 GX |
234 | |
235 | Directories | |
236 | ----------- | |
237 | All directories are now organized in a compact on-disk format. Note that | |
238 | each directory block is divided into index and name areas in order to support | |
239 | random file lookup, and all directory entries are _strictly_ recorded in | |
240 | alphabetical order in order to support improved prefix binary search | |
241 | algorithm (could refer to the related source code). | |
242 | ||
e66d8631 MCC |
243 | :: |
244 | ||
1b55767d GX |
245 | ___________________________ |
246 | / | | |
247 | / ______________|________________ | |
248 | / / | nameoff1 | nameoffN-1 | |
249 | ____________.______________._______________v________________v__________ | |
250 | | dirent | dirent | ... | dirent | filename | filename | ... | filename | | |
251 | |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____| | |
252 | \ ^ | |
253 | \ | * could have | |
254 | \ | trailing '\0' | |
255 | \________________________| nameoff0 | |
256 | Directory block | |
fdb05364 GX |
257 | |
258 | Note that apart from the offset of the first filename, nameoff0 also indicates | |
259 | the total number of directory entries in this block since it is no need to | |
260 | introduce another on-disk field at all. | |
261 | ||
6e95d0a0 GX |
262 | Chunk-based files |
263 | ----------------- | |
2a9dc7a8 GX |
264 | In order to support chunk-based data deduplication, a new inode data layout has |
265 | been supported since Linux v5.15: Files are split in equal-sized data chunks | |
266 | with ``extents`` area of the inode metadata indicating how to get the chunk | |
267 | data: these can be simply as a 4-byte block address array or in the 8-byte | |
268 | chunk index form (see struct erofs_inode_chunk_index in erofs_fs.h for more | |
269 | details.) | |
270 | ||
271 | By the way, chunk-based files are all uncompressed for now. | |
272 | ||
3048102d JX |
273 | Long extended attribute name prefixes |
274 | ------------------------------------- | |
275 | There are use cases where extended attributes with different values can have | |
276 | only a few common prefixes (such as overlayfs xattrs). The predefined prefixes | |
277 | work inefficiently in both image size and runtime performance in such cases. | |
278 | ||
279 | The long xattr name prefixes feature is introduced to address this issue. The | |
280 | overall idea is that, apart from the existing predefined prefixes, the xattr | |
281 | entry could also refer to user-specified long xattr name prefixes, e.g. | |
282 | "trusted.overlay.". | |
283 | ||
284 | When referring to a long xattr name prefix, the highest bit (bit 7) of | |
285 | erofs_xattr_entry.e_name_index is set, while the lower bits (bit 0-6) as a whole | |
286 | represent the index of the referred long name prefix among all long name | |
287 | prefixes. Therefore, only the trailing part of the name apart from the long | |
288 | xattr name prefix is stored in erofs_xattr_entry.e_name, which could be empty if | |
289 | the full xattr name matches exactly as its long xattr name prefix. | |
290 | ||
291 | All long xattr prefixes are stored one by one in the packed inode as long as | |
292 | the packed inode is valid, or in the meta inode otherwise. The | |
293 | xattr_prefix_count (of the on-disk superblock) indicates the total number of | |
294 | long xattr name prefixes, while (xattr_prefix_start * 4) indicates the start | |
295 | offset of long name prefixes in the packed/meta inode. Note that, long extended | |
296 | attribute name prefixes are disabled if xattr_prefix_count is 0. | |
297 | ||
298 | Each long name prefix is stored in the format: ALIGN({__le16 len, data}, 4), | |
299 | where len represents the total size of the data part. The data part is actually | |
300 | represented by 'struct erofs_xattr_long_prefix', where base_index represents the | |
301 | index of the predefined xattr name prefix, e.g. EROFS_XATTR_INDEX_TRUSTED for | |
302 | "trusted.overlay." long name prefix, while the infix string keeps the string | |
303 | after stripping the short prefix, e.g. "overlay." for the example above. | |
304 | ||
46f2e044 GX |
305 | Data compression |
306 | ---------------- | |
2109901d | 307 | EROFS implements fixed-sized output compression which generates fixed-sized |
46f2e044 GX |
308 | compressed data blocks from variable-sized input in contrast to other existing |
309 | fixed-sized input solutions. Relatively higher compression ratios can be gotten | |
310 | by using fixed-sized output compression since nowadays popular data compression | |
311 | algorithms are mostly LZ77-based and such fixed-sized output approach can be | |
312 | benefited from the historical dictionary (aka. sliding window). | |
313 | ||
314 | In details, original (uncompressed) data is turned into several variable-sized | |
315 | extents and in the meanwhile, compressed into physical clusters (pclusters). | |
316 | In order to record each variable-sized extent, logical clusters (lclusters) are | |
317 | introduced as the basic unit of compress indexes to indicate whether a new | |
318 | extent is generated within the range (HEAD) or not (NONHEAD). Lclusters are now | |
319 | fixed in block size, as illustrated below:: | |
e66d8631 | 320 | |
1b55767d GX |
321 | |<- variable-sized extent ->|<- VLE ->| |
322 | clusterofs clusterofs clusterofs | |
323 | | | | | |
324 | _________v_________________________________v_______________________v________ | |
325 | ... | . | | . | | . ... | |
326 | ____|____._________|______________|________.___ _|______________|__.________ | |
327 | |-> lcluster <-|-> lcluster <-|-> lcluster <-|-> lcluster <-| | |
46f2e044 GX |
328 | (HEAD) (NONHEAD) (HEAD) (NONHEAD) . |
329 | . CBLKCNT . . | |
330 | . . . | |
331 | . . . | |
332 | _______._____________________________.______________._________________ | |
1b55767d GX |
333 | ... | | | | ... |
334 | _______|______________|______________|______________|_________________ | |
46f2e044 GX |
335 | |-> big pcluster <-|-> pcluster <-| |
336 | ||
337 | A physical cluster can be seen as a container of physical compressed blocks | |
338 | which contains compressed data. Previously, only lcluster-sized (4KB) pclusters | |
339 | were supported. After big pcluster feature is introduced (available since | |
340 | Linux v5.13), pcluster can be a multiple of lcluster size. | |
341 | ||
342 | For each HEAD lcluster, clusterofs is recorded to indicate where a new extent | |
343 | starts and blkaddr is used to seek the compressed data. For each NONHEAD | |
344 | lcluster, delta0 and delta1 are available instead of blkaddr to indicate the | |
345 | distance to its HEAD lcluster and the next HEAD lcluster. A PLAIN lcluster is | |
346 | also a HEAD lcluster except that its data is uncompressed. See the comments | |
347 | around "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details. | |
348 | ||
349 | If big pcluster is enabled, pcluster size in lclusters needs to be recorded as | |
350 | well. Let the delta0 of the first NONHEAD lcluster store the compressed block | |
351 | count with a special flag as a new called CBLKCNT NONHEAD lcluster. It's easy | |
352 | to understand its delta0 is constantly 1, as illustrated below:: | |
353 | ||
354 | __________________________________________________________ | |
355 | | HEAD | NONHEAD | NONHEAD | ... | NONHEAD | HEAD | HEAD | | |
356 | |__:___|_(CBLKCNT)_|_________|_____|_________|__:___|____:_| | |
357 | |<----- a big pcluster (with CBLKCNT) ------>|<-- -->| | |
358 | a lcluster-sized pcluster (without CBLKCNT) ^ | |
359 | ||
360 | If another HEAD follows a HEAD lcluster, there is no room to record CBLKCNT, | |
361 | but it's easy to know the size of such pcluster is 1 lcluster as well. | |
2109901d GX |
362 | |
363 | Since Linux v6.1, each pcluster can be used for multiple variable-sized extents, | |
364 | therefore it can be used for compressed data deduplication. |