[RFC] Moving RELRO segment

Hey all,

TL;DR: Moving RELRO segment to be immediately after read-only segment so that the dynamic linker has the option to merge the two virtual memory areas at run time.

This is an RFC for moving RELRO segment. Currently, lld orders ELF sections in the following order: R, RX, RWX, RW, and RW contains RELRO. At run time, after RELRO is write-protected, we’d have VMAs in the order of: R, RX, RWX, R (RELRO), RW. I’d like to propose that we move RELRO to be immediately after the read-only sections, so that the order of VMAs become: R, R (RELRO), RX, RWX, RW, and the dynamic linker would have the option to merge the two read-only VMAs to reduce bookkeeping costs.

While I only tested this proposal on an ARM64 Android platform, the same optimization should be applicable to other platforms as well. My test showed an overall ~1MB decrease in kernel slab memory usage on vm_area_struct, with about 150 processes running. For this to work, I had to modify the dynamic linker:

  1. The dynamic linker needs to make the read-only VMA briefly writable in order for it to have the same VM flags with the RELRO VMA so that they can be merged. Specifically VM_ACCOUNT is set when a VMA is made writable.
  2. The cross-DSO CFI implementation in Android dynamic linker currently assumes __cfi_check is at a lower address than all CFI targets, so CFI check fails when RELRO is moved to below text section. After I added support for CFI targets below __cfi_check, I don’t see CFI failures anymore.
    One drawback that comes with this change is that the number of LOAD segments increases by one for DSOs with anything other than those in RELRO in its RW LOAD segment.

This would be a somewhat tedious change (especially the part about having to update all the unit tests), but the benefit is pretty good, especially considering the kernel slab memory is not swappable/evictable. Please let me know your thoughts!

Thanks,
Vic

Hi Vic,

I’m in favor of this proposal. Saving that amount of kernel memory by changing the memory layout seems like a win. I believe that there are programs in the wild that assume some specific segment order, and moving the RELRO segment might break some of them, but looks like it’s worth the risk.

Hello Vic,

I don't have a lot to add myself. I think that majority of the input
needs to come from the OS stakeholders. My main concern is if it
requires work on every platform to take advantage or avoid regressions
then perhaps it is worth adding as an option rather than changing the
default.

Some questions:
- Does this need work in every OS for correctness of programs? For
example you mention that cross-DSO CFI implementation in Android
needed to be updated, could that also be the case on other platforms?
- Does this need work in every OS to take advantage of it? For example
would this need a ld.so change on Linux?

The last time we updated the position of RELRO was in
https://reviews.llvm.org/D56828 it will be worth going through the
arguments in there to see if there is anything that triggers any
thoughts.

Peter

Hello Vic,

To make sure I understand the proposal correctly, do you propose:

Old: R RX RW(RELRO) RW
New: R(R+RELRO) RX RW; R includes the traditional R part and the RELRO part
Runtime (before relocation resolving): RW RX RW
Runtime (after relocation resolving): R RX RW

How to layout the segments if --no-rosegment is specified?

One option is to keep the old layout if --no-rosegment is specified, the other is:

Old: RX RW(RELRO) RW
New: RX(R+RELRO+RX) RW; RX includes the traditional R part, the RELRO part, and the RX part
Runtime (before relocation resolving): RW RX RW; ifunc can’t run if RX is not kept
Runtime (before relocation resolving): RX RW ; some people may be concered with writable stuff (relocated part) being made executable

Another problem is that in the default -z relro -z lazy (-z now not specified) layout, .got and .got.plt will be separated by potentially huge code sections (e.g. .text). I’m still thinking what problems this layout change may bring.

Hello Vic,

I don’t have a lot to add myself. I think that majority of the input
needs to come from the OS stakeholders. My main concern is if it
requires work on every platform to take advantage or avoid regressions
then perhaps it is worth adding as an option rather than changing the
default.

Some questions:

  • Does this need work in every OS for correctness of programs? For
    example you mention that cross-DSO CFI implementation in Android
    needed to be updated, could that also be the case on other platforms?

Indeed this could be a problem for other platforms. I’m not familiar with what CFI implementations are commonly in use, but from what I can tell, the implementation mentioned in Clang CFI design doc has this problem as well, so I wouldn’t be surprised that we see this problem in other implementations: https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html#cfi-shadow. Either those implementations need to be fixed or we need to add an option for where RELRO is placed (which brings more maintenance cost).

  • Does this need work in every OS to take advantage of it? For example
    would this need a ld.so change on Linux?

I can’t say for sure for other platforms, but for Linux, I think it depends on how we implement this. If we still keep RO and RELRO segments separate, ld.so needs to be updated for the VM_ACCOUNT issue I mentioned in order to take advantage of this. However, we can consider merging RO segment into RELRO segment if they are adjacent to each other (i.e. make what’s RO now a part of RELRO), so that we have one less LOAD and also existing linkers can take advantage of this without change (well, except for the CFI issue.)

The last time we updated the position of RELRO was in
https://reviews.llvm.org/D56828 it will be worth going through the
arguments in there to see if there is anything that triggers any
thoughts.

Thanks for the pointer! I’ll go through it.

Vic

Hello Vic,

To make sure I understand the proposal correctly, do you propose:

Old: R RX RW(RELRO) RW
New: R(R+RELRO) RX RW; R includes the traditional R part and the RELRO part
Runtime (before relocation resolving): RW RX RW
Runtime (after relocation resolving): R RX RW

I actually see two ways of implementing this, and yes what you mentioned here is one of them:

  1. Move RELRO to before RX, and merge it with R segment. This is what you said above.
  2. Move RELRO to before RX, but keep it as a separate segment. This is what I implemented in my test.
    As I mentioned in my reply to Peter, option 1 would allow existing implementations to take advantage of this without any change. While I think this optimization is well worth it, if we go with option 1, the dynamic linkers won’t have a choice to keep RO separate if they want to for whatever reason (e.g. less VM commit, finer granularity in VM maps, not wanting to have RO as writable even if for a short while.) So there’s a trade-off to be made here (or an option to be added, even though we all want to avoid that if we can.)

How to layout the segments if --no-rosegment is specified?

One option is to keep the old layout if --no-rosegment is specified, the other is:

Old: RX RW(RELRO) RW
New: RX(R+RELRO+RX) RW; RX includes the traditional R part, the RELRO part, and the RX part
Runtime (before relocation resolving): RW RX RW; ifunc can’t run if RX is not kept
Runtime (before relocation resolving): RX RW ; some people may be concered with writable stuff (relocated part) being made executable

Indeed I think weakening in the security aspect may be a problem if we are to merge RELRO into RX. Keeping the old layout would be more preferable IMHO.

Another problem is that in the default -z relro -z lazy (-z now not specified) layout, .got and .got.plt will be separated by potentially huge code sections (e.g. .text). I’m still thinking what problems this layout change may bring.

Not sure if this is the same issue as what you mentioned here, but I also see a comment in lld/ELF/Writer.cpp about how .rodata and .eh_frame should be as close to .text as possible due to fear of relocation overflow. If we go with option 2 above, the distance would have to be made larger. With option 1, we may still have some leeway in how to order sections within the merged RELRO segment.

Vic

Old: R RX RW(RELRO) RW
New: R(R+RELRO) RX RW; R includes the traditional R part and the
RELRO part
Runtime (before relocation resolving): RW RX RW
Runtime (after relocation resolving): R RX RW

I actually see two ways of implementing this, and yes what you mentioned
here is one of them:

  1. Move RELRO to before RX, and merge it with R segment. This is what you
    said above.
  2. Move RELRO to before RX, but keep it as a separate segment. This is
    what I implemented in my test.
    As I mentioned in my reply to Peter, option 1 would allow existing
    implementations to take advantage of this without any change. While I think
    this optimization is well worth it, if we go with option 1, the dynamic
    linkers won’t have a choice to keep RO separate if they want to for
    whatever reason (e.g. less VM commit, finer granularity in VM maps, not
    wanting to have RO as writable even if for a short while.) So there’s a
    trade-off to be made here (or an option to be added, even though we all
    want to avoid that if we can.)

Then you probably meant:

Old: R RX RW(RELRO) RW
New: R | RW(RELRO) RX RW
Runtime (before relocation resolving): R RW RX RW
Runtime (after relocation resolving): R R RX RW ; the two R cannot be merged

means a maxpagesize alignment. I am not sure whether you are going to add it
because I still do not understand where the saving comes from.

If the alignment is added, the R and RW maps can get contiguous
(non-overlapping) p_offset ranges. However, the RW map is private dirty,
it cannot be merged with adjacent maps so I am not clear how it can save kernel memory.

If the alignment is not added, the two maps will get overlapping p_offset ranges.

My test showed an overall ~1MB decrease in kernel slab memory usage on
vm_area_struct, with about 150 processes running. For this to work, I had
to modify the dynamic linker:

Can you elaborate how this decreases the kernel slab memory usage on
vm_area_struct? References to source code are very welcomed :slight_smile: This is
contrary to my intuition because the second R is private dirty. The number of
VMAs do not decrease.

  1. The dynamic linker needs to make the read-only VMA briefly writable in
    order for it to have the same VM flags with the RELRO VMA so that they can
    be merged. Specifically VM_ACCOUNT is set when a VMA is made writable.

Same question. I hope you can give a bit more details.

How to layout the segments if --no-rosegment is specified?
Runtime (before relocation resolving): RX RW ; some people may be
concered with writable stuff (relocated part) being made executable
Indeed I think weakening in the security aspect may be a problem if we are
to merge RELRO into RX. Keeping the old layout would be more
preferable IMHO.

This means the new layout conflicts with --no-rosegment.
In Driver.cpp, there should be a “… cannot be used together” error.

Another problem is that in the default -z relro -z lazy (-z now not
specified) layout, .got and .got.plt will be separated by potentially huge
code sections (e.g. .text). I’m still thinking what problems this layout
change may bring.

Not sure if this is the same issue as what you mentioned here, but I also
see a comment in lld/ELF/Writer.cpp about how .rodata and .eh_frame should
be as close to .text as possible due to fear of relocation overflow. If we
go with option 2 above, the distance would have to be made larger. With
option 1, we may still have some leeway in how to order sections within the
merged RELRO segment.

For huge executables (>2G or 3G), it may cause relocation overflows
between .text and .rodata if other large sections like .dynsym and .dynstr are
placed in between.

I do not worry too much about overflows potentially caused by moving
PT_GNU_RELRO around. PT_GNU_RELRO is usually less than 10% of the size of the
RX PT_LOAD.

This would be a somewhat tedious change (especially the part about having
to update all the unit tests), but the benefit is pretty good, especially
considering the kernel slab memory is not swappable/evictable. Please let
me know your thoughts!

Definitely! I have prototyped this and find ~260 tests will need address changing…

I am not convinced by this change. With current hardware, to make any mapping more efficient, you need both the virtual to physical translation and the permissions to be the same.

Anything that is writeable at any point will be a CoW mapping that, when written, will be replaced by a different page. Anything that is not ever writeable will be the same physical pages. This means that the old order is (S for shared, P for private):

S S P P

The new order is:

S P S P P

This means that the translation for the shared part is *definitely* not contiguous. Modern architectures currently (though not necessarily indefinitely) conflate protection and translation and so both versions require the same number of page table and TLB entries.

This; however, is true only when you think about single-level translation. When you consider nested paging in a VM, things get more complex because the translation is a two-stage lookup and the protection is based on the intersection of the permissions at each level.

The hypervisor will typically try to use superpages for the second-level translation and so both of the shared pages have a high probability of hitting in the same PTE for the second-level translation. The same is true for the RW and RELRO segments, because they will be allocated at the same time and any OS that does transparent superpage promotion (I think Linux does now? FreeBSD has for almost a decade) will therefore try to allocate contiguous physical memory for the mappings if possible.

I would expect your scheme to translate to more memory traffic from page-table walks in any virtualised environment and I don't see (given that you have increased address space fragmentation) where you are seeing a saving. With RELRO as part of RW, the kernel is free to split and recombine adjacent VM objects, with the new layout it is not able to combine adjacent objects because they are backed by different storage.

David

Old: R RX RW(RELRO) RW
New: R(R+RELRO) RX RW; R includes the traditional R part and the
RELRO part
Runtime (before relocation resolving): RW RX RW
Runtime (after relocation resolving): R RX RW

I actually see two ways of implementing this, and yes what you mentioned
here is one of them:

  1. Move RELRO to before RX, and merge it with R segment. This is what you
    said above.
  2. Move RELRO to before RX, but keep it as a separate segment. This is
    what I implemented in my test.
    As I mentioned in my reply to Peter, option 1 would allow existing
    implementations to take advantage of this without any change. While I think
    this optimization is well worth it, if we go with option 1, the dynamic
    linkers won’t have a choice to keep RO separate if they want to for
    whatever reason (e.g. less VM commit, finer granularity in VM maps, not
    wanting to have RO as writable even if for a short while.) So there’s a
    trade-off to be made here (or an option to be added, even though we all
    want to avoid that if we can.)

Then you probably meant:

Old: R RX RW(RELRO) RW
New: R | RW(RELRO) RX RW
Runtime (before relocation resolving): R RW RX RW
Runtime (after relocation resolving): R R RX RW ; the two R cannot be merged

means a maxpagesize alignment. I am not sure whether you are going to add it
because I still do not understand where the saving comes from.

If the alignment is added, the R and RW maps can get contiguous
(non-overlapping) p_offset ranges. However, the RW map is private dirty,
it cannot be merged with adjacent maps so I am not clear how it can save kernel memory.

My understanding (and my test result shows so) is that two VMAs can be merged even when one of them contains dirty pages. As far as I can tell from reading vma_merge() in mm/mmap.c in Linux kernel, there’s nothing preventing merging consecutively mmaped regions in that case. That said, we may not care about this case too much if we decide that this change should be put behind a flag, because in that case, I think we can just go with option 1.

If the alignment is not added, the two maps will get overlapping p_offset ranges.

My test showed an overall ~1MB decrease in kernel slab memory usage on
vm_area_struct, with about 150 processes running. For this to work, I had
to modify the dynamic linker:

Can you elaborate how this decreases the kernel slab memory usage on
vm_area_struct? References to source code are very welcomed :slight_smile: This is
contrary to my intuition because the second R is private dirty. The number of
VMAs do not decrease.

In mm/mprotect.c, merging is done in mprotect_fixup(), which calls vma_merge() to do the actual work. In the same function you can also see VM_ACCOUNT flag is set for writable VMA, which is why I had to modify the dynamic linker to make R section temporarily writable for it to be mergeable with RELRO (they need to have the same flags to be merged.) Again, IMO all these somewhat indirect manipulations of VMAs were because I was hoping to give the dynamic linker an option to choose whether to take advantage of this or not. If for any reason, we put this behind a build time flag, there’s no reason to jump through these hoops instead of just going with option 1.

  1. The dynamic linker needs to make the read-only VMA briefly writable in
    order for it to have the same VM flags with the RELRO VMA so that they can
    be merged. Specifically VM_ACCOUNT is set when a VMA is made writable.

Same question. I hope you can give a bit more details.

How to layout the segments if --no-rosegment is specified?
Runtime (before relocation resolving): RX RW ; some people may be
concered with writable stuff (relocated part) being made executable
Indeed I think weakening in the security aspect may be a problem if we are
to merge RELRO into RX. Keeping the old layout would be more
preferable IMHO.

This means the new layout conflicts with --no-rosegment.
In Driver.cpp, there should be a “… cannot be used together” error.

Another problem is that in the default -z relro -z lazy (-z now not
specified) layout, .got and .got.plt will be separated by potentially huge
code sections (e.g. .text). I’m still thinking what problems this layout
change may bring.

Not sure if this is the same issue as what you mentioned here, but I also
see a comment in lld/ELF/Writer.cpp about how .rodata and .eh_frame should
be as close to .text as possible due to fear of relocation overflow. If we
go with option 2 above, the distance would have to be made larger. With
option 1, we may still have some leeway in how to order sections within the
merged RELRO segment.

For huge executables (>2G or 3G), it may cause relocation overflows
between .text and .rodata if other large sections like .dynsym and .dynstr are
placed in between.

I do not worry too much about overflows potentially caused by moving
PT_GNU_RELRO around. PT_GNU_RELRO is usually less than 10% of the size of the
RX PT_LOAD.

That’s good to know!


> ><i> This is an RFC for moving RELRO segment. Currently, lld orders ELF 
</i>> ><i> sections in the following order: R, RX, RWX, RW, and RW contains RELRO. 
</i>> ><i> At run time, after RELRO is write-protected, we'd have VMAs in the order 
</i>> ><i> of: R, RX, RWX, R (RELRO), RW. I'd like to propose that we move RELRO to 
</i>> ><i> be immediately after the read-only sections, so that the order of VMAs 
</i>> ><i> become: R, R (RELRO), RX, RWX, RW, and the dynamic linker would have the 
</i>> ><i> option to merge the two read-only VMAs to reduce bookkeeping costs.
</i>> 
> I am not convinced by this change.  With current hardware, to make any 
> mapping more efficient, you need both the virtual to physical 
> translation and the permissions to be the same.
> 
> Anything that is writeable at any point will be a CoW mapping that, when 
> written, will be replaced by a different page.  Anything that is not 
> ever writeable will be the same physical pages.  This means that the old 
> order is (S for shared, P for private):
> 
> S S P P
> 
> The new order is:
> 
> S P S P P
> 
> This means that the translation for the shared part is *definitely* not 
> contiguous.  Modern architectures currently (though not necessarily 
> indefinitely) conflate protection and translation and so both versions 
> require the same number of page table and TLB entries.
> 
> This; however, is true only when you think about single-level 
> translation.  When you consider nested paging in a VM, things get more 
> complex because the translation is a two-stage lookup and the protection 
> is based on the intersection of the permissions at each level.
> 
> The hypervisor will typically try to use superpages for the second-level 
> translation and so both of the shared pages have a high probability of 
> hitting in the same PTE for the second-level translation.  The same is 
> true for the RW and RELRO segments, because they will be allocated at 
> the same time and any OS that does transparent superpage promotion (I 
> think Linux does now?  FreeBSD has for almost a decade) will therefore 
> try to allocate contiguous physical memory for the mappings if possible.
> 
> I would expect your scheme to translate to more memory traffic from 
> page-table walks in any virtualised environment and I don't see (given 
> that you have increased address space fragmentation) where you are 
> seeing a saving.  With RELRO as part of RW, the kernel is free to split 
> and recombine adjacent VM objects, with the new layout it is not able to 
> combine adjacent objects because they are backed by different storage.
> 
> David
Indeed I did not think about this case. Thanks for pointing this out! I agree that with superpages this can result in worse performance and memory usage. Perhaps we can consider putting this change behind a build time flag? As much as I'd like to avoid adding flags, it seems to me from this thread that there are some real world cases that benefit from this change and some that suffer.

Vic

> ><i> This is an RFC for moving RELRO segment. Currently, lld orders ELF 
</i>> ><i> sections in the following order: R, RX, RWX, RW, and RW contains RELRO. 
</i>> ><i> At run time, after RELRO is write-protected, we'd have VMAs in the order 
</i>> ><i> of: R, RX, RWX, R (RELRO), RW. I'd like to propose that we move RELRO to 
</i>> ><i> be immediately after the read-only sections, so that the order of VMAs 
</i>> ><i> become: R, R (RELRO), RX, RWX, RW, and the dynamic linker would have the 
</i>> ><i> option to merge the two read-only VMAs to reduce bookkeeping costs.
</i>> 
> I am not convinced by this change.  With current hardware, to make any 
> mapping more efficient, you need both the virtual to physical 
> translation and the permissions to be the same.
> 
> Anything that is writeable at any point will be a CoW mapping that, when 
> written, will be replaced by a different page.  Anything that is not 
> ever writeable will be the same physical pages.  This means that the old 
> order is (S for shared, P for private):
> 
> S S P P
> 
> The new order is:
> 
> S P S P P
> 
> This means that the translation for the shared part is *definitely* not 
> contiguous.  Modern architectures currently (though not necessarily 
> indefinitely) conflate protection and translation and so both versions 
> require the same number of page table and TLB entries.
> 
> This; however, is true only when you think about single-level 
> translation.  When you consider nested paging in a VM, things get more 
> complex because the translation is a two-stage lookup and the protection 
> is based on the intersection of the permissions at each level.
> 
> The hypervisor will typically try to use superpages for the second-level 
> translation and so both of the shared pages have a high probability of 
> hitting in the same PTE for the second-level translation.  The same is 
> true for the RW and RELRO segments, because they will be allocated at 
> the same time and any OS that does transparent superpage promotion (I 
> think Linux does now?  FreeBSD has for almost a decade) will therefore 
> try to allocate contiguous physical memory for the mappings if possible.
> 
> I would expect your scheme to translate to more memory traffic from 
> page-table walks in any virtualised environment and I don't see (given 
> that you have increased address space fragmentation) where you are 
> seeing a saving.  With RELRO as part of RW, the kernel is free to split 
> and recombine adjacent VM objects, with the new layout it is not able to 
> combine adjacent objects because they are backed by different storage.
> 
> David
Indeed I did not think about this case. Thanks for pointing this out! I agree that with superpages this can result in worse performance and memory usage. Perhaps we can consider putting this change behind a build time flag? As much as I'd like to avoid adding flags, it seems to me from this thread that there are some real world cases that benefit from this change and some that suffer

If “build time” in the above sentence means a build-time configuration of the linker (i.e. changing a linker default setting when lld is configured and built), we don’t have that kind of configuration in lld at all, and that is (I believe) considered a good thing. As long as two lld binaries are of the same version, they behave exactly the same however they were built on any OS. So, if we need to make it configurable, we should add it as a linker flag.

If the proposed new layout works better than the current one on a non-virtualized environment and behave poorly on a virtualized environment, that is a tricky situation. We usually run the exact same OS and applications on both environments, so we have to choose one. Perhaps, we should first verify that the performance degradation on a VM is not hypothetical but real?


> ><i> This is an RFC for moving RELRO segment. Currently, lld orders ELF 
</i>> ><i> sections in the following order: R, RX, RWX, RW, and RW contains RELRO. 
</i>> ><i> At run time, after RELRO is write-protected, we'd have VMAs in the order 
</i>> ><i> of: R, RX, RWX, R (RELRO), RW. I'd like to propose that we move RELRO to 
</i>> ><i> be immediately after the read-only sections, so that the order of VMAs 
</i>> ><i> become: R, R (RELRO), RX, RWX, RW, and the dynamic linker would have the 
</i>> ><i> option to merge the two read-only VMAs to reduce bookkeeping costs.
</i>> 
> I am not convinced by this change.  With current hardware, to make any 
> mapping more efficient, you need both the virtual to physical 
> translation and the permissions to be the same.
> 
> Anything that is writeable at any point will be a CoW mapping that, when 
> written, will be replaced by a different page.  Anything that is not 
> ever writeable will be the same physical pages.  This means that the old 
> order is (S for shared, P for private):
> 
> S S P P
> 
> The new order is:
> 
> S P S P P
> 
> This means that the translation for the shared part is *definitely* not 
> contiguous.  Modern architectures currently (though not necessarily 
> indefinitely) conflate protection and translation and so both versions 
> require the same number of page table and TLB entries.
> 
> This; however, is true only when you think about single-level 
> translation.  When you consider nested paging in a VM, things get more 
> complex because the translation is a two-stage lookup and the protection 
> is based on the intersection of the permissions at each level.
> 
> The hypervisor will typically try to use superpages for the second-level 
> translation and so both of the shared pages have a high probability of 
> hitting in the same PTE for the second-level translation.  The same is 
> true for the RW and RELRO segments, because they will be allocated at 
> the same time and any OS that does transparent superpage promotion (I 
> think Linux does now?  FreeBSD has for almost a decade) will therefore 
> try to allocate contiguous physical memory for the mappings if possible.
> 
> I would expect your scheme to translate to more memory traffic from 
> page-table walks in any virtualised environment and I don't see (given 
> that you have increased address space fragmentation) where you are 
> seeing a saving.  With RELRO as part of RW, the kernel is free to split 
> and recombine adjacent VM objects, with the new layout it is not able to 
> combine adjacent objects because they are backed by different storage.
> 
> David
Indeed I did not think about this case. Thanks for pointing this out! I agree that with superpages this can result in worse performance and memory usage. Perhaps we can consider putting this change behind a build time flag? As much as I'd like to avoid adding flags, it seems to me from this thread that there are some real world cases that benefit from this change and some that suffer

If “build time” in the above sentence means a build-time configuration of the linker (i.e. changing a linker default setting when lld is configured and built), we don’t have that kind of configuration in lld at all, and that is (I believe) considered a good thing. As long as two lld binaries are of the same version, they behave exactly the same however they were built on any OS. So, if we need to make it configurable, we should add it as a linker flag.

Yes, a linker flag is what I meant. (i.e. same lld binary)

If the proposed new layout works better than the current one on a non-virtualized environment and behave poorly on a virtualized environment, that is a tricky situation. We usually run the exact same OS and applications on both environments, so we have to choose one. Perhaps, we should first verify that the performance degradation on a VM is not hypothetical but real?

Agreed. I’ll see if I can figure out how to verify this. If anyone has any pointers on how I can get a VM to use a huge page, that’d be most welcomed.

At the same time, I’d also like to point out that there are cases where a shared library is only used in a non-virtualized environment and never in a virtualized environment. For example, Android has different build targets for real devices vs virtual devices, and if we ended up with the linker flag, one can choose to enable it on real devices and keep the current behavior on virtual devices if the performance of virtual devices is of concern.