[RFC] Embedded bitcode and related upstream (Part II)

Hi everyone

I am still in the process of upstreaming some improvements to the embed bitcode option. If you want more background, you can read the previous RFC (http://lists.llvm.org/pipermail/llvm-dev/2016-February/094851.html). This is part II of the discussion.

Current Status:
A basic version of -fembed-bitcode option is upstreamed and functioning.
You can use -fembed-bitcode={off, all, bitcode, marker} option to control what gets embedded in the final object file output:
off: default, nothing gets embedded.
all: optimized bitcode and command line options gets embedded in the object file.
bitcode: only optimized bitcode is embedded
marker: only put a marker in the object file

What needs to be improved:

  1. Whitelist for command line options that can be used with bitcode:
    Current trunk implementation embeds all the cc1 command line options (that includes header include paths, warning flags and other front-end options) in the command line section. That is lot of redundant information. To re-create the object file from the embedded optimized bitcode, most of these options are useless. On the other hand, they can leak information of the source code. One solution will be keeping a list of all the options that can affect code generation but not encoded in the bitcode. I have internally prototyped with disallowing these options explicitly and allowed only the reminder of the options to be embedded (http://reviews.llvm.org/D17394). A better solution might be encoding that information in “Options.td” as specific group.

  2. Assembly input handling:
    This is a workaround to allow source code written in assembly to work with “-fembed-bitcode” options. When compiling assembly source code with “-fembed-bitcode”, clang-as creates an empty section “__LLVM, __asm” in the object file. That is just a way to distinguish object files compiled from assembly source from those compiled from higher level source code but forgot to use “-fembed-bitcode” options. Linker can use this section to diagnose if “-fembed-bitcode” is consistently used on all the object files participated in the linking.

  3. Bitcode symbol hiding:
    There was some concerns for leaking source code information when using bitcode feature. One approach to avoid the leak is to add a pass which renames all the globals and metadata strings. The also keeps a reverse map in case the original name needs to be recovered. The final bitcode should contain no more symbols or debug info than a stripped binary. To make sure modified bitcode can still be linked correctly, the renaming need to be consistent across all bitcode participated in the linking and everything that is external of the linkage unit need to be preserved. This means the pass can only be run during the linking and requires some LTO api.

  4. Debug info strip to line-tables pass:
    As the name suggested, this pass strip down the full debug info to line-tables only. This is also one of the steps we took to prevent the leak of source code information in bitcode.

Please let me know what do you think about the pieces above or if you have any concerns about the methodology. I will put up patches for review soon.

Thanks

Steven

Hi Steven,

Great to see the commentary and updates here. I’ve got a few questions about some of this work. It might be nice to see some separate RFCs for a couple of things, but we’ll figure that out after you send out patches probably :slight_smile:

What needs to be improved:

  1. Whitelist for command line options that can be used with bitcode:
    Current trunk implementation embeds all the cc1 command line options (that includes header include paths, warning flags and other front-end options) in the command line section. That is lot of redundant information. To re-create the object file from the embedded optimized bitcode, most of these options are useless. On the other hand, they can leak information of the source code. One solution will be keeping a list of all the options that can affect code generation but not encoded in the bitcode. I have internally prototyped with disallowing these options explicitly and allowed only the reminder of the options to be embedded (http://reviews.llvm.org/D17394). A better solution might be encoding that information in “Options.td” as specific group.

This is really interesting. I’m not a particularly security minded person so I don’t have a lot of commentary there. An explicit whitelist sounds a bit painful to keep maintained, explicitly having a group in Options.td sounds pretty nice. You’ll need to add them to multiple groups, but it seems pretty nice.

  1. Assembly input handling:
    This is a workaround to allow source code written in assembly to work with “-fembed-bitcode” options. When compiling assembly source code with “-fembed-bitcode”, clang-as creates an empty section “__LLVM, __asm” in the object file. That is just a way to distinguish object files compiled from assembly source from those compiled from higher level source code but forgot to use “-fembed-bitcode” options. Linker can use this section to diagnose if “-fembed-bitcode” is consistently used on all the object files participated in the linking.

I’m surprised you want a separate and empty section and not a header flag as those are easier to keep around and won’t take up a precious mach-o section. There are probably other options here as well. There are probably other options or concerns that someone shipping bitcode might have here as well, but I’m sure those are being talked about - doesn’t have too much affect on the community though.

  1. Bitcode symbol hiding:
    There was some concerns for leaking source code information when using bitcode feature. One approach to avoid the leak is to add a pass which renames all the globals and metadata strings. The also keeps a reverse map in case the original name needs to be recovered. The final bitcode should contain no more symbols or debug info than a stripped binary. To make sure modified bitcode can still be linked correctly, the renaming need to be consistent across all bitcode participated in the linking and everything that is external of the linkage unit need to be preserved. This means the pass can only be run during the linking and requires some LTO api.

How are you planning to ensure the safety of the reverse map? Seems that requiring linking is a bit icky, but might work. Are you mostly worried about function names that could be stripped out? What LTO api are you envisioning here?

  1. Debug info strip to line-tables pass:
    As the name suggested, this pass strip down the full debug info to line-tables only. This is also one of the steps we took to prevent the leak of source code information in bitcode.

I’m very curious about what’s going on here. Could you elaborate? :slight_smile:

Thanks a ton for the update - glad to see this being worked on!

-eric

Thanks for the feedback! Replies inline.

Hi Steven,

Great to see the commentary and updates here. I’ve got a few questions about some of this work. It might be nice to see some separate RFCs for a couple of things, but we’ll figure that out after you send out patches probably :slight_smile:

What needs to be improved:

  1. Whitelist for command line options that can be used with bitcode:
    Current trunk implementation embeds all the cc1 command line options (that includes header include paths, warning flags and other front-end options) in the command line section. That is lot of redundant information. To re-create the object file from the embedded optimized bitcode, most of these options are useless. On the other hand, they can leak information of the source code. One solution will be keeping a list of all the options that can affect code generation but not encoded in the bitcode. I have internally prototyped with disallowing these options explicitly and allowed only the reminder of the options to be embedded (http://reviews.llvm.org/D17394). A better solution might be encoding that information in “Options.td” as specific group.

This is really interesting. I’m not a particularly security minded person so I don’t have a lot of commentary there. An explicit whitelist sounds a bit painful to keep maintained, explicitly having a group in Options.td sounds pretty nice. You’ll need to add them to multiple groups, but it seems pretty nice.

I have already implemented the new approach in http://reviews.llvm.org/D21230. It creates a new group for all the cc1 options that can affect codegen but not having a corresponding attribute in the bitcode. When I wrote up this patch, I think it is also a good idea to extend the group to driver flags so clang driver can issue warnings when using these flags with LTO because they are likely to be dropped in the process. That is my next thing to do if someone reviews my patch and agrees that is right thing to do.

  1. Assembly input handling:
    This is a workaround to allow source code written in assembly to work with “-fembed-bitcode” options. When compiling assembly source code with “-fembed-bitcode”, clang-as creates an empty section “__LLVM, __asm” in the object file. That is just a way to distinguish object files compiled from assembly source from those compiled from higher level source code but forgot to use “-fembed-bitcode” options. Linker can use this section to diagnose if “-fembed-bitcode” is consistently used on all the object files participated in the linking.

I’m surprised you want a separate and empty section and not a header flag as those are easier to keep around and won’t take up a precious mach-o section. There are probably other options here as well. There are probably other options or concerns that someone shipping bitcode might have here as well, but I’m sure those are being talked about - doesn’t have too much affect on the community though.

I suppose you mean the alternative is to burn a macho command for that. Well, that is a limited resource and we don’t have much left. Plus, using empty section will make this accessible to other binary format, not only macho files. I also have an interesting thought about handle the assembly, that is to wrap it in module assembly in a bitcode file. I am not sure it would preserve the all semantics of the original assembly and that would mean I need to somehow teach the assembler about bitcode (which might make this not very attractive). Yes, you might be right this doesn’t affect the community, if no one else is interesting in a solution for the problem we have, then this might not be suitable for contributing. I am happy to keep it downstream.

  1. Bitcode symbol hiding:
    There was some concerns for leaking source code information when using bitcode feature. One approach to avoid the leak is to add a pass which renames all the globals and metadata strings. The also keeps a reverse map in case the original name needs to be recovered. The final bitcode should contain no more symbols or debug info than a stripped binary. To make sure modified bitcode can still be linked correctly, the renaming need to be consistent across all bitcode participated in the linking and everything that is external of the linkage unit need to be preserved. This means the pass can only be run during the linking and requires some LTO api.

How are you planning to ensure the safety of the reverse map? Seems that requiring linking is a bit icky, but might work. Are you mostly worried about function names that could be stripped out? What LTO api are you envisioning here?

The reverse map is emitted as a separate file from the output binary/bitcode. It should not be shipped together with the binary output, just like dSYM bundle.
The reason it needs to be done after linking is a limitation of the symbol hiding technique. It requires that the symbols must be resolved. Think about the following case:
a.o:
T export_symbol
T global_symbol
t local_symbol
b.o:
U global_symbol

To make sure the bitcode after symbol hiding pass can still link and produce the same output, the pass need to rename them:

a.o:
T export_symbol → export_symbol (preserve)
T global_symbol → hidden_symbol_1 (rename, but need to have the same name as the one in b.o)
t local_symbol → hidden_symbol_2 (rename, but don’t care what it becomes)
b.o:
U global_symbol → hidden_symbol_1
The pass need to know what symbols to keep and a global renaming table so the names after renaming are consistent across all the modules.

  1. Debug info strip to line-tables pass:
    As the name suggested, this pass strip down the full debug info to line-tables only. This is also one of the steps we took to prevent the leak of source code information in bitcode.

I’m very curious about what’s going on here. Could you elaborate? :slight_smile:

Cc Adrian
He would know more about it. I would only know that it can reconstruct -gline-tables-only debug info from full debug info. We use it as a part of the bitcode pipeline because we don’t want the bitcode file to be exceedingly large but I can see this pass to be useful in other circumstances.

Steven

Thanks for the feedback! Replies inline.

Hi Steven,

Great to see the commentary and updates here. I’ve got a few questions about some of this work. It might be nice to see some separate RFCs for a couple of things, but we’ll figure that out after you send out patches probably :slight_smile:

What needs to be improved:

  1. Whitelist for command line options that can be used with bitcode:
    Current trunk implementation embeds all the cc1 command line options (that includes header include paths, warning flags and other front-end options) in the command line section. That is lot of redundant information. To re-create the object file from the embedded optimized bitcode, most of these options are useless. On the other hand, they can leak information of the source code. One solution will be keeping a list of all the options that can affect code generation but not encoded in the bitcode. I have internally prototyped with disallowing these options explicitly and allowed only the reminder of the options to be embedded (http://reviews.llvm.org/D17394). A better solution might be encoding that information in “Options.td” as specific group.

This is really interesting. I’m not a particularly security minded person so I don’t have a lot of commentary there. An explicit whitelist sounds a bit painful to keep maintained, explicitly having a group in Options.td sounds pretty nice. You’ll need to add them to multiple groups, but it seems pretty nice.

I have already implemented the new approach in http://reviews.llvm.org/D21230. It creates a new group for all the cc1 options that can affect codegen but not having a corresponding attribute in the bitcode. When I wrote up this patch, I think it is also a good idea to extend the group to driver flags so clang driver can issue warnings when using these flags with LTO because they are likely to be dropped in the process. That is my next thing to do if someone reviews my patch and agrees that is right thing to do.

Honestly I think it’s better if we implement almost everything in the bitcode that doesn’t involve a pass manager configuration - and then make that more explicit as far as library calls. Same with TargetMachine configuration etc. What do you think?

  1. Assembly input handling:
    This is a workaround to allow source code written in assembly to work with “-fembed-bitcode” options. When compiling assembly source code with “-fembed-bitcode”, clang-as creates an empty section “__LLVM, __asm” in the object file. That is just a way to distinguish object files compiled from assembly source from those compiled from higher level source code but forgot to use “-fembed-bitcode” options. Linker can use this section to diagnose if “-fembed-bitcode” is consistently used on all the object files participated in the linking.

I’m surprised you want a separate and empty section and not a header flag as those are easier to keep around and won’t take up a precious mach-o section. There are probably other options here as well. There are probably other options or concerns that someone shipping bitcode might have here as well, but I’m sure those are being talked about - doesn’t have too much affect on the community though.

I suppose you mean the alternative is to burn a macho command for that. Well, that is a limited resource and we don’t have much left. Plus, using empty section will make this accessible to other binary format, not only macho files. I also have an interesting thought about handle the assembly, that is to wrap it in module assembly in a bitcode file. I am not sure it would preserve the all semantics of the original assembly and that would mean I need to somehow teach the assembler about bitcode (which might make this not very attractive). Yes, you might be right this doesn’t affect the community, if no one else is interesting in a solution for the problem we have, then this might not be suitable for contributing. I am happy to keep it downstream.

You could use a section for ELF. I’m not sure what the right thing to do for PE would be, but it’ll either be a section or a command like thing. The wrapping asm in module assembly might work, though you’ll need to make some (hopefully all programmatic) textual changes to keep it working. You might be able to teach clang how to do that.

  1. Bitcode symbol hiding:
    There was some concerns for leaking source code information when using bitcode feature. One approach to avoid the leak is to add a pass which renames all the globals and metadata strings. The also keeps a reverse map in case the original name needs to be recovered. The final bitcode should contain no more symbols or debug info than a stripped binary. To make sure modified bitcode can still be linked correctly, the renaming need to be consistent across all bitcode participated in the linking and everything that is external of the linkage unit need to be preserved. This means the pass can only be run during the linking and requires some LTO api.

How are you planning to ensure the safety of the reverse map? Seems that requiring linking is a bit icky, but might work. Are you mostly worried about function names that could be stripped out? What LTO api are you envisioning here?

The reverse map is emitted as a separate file from the output binary/bitcode.

That makes some sense.

It should not be shipped together with the binary output, just like dSYM bundle.
The reason it needs to be done after linking is a limitation of the symbol hiding technique. It requires that the symbols must be resolved. Think about the following case:
a.o:
T export_symbol
T global_symbol
t local_symbol
b.o:
U global_symbol

To make sure the bitcode after symbol hiding pass can still link and produce the same output, the pass need to rename them:

a.o:
T export_symbol → export_symbol (preserve)
T global_symbol → hidden_symbol_1 (rename, but need to have the same name as the one in b.o)
t local_symbol → hidden_symbol_2 (rename, but don’t care what it becomes)
b.o:
U global_symbol → hidden_symbol_1
The pass need to know what symbols to keep and a global renaming table so the names after renaming are consistent across all the modules.

Right. That’s where I thought you were going. Would this instead be better implemented as a tool/library that could “anonymize” a bitcode file? I realize it wouldn’t “work” for shipping to the store if you ran it on bitcode before it’s been linked, but it might be a good way for people to submit bitcode files to us as well. Just throwing ideas out here :slight_smile:

  1. Debug info strip to line-tables pass:
    As the name suggested, this pass strip down the full debug info to line-tables only. This is also one of the steps we took to prevent the leak of source code information in bitcode.

I’m very curious about what’s going on here. Could you elaborate? :slight_smile:

Cc Adrian
He would know more about it. I would only know that it can reconstruct -gline-tables-only debug info from full debug info. We use it as a part of the bitcode pipeline because we don’t want the bitcode file to be exceedingly large but I can see this pass to be useful in other circumstances.

:slight_smile:

-eric

3. Bitcode symbol hiding:

[...]

4. Debug info strip to line-tables pass:
As the name suggested, this pass strip down the full debug info to line-tables only. This is also one of the steps we took to prevent the leak of source code information in bitcode.

I'm very curious about what's going on here. Could you elaborate? :slight_smile:

Cc Adrian
He would know more about it. I would only know that it can reconstruct -gline-tables-only debug info from full debug info. We use it as a part of the bitcode pipeline because we don't want the bitcode file to be exceedingly large but I can see this pass to be useful in other circumstances.

:slight_smile:

IIRC, removing the type information and flattening the scope tree also helps to prevent leaking information about the source file, as part of (3) above.

Yes, I would agree with that as well.

-eric

Thanks for the feedback! Replies inline.

Hi Steven,

Great to see the commentary and updates here. I’ve got a few questions about some of this work. It might be nice to see some separate RFCs for a couple of things, but we’ll figure that out after you send out patches probably :slight_smile:

What needs to be improved:

  1. Whitelist for command line options that can be used with bitcode:
    Current trunk implementation embeds all the cc1 command line options (that includes header include paths, warning flags and other front-end options) in the command line section. That is lot of redundant information. To re-create the object file from the embedded optimized bitcode, most of these options are useless. On the other hand, they can leak information of the source code. One solution will be keeping a list of all the options that can affect code generation but not encoded in the bitcode. I have internally prototyped with disallowing these options explicitly and allowed only the reminder of the options to be embedded (http://reviews.llvm.org/D17394). A better solution might be encoding that information in “Options.td” as specific group.

This is really interesting. I’m not a particularly security minded person so I don’t have a lot of commentary there. An explicit whitelist sounds a bit painful to keep maintained, explicitly having a group in Options.td sounds pretty nice. You’ll need to add them to multiple groups, but it seems pretty nice.

I have already implemented the new approach in http://reviews.llvm.org/D21230. It creates a new group for all the cc1 options that can affect codegen but not having a corresponding attribute in the bitcode. When I wrote up this patch, I think it is also a good idea to extend the group to driver flags so clang driver can issue warnings when using these flags with LTO because they are likely to be dropped in the process. That is my next thing to do if someone reviews my patch and agrees that is right thing to do.

  1. Assembly input handling:
    This is a workaround to allow source code written in assembly to work with “-fembed-bitcode” options. When compiling assembly source code with “-fembed-bitcode”, clang-as creates an empty section “__LLVM, __asm” in the object file. That is just a way to distinguish object files compiled from assembly source from those compiled from higher level source code but forgot to use “-fembed-bitcode” options. Linker can use this section to diagnose if “-fembed-bitcode” is consistently used on all the object files participated in the linking.

I’m surprised you want a separate and empty section and not a header flag as those are easier to keep around and won’t take up a precious mach-o section. There are probably other options here as well. There are probably other options or concerns that someone shipping bitcode might have here as well, but I’m sure those are being talked about - doesn’t have too much affect on the community though.

I suppose you mean the alternative is to burn a macho command for that. Well, that is a limited resource and we don’t have much left. Plus, using empty section will make this accessible to other binary format, not only macho files. I also have an interesting thought about handle the assembly, that is to wrap it in module assembly in a bitcode file. I am not sure it would preserve the all semantics of the original assembly and that would mean I need to somehow teach the assembler about bitcode (which might make this not very attractive). Yes, you might be right this doesn’t affect the community, if no one else is interesting in a solution for the problem we have, then this might not be suitable for contributing. I am happy to keep it downstream.

  1. Bitcode symbol hiding:
    There was some concerns for leaking source code information when using bitcode feature. One approach to avoid the leak is to add a pass which renames all the globals and metadata strings. The also keeps a reverse map in case the original name needs to be recovered. The final bitcode should contain no more symbols or debug info than a stripped binary. To make sure modified bitcode can still be linked correctly, the renaming need to be consistent across all bitcode participated in the linking and everything that is external of the linkage unit need to be preserved. This means the pass can only be run during the linking and requires some LTO api.

How are you planning to ensure the safety of the reverse map? Seems that requiring linking is a bit icky, but might work. Are you mostly worried about function names that could be stripped out? What LTO api are you envisioning here?

The reverse map is emitted as a separate file from the output binary/bitcode. It should not be shipped together with the binary output, just like dSYM bundle.
The reason it needs to be done after linking is a limitation of the symbol hiding technique. It requires that the symbols must be resolved. Think about the following case:
a.o:
T export_symbol
T global_symbol
t local_symbol
b.o:
U global_symbol

To make sure the bitcode after symbol hiding pass can still link and produce the same output, the pass need to rename them:

a.o:
T export_symbol → export_symbol (preserve)
T global_symbol → hidden_symbol_1 (rename, but need to have the same name as the one in b.o)
t local_symbol → hidden_symbol_2 (rename, but don’t care what it becomes)
b.o:
U global_symbol → hidden_symbol_1
The pass need to know what symbols to keep and a global renaming table so the names after renaming are consistent across all the modules.

  1. Debug info strip to line-tables pass:
    As the name suggested, this pass strip down the full debug info to line-tables only. This is also one of the steps we took to prevent the leak of source code information in bitcode.

I’m very curious about what’s going on here. Could you elaborate? :slight_smile:

Cc Adrian
He would know more about it. I would only know that it can reconstruct -gline-tables-only debug info from full debug info. We use it as a part of the bitcode pipeline because we don’t want the bitcode file to be exceedingly large but I can see this pass to be useful in other circumstances.

It’s a pass that aims at downgrading full debug info to -gline-tables-only debug info. It performs a deep copy of the debug metadata graph, removing all types and variables while leaving all locations, subprograms, and inline information intact. It then makes the IR point to the newly created stripped subprograms, CU, and locations.

– adrian

Thanks for the feedback! Replies inline.

Hi Steven,

Great to see the commentary and updates here. I’ve got a few questions about some of this work. It might be nice to see some separate RFCs for a couple of things, but we’ll figure that out after you send out patches probably :slight_smile:

What needs to be improved:

  1. Whitelist for command line options that can be used with bitcode:
    Current trunk implementation embeds all the cc1 command line options (that includes header include paths, warning flags and other front-end options) in the command line section. That is lot of redundant information. To re-create the object file from the embedded optimized bitcode, most of these options are useless. On the other hand, they can leak information of the source code. One solution will be keeping a list of all the options that can affect code generation but not encoded in the bitcode. I have internally prototyped with disallowing these options explicitly and allowed only the reminder of the options to be embedded (http://reviews.llvm.org/D17394). A better solution might be encoding that information in “Options.td” as specific group.

This is really interesting. I’m not a particularly security minded person so I don’t have a lot of commentary there. An explicit whitelist sounds a bit painful to keep maintained, explicitly having a group in Options.td sounds pretty nice. You’ll need to add them to multiple groups, but it seems pretty nice.

I have already implemented the new approach in http://reviews.llvm.org/D21230. It creates a new group for all the cc1 options that can affect codegen but not having a corresponding attribute in the bitcode. When I wrote up this patch, I think it is also a good idea to extend the group to driver flags so clang driver can issue warnings when using these flags with LTO because they are likely to be dropped in the process. That is my next thing to do if someone reviews my patch and agrees that is right thing to do.

  1. Assembly input handling:
    This is a workaround to allow source code written in assembly to work with “-fembed-bitcode” options. When compiling assembly source code with “-fembed-bitcode”, clang-as creates an empty section “__LLVM, __asm” in the object file. That is just a way to distinguish object files compiled from assembly source from those compiled from higher level source code but forgot to use “-fembed-bitcode” options. Linker can use this section to diagnose if “-fembed-bitcode” is consistently used on all the object files participated in the linking.

I’m surprised you want a separate and empty section and not a header flag as those are easier to keep around and won’t take up a precious mach-o section. There are probably other options here as well. There are probably other options or concerns that someone shipping bitcode might have here as well, but I’m sure those are being talked about - doesn’t have too much affect on the community though.

I suppose you mean the alternative is to burn a macho command for that. Well, that is a limited resource and we don’t have much left. Plus, using empty section will make this accessible to other binary format, not only macho files. I also have an interesting thought about handle the assembly, that is to wrap it in module assembly in a bitcode file. I am not sure it would preserve the all semantics of the original assembly and that would mean I need to somehow teach the assembler about bitcode (which might make this not very attractive). Yes, you might be right this doesn’t affect the community, if no one else is interesting in a solution for the problem we have, then this might not be suitable for contributing. I am happy to keep it downstream.

  1. Bitcode symbol hiding:
    There was some concerns for leaking source code information when using bitcode feature. One approach to avoid the leak is to add a pass which renames all the globals and metadata strings. The also keeps a reverse map in case the original name needs to be recovered. The final bitcode should contain no more symbols or debug info than a stripped binary. To make sure modified bitcode can still be linked correctly, the renaming need to be consistent across all bitcode participated in the linking and everything that is external of the linkage unit need to be preserved. This means the pass can only be run during the linking and requires some LTO api.

How are you planning to ensure the safety of the reverse map? Seems that requiring linking is a bit icky, but might work. Are you mostly worried about function names that could be stripped out? What LTO api are you envisioning here?

The reverse map is emitted as a separate file from the output binary/bitcode. It should not be shipped together with the binary output, just like dSYM bundle.
The reason it needs to be done after linking is a limitation of the symbol hiding technique. It requires that the symbols must be resolved. Think about the following case:
a.o:
T export_symbol
T global_symbol
t local_symbol
b.o:
U global_symbol

To make sure the bitcode after symbol hiding pass can still link and produce the same output, the pass need to rename them:

a.o:
T export_symbol → export_symbol (preserve)
T global_symbol → hidden_symbol_1 (rename, but need to have the same name as the one in b.o)
t local_symbol → hidden_symbol_2 (rename, but don’t care what it becomes)
b.o:
U global_symbol → hidden_symbol_1
The pass need to know what symbols to keep and a global renaming table so the names after renaming are consistent across all the modules.

  1. Debug info strip to line-tables pass:
    As the name suggested, this pass strip down the full debug info to line-tables only. This is also one of the steps we took to prevent the leak of source code information in bitcode.

I’m very curious about what’s going on here. Could you elaborate? :slight_smile:

Cc Adrian
He would know more about it. I would only know that it can reconstruct -gline-tables-only debug info from full debug info. We use it as a part of the bitcode pipeline because we don’t want the bitcode file to be exceedingly large but I can see this pass to be useful in other circumstances.

It’s a pass that aims at downgrading full debug info to -gline-tables-only debug info. It performs a deep copy of the debug metadata graph, removing all types and variables while leaving all locations, subprograms, and inline information intact. It then makes the IR point to the newly created stripped subprograms, CU, and locations.

Sounds like what I’d expect.

How are you planning on hooking all of these up?

-eric

Thanks for the feedback! Replies inline.

Hi Steven,

Great to see the commentary and updates here. I’ve got a few questions about some of this work. It might be nice to see some separate RFCs for a couple of things, but we’ll figure that out after you send out patches probably :slight_smile:

What needs to be improved:

  1. Whitelist for command line options that can be used with bitcode:
    Current trunk implementation embeds all the cc1 command line options (that includes header include paths, warning flags and other front-end options) in the command line section. That is lot of redundant information. To re-create the object file from the embedded optimized bitcode, most of these options are useless. On the other hand, they can leak information of the source code. One solution will be keeping a list of all the options that can affect code generation but not encoded in the bitcode. I have internally prototyped with disallowing these options explicitly and allowed only the reminder of the options to be embedded (http://reviews.llvm.org/D17394). A better solution might be encoding that information in “Options.td” as specific group.

This is really interesting. I’m not a particularly security minded person so I don’t have a lot of commentary there. An explicit whitelist sounds a bit painful to keep maintained, explicitly having a group in Options.td sounds pretty nice. You’ll need to add them to multiple groups, but it seems pretty nice.

I have already implemented the new approach in http://reviews.llvm.org/D21230. It creates a new group for all the cc1 options that can affect codegen but not having a corresponding attribute in the bitcode. When I wrote up this patch, I think it is also a good idea to extend the group to driver flags so clang driver can issue warnings when using these flags with LTO because they are likely to be dropped in the process. That is my next thing to do if someone reviews my patch and agrees that is right thing to do.

Honestly I think it’s better if we implement almost everything in the bitcode that doesn’t involve a pass manager configuration - and then make that more explicit as far as library calls. Same with TargetMachine configuration etc. What do you think?

I believe that is always the ultimate goal but it is not an short-term task and every time I go through the lists of TargetMachine configuration, there are new flags added.
The easiest solution might be add all of them to the module flags but we probably need to go through them one by one to see if that is appropriate and will not conflict with function multiversioning. We also need tests to make sure flags are properly encoded in the bitcode.

  1. Assembly input handling:
    This is a workaround to allow source code written in assembly to work with “-fembed-bitcode” options. When compiling assembly source code with “-fembed-bitcode”, clang-as creates an empty section “__LLVM, __asm” in the object file. That is just a way to distinguish object files compiled from assembly source from those compiled from higher level source code but forgot to use “-fembed-bitcode” options. Linker can use this section to diagnose if “-fembed-bitcode” is consistently used on all the object files participated in the linking.

I’m surprised you want a separate and empty section and not a header flag as those are easier to keep around and won’t take up a precious mach-o section. There are probably other options here as well. There are probably other options or concerns that someone shipping bitcode might have here as well, but I’m sure those are being talked about - doesn’t have too much affect on the community though.

I suppose you mean the alternative is to burn a macho command for that. Well, that is a limited resource and we don’t have much left. Plus, using empty section will make this accessible to other binary format, not only macho files. I also have an interesting thought about handle the assembly, that is to wrap it in module assembly in a bitcode file. I am not sure it would preserve the all semantics of the original assembly and that would mean I need to somehow teach the assembler about bitcode (which might make this not very attractive). Yes, you might be right this doesn’t affect the community, if no one else is interesting in a solution for the problem we have, then this might not be suitable for contributing. I am happy to keep it downstream.

You could use a section for ELF. I’m not sure what the right thing to do for PE would be, but it’ll either be a section or a command like thing. The wrapping asm in module assembly might work, though you’ll need to make some (hopefully all programmatic) textual changes to keep it working. You might be able to teach clang how to do that.

I will look into this.

  1. Bitcode symbol hiding:
    There was some concerns for leaking source code information when using bitcode feature. One approach to avoid the leak is to add a pass which renames all the globals and metadata strings. The also keeps a reverse map in case the original name needs to be recovered. The final bitcode should contain no more symbols or debug info than a stripped binary. To make sure modified bitcode can still be linked correctly, the renaming need to be consistent across all bitcode participated in the linking and everything that is external of the linkage unit need to be preserved. This means the pass can only be run during the linking and requires some LTO api.

How are you planning to ensure the safety of the reverse map? Seems that requiring linking is a bit icky, but might work. Are you mostly worried about function names that could be stripped out? What LTO api are you envisioning here?

The reverse map is emitted as a separate file from the output binary/bitcode.

That makes some sense.

It should not be shipped together with the binary output, just like dSYM bundle.
The reason it needs to be done after linking is a limitation of the symbol hiding technique. It requires that the symbols must be resolved. Think about the following case:
a.o:
T export_symbol
T global_symbol
t local_symbol
b.o:
U global_symbol

To make sure the bitcode after symbol hiding pass can still link and produce the same output, the pass need to rename them:

a.o:
T export_symbol → export_symbol (preserve)
T global_symbol → hidden_symbol_1 (rename, but need to have the same name as the one in b.o)
t local_symbol → hidden_symbol_2 (rename, but don’t care what it becomes)
b.o:
U global_symbol → hidden_symbol_1
The pass need to know what symbols to keep and a global renaming table so the names after renaming are consistent across all the modules.

Right. That’s where I thought you were going. Would this instead be better implemented as a tool/library that could “anonymize” a bitcode file? I realize it wouldn’t “work” for shipping to the store if you ran it on bitcode before it’s been linked, but it might be a good way for people to submit bitcode files to us as well. Just throwing ideas out here :slight_smile:

Agree. It is a good tool to have for submitting bitcode bug report and we have a MetaRenamer to achieve part of that goal. The hard part of that goal is to rename without changing the semantics of the code. Even the renamer is running on ‘llvm-linked’ bitcode, you still need extra information to do it perfectly (info like compiler_rt symbols, system library interface). Some linker support will help to fix that problem. For the libLTO solution I proposed, it will be similar to thinlto interface. It requires the equivalent of:
thinlto_create_codegen
thinlto_codegen_dispose
thinlto_codegen_add_module
thinlto_codegen_process
thinlto_codegen_add_must_preserve_symbol
And 2 more to lookup the new name and to write the mapping out to the disk.

  1. Debug info strip to line-tables pass:
    As the name suggested, this pass strip down the full debug info to line-tables only. This is also one of the steps we took to prevent the leak of source code information in bitcode.

I’m very curious about what’s going on here. Could you elaborate? :slight_smile:

Cc Adrian
He would know more about it. I would only know that it can reconstruct -gline-tables-only debug info from full debug info. We use it as a part of the bitcode pipeline because we don’t want the bitcode file to be exceedingly large but I can see this pass to be useful in other circumstances.

Adrian, do you want me to post the patch for you?

Steven

Hi,

I hope I'm not breaking any mailing list etiquette by replying to this
mail, but if I am then please accept my apologies.

Hi everyone

I am still in the process of upstreaming some improvements to the embed
bitcode option. If you want more background, you can read the previous RFC
(http://lists.llvm.org/pipermail/llvm-dev/2016-February/094851.html). This
is part II of the discussion.

Current Status:
A basic version of -fembed-bitcode option is upstreamed and functioning.
You can use -fembed-bitcode={off, all, bitcode, marker} option to control
what gets embedded in the final object file output:
off: default, nothing gets embedded.
all: optimized bitcode and command line options gets embedded in the object
file.
bitcode: only optimized bitcode is embedded
marker: only put a marker in the object file

What needs to be improved:
1. Whitelist for command line options that can be used with bitcode:
Current trunk implementation embeds all the cc1 command line options (that
includes header include paths, warning flags and other front-end options) in
the command line section. That is lot of redundant information. To re-create
the object file from the embedded optimized bitcode, most of these options
are useless. On the other hand, they can leak information of the source
code. One solution will be keeping a list of all the options that can affect
code generation but not encoded in the bitcode. I have internally prototyped
with disallowing these options explicitly and allowed only the reminder of
the options to be embedded (http://reviews.llvm.org/D17394). A better
solution might be encoding that information in "Options.td" as specific
group.

2. Assembly input handling:
This is a workaround to allow source code written in assembly to work with
"-fembed-bitcode" options. When compiling assembly source code with
"-fembed-bitcode", clang-as creates an empty section "__LLVM, __asm" in the
object file. That is just a way to distinguish object files compiled from
assembly source from those compiled from higher level source code but forgot
to use "-fembed-bitcode" options. Linker can use this section to diagnose if
"-fembed-bitcode" is consistently used on all the object files participated
in the linking.

3. Bitcode symbol hiding:
There was some concerns for leaking source code information when using
bitcode feature. One approach to avoid the leak is to add a pass which
renames all the globals and metadata strings. The also keeps a reverse map
in case the original name needs to be recovered. The final bitcode should
contain no more symbols or debug info than a stripped binary. To make sure
modified bitcode can still be linked correctly, the renaming need to be
consistent across all bitcode participated in the linking and everything
that is external of the linkage unit need to be preserved. This means the
pass can only be run during the linking and requires some LTO api.

Regarding the symbol map, are you planning to upstream a pass that
restores the symbols? I have been trying to do this myself in order to
reverse the "BCSymbolMap". However this turned out to be less
straightforward than I'd hoped. Any info on this would be greatly
appreciated!

4. Debug info strip to line-tables pass:
As the name suggested, this pass strip down the full debug info to
line-tables only. This is also one of the steps we took to prevent the leak
of source code information in bitcode.

Please let me know what do you think about the pieces above or if you have
any concerns about the methodology. I will put up patches for review soon.

Thanks

Steven

_______________________________________________
LLVM Developers mailing list
llvm-dev@lists.llvm.org
llvm-dev Info Page

Cheers,
Jonas

Hi,

I hope I’m not breaking any mailing list etiquette by replying to this
mail, but if I am then please accept my apologies.

Hi everyone

I am still in the process of upstreaming some improvements to the embed
bitcode option. If you want more background, you can read the previous RFC
(http://lists.llvm.org/pipermail/llvm-dev/2016-February/094851.html). This
is part II of the discussion.

Current Status:
A basic version of -fembed-bitcode option is upstreamed and functioning.
You can use -fembed-bitcode={off, all, bitcode, marker} option to control
what gets embedded in the final object file output:
off: default, nothing gets embedded.
all: optimized bitcode and command line options gets embedded in the object
file.
bitcode: only optimized bitcode is embedded
marker: only put a marker in the object file

What needs to be improved:

  1. Whitelist for command line options that can be used with bitcode:
    Current trunk implementation embeds all the cc1 command line options (that
    includes header include paths, warning flags and other front-end options) in
    the command line section. That is lot of redundant information. To re-create
    the object file from the embedded optimized bitcode, most of these options
    are useless. On the other hand, they can leak information of the source
    code. One solution will be keeping a list of all the options that can affect
    code generation but not encoded in the bitcode. I have internally prototyped
    with disallowing these options explicitly and allowed only the reminder of
    the options to be embedded (http://reviews.llvm.org/D17394). A better
    solution might be encoding that information in “Options.td” as specific
    group.

  2. Assembly input handling:
    This is a workaround to allow source code written in assembly to work with
    “-fembed-bitcode” options. When compiling assembly source code with
    “-fembed-bitcode”, clang-as creates an empty section “__LLVM, __asm” in the
    object file. That is just a way to distinguish object files compiled from
    assembly source from those compiled from higher level source code but forgot
    to use “-fembed-bitcode” options. Linker can use this section to diagnose if
    “-fembed-bitcode” is consistently used on all the object files participated
    in the linking.

  3. Bitcode symbol hiding:
    There was some concerns for leaking source code information when using
    bitcode feature. One approach to avoid the leak is to add a pass which
    renames all the globals and metadata strings. The also keeps a reverse map
    in case the original name needs to be recovered. The final bitcode should
    contain no more symbols or debug info than a stripped binary. To make sure
    modified bitcode can still be linked correctly, the renaming need to be
    consistent across all bitcode participated in the linking and everything
    that is external of the linkage unit need to be preserved. This means the
    pass can only be run during the linking and requires some LTO api.

Regarding the symbol map, are you planning to upstream a pass that
restores the symbols? I have been trying to do this myself in order to
reverse the “BCSymbolMap”. However this turned out to be less
straightforward than I’d hoped. Any info on this would be greatly
appreciated!

We have tools to restore symbols in the dSYM bundle (check dsymutil -symbol-map option in the Apple toolchain).
I don’t think we have a pass to restore the symbols in the bitcode now but that should be very straight forward and I am happy to implement one as a part of the item 3.
Of course, that will only happen if the community thinks this feature is beneficial to them. At the meantime, if you need assist, please file a radar to Apple at https://bugreport.apple.com.

Steven

Hi Steven,

Hi Steven,

Hi everyone

I am still in the process of upstreaming some improvements to the embed
bitcode option. If you want more background, you can read the previous RFC (
http://lists.llvm.org/pipermail/llvm-dev/2016-February/094851.html).
This is part II of the discussion.

Current Status:
A basic version of -fembed-bitcode option is upstreamed and functioning.
You can use -fembed-bitcode={off, all, bitcode, marker} option to control
what gets embedded in the final object file output:
off: default, nothing gets embedded.
all: optimized bitcode and command line options gets embedded in the
object file.
bitcode: only optimized bitcode is embedded
marker: only put a marker in the object file

What needs to be improved:
1. Whitelist for command line options that can be used with bitcode:
Current trunk implementation embeds all the cc1 command line options
(that includes header include paths, warning flags and other front-end
options) in the command line section. That is lot of redundant information.
To re-create the object file from the embedded optimized bitcode, most of
these options are useless. On the other hand, they can leak information of
the source code. One solution will be keeping a list of all the options
that can affect code generation but not encoded in the bitcode. I have
internally prototyped with disallowing these options explicitly and allowed
only the reminder of the options to be embedded (
http://reviews.llvm.org/D17394). A better solution might be encoding
that information in "Options.td" as specific group.

2. Assembly input handling:
This is a workaround to allow source code written in assembly to work
with "-fembed-bitcode" options. When compiling assembly source code with
"-fembed-bitcode", clang-as creates an empty section "__LLVM, __asm" in the
object file. That is just a way to distinguish object files compiled from
assembly source from those compiled from higher level source code but
forgot to use "-fembed-bitcode" options. Linker can use this section to
diagnose if "-fembed-bitcode" is consistently used on all the object files
participated in the linking.

It looks like shipping Xcode's clang has this behavior, but open-source
clang still doesn't. Can you contribute it? It's very useful to us if
open-source clang has the same features as the clang shipping in Xcode.
(That last sentence is true in general, not just for this specific feature.)

Just FYI, Steven is away on vacation for a month. I think he should be back
in January.