As mentioned on a couple of the embedded LLVM calls, my changes supporting MC/DC are presently in phabricator (quoted above):
Since the Developers’ Meeting last November, I’ve been hearing from more folks who are interested in seeing this functionality upstream but don’t have the LLVM expertise to contribute meaningfully to the reviews, unfortunately, so I could really use some help in getting things reviewed.
So far, @ellishg has been able to look at some of the back-end work and provide some good feedback. @smithp35 provided some good suggestions for the preliminary review I added, which I incorporated into the clang-specific review linked here.
Of course, I don’t want to trivialize the fact that everybody is busy, and many of you have upstreaming work of your own. I appreciate the feedback you have and whatever time y’all are able to contribute to this effort! I’m also on Discord if you want to chat about MC/DC.
Multilib implementation code reviews by Michael Platinigs
Other code reviews
LLD key embedded features by Peter Smith
Two major areas:
Observability/discoverability - more understandable output, better usability.
Disjoint memory regions: multiple memory banks with different properties => possible linker script extension to distribute code over multiple free spaces in different regions.
RW data compression - copy RW data from ROM to RAM and decompress, can save ROM => could add to LLD or have a separate utility. It is important that compressions and decompression algorithms match! Maintaining multiple algorithms may add to overheads.
Memory-mapped variables - placing a section at a particular address, e.g. to access IO ports directly.
In practice many issues are up to linker scripts issues (difference in behaviour of BFD vs LLD), thus being able to debug linker scripts easily helps a lot.
Disjoint memory - distributing by hand is very tedious, indeed.
Compression is helpful.
LTO support with embedded constraints of placement is another interesting area - there was a presentation by TI recently.
Another GSoC project idea is for machine readable format, e.g. JSON, for debug output (also link map, that is different between linkers now, thus tedious to parse) so that people can create their own visualizers/analyzers. Would be nice to convince the GNU community to implement the same format as well.
Demo by Peter Smith how the features mentioned above work in armlink (Arm proprietary linker).
armlink has 3 compression algorithms, one very basic run-length for 0’s which is already very helpful.
armlink supports placement attributes from C code, i.e. saves on manually editing linker script files (called scatter files for armlink).
armlink can show useful debug info like call graph/stack depth required, also breakdown of code/data sizes including the libraries to analyse code size issues.
armlink can trace symbols to show why a particular one was included.
Multilib implementation code reviews by Michael Platings
Multilib implementation code reviews by Michael Platinigs.
MC/DC implementation code review by Alan Phipps.
FatLTO by Petr Hosek.
Multilibs code review
Patches in review, few rounds of discussions happened and comments addressed.
One patch landed, 6 more to finish.
How to speed up or accept the current version with the intent to improve/address any issues?
Feedback form Petr:
The team reviewed the RFC in detail, the response will be posted on Discourse in coming days.
Suggestion: There are changes to internal API and adding new file formats (which are UVB - user visible behavior), so for internal changes it should be OK to land, UVB may need a bit more discussion.
Michael: Could/should we be more aggressive: accept a format now as an experimental feature, so warn that it may and likely will change in the future? May commit now, but review/refine before LLVM17 release to have it as stable as possible by the next release.
Peter: It would be nice to be able to give it a try with real projects and see if it works, rather than keep overthinking.
Agreed: Petr posts the response on Discourse, then if after the Discourse discussion there are no blockers, we commit the current format and try to refine it for LLVM17.
MC/DC code review
Petr: Someone on the team is reviewing the patches, it goes a bit slower than wanted, but in progress, not forgotten.
Petr: FatLTO is progressing, there is an RFC and patches will be available soon. Approach aligned with LTO maintainers.
The idea of FatLTO is for object files to contain information for both normal and LTO linking (i.e. binary and IR code).
TI presented a revised version of LTO for embedded/linker scripts recently, their solution is similar to/compatible with FatLTO.
Todd explained the details of the TI solution from the presentation - the two teams will talk to each other to further align the approach and implementation.
Peter: FOSDEM embedded developers were asking about a way to embed a section, e.g. a checksum, into the output image at the link time.
Petr: why is build-id not enough? Looks like something very custom/special.
Suggested that it would make sense to start a topic on Discourse to explain the use case, then consider possible solutions.
Peter: Use of TLS (thread local storage) in embedded projects. Picolibc uses TLS and initializes it in the linker script. The linker script and the library need to agree on the calculations of relevant addresses. LLD and GNU LD disagree on this - Peter is looking to create a reduced reproducer.
Is anyone using TLS in embedded apps? Vince: No, but had similar issues.
Is this going to change with C11 used more in embedded? Something to look out for in the future.
Peter will post an issue with the reproducer upstream.
Peter: A request raised for the LLVM Embedded Toolchains for Arm Issue #197
One option is to create a trivial runtime that would dump the counters somewhere as suggested in the issue discussion thread.
Wider question is how to add bare-metal support to the compiter_rt?
The PR Pull Request #204 suggests an implementation based on reusing compiler_rt pieces, which goes in the right direction, but only provides a very narrow Arm semihosting-specific implementation. How to generalise?
Can we provide an interface inside compiler_rt that can be used to tailor actual implementation of storing the data, suitable for bare-metal use cases as well?
Petr: The idea makes sense, the profile runtime is not in the best shape now, it would be great to refactor it and rewrite in C++. Would be good to have a header-only minimal implementation to allow easy reuse between actual implementations.
The team is very much interested in the implementation, but there was a lack of time to progress.
LLVM Embedded Toolchain for Arm is an example to learn from - see the top level cmake file. Newlib was supported in LLVM 13 and LLVM 14 builds, can be found in the source packages in the releases section.
Where to get binary libraries for RISC-V? Depends on the toolchains/vendor.
Peter requested a roundtable for embedded toolchains, however there was no confirmation yet.
Multilibs code review (Peter)
Current discussion ([RFC] Multilib) is about options-to-libraries matching logic: so far agreed to use the normalised command line option for the architecture, we need to figure out a sensible way to match against it - regex or anything else.
Agreed the general preference to unblock and land the important patches, then get back to option printing and other possible improvements.
Note: Ordering of architecture options issue was also highlighted in the RISC-V call earlier today, so the issue is real and needs to be addressed in the design.
MC/DC coverage (Alan)
Comments were provided for all 3 patches, many thanks to those who contributed, updates are in progress - patches should be updated in coming weeks.
Petr: Google team provided all the useful information links in previous meeting minutes.
Next steps: Need a patch to start a more practical discussion.
Note: We need to keep the ABI stable, we may use a script to generate the list of public symbols, then check differences between versions. Petr suggested uploading the script for review/consideration, then it can be added to compiler_rt, if useful.
Other toolchains have different approaches (syntax and semantics) to resolve this issue that is typical in embedded, because devices may have many types and many regions of memory, e.g. flash, static, dynamic memory, etc.
What is the best way to implement such in LLD?
Re-implementing LD logic in LLD might be a reasonable option. Would make compatibility between GNU and LLVM easier for projects that use both.
Can be promoted to some linker script file syntax instead of the command line option later.
The “fill till overflow, then switch to the next memory region” strategy seems to work best in practice (distributing evenly across memory regions makes local code from source scattered all over the memory which may have performance pitfalls).
Scott: (CircuitPython for AdaScript) needs:
Explicit marking for target region, e.g. what to put into flash or not including the whole call tree.
Access properties for memory region, e.g. place hot code into TCM memory.
Can we do much of it in the compiler, instead of linker? E.g. allocation to sections with specific properties. Alternatively, can be a standalone binary rewriting tool like bolt.
LLD has ordering based on profiling data feature, contributed for games optimization?
There is a symbol ordering file to control order, used by PGO already - would be best to reuse such existing features, if possible.
LLD why to avoid complexity in implementation?
Maintenance, especially the mix of different features not intended to work together originally.
Impact on performance of LLD - the more logic, the slower it is.
Need to check with the LLD maintainer if there are any objections to the LD feature to be reimplemented in LLD?
Daniel is happy to progress based on the discussion.
Multilibs - Michael updated as per the latest comments, thanks to Petr for the review and feedback to keep it moving.
MC/DC - update from Alan: thanks for the useful comments, patches to be updated soon.
Profiling runtime - no patches yet.
Petr: the team looked into this, refactoring is needed: the idea is to move the implementation to C++ incrementally. The team would like to start doing that, but need to make sure not to break the ABI. There is a patch in Phabricator that uses LLVM readelf with JSON output to extract all the API information and to do diff with the refactored one, so that it is possible to catch incompatibilities. There are some limitations, though: JSON output is currently only supported for ELF fil format.
In libc++ there are similar scripts to capture the ABI details, they are based on readelf and nm, they work well for dynamic libraries, but not static libraries. A Discussion started to generalize this libc++ infrastructure for other runtimes.
So there are a lot of tools, but each of them has limitations. So we need to decide on priorities/strategy. E.g. focus on ELF file format for now, then add the rest later; or first improve readelf to extend the support to other formats, then continue with the refactoring.
Note that the profile data format can change, there is a version embedded into the format itself. But this is not the issue, the discussion is specifically about ABI compatibility of the runtime itself.
May be a good idea to ask in Discourse who is using profiling with what OSes/formats. Darwin format is probably for Apple to check, COFF format may be for the Chrome team.
Suppressing section type mismatch
An alternative solution landed yesterday!
This LLVM Embedded Toolchains sync was advertised in the EuroLLVM as an extended roundtable - people were invited to continue the discussion in these sync ups. Specific topics of interest follow.
It might be a good idea to setup a real-time communication channel, e.g. a Discord - Volodymyr will try to do so.
Code and RFC reviews: It was highlighted that all patch/RFC comments are useful, even just to say that the idea sounds good, is useful, etc - helps to support the progress and builds confidence.
How to advertise LLVM Embedded Toolchain more? Options considered: LLVM blog or a company blog? Invite people to comment on issues/needs/features they want to see for embedded use cases.
Another idea is to have talks in LLVM DevMeeting this fall. Google team want to present about porting a big project from GCC to LLVM. Issues the project run into and ideas to improve will be part of the presentation.
Similarly, Ties works on a blog about using LLVM Embedded Toolchain to target the Game Boy Advance game console. He wants to submit a talk for the LLVM Dev Meeting as well.
Scott highlighted that the CircuitPython team works on migration from GCC to LLVM and invited to help contribute - this is an open-source project, see Contributing - Pull Requests
Everyone agreed that the code size is definitely an issue, especially on smaller cores!
Petr suggested a possible future topic for discussion: analysis of optimization passes and how they contribute to the code size. There is an observation, that the Attributor pass with LTO gives a size reduction of about 10-12%, but it is not enabled by default. Proposal may be to enable it for -Oz? Enabling the Attributor pass may increase the compilation time, however compile time for embedded code (that is comparably small) is not that a big issue - may be a good trade off.
Related topic: Unified LTO discussion: the proposal to unify the ThinLTO and FullLTO. FullLTO is useful in embedded (again, smaller overall code size) vs ThinLTO for big apps like Chrome.
Quantum: Who has experience of using GCC LTO? Scott: it is used in CircuitPython from the very beginning - need to build it without LTO to see what is the impact.
Overlays in the linker. Arm Compiler has automatic overlays. embecosm attempt to standardise on ComRV (link in the trip report). It is driven by RISC-V community, but if it is interesting to a wider community, then we can collaborate.
Ties: LLD does not seem to support all the syntax from GNU LD, so using overlays was difficult.
Petr: Our project uses overlays that are reimplemented manually (not LLD one). LLVM and GCC do different things here, thus it is difficult to use their implementation. LLD implementation is not on par with GNU LD, e.g. cross refs checks that are controlled in the linker script for GNU LD (LLD does not even parse the relevant keywords).
Are GCC overlays usable (as the approach/design) or can we do better? Something more advanced would create a split between LLVM and GNU, thus we need to seek consensus with the GNU community.
ComRV may be one option to discuss - needs a deeper evaluation.
LLVM libc in embedded: Ther are some good news: it was tried in some projects and worked.
There as a migration project to replace gcc, newlib, libgcc, libstdc++ with LLVM compiler, LLVM libc, compiler_rt, and LLVM libc++.
Google team is working on a report to present in the LLVM Dev Meeting.
Some key issues: code size, e.g. printf is not configurable yet; memcopy size - improved, etc.
Now LLVM libc covers the needs of this particular project which is not that much: ~25 functions. The expectation is that for many embedded projects it is already usable - many projects use only a few functions.
Problem is that many embedded projects grow in complexity/size now and go closer to RTOS and using a lot of maths library, e.g. for DSP, thus become more demanding.
Single precision maths is complete in LLVM libc and is even better than glibc; double precision is in progress, but does not seem to be used a lot in embedded.
Petr suggested a possible future topic: malloc - LLVM libc uses scudo algorithm from compiler_rt as the default implementation. It is a good choice for desktop, but too big for embedded. Do we need a minimal malloc implementation for embedded? Exploring options and papers, etc.
Automotive community needs may be special here: they need deterministic memory management - would be good to make heap memory management pluggable so that people can replace depending on their use case.
I am glad to let you know that we have got a dedicated LLVM Discord channel for more interactive discussions in between the sync ups, please see #embedded-toolchains under the Communities & Working Groups section or follow the direct link: Discord
Petr and the team are thinking about a tutorial on coverage for embedded.
Another request was about build systems: may be able to cover building multilibs and runtimes.
Petr: supportive of the workshop, used to have similar sessions in the past, however to make it the most efficient it would be great to invite relevant maintainers, e.g. for LLD topics, so that the proposals can be discussed in person to save on online RFCs and comments. So it is even more important to choose the topics upfront.
Action: Peter to submit the proposal for the workshop.
The default Scudo allocator is large, thus there are many requests to add other options.
Want to start with something very simple and small (to fit embedded use cases), then enhance.
Plan: start with a simple implementation and put it up for review.
Peter: Option to override on the binary level would be great to allow people to substitute. Siva: sure, part of the original design to have it replaceable.
multilibs: just landed! Big thank you to Michael for driving and everyone who contributed to reviews!
MC/DC: goo feedback received - progressing.
Scott: Large code size with LTO, see details in Discord (Discord):
Peter: Arm Compiler has -Omin option that modifies the LTO pipeline to avoid cross module inlining (that makes the code bigger). Can be a topic for discussion to suggest LTO pipeline for code size optimization.
Scott plans to also check per-section size difference - having map files in JSON format would be really good to enable comparison between clang and GCC.
Changes were required to make CircuitPython building with clang: Scott can share the experience/details with anyone interested.
Petr: We also experimented with LTO, there are places where LTO helps, but in others it makes it worse - not a straightforward experience. Now Fat LTO is used, Thin LTO does not help code size most of the time. Unified LTO discussion started that should be able to address (or enable addressing) these issue and design a pipeline useful for embedded code.
Petr: There are 10-15 issues related to LLD and embedded in the Fuchsia issue tracker, but most people are not aware of the tracker. The tea will upstream these issues in the LLVM project. Would be nice to label them to make it easier to find. Agreed to add the “embedded” label. Peter will check that we raise defects that we are aware of in Arm Compiler as well.
Memory region function attributes and how they’d impact inlining and output section.
Assembly inline with the source similar to opt-viewer, but be able to have gcc assembly alongside clang generated assembly.
Using Arm trace data as an input to PGO. That’d give high quality performance data without needing any instrumentation.
Pre-LLVM DevMeeting workshop (Peter)
NOTE: LLVM sync on 12th Oct will overlap with the LLVMDev meeting, so we will skip it.
Proposal submitted - did not hear back yet. Number of people requested ~25. There was a list of possible topics suggested - we will need to review and confirm topics and agree who can drive each of the topics.
News and next steps to be posted on Discourse when the workshop is confirmed.
Update from Alan Phipps on MC/DC: code reviews have been accepted, thanks for the help!
Michael P: libc++ with picolibc testing: code review accepted, expected to land soon, buildkite CI will test the picolibc (embedded) configuration of libc++ running in QEMU on Armv7-M.
Automatic attribute propagation through the call graph is useful if there are libraries source code of which cannot be changed.
Somewhat similar to overlay logic to copy or not functions for different overlays.
PGO from traces (Scott)
PGO: trace capability of higher end CPUs - can it be used as input to PGO (without code instrumentation)? Branch instructions are most interesting to recreate the flow. Should be possible in principle. Arm Streamline is a trace based tool, armcc (Arm Compiler 5) was able to read its output, but not armclang (Arm Compiler 6).
There are a lot of trace formats out there so it could be tricky to parse all of them.
Compiler teams use a lot of models for testing, however for people working with peripherals there are less options.
There is no current plan to backport to CMSIS5, however both the clang enablement is a minor change and CMSIS6 is mostly compatible with CMSIS5 - it is a better split and arrangement of the same components, so should be straightforward to migrate.
In short, multiple C++ includes are blocking C includes.
Arm team is working on it and will suggest a patch or RFC to discuss possible solutions.
LLD improvements (Prabhu)
Auto packing in linker scripts: tested different approaches, GNU LD seems to do what is needed, so it may be useful to add the same to LLD. A review may be posted in the next few weeks.
Response: Makes sense, compatibility is more important, than a slightly better, but different, feature.
LTO and linker scripts: The team looked at TI and Qualcomm proposals, but it turned out that the solutions are optimised for different use cases, thus the key question is: What are the different requirements that we want to prioritise and design for? Will discuss in the DevMeetig workshop.
Peter: some input from EuroLLVM discussions:
Try to minimise code changes before using LTO - users want a magic solution: just change the switches to make it work.
Support for segregation of different memory types/regions.
Code size optimization options (Petr)
A question from the last RISC-V LLVM working group sync up meeting: Zephyr team is trying clang, but noticed a big difference between -Os and -Oz compared to GCC.
-Os in GCC is close to -Oz in clang: Does this match the experience of other teams?
Should we rename the options to match GCC behaviour to avoid the confusion?
Peter agreed that clang -Os is not focused on code size, but rather smaller, but still fast, -O2. -Oz can have performance impact that makes some people unhappy, it is “code size at all cost.”
Response: No major concerts with renaming.
Another idea was that it might be useful to have an optimization level to target embedded as there are some optimisation passes that are off by default, but are known to be beneficial for embedded use cases.
“Is it a bug if LLD doesn’t implement GNU LD specific behavior?”
There are things that are difficult to express in a GNU linker script. For example, aligning and (sometimes) padding an output section to the next power of two based on the size of the contents of the output section. If a downstream LLVM-based toolchain supports a more effective means of expressing something in a linker script that is not (yet) supported in GNU, should LLVM lld adopt it? If not, why not?
Do we need to drive incorporation of new linker script mechanisms into GNU ld before we consider upstreaming them to LLVM?
Petr: There are some subsets in libc++, but they are not very fine grained. May be useful to add more to be more flexible for embedded. Freestanding: not really/strictly supported in libc++ yet either, but would be useful.
Petr: libc++ is not supported on top of libc yet, it would be good to set up testing for the subset that already works, similar to picolibc.