AArch64 Round table

I would like to organise a AArch64 round table and wanted to check if people are interested and would like to propose any topics. For example, we could discuss anything related to AArch64 code-generation, optimisations, enablement, things people would like to work on and share, blockers, etc.

Please reply or leave a message if you’re interested as that would help to see if it is worthwhile organising a round table. I missed the deadline to get the roundtable included in the online agenda but Tanya wrote:

At the event you can write your round table title on the agenda side outside the room. It will be visible to attendees who walk by the sign, but it won’t be on the online agenda.

Tagging some folks that might be attending and interested:
@kbeyls , @smithp35 , @fhahn , @sscalpone , @ramana.radhakrishnan

5 Likes

Thanks for organizing this, @sjoerdmeijer. I would definitely attend such a round table.

1 Like

Assuming it doesn’t clash with the PAuth and Embedded Toolchains, I would also attend.

Thanks @sjoerdmeijer for kick starting this .

Perhaps it is worth starting to crowdsource some topics, do you have any @kbeyls and @smithp35 that complement the Pauth ABI and Embedded Toolchain Round tables ?

Yeah, I think it would be a good idea to crowdsource some topics.
I’m happy to contribute to any topic related to better support for AArch64 in LLVM technology.

A non-exhaustive list of a few things that are on my mind personally are:

  • deployment of pointer authentication and other security features in the AArch64 architecture.
  • support for AArch64 in bolt.
  • full globalisel support for AArch64.

I’m sure I’m forgetting a lot more topics that I’m actually interested in :wink:

Thanks for proposing this!

I will be around and can discuss SVE and SME enablement in MLIR. If that’s of interest to anyone :slight_smile:

-Andrzej

I work on Meta’s mobile apps, and I’d be interested in attending and discussing issues around binary size. In particular, I was wondering if there’s been any discussions around a smaller unwind info format for ELF AArch64 (similar to exidx/extab on armv7, compact unwind on Darwin, and pdata/xdata on Windows), because we find that unwind information is actually a large contributor to binary size for us and also impedes other size optimizations like outlining (the added unwind information for outlined functions adds a lot of size overhead).

1 Like

Thanks for all the replies! It looks like there’s enough interest, so let’s go ahead with this.
I will register the round table at the conference and try to avoid the path and embedded toolchain roundtables if different time slots are available.

I would like to bring some performance related topics to the table. Perhaps something related to auto-vectorisation and cost-modelling but I will see if I can make that more concrete.

3 Likes

Did you end up deciding a time for this?

I’d like to attend once the time and the venue is decided.

There are 4 round tables on Wednesday and ~7 on Thursday.

Let’s go for 11:00am tomorrow (Thursday).

The schedule for tomorrow is not up yet, but I will add it in the morning.

1 Like

Sorry for the reschedule, but on request we have moved this to 16:15hrs so that more folks can attend.

2 Likes

Can someone please share MoM/notes from this round table?

Thanks
Madhur.

1 Like

Notes from my memory. I expect that we’ll need more people’s notes to get a full picture:

  • Code Size including exception table size

Some mobile applications have a large code base optimised with -Oz with many outlined functions. Parts of the code-base use exceptions so unwind tables are required. The number of outlined functions leads to large .eh_frame sizes. Is there any scope for a compact unwinding table format, asynchronous exceptions are not required. No-one has plans on implementing such a format as it is a considerable amount of work to produce and then maintain over time. Could the exception tables be compressed? Yes but at a large up-front hit when the first exception is called, is there enough memory to decompress etc.

Most at the table are working on performance optimisations for AArch64 rather than code-size. Outside of outlining code-size optimisation are largely a long tail of small optimisations that accumulate over time.

  • Bolt on AArch64

Several people have tried it. Had some very good results on some programs but not much on others. Some programs such as various language VMs don’t work at all. For example if they encode pointers as offsets which breaks some assumptions that Bolt makes.

Biggest limitation for Bolt was seen to be weakness of sample based profiling for AArch64. Bolt can support instrumentation based profiles which should improve the situation.

Due to its nature it is likely to be an expert level tool requiring some understanding of the program and the hardware to get the best out of it.

There are regular Bolt office hours. Please try it out, ask questions report bugs, submit patches to improve.

  • SME

Will Clang generate code for SME other than via intrinsics? Not via clang, but it is possible via MLIR.

  • Performance

Only a small amount spent on this. In general agreement that a regular AArch64 call would be useful to coordinate on benchmarks, different optimisations, testing etc. Just needs someone to organise!

Thanks @smithp35, I think that’s a good summary. I don’t have much to add.

In general agreement that a regular AArch64 call would be useful to coordinate on benchmarks, different optimisations, testing etc. Just needs someone to organise!

I agree that it would be really useful to have this, and I would like to volunteer for organising this. I will set something up and plan something in soon.

2 Likes

Thank you for volunteering - I think that it will be super helpful :pray:t2: !

Related to this, last week I chatted with a few people about SME in the context of MLIR/Clang. Do you think that it would be OK to direct them to this call for any updates and/or questions that they may have?

-Andrzej

Sure, why not. We can add SME to the agenda, so let’s do that and see how it goes;
let’s see how it goes before thinking about a separate SME sync. :slight_smile:

1 Like

FWIW Clang has a matrix type (fixed dimensions only at the moment, but may be extended to allow variable dimensions). I think SME code could be generated for those.

A bit of advertising on your behalf :slight_smile: :pray:t2:

Yes, but even then the user would be responsible for e.g.

  1. virtual tile allocation, and
  2. enabling/disabling the streaming mode and the ZA storage array.

Basically, there’s relatively little “hand-holding” that end-users can expect from the backend ATM. But given how specialised SME is, perhaps that’s OK?

On a related note, SME brings two things:

  • ZA array storage (for accelerating e.g. outer-products),
  • Streaming SVE (in addition to host CPU SVE).

Only the latter will only require Step 2 from above. So it’s not like Clang users won’t be able to leverage their lovely “matrix extension” :wink:

I’ve not looked at SME in detail, but couldn’t the compiler take care of that?