[patch] char8_t support (plus dlang UTF8/16/32)

Dear LLDB developers:

I have added support for C++20 char8_t, as well as support for dlang's char/wchar/dchar types. As I am not a professional developer, and the submission-review-merge process for LLVM projects seems somewhat byzantine, I wanted to offer this up on the list in the hopes that others find it useful and someone will be able to integrate it.

kind regards
James

Using an example program that defines each of the unicode types as a single character as well as string, we see a major improvement.

BEFORE:
(lldb) frame v
error: need to add support for DW_TAG_base_type 'char8_t' encoded with DW_ATE = 0x10, bit_size = 8
(void) c8 = <Unable to determine byte size.>

(char16_t) c16 = U+0000 u'\0'
(char32_t) c32 = U+0x00007fff U'翿'
(void [11]) str8 = ([0] = <Unable to determine byte size.>, [1] = <Unable to determine byte size.>, [2] = <Unable to determine byte size.>, [3] = <Unable to determine byte size.>, [4] = <Unable to determine byte size.>, [5] = <Unable to determine byte size.>, [6] = <Unable to determine byte size.>, [7] = <Unable to determine byte size.>, [8] = <Unable to determine byte size.>, [9] = <Unable to determine byte size.>, [10] = <Unable to determine byte size.>)
(void *) str8ptr = 0x00007fffffffe3d9
(char16_t [12]) str16 = u"Hello UTF16"
(char16_t *) str16ptr = 0x00007fffffffe3b0 u"Hello UTF16"
(char32_t [12]) str32 = U"Hello UTF32"
(char32_t *) str32ptr = 0x00007fffffffe370 U"Hello UTF32"

AFTER:
(lldb) frame v
(char8_t) c8 = 0x00 u8'\0'
(char16_t) c16 = U+0000 u'\0'
(char32_t) c32 = U+0x00007fff U'翿'
(char8_t [11]) str8 = u8"Hello UTF8"
(char8_t *) str8ptr = 0x00007fffffffe3c9 u8"Hello UTF8"
(char16_t [12]) str16 = u"Hello UTF16"
(char16_t *) str16ptr = 0x00007fffffffe3a0 u"Hello UTF16"
(char32_t [12]) str32 = U"Hello UTF32"
(char32_t *) str32ptr = 0x00007fffffffe360 U"Hello UTF32”

diff --git a/include/lldb/lldb-enumerations.h b/include/lldb/lldb-enumerations.h
index f9830c04b..e7189dc9d 100644
--- a/include/lldb/lldb-enumerations.h
+++ b/include/lldb/lldb-enumerations.h
@@ -167,6 +167,7 @@ enum Format {
   eFormatOctal,
   eFormatOSType, // OS character codes encoded into an integer 'PICT' 'text'
                  // etc...
+ eFormatUnicode8,
   eFormatUnicode16,
   eFormatUnicode32,
   eFormatUnsigned,
diff --git a/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp b/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp
index 0b3c31816..15e0a82bd 100644
--- a/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp
+++ b/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp
@@ -853,6 +853,14 @@ static void LoadSystemFormatters(lldb::TypeCategoryImplSP cpp_category_sp) {

   // FIXME because of a bug in the FormattersContainer we need to add a summary
   // for both X* and const X* (<rdar://problem/12717717>)
+ AddCXXSummary(
+ cpp_category_sp, lldb_private::formatters::Char8StringSummaryProvider,
+ "char8_t * summary provider", ConstString("char8_t *"), string_flags);
+ AddCXXSummary(cpp_category_sp,
+ lldb_private::formatters::Char8StringSummaryProvider,
+ "char8_t [] summary provider",
+ ConstString("char8_t \\[[0-9]+\\]"), string_array_flags, true);

Hi James,

Thanks for working on this. I've opened a code review for your patch:
https://reviews.llvm.org/D66447

I've had to make some modification for it to compile and added a test.

Cheers,
Jonas