Editline Rewrite : issues surround wide character handling on different platforms

There is a significant Editline rewrite, adding a bunch of improvements. It has been well tested on OSx, but not yet upstreamed. I have spent some time reviewing the proposed patch and working through issues to get it running on linux. To see the patch and accompanying discussion, refer to: http://reviews.llvm.org/D5835 The main issues that came up are related to handling wide characters and differences between platforms.

Internally, lldb uses std::string which is an array of 8 bit chars, that can either be 7 bit ascii, or utf8 encoded wide characters. Libedit uses either char, or wchar_t which is a 32 bit char on linux.

<codecvt> : the patch uses a c++ 11 class std::codecvt_utf8, this is a facet implementation that will do utf8 to wchar convervsion. It is part of c++ 11 standard, but not yet supported in gcc. I can use #ifdef to temporarily write equivalent functionality in that case while we wait for gcc to catch up.

libedit : libedit is a prerequisite that a new linux/lldb user installs ( sudo apt-get install libedit-dev ). A few years ago, libedit added versions of its functions that work on wchar_t. Unfortunately, this option is not built by default, and not present in the Ubuntu distribution. To get around this, I see a few options:

- take libedit source files (or subset) and add to the lldb project. We could either build a .so file, or just statically link the .cpp files.

- rework the Editline rewrite, so it either uses standard 8 bit chars, or wchar_t/utf8 depending on the platform. This would be conditionally built depending on the platform.

- modify Ubuntu, so 'sudo apt-get install dev-libedit' installs a version that has been built with wide character support enabled.

- introduce custom step for new linux lldb users, where they download libedit source and build and install a wchar version

The last 2 options don't seem that great.

I expect there will be problems on Windows, which I think uses utf16 coding. The file EditLineWin.cpp, contains prototypes for most of the structures and functions needed, but they look stubbed out.

Any thoughts?

Shawn.

I haven’t had time to look into the libedit stuff on Windows, but my understanding is that lldb on Windows has never been using libedit to begin with. If that’s the case, then we can probably continue not using it as that will eliminate the risk of any breaks, and python’s builtin console control handler on Windows is not that bad. It might be worth trying to port this implementation to Windows at some point, but if we’re not using it currently (and I don’t think we are), I don’t see any reason to block this change because of Windows, as long as the change keeps it disabled for Windows.

On the Linux front, the times where I have had to work on Linux (testing patches etc) when you go into the embedded interpreter by typing “script”, everything is totally broken, and you can’t use anything that is not a simple ascii character. No arrows, no control codes, nothing. Do you know if this patch fixes that on Linux (or at least assuming the issues you pointed out are addressed, will it fix that)?

: the patch uses a c++ 11 class std::codecvt_utf8, this is a facet implementation that will do utf8 to wchar convervsion. It is part of c++ 11 standard, but not yet supported in gcc.

Should we drop support for building with gcc on Linux?

  • take libedit source files (or subset) and add to the lldb project. We could either build a .so file, or just statically link the .cpp files.

Is it a problem to drop these “berkeley stype license” files into the project?

  • rework the Editline rewrite, so it either uses standard 8 bit chars, or wchar_t/utf8 depending on the platform. This would be conditionally built depending on the platform.

This would be my favorite option if possible. wchar_t never really took roots in Linux AFAIK.

  • introduce custom step for new linux lldb users, where they download libedit source and build and install a wchar version

Not as good as options above but we can work with this.

Also probably the best option for Windows, although it’s worth pointing out that at least for now, most other stuff in LLDB doesn’t really use wide character strings either, so char would be the path of least resistance for Windows right now.

Personally I’m a fan of just always using utf8 in std::strings for everything. If your OS expects things in a different format (as Windows sometimes does, you can do the conversion in your OS abstraction layer, or before you call whatever API.

Vince - I don't think dropping gcc on Linux is a great option, as other platforms (FreeBSD) use gcc. The issue is also more related libc++ vs libstdc++

Zachary - I believe the patch is meant to address problems like that, but I have only ever used lldb command line in a simplistic way. I will test something more complex and verify it works same on linux as OSX.

The wchar_t is used by internally in libedit, then translated to utf8 when to interface to lldb. I agree, utf8 std::string seems the cleanest way to go.

Shawn.

On FreeBSD our standard system compiler and C++ standard library is
clang and libc++. There would be little impact to me, or other FreeBSD
users, if LLDB did not build with GCC/libstdc++.

That said, being able to build with another compiler and standard
library can prove quite useful while trying to track down issues. I
believe it would be a huge issue for various Linux distros. I don't
think it's feasible to actually pursue dropping GCC/libstdc++ support.

Since gcc is the default compiler on all the linux distros I've recently
come across, "dropping support for building using gcc on linux" seems
like a bad idea.

Matt

With that said, building LLDB requires building LLVM and clang as prerequisites.
So there definitely is a clang available if LLDB is being built.
I wonder if one could setup things to bootstrap a clang with gcc, and then rebuild the entire LLVM, clang, lldb toolset with that built clang.

Yes, of course that's possible. But I'd argue that just makes things
more complex, and less attractive for newcomers to the project.

Incidentally, why does lldb require clang to be built as a prerequisite?
Given that I can use it to debug a linux binary built with gcc, why do I
need to build another compiler along the way?

Matt

Member of the CSR plc group of companies. CSR plc registered in England and Wales, registered number 4187346, registered office Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, United Kingdom
More information can be found at www.csr.com. Keep up to date with CSR on our technical blog, www.csr.com/blog, CSR people blog, www.csr.com/people, YouTube, www.youtube.com/user/CSRplc, Facebook, www.facebook.com/pages/CSR/191038434253534, or follow us on Twitter at www.twitter.com/CSR_plc.
New for 2014, you can now access the wide range of products powered by aptX at www.aptx.com.

Because LLDB re-uses parts of Clang ... grep for "clang::" in the codebase
and you'll see a lot of it.

I personally wish it weren't quite this way and that the layering were
different ... but that ship may have sailed a long time ago.

- Bruce

It doesn’t require building llvm, but not clang as far as I’m aware. I can build LLDB on Windows using only MSVC. I do need to have clang for the test suite, however.

As for compiler support, I think the pros of having LLDB support exactly the same set of compilers that LLVM supports outweigh any cons. Sadly, that means we should probably try to get the test suite working on the same set of compilers as well, although that’s quite a lofty goal.

Sorry, I meant to say it doesn’t require building clang, of course it requires building llvm. But then I realized you meant libclang, and not clang.exe, so yes you’re essentially right.

Yes, of course that's possible. But I'd argue that just makes things
more complex, and less attractive for newcomers to the project.

I very much agree, we don't want to make it any more difficult for
newcomers to get started with LLDB. A while back It found it difficult
enough to get LLDB built in a Linux VM because the standard GCC that
came with the distribution predated C++11 support. Clang and LLVM
require a modern-enough toolchain that I don't see a reason we should
be even more restrictive in LLDB.

> With that said, building LLDB requires building LLVM and clang as prerequisites.
> So there definitely is a clang available if LLDB is being built.
> I wonder if one could setup things to bootstrap a clang with gcc, and then rebuild the entire LLVM, clang, lldb toolset with that built clang.
>

Yes, of course that's possible. But I'd argue that just makes things
more complex, and less attractive for newcomers to the project.

Incidentally, why does lldb require clang to be built as a prerequisite?
Given that I can use it to debug a linux binary built with gcc, why do I
need to build another compiler along the way?

Because LLDB re-uses parts of Clang ... grep for "clang::" in the codebase and you'll see a lot of it.

I personally wish it weren't quite this way and that the layering were different ... but that ship may have sailed a long time ago.

I'm interested in why you say this?

Note that the ClangASTType class and the ways it gets used don't expose all that many details of the underlying compiler & type support. It wouldn't be a huge project to make a more generic CompilerType class, and clean up this layering so you could plug in your own type representation. And the interface to Expression parsing is pretty language agnostic, so it would also not be that hard to plug in another compiler front end. Some of this work needed to be done for Swift, though that isn't available in the current lldb sources.

The point of reusing parts of clang is that we shouldn't have to write another compiler front end to parse expressions in the debugger when we've got a perfectly good one in clang. Similarly we shouldn't have to invent our own strategy for representing types in the languages we support when presumably the clang folks have already done a good job at that as well.

There are some challenges with getting a compiler to be fuzzier in the way that the debugger requires, but clang also needed to take on some of this challenge to support parsing for source code completion and some of the other similar tasks it does. And though this has been work, it's been lots less work than writing a really accurate front-end/type representation.

And for instance, if we ever get good support for modules in C++, we could presumably then use that to do things like instantiate template classes for types that weren't instantiated in the current program, and other cool'o things like that which would be hard to do with some hand-built C++ parser, a la gdb.

Jim

The point of reusing parts of clang is that we shouldn’t have to write another compiler front end to parse expressions in the debugger when we’ve got a perfectly good one in clang.

This is one of bets things about LLDB. I’ve heard from someone on the gcc team that gdb is moving this way but haven’t confirmed.

Vince

GDB's solution for such cases currently is XMethods:
https://sourceware.org/gdb/current/onlinedocs/gdb/Xmethods-In-Python.html

There are a bunch of XMethods available for container classes in
libstdc++: https://gcc.gnu.org/svn/gcc/trunk/libstdc++-v3/python/libstdcxx/v6/xmethods.py

I hadn't seen that. That looks pretty useful for things like reintroducing the "size" method for std::vectors that hold the size in an easily accessible field, or something else like that. It involves re-implementing the guts of the library you are patching up in Python using debugger API's so it isn't entirely straight-forward. Note we already do a similar sort of thing in the data formatters to produce nice summaries and "synthetic children" for std & Foundation types that can avoid running code. I think gdb also has some formatter infrastructure like this, IIRC, but I haven't used gdb much for a few years now.

Anyway, you'd be hard pressed to use something like these Xmethods to do more science fiction'y things like "I wish that I had std::vector<T> around for some experiment I want to do in the debugger, but the program never used std::vector<T>, please make it for me". Not sure how often you'd really use that, but the point is that if we had the C++ modules around, we would get it pretty much for free.

Jim

I hadn’t seen that. That looks pretty useful for things like reintroducing the “size” method for std::vectors that hold the size in an easily accessible field, or something else like that. It involves re-implementing the guts of the library you are patching up in Python using debugger API’s so it isn’t entirely straight-forward. Note we already do a similar sort of thing in the data formatters to produce nice summaries and “synthetic children” for std & Foundation types that can avoid running code. I think gdb also has some formatter infrastructure like this, IIRC, but I haven’t used gdb much for a few years now.

It’s called “pretty printers” (but they do both summaries & synthetic children): https://sourceware.org/gdb/current/onlinedocs/gdb/Pretty-Printing-API.html#Pretty-Printing-API

Anyway, you’d be hard pressed to use something like these Xmethods to do more science fiction’y things like “I wish that I had std::vector around for some experiment I want to do in the debugger, but the program never used std::vector, please make it for me”. Not sure how often you’d really use that, but the point is that if we had the C++ modules around, we would get it pretty much for free.

Jim

And for instance, if we ever get good support for modules in C++,
we could presumably then use that to do things like instantiate template
classes for types that weren’t instantiated in the current program, and
other cool’o things like that which would be hard to do with some
hand-built C++ parser, a la gdb.

GDB’s solution for such cases currently is XMethods:
https://sourceware.org/gdb/current/onlinedocs/gdb/Xmethods-In-Python.html

There are a bunch of XMethods available for container classes in
libstdc++: https://gcc.gnu.org/svn/gcc/trunk/libstdc++-v3/python/libstdcxx/v6/xmethods.py


lldb-dev mailing list
lldb-dev@cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/lldb-dev

Thanks,
- Enrico
:envelope_with_arrow: egranata@.com :phone: 27683

With the Editline rewrite I made the explicit decision to insulate the rest of LLDB from wide characters and strings by encoding everything as UTF8. I agree that reverting to char-only input is a perfectly reasonable solution for platforms that don’t yet include wchar-aware libedit implementations.

Kate Stone k8stone@apple.com
 Xcode Runtime Analysis Tools

If you’re storing UTF8 anyway, why not just use regular character strings? Doesn’t it defeat the purpose of using UTF8 if you’re combining it with a character type that isn’t 1 byte?