LLVM supports Unicode?

Hi everyone!

LLVM supports Unicode?

Is possible adapt LLVM to support Unicode? What problems can happen on try this?

's

geovanisouza92@gmail.com wrote:

Hi everyone!

LLVM supports Unicode?

Is possible adapt LLVM to support Unicode? What problems can happen on
try this?

Value names can be any sequence of bytes, even including nulls. LLVM itself does not have any particular support for Unicode, or for any encoding except ASCII.

Nick

Thanks for reply, Nick!

The project has this feature in roadmap?

's

2011/8/27 Nick Lewycky <nicholas@mxc.ca>

geovanisouza92@gmail.com wrote:

Thanks for reply, Nick!

The project has this feature in roadmap?

What feature? LLVM is not a text editor, it is not obvious what LLVM has to do with Unicode. Please explain.

Nick

Presumably he means identifier names.

Well, LLVM is a compiler infrastructure, right? So, it don’t should support Unicode for construct programming languages over it?

Or, the front-end of my programming language has to analize the source code, and convert it to LLVM-IR?
If this is right way, sorry for my confusion…

's

2011/8/28 FlyLanguage <flylanguage@gmail.com>

Or, the front-end of my programming language has to analize the source
code, and convert it to LLVM-IR?

Yes

Right, FlyLanguage. Thanks again for reply. :o)

2011/8/28 FlyLanguage <flylanguage@gmail.com>

Well, have you any idea about how I can implement rightly Unicode in C/C++?

Thanks.

2011/8/28 FlyLanguage <flylanguage@gmail.com>

Well, have you any idea about how I can implement rightly Unicode in C/C++?

Thanks!

2011/8/28 FlyLanguage <flylanguage@gmail.com>

What do you mean with "implement in C/C++"?

If you mean adding libraries to C/C++ that correctly deal with Unicode: that's nothing you do with a compiler infrastructure. And probably duplicate work, since Unicode libraries already exist.

If you mean making the C/C++ compiler understand Unicode string literals: either that's in the language standard and implemented by conformant compilers already, or it's not in the language standard, implementing it would deviate from the standard, and it would not be "rightly" implemented. (I'm not a C/C++ guy so I don't know whether it's actually in the standard, but if it isn't, I guess any compiler has extensions already.)

Hm.

Maybe we're talking at the wrong level here, so:
What's the problem/need that you wish to address?

Regards,
Jo

Hi, Jo!

I’m trying create a new programming language, and I want that it have Unicode support (support for read and manipulate rightly the source-code and string literals).

But, in addition, my programming language supports “string interpolation” string, and in these interpolations, tiny snippets of code, like expressions, or variable names.

So, I need read each char, separating the interpolations.

However, if you have another sugestion, I will stay grateful in listen you.

's

2011/8/28 Joachim Durchholz <jo@durchholz.org>

LLVM cannot solve these problems for you, it's only a compiler
framework. Asking if LLVM can help you support Unicode in your
language is like asking if x86 machine code can help you support
Unicode. You can use both to generate code that handles Unicode
correctly, but it's up to you to generate that code.

Reid

As Reid said, this probably isn't the right list to ask questions about the runtime system.
Still, it's marginally relevant, and I happen to have done a bit with Unicode lately, so here goes:

In that case, you have a multitude of design and implementation choices. You won't be able to properly explore these until you have done some more reading.

I'd suggest reading the Unicode standard, available for free at http://unicode.org. You'll have to read the material there more than once, I fear; at least I had to before I was able to roughly determine which parts of the standard were relevant for what I wanted to do.
For starters, you'll want to know about the various encodings (UTF-8 and UTF-16 are the most relevant ones), and about surrogate pairs. With that in mind, you can start thinking about writing (or using) a library.

For practical usage, I have been sticking with the ICU library.
(Be warned that you still need to know a good deal about Unicode before you can properly determine what options of ICU actually do what you want.)

Hope this helps, and good luck!
Regards,
Jo

Okay, this help so much!

Now, I’m see more clear about what I can do with LLVM.

Thanks very much, Jo and Reid!

Comming soon, I will post about my project, if you want know… :wink:

2011/8/28 Joachim Durchholz <jo@durchholz.org>

geovanisouza92@gmail.com wrote:

I'm trying create a new programming language, and I want that it have
Unicode support (support for read and manipulate rightly the source-code and
string literals).

LLVM IR iteself only supports one string ty, which is an array of
i8 (8 bit integers). In your compile you can use utf-8 and any
utf8 string literal can be stored in an i8 array in the LLVM IR.

For example, the LLVM backend for the DDC compiler [0] does this:

   @str = internal constant [4 x i8] c"bar\00", align 8

HTH,
Erik

[0] http://disciple.ouroborus.net/

I think a very related question is "Does LLVM support UTF-8? The answer has two parts:
1. As strings (arrays of bytes) - yes
2. As identifiers - no

The fix to the second part depends partly on the object file formats. But to at least accept UTF-8 as identifiers, the following patch helps. (I know that it does not descriminate between valid and in-valid UTF-8.)

--- lib/AsmParser/LLLexer.cpp (revision 138730)
+++ lib/AsmParser/LLLexer.cpp (working copy)
@@ -348,10 +348,10 @@
  bool LLLexer::ReadVarName() {
    const char *NameStart = CurPtr;
    if (isalpha(CurPtr[0]) || CurPtr[0] == '-' || CurPtr[0] == '$' ||
- CurPtr[0] == '.' || CurPtr[0] == '_') {
+ CurPtr[0] == '.' || CurPtr[0] == '_' || (CurPtr[0]&0x80) != 0) {
      ++CurPtr;
      while (isalnum(CurPtr[0]) || CurPtr[0] == '-' || CurPtr[0] == '$' ||
- CurPtr[0] == '.' || CurPtr[0] == '_')
+ CurPtr[0] == '.' || CurPtr[0] == '_' || (CurPtr[0]&0x80) != 0)
        ++CurPtr;

      StrVal.assign(NameStart, CurPtr);

Thanks, Bagel and Erik!

Your replies help me so much!

2011/8/28 Erik de Castro Lopo <mle+cl@mega-nerd.com>