LLVM supports Unicode?

geovanisouza92 · August 27, 2011, 1:38pm

Hi everyone!

Is possible adapt LLVM to support Unicode? What problems can happen on try this?

's

Nick_Lewycky · August 27, 2011, 8:57pm

geovanisouza92@gmail.com wrote:

Hi everyone!

LLVM supports Unicode?

Is possible adapt LLVM to support Unicode? What problems can happen on
try this?

Value names can be any sequence of bytes, even including nulls. LLVM itself does not have any particular support for Unicode, or for any encoding except ASCII.

Nick

geovanisouza92 · August 28, 2011, 2:33am

Thanks for reply, Nick!

The project has this feature in roadmap?

's

2011/8/27 Nick Lewycky <nicholas@mxc.ca>

Nick_Lewycky · August 28, 2011, 2:53am

geovanisouza92@gmail.com wrote:

Thanks for reply, Nick!

The project has this feature in roadmap?

What feature? LLVM is not a text editor, it is not obvious what LLVM has to do with Unicode. Please explain.

Nick

FlyLanguage · August 28, 2011, 8:14am

Presumably he means identifier names.

geovanisouza92 · August 28, 2011, 1:08pm

Well, LLVM is a compiler infrastructure, right? So, it don’t should support Unicode for construct programming languages over it?

Or, the front-end of my programming language has to analize the source code, and convert it to LLVM-IR?
If this is right way, sorry for my confusion…

's

2011/8/28 FlyLanguage <flylanguage@gmail.com>

FlyLanguage · August 28, 2011, 1:59pm

Or, the front-end of my programming language has to analize the source
code, and convert it to LLVM-IR?

Yes

geovanisouza92 · August 28, 2011, 2:00pm

Right, FlyLanguage. Thanks again for reply. :o)

2011/8/28 FlyLanguage <flylanguage@gmail.com>

geovanisouza92 · August 28, 2011, 2:02pm

Well, have you any idea about how I can implement rightly Unicode in C/C++?

Thanks.

2011/8/28 FlyLanguage <flylanguage@gmail.com>

FlyLanguage · August 28, 2011, 2:16pm

Well, have you any idea about how I can implement rightly Unicode in C/C++?

geovanisouza92 · August 28, 2011, 2:20pm

Thanks!

2011/8/28 FlyLanguage <flylanguage@gmail.com>

Joachim_Durchholz · August 28, 2011, 5:44pm

What do you mean with "implement in C/C++"?

If you mean adding libraries to C/C++ that correctly deal with Unicode: that's nothing you do with a compiler infrastructure. And probably duplicate work, since Unicode libraries already exist.

If you mean making the C/C++ compiler understand Unicode string literals: either that's in the language standard and implemented by conformant compilers already, or it's not in the language standard, implementing it would deviate from the standard, and it would not be "rightly" implemented. (I'm not a C/C++ guy so I don't know whether it's actually in the standard, but if it isn't, I guess any compiler has extensions already.)

Hm.

Maybe we're talking at the wrong level here, so:
What's the problem/need that you wish to address?

Regards,
Jo

geovanisouza92 · August 28, 2011, 6:02pm

Hi, Jo!

I’m trying create a new programming language, and I want that it have Unicode support (support for read and manipulate rightly the source-code and string literals).

But, in addition, my programming language supports “string interpolation” string, and in these interpolations, tiny snippets of code, like expressions, or variable names.

So, I need read each char, separating the interpolations.

However, if you have another sugestion, I will stay grateful in listen you.

's

2011/8/28 Joachim Durchholz <jo@durchholz.org>

Reid_Kleckner · August 28, 2011, 6:44pm

LLVM cannot solve these problems for you, it's only a compiler
framework. Asking if LLVM can help you support Unicode in your
language is like asking if x86 machine code can help you support
Unicode. You can use both to generate code that handles Unicode
correctly, but it's up to you to generate that code.

Reid

Joachim_Durchholz · August 28, 2011, 6:55pm

As Reid said, this probably isn't the right list to ask questions about the runtime system.
Still, it's marginally relevant, and I happen to have done a bit with Unicode lately, so here goes:

In that case, you have a multitude of design and implementation choices. You won't be able to properly explore these until you have done some more reading.

I'd suggest reading the Unicode standard, available for free at http://unicode.org. You'll have to read the material there more than once, I fear; at least I had to before I was able to roughly determine which parts of the standard were relevant for what I wanted to do.
For starters, you'll want to know about the various encodings (UTF-8 and UTF-16 are the most relevant ones), and about surrogate pairs. With that in mind, you can start thinking about writing (or using) a library.

For practical usage, I have been sticking with the ICU library.
(Be warned that you still need to know a good deal about Unicode before you can properly determine what options of ICU actually do what you want.)

Hope this helps, and good luck!
Regards,
Jo

geovanisouza92 · August 28, 2011, 7:19pm

Okay, this help so much!

Now, I’m see more clear about what I can do with LLVM.

Thanks very much, Jo and Reid!

Comming soon, I will post about my project, if you want know…

2011/8/28 Joachim Durchholz <jo@durchholz.org>

Erik_de_Castro_Lopo · August 28, 2011, 10:41pm

geovanisouza92@gmail.com wrote:

I'm trying create a new programming language, and I want that it have
Unicode support (support for read and manipulate rightly the source-code and
string literals).

LLVM IR iteself only supports one string ty, which is an array of
i8 (8 bit integers). In your compile you can use utf-8 and any
utf8 string literal can be stored in an i8 array in the LLVM IR.

For example, the LLVM backend for the DDC compiler [0] does this:

@str = internal constant [4 x i8] c"bar\00", align 8

HTH,
Erik

[0] http://disciple.ouroborus.net/

bagel99 · August 29, 2011, 1:00am

I think a very related question is "Does LLVM support UTF-8? The answer has two parts:
1. As strings (arrays of bytes) - yes
2. As identifiers - no

The fix to the second part depends partly on the object file formats. But to at least accept UTF-8 as identifiers, the following patch helps. (I know that it does not descriminate between valid and in-valid UTF-8.)

--- lib/AsmParser/LLLexer.cpp (revision 138730)
+++ lib/AsmParser/LLLexer.cpp (working copy)
@@ -348,10 +348,10 @@
  bool LLLexer::ReadVarName() {
    const char *NameStart = CurPtr;
    if (isalpha(CurPtr[0]) || CurPtr[0] == '-' || CurPtr[0] == '$' ||
- CurPtr[0] == '.' || CurPtr[0] == '_') {
+ CurPtr[0] == '.' || CurPtr[0] == '_' || (CurPtr[0]&0x80) != 0) {
      ++CurPtr;
      while (isalnum(CurPtr[0]) || CurPtr[0] == '-' || CurPtr[0] == '$' ||
- CurPtr[0] == '.' || CurPtr[0] == '_')
+ CurPtr[0] == '.' || CurPtr[0] == '_' || (CurPtr[0]&0x80) != 0)
        ++CurPtr;

StrVal.assign(NameStart, CurPtr);

geovanisouza92 · August 29, 2011, 1:12am

Thanks, Bagel and Erik!

Your replies help me so much!

2011/8/28 Erik de Castro Lopo <mle+cl@mega-nerd.com>

Topic		Replies	Views
Quick question about Unicode support LLVM Dev List Archives	1	68	November 12, 2013
Is LLVM's C-Api Unicode aware? LLVM Dev List Archives	1	92	September 8, 2017
Question about non-UTF-8 filesystems LLVM Dev List Archives	1	87	August 15, 2013
Transcoding UTF-8 to ASCII? LLVM Dev List Archives	1	73	January 16, 2004
Recompile of LLVM7.0.0-git Results in Unicode Errors LLVM Dev List Archives	3	207	February 19, 2022

LLVM supports Unicode?

Related topics