Linking Linux kernel with LLD

Hi Dmitry,

thanls for sharing. Few comments/questions below:

Here is the list of modifications I had to do in order to link the kernel (I have used llvmlinux with clang and mainline with gcc, the >results are similar):

1. LLD patches:

> - D28094 (Implemented support for R_386_PC8/R_386_8 relocations)

Do you remember where it was used ?

5. In arch/x86/kernel/vmlinux.lds.S commented out the "CONSTRUCTORS", because LLD doesn't support it.

It is https://reviews.llvm.org/D28951. CONSTRUCTORS can be just removed, they do nothing for ELF.

6. In arch/x86/boot/setup.ld replaced 5*512 with precalculated value of 2560 because it doesn't seem that LLD supports math inside >ASSERT in linker scripts.

It is actually not relative with ASSERT. LLD does not support "symbol = 5*6", but accepts "symbol = 5 * 6" currently.
Not sure what is easy fix here.

Finally the kernel was built and it obviously didn't run (only printed out "No setup signature found..." but this is some result as well). >Probably, the result could be better if the --emit-relocs option didn't fail and CONSTRUCTORS were supported. I really don't know what >to do about the assertion that I have commented out.

I updated patch for --emit-relocs, now they do not fail: https://reviews.llvm.org/D28612
It looks to be important feature for self relocations, so it is not surprising it did not run without :slight_smile:

George.

Hi Dmitry,

thanls for sharing. Few comments/questions below:

>Here is the list of modifications I had to do in order to link the kernel
(I have used llvmlinux with clang and mainline with gcc, the >results are
similar):
>
>1. LLD patches:
> - D28094 (Implemented support for R_386_PC8/R_386_8 relocations)

Do you remember where it was used ?

>5. In arch/x86/kernel/vmlinux.lds.S commented out the "CONSTRUCTORS",
because LLD doesn't support it.

It is https://reviews.llvm.org/D28951. CONSTRUCTORS can be just removed,
they do nothing for ELF.

>6. In arch/x86/boot/setup.ld replaced 5*512 with precalculated value of
2560 because it doesn't seem that LLD supports math inside >ASSERT in
linker scripts.

It is actually not relative with ASSERT. LLD does not support "symbol =
5*6", but accepts "symbol = 5 * 6" currently.
Not sure what is easy fix here.

I'm not sure if it is easy, but I think that it's clear that the
linkerscript lexer needs to be improved. I think that is the source of the
problems with `*(.apicdrivers);` as well. This is not the first bug related
to lexing that we have run into (e.g. lexing `.=` as a single token is the
cause of https://llvm.org/bugs/show_bug.cgi?id=31128 ).

-- Sean Silva

- D28094 (Implemented support for R_386_PC8/R_386_8 relocations)

Do you remember where it was used ?

I can undo the patch (but I can access the build machine only on Monday) and see what breaks.

CONSTRUCTORS can be just removed, they do nothing for ELF.

Okay, this is what I did (I thought it will break things, but it is okay). I will apply the patch.

LLD does not support "symbol = 5*6", but accepts "symbol = 5 * 6" currently. Not sure what is easy fix here.

Just ignore it for now, it's not really a big deal.

I updated patch for --emit-relocs, now they do not fail: https://reviews.llvm.org/D28612

Thank you, I will apply the updated patch and hope that it will boot.

Regards,
Dmitry

Hi Dmitry,

thanls for sharing. Few comments/questions below:

>Here is the list of modifications I had to do in order to link the
kernel (I have used llvmlinux with clang and mainline with gcc, the
>results are similar):
>
>1. LLD patches:
> - D28094 (Implemented support for R_386_PC8/R_386_8 relocations)

Do you remember where it was used ?

>5. In arch/x86/kernel/vmlinux.lds.S commented out the "CONSTRUCTORS",
because LLD doesn't support it.

It is https://reviews.llvm.org/D28951. CONSTRUCTORS can be just removed,
they do nothing for ELF.

>6. In arch/x86/boot/setup.ld replaced 5*512 with precalculated value of
2560 because it doesn't seem that LLD supports math inside >ASSERT in
linker scripts.

It is actually not relative with ASSERT. LLD does not support "symbol =
5*6", but accepts "symbol = 5 * 6" currently.
Not sure what is easy fix here.

I'm not sure if it is easy, but I think that it's clear that the
linkerscript lexer needs to be improved. I think that is the source of the
problems with `*(.apicdrivers);` as well.

Actually, quickly staring at the code, the `*(.apicdrivers);` seems like it
will be lexed correctly.

-- Sean Silva

- D28094 (Implemented support for R_386_PC8/R_386_8 relocations)

Do you remember where it was used ?

setup.elf:
      ld.lld -m elf_i386 -T arch/x86/boot/setup.ld arch/x86/boot/a20.o arch/x86/boot/bioscall.o arch/x86/boot/cmdline.o arch/x86/boot/copy.o arch/x86/boot/cpu.o arch/x86/boot/cpuflags.o arch/x86/boot/cpucheck.o arch/x86/boot/early_serial_console.o arch/x86/boot/edd.o arch/x86/boot/header.o arch/x86/boot/main.o arch/x86/boot/mca.o arch/x86/boot/memory.o arch/x86/boot/pm.o arch/x86/boot/pmjump.o arch/x86/boot/printf.o arch/x86/boot/regs.o arch/x86/boot/string.o arch/x86/boot/tty.o arch/x86/boot/video.o arch/x86/boot/video-mode.o arch/x86/boot/version.o arch/x86/boot/video-vga.o arch/x86/boot/video-vesa.o arch/x86/boot/video-bios.o -o arch/x86/boot/setup.elf
    ld.lld: error: do not know how to handle relocation 'R_386_PC8' (23)

I updated patch for --emit-relocs, now they do not fail.

Thanks, applied it, doesn't fail.

I still didn't do anything with "Setup too big!" problem, just commented out the assert. Tried booting the resulting bzImage and vmlinux with qemu. The bzImage only did reboot over and over, but the vmlinux did show an adorable picture (attached).

Regards,
Dmitry

I'm not sure if it is easy, but I think that it's clear that the linkerscript lexer needs to be improved. I think that is the source of the >problems with `*(.apicdrivers);` as well. This is not the first bug related to lexing that we have run into (e.g. lexing `.=` as a single >token is the cause of https://llvm.org/bugs/show_bug.cgi?id=31128 ).

-- Sean Silva

PR31128? seems to be not an issue. Both gold and bfd do not accept '.='.
So it seems the only known issue we have is about math expressions like "x = 5*4",
I am going to look again how to fix that.

George.?

Our tokenizer recognize

[A-Za-z0-9_.$/\~=+*?-:!<>]+

as a token. gold uses more complex rules to tokenize. I don’t think we need that much complex rules, but there seems to be room to improve our tokenizer. In particular, I believe we can parse the Linux’s linker script by changing the tokenizer rules as follows.

[A-Za-z_.$/\~=+?-:!<>][A-Za-z0-9_.$/\~=+[]?-:!<>]*

or

[0-9]+

>> - D28094 (Implemented support for R_386_PC8/R_386_8 relocations)
> Do you remember where it was used ?

setup.elf:
      ld.lld -m elf_i386 -T arch/x86/boot/setup.ld arch/x86/boot/a20.o
arch/x86/boot/bioscall.o arch/x86/boot/cmdline.o arch/x86/boot/copy.o
arch/x86/boot/cpu.o arch/x86/boot/cpuflags.o arch/x86/boot/cpucheck.o
arch/x86/boot/early_serial_console.o arch/x86/boot/edd.o
arch/x86/boot/header.o arch/x86/boot/main.o arch/x86/boot/mca.o
arch/x86/boot/memory.o arch/x86/boot/pm.o arch/x86/boot/pmjump.o
arch/x86/boot/printf.o arch/x86/boot/regs.o arch/x86/boot/string.o
arch/x86/boot/tty.o arch/x86/boot/video.o arch/x86/boot/video-mode.o
arch/x86/boot/version.o arch/x86/boot/video-vga.o
arch/x86/boot/video-vesa.o arch/x86/boot/video-bios.o -o
arch/x86/boot/setup.elf
    ld.lld: error: do not know how to handle relocation 'R_386_PC8' (23)

> I updated patch for --emit-relocs, now they do not fail.

Thanks, applied it, doesn't fail.

I still didn't do anything with "Setup too big!" problem, just commented
out the assert. Tried booting the resulting bzImage and vmlinux with qemu.
The bzImage only did reboot over and over, but the vmlinux did show an
adorable picture (attached).

That's beautiful! Looks like some kernel or loader text got copied into the
VGA text buffer. Those "e with two dots above" looks like Code Page 437 (
Code page 437 - Wikipedia) for 0x89 which is a common MOV
opcode in x86 machine code. "Capital H" and "capital Phi" also look like
common x86 opcodes.

I remember when we were getting the FreeBSD kernel to link correctly, one
of the bugs was that the kernel load address was too low (because of our
MAXPAGESIZE value) which meant that the location where the kernel was
asking to be copied into actually overlapped the bootloader's text, which
ended about as well as you would expect it to. We didn't get a pretty
picture though :slight_smile:

-- Sean Silva

Our tokenizer recognize

Our tokenizer recognize

[A-Za-z0-9_.$/\~=+*?-:!<>]+

as a token. gold uses more complex rules to tokenize. I don’t think we need that much complex rules, but there seems to be >room to improve our tokenizer. In particular, I believe we can parse the Linux’s linker script by changing the tokenizer rules as >follows.

[A-Za-z_.$/\~=+?-:!<>][A-Za-z0-9_.$/\~=+[]?-:!<>]*

or

[0-9]+​

After more investigation, that seems will not work so simple.
Next are possible examples where it will be broken:
. = 0x1000; (gives tokens “0, x1000”)
. = A10; (gives "A10")
. = 10k; (gives “10, k”)
. = 10*5; (gives “10, *5”

“[0-9]+” could be “[0-9][kmhKMHx0-9]"
but for "10
5” that anyways gives “10” and “*5” tokens.
And I do not think we can involve some handling of operators,
as its hard to assume some context on tokenizing step.
We do not know if that a file name we are parsing or a math expression.

May be worth trying to handle this on higher level, during evaluation of
expressions ?

Well, maybe, we should just change the Linux kernel instead of tweaking our tokenizer too hard.

Well, maybe, we should just change the Linux kernel instead of tweaking our tokenizer too hard.

I agree, for now I am inclined to do that and watch for other scripts.

George.

Well, maybe, we should just change the Linux kernel instead of tweaking
our tokenizer too hard.

This is silly. Writing a simple and maintainable lexer is not hard (look
e.g. at https://reviews.llvm.org/D10817). There are some complicated
context-sensitive cases in linker scripts that break our approach of
tokenizing up front (so we might want to hold off on), but we aren't going
to die from implementing enough to lex basic arithmetic expressions
independent of whitespace.

We will be laughed at. ("You seriously couldn't even be bothered to
implement a real lexer?")

-- Sean Silva

Well, maybe, we should just change the Linux kernel instead of tweaking
our tokenizer too hard.

This is silly. Writing a simple and maintainable lexer is not hard (look
e.g. at https://reviews.llvm.org/D10817). There are some complicated
context-sensitive cases in linker scripts that break our approach of
tokenizing up front (so we might want to hold off on), but we aren't going
to die from implementing enough to lex basic arithmetic expressions
independent of whitespace.

Hmm..., the crux of not being able to lex arithmetic expressions seems to
be due to lack of context sensitivity. E.g. consider `foo*bar`. Could be a
multiplication, or could be a glob pattern.

Looking at the code more closely, adding context sensitivity wouldn't be
that hard. In fact, our ScriptParserBase class is actually a lexer (look at
the interface; it is a lexer's interface). It shouldn't be hard to change
from an up-front tokenization to a more normal lexer approach of scanning
the text for each call that wants the next token. Roughly speaking, just
take the body of the for loop inside ScriptParserBase::tokenize and add a
helper which does that on the fly and is called by consume/next/etc.
Instead of an index into a token vector, just keep a `const char *` pointer
that we advance.

Once that is done, we can easily add a `nextArithmeticToken` or something
like that which just lexes with different rules.

Implementing a linker is much harder than implementing a lexer. If we give
our users the impression that implementing a compatible lexer is hard for
us, what impression will we give them about the linker's implementation
quality? If we can afford 100 lines of self-contained code to implement a
concurrent hash table; we can afford 100 self-contained lines to implement
a context-sensitive lexer. This is end-user visible functionality; we
should be careful skimping on it in the name of simplicity.

-- Sean Silva

>Our tokenizer recognize
>
> [A-Za-z0-9_.$/\\~=+*?\-:!<>]+
>
>as a token. gold uses more complex rules to tokenize. I don't think we
need that much complex rules, but there seems to be >room to improve our
tokenizer. In particular, I believe we can parse the Linux's linker script
by changing the tokenizer rules as >follows.
>
> [A-Za-z_.$/\\~=+*?\-:!<>][A-Za-z0-9_.$/\\~=+*?\-:!<>]*
>
>or
>
> [0-9]+​

After more investigation, that seems will not work so simple.
Next are possible examples where it will be broken:
. = 0x1000; (gives tokens "0, x1000")
. = A*10; (gives "A*10")
. = 10k; (gives "10, k")
. = 10*5; (gives "10, *5"

"[0-9]+" could be "[0-9][kmhKMHx0-9]*"
but for "10*5" that anyways gives "10" and "*5" tokens.
And I do not think we can involve some handling of operators,
as its hard to assume some context on tokenizing step.
We do not know if that a file name we are parsing or a math expression.

May be worth trying to handle this on higher level, during evaluation of
expressions ?

The lexical format of linker scripts requires a context-sensitive lexer.

Look at how gold does it. IIRC there are 3 cases that are something like:
one is for file-name like things, one is for numbers and stuff, and the
last category is for numbers and stuff but numbers can also include things
like `10k` (I think; would need to look at the code to remember for sure).
It's done in a very elegant way in gold (passing a callback "can continue"
that says which characters can continue the token). Which token regex to
use is dependent on the grammar production (hence context sensitive). If
you look at the other message I sent in this thread just now,
ScriptParserBase is essentially a lexer interface and can be pretty easily
converted to a more standard on-the-fly character-scanning implementation
of a lexer. Once that is done adding a new method to scan a different kind
of token for certain parts of the parser.

-- Sean Silva

The lexical format of linker scripts requires a context-sensitive lexer.

>The lexical format of linker scripts requires a context-sensitive lexer.
>
>Look at how gold does it. IIRC there are 3 cases that are something like:
one is for file-name like things, one is for numbers and stuff, and the
last category is for >numbers and stuff but numbers can also include things
like `10k` (I think; would need to look at the code to remember for sure).
It's done in a very elegant way in gold >(passing a callback "can continue"
that says which characters can continue the token). Which token regex to
use is dependent on the grammar production (hence >context sensitive). If
you look at the other message I sent in this thread just now,
ScriptParserBase is essentially a lexer interface and can be pretty easily
converted to >a more standard on-the-fly character-scanning implementation
of a lexer. Once that is done adding a new method to scan a different kind
of token for certain parts of >the parser.
>
>-- Sean Silva

I think that approach should work and should not be hard to implement.
Though when I think about that feature from "end user POV" I wonder how
much users of it can be ? AFAIK we have the only script found in the wild
that suffers from absence of whitespaces in math expressions. Looks 99.9%
of scripts are free of that issue. And writing "5*6" instead "5 * 6" is
looks not nice from code style.
Adding more code to LLD requires additional support for it at the end.

I am not going to say we should or should not doing that, that is just my
concern. Moreover I probably would try to do that (just in case, to extend
flexibility), though I can't say I see read need for that atm, basing on
said above.

Most of the features in the linker are for a single user at the time that
they are implemented; but we know that we want that single user to work and
so it doesn't matter. If the programs are buggy (not following ELF spec or
whatever) then it may make sense to push for a fix upstream. But asking a
user to change their program just because we can't be bothered to implement
something simple (and clearly "correct") does not reflect well on the LLD
project.

-- Sean Silva

Hmm..., the crux of not being able to lex arithmetic expressions seems to
be due to lack of context sensitivity. E.g. consider `foo*bar`. Could be a
multiplication, or could be a glob pattern.

Looking at the code more closely, adding context sensitivity wouldn't be
that hard. In fact, our ScriptParserBase class is actually a lexer (look at
the interface; it is a lexer's interface). It shouldn't be hard to change
from an up-front tokenization to a more normal lexer approach of scanning
the text for each call that wants the next token. Roughly speaking, just
take the body of the for loop inside ScriptParserBase::tokenize and add a
helper which does that on the fly and is called by consume/next/etc.
Instead of an index into a token vector, just keep a `const char *` pointer
that we advance.

Once that is done, we can easily add a `nextArithmeticToken` or something
like that which just lexes with different rules.

I like that idea. I first thought of always having '*' as a token, but
then space has to be a token, which is an incredible pain.

I then thought of having a "setLexMode" method, but the lex mode can
always be implicit from where we are in the parser. The parser should
always know if it should call next or nextArithmetic.

And I agree we should probably implement this. Even if it is not common,
it looks pretty silly to not be able to handle 2*5.

Cheers,
Rafael

Sean,

So as you noticed that linker script tokenization rule is not very trivial – it is context sensitive. The current lexer is extremely simple and almost always works well. Improving “almost always” to “perfect” is not high priority because we have many more high priority things, but I’m fine if someone improves it. If you are interested, please take it. Or maybe I’ll take a look at it. It shouldn’t be hard. It’s probably just a half day work.

As far as I know, the grammar is LL(1), so it needs only one push-back buffer. Handling INCLUDE directive can be a bit tricky though.

Maybe we should rename ScriptParserBase ScriptLexer.

Sean,

So as you noticed that linker script tokenization rule is not very trivial
-- it is context sensitive. The current lexer is extremely simple and
almost always works well. Improving "almost always" to "perfect" is not
high priority because we have many more high priority things, but I'm fine
if someone improves it. If you are interested, please take it. Or maybe
I'll take a look at it. It shouldn't be hard. It's probably just a half day
work.

Yeah. To be clear, I wasn't saying that this was high priority. Since I'm
complaining so much about it maybe I should take a look this weekend :slight_smile:

As far as I know, the grammar is LL(1), so it needs only one push-back
buffer. Handling INCLUDE directive can be a bit tricky though.

Maybe we should rename ScriptParserBase ScriptLexer.

That sounds like a good idea.

-- Sean Silva