Regular Expressions

David_A_Greene · June 9, 2009, 7:00pm

That's potentially a lot of work. I started looking at it and it would
involve hiding the OpenBSD library under some other lib/System facade.
Designing that facade takes time and we probably won't get it right
the first or second time.

Another option is to just provide the OpenBSD regex code as an optional
library that gets compiled and linked for MSVC systems. I'm not at all
familiar with MSVC build procedures so I don't know how involved that is.

The advantage to the second approach is that we keep the POSIX interfaces
in the rest of the LLVM code, we don't have to design another set of
interfaces and we just use the OpenBSD library when we have to.

Thoughts?

-Dave

Sandeep_Patel · June 9, 2009, 7:12pm

<TRE — The free and portable approximate regex matching library.; looks somewhat promising as a
POSIX-compatible regex library with a suitable license.

deep

Dan_Gohman3 · June 9, 2009, 7:34pm

Can you describe what problem you're trying to solve here? Does it
really need Regular Expressions?

Dan

David_A_Greene · June 9, 2009, 7:37pm

r with MSVC build procedures so I don't know how involved that is.

>
> The advantage to the second approach is that we keep the POSIX interfaces
> in the rest of the LLVM code, we don't have to design another set of
> interfaces and we just use the OpenBSD library when we have to.

<TRE — The free and portable approximate regex matching library.; looks somewhat promising as a
POSIX-compatible regex library with a suitable license.

We've talked about http://www.openbsd.org/cgi-bin/cvsweb/src/lib/libc/regex/
as well. Is there any reason to prefer one over the other?

I'm testing an integration of the OpenBSD code right now. I can't test a
Windows build, though. So how should we go about doing that? I have the
autoconf and (I think) cmake work done, but I have no way of fully testing
it. Should I just check it in and let Windows users give it a try?

-Dave

David_A_Greene · June 9, 2009, 7:39pm

Yes. I want TableGen to be able to infer lots of stuff programmatically.
This helps tremendously when specifying things like, oh, AVX.

We could invent our own pattern matching syntax, but why?

-Dave

nlopes · June 9, 2009, 8:02pm

I guess the standard regex library is PCRE (http://pcre.org). It has its own api, but it also supports the POSIX interface. The regex syntax is Perl compatible.
We bundle this library in the PHP core and it works pretty well. It compiles everywhere and the maintenance burden is close to zero.

Nuno

David_A_Greene · June 9, 2009, 8:18pm

That's a big piece of software. The OpenBSD version is much smaller. I have
the OpenBSD version integrated so I'll intend to use that unless someone tells
me we can't for some reason.

Testing under Linux went fine, but that's to be expected.

The main issue with pcre is that it's meant to be built as a standalone
library. While I agree with that architecture, it doesn't fit well into
LLVM's libSystem notion. I don't know if we want to get into the business
of configuring and building external libraries as part of the LLVM build
process. At least I know *I* don't want to do it. Others should feel
free.

Really, we should be using TR1 but not all compilers are there yet.

BTW, is there a process for dropping compiler support? Eventually most
compilers will have TR1 except for very old versions. It's not feasible
to support all of those crufty compilers forever.

-Dave

Chris_Lattner · June 11, 2009, 5:28pm

Can you describe what problem you're trying to solve here? Does it
really need Regular Expressions?

Yes. I want TableGen to be able to infer lots of stuff programmatically.
This helps tremendously when specifying things like, oh, AVX.

I don't see how this relates to regex's, and really don't want to suck in an external regex library. Can you give an example of how this would help?

BTW, is there a process for dropping compiler support? Eventually most
compilers will have TR1 except for very old versions. It's not feasible
to support all of those crufty compilers forever.

There is no formal process. What do you need out of TR1? Is it possible to make it conditional? Many old compilers are already broken anyway, but all it takes is one "popular" one to make it a no-go.

-Chris

SamuraiCrow · June 11, 2009, 6:06pm

Hello everybody,

I don't quite understand how the proposed regex library works but I know that PEG parser generators use a super-set of regex functionality to define their parsers. There's also a nice one on Google code called YARDparser that uses templates based on PEGs to generate efficient recursive-decent parsers.

Furthermore, my partner and I am working on an interpreter for PEG syntax based parsers and LLVM Assembly code for all of the code-generation stages so that we can debug the code and the parser all in one go. It's part of the Mattathias BASIC project on SourceForge.net and don't let the LGPL license scare you. We can change the license for the parser generator. It's only the extensible BASIC compiler and interpreter we are building with it that is required to be LGPL due to the runtime libraries.

Does the prospect of a PEG based syntax interest the team for the TableGen? They are wonderfully simple to maintain since they don't allow ambiguity of any type. Their downfall is that since they use a greedy algorithm they sometimes degrade performance to O(n) where a nongreedy algorithm would result in O(1). Since the YARDparser is template-based, it should be possible to slip a stringmap into the keyword lookup to bring the performance back up to a more resonable specification.

Let me know what you think.

--Sam Crow

David_A_Greene · June 11, 2009, 9:01pm

>> Can you describe what problem you're trying to solve here? Does it
>> really need Regular Expressions?
>
> Yes. I want TableGen to be able to infer lots of stuff
> programmatically.
> This helps tremendously when specifying things like, oh, AVX.

I don't see how this relates to regex's, and really don't want to suck
in an external regex library. Can you give an example of how this
would help?

Sure:

[Top-level specification]
defm CVTSI2SS : sse1_avx_fp_cvt_scalar_xs_scalar64_xs_node_sintrinsic_rm<
0x2D, "cvtsi2ss", sint_to_fp, "cvtsi2ss", "f32", "i32">;

[Meanwhile, down in the guts...]

class fp_cvt_scalar_VXSnrr<
// Parent: avx_fp_cvt_scalar_xs_node_rm_DEF_V#NAME#_128rr
  bits<8> opc,
  string OpcodeStr,
  SDNode OpNode,
  string DstType,
  string SrcType,
  int CustomPatterns = 0,
  list<dag> patterns = ,
  string asm = ""

: fp_unary_vxs_n_rm_rr<

opc,
!cast<RegisterClass>(!patsubst("^f([0-9]+)","FR$1",!patsubst("^i([0-9]+)","GR$1",DstType))),
[...]

Basically, the code keys off type strings to deduce register classes and
other such things. This makes specifying things like converts much easier
because one doesn't need to pass a bunch of additional parameters around to
specify register classes, etc. when it is entirely obvious what they should be
given other parameters.

I'm trying to reduce the number of classes/parameters to deal with so I
can eventually get rid of the Perl script that's generating most of this
stuff right now. This kind of pattern matching and substitution really
helps simplify things to a point where I can imagine manually maintaining
a smaller amount of information.

The BSD regex library is very tiny, almost ideal for what we want to do, which
is fill a functionality hole on Windows only.

-Dave

David_A_Greene · June 11, 2009, 9:05pm

Right now we don't need anything as stong as a full parser for string pattern
matching and substitution in TableGen. Anything we add should be small. The
BSD regex library is about the smallest I've seen. Nine source files totaling
3,300 lines.

-Dave

Chris_Lattner · June 13, 2009, 5:25pm

Yes. I want TableGen to be able to infer lots of stuff
programmatically.
This helps tremendously when specifying things like, oh, AVX.

I don't see how this relates to regex's, and really don't want to suck
in an external regex library. Can you give an example of how this
would help?

Sure:

[Top-level specification]
defm CVTSI2SS : sse1_avx_fp_cvt_scalar_xs_scalar64_xs_node_sintrinsic_rm<
0x2D, "cvtsi2ss", sint_to_fp, "cvtsi2ss", "f32", "i32">;

[Meanwhile, down in the guts...]

class fp_cvt_scalar_VXSnrr<
// Parent: avx_fp_cvt_scalar_xs_node_rm_DEF_V#NAME#_128rr
bits<8> opc,
string OpcodeStr,
SDNode OpNode,
string DstType,
string SrcType,
int CustomPatterns = 0,
list<dag> patterns = ,
string asm = ""

: fp_unary_vxs_n_rm_rr<

opc,
!cast<RegisterClass>(!patsubst("^f([0-9]+)","FR$1",!patsubst("^i([0-9]+)","GR$1",DstType))),
[...]

Very clever.

Basically, the code keys off type strings to deduce register classes and
other such things. This makes specifying things like converts much easier
because one doesn't need to pass a bunch of additional parameters around to
specify register classes, etc. when it is entirely obvious what they should be
given other parameters.

I'm trying to reduce the number of classes/parameters to deal with so I
can eventually get rid of the Perl script that's generating most of this
stuff right now. This kind of pattern matching and substitution really
helps simplify things to a point where I can imagine manually maintaining
a smaller amount of information.

Right, I definitely agree with your goal of reducing redundancy in the .td files!

However, I don't see any reason to base this off of strings. Instead of passing down "f32" as a string, why not do something like this pseudo code:

class X86ValueType {
RegisterClass RegClass;
...
}

def X86_f32 : X86ValueType {
let RegClass = FR32;
... };
def X86_i32 : X86ValueType { ... };

Then change fp_cvt_scalar_VXSnrr to be something like this:

class fp_cvt_scalar_VXSnrr<
// Parent: avx_fp_cvt_scalar_xs_node_rm_DEF_V#NAME#_128rr
bits<8> opc,
string OpcodeStr,
SDNode OpNode,
X86ValueType DstType,
X86ValueType SrcType,
int CustomPatterns = 0,
list<dag> patterns = ,
string asm = ""

: fp_unary_vxs_n_rm_rr<

opc, DstType.RegClass,

This lets you encode whatever you want as properties of the dependent class, makes everything "type safe", and eliminates string munging. Would something like this work?

-Chris

David_A_Greene · June 15, 2009, 6:33pm

Chris Lattner wrote:

However, I don't see any reason to base this off of strings. Instead of passing down "f32" as a string, why not do something like this pseudo code:

class X86ValueType {
   RegisterClass RegClass;
   ...
}

def X86_f32 : X86ValueType {
   let RegClass = FR32;
   ... };
def X86_i32 : X86ValueType { ... };

Then change fp_cvt_scalar_VXSnrr to be something like this:

class fp_cvt_scalar_VXSnrr<
// Parent: avx_fp_cvt_scalar_xs_node_rm_DEF_V#NAME#_128rr
bits<8> opc,
string OpcodeStr,
SDNode OpNode,
X86ValueType DstType,
X86ValueType SrcType,
int CustomPatterns = 0,
list<dag> patterns = ,
string asm = ""

: fp_unary_vxs_n_rm_rr<

   opc, DstType.RegClass,

This lets you encode whatever you want as properties of the dependent class, makes everything "type safe", and eliminates string munging. Would something like this work?

Yes, that will work for this case and is probably a better solution than
regular expressions.

There are other cases where I want to key off opcode strings and
there it's not practical to wrap each opcode in its own class.

For example:

!subst(SRCTYPE,
!cast<ValueType>(!patsubst(".*ps$","v4f32",!patsubst(".*pd$","v2f64",OpcodeStr))),

To reduce redundancy, developers must be able to write generic patterns
like this:

[(set DSTREGCLASS:$dst, // rr, rrr
(xor (INTSRCTYPE (bitconvert (SRCTYPE SRCREGCLASS:$src1))),
(INTSRCTYPE (bitconvert (SRCTYPE SRCREGCLASS:$src2)))))],

The substitution then fills in the appropriate types, etc. based
on which variant (32-bit, 64-bit, AVX, etc.) is being produced.

I suppose you could argue that additional parameters specifying
the source and dest types could be passed, but why bother when
it is already encoded in the mnemonic? That would just be
adding error-prone redundancy.

In fact while writing this stuff this pattern matching and
substitution already caught a few typing errors for us.

-Dave

Chris_Lattner · June 15, 2009, 7:35pm

Why not synthesize the opcode string from the information passed down?

-Chris

David_A_Greene · June 17, 2009, 12:35am

> I suppose you could argue that additional parameters specifying
> the source and dest types could be passed, but why bother when
> it is already encoded in the mnemonic? That would just be
> adding error-prone redundancy.

Why not synthesize the opcode string from the information passed down?

That's actually how I started things out initially. The problem is that it
leads to a less intuitive specification. I'll see if I can illustrate.

Say we have a wrapper class like this:

class X86ValueType {
    ValueType VT;
    RegisterClass RegClass;
    string suffix;
}

Now we instantiate some concrete types:

class X86_f32 : X86ValueType {
  let VT = f32;
  let RegClass = FR32;
  let suffix = "ss";
}

class X86_v4f32 : X86ValueType {
  let VT = v4f32;
  let RegClass = VR128;
  let suffix = "ps";
}

class X86_v8f32 : X86ValueType {
  let VT = v8f32;
  let RegClass = VR256;
  let suffix = "ps";
}

Ok, you get the picture. Now let's look at how we would write instruction
patterns:

defm BLENDPS : sse41_avx_fp_binary_vector_osta_vintrinsic_rmi_rrmi<0x0C,
i32i8imm, "blend", "blend", "f32">;
defm BLENDPD : sse41_avx_fp_binary_vector_osta_vintrinsic_rmi_rrmi<0x0D,
i32i8imm, "blend", "blend", "f64">;

We have to send types as strings because we need to be able to munge them
for SSE vs. AVX. Passing multiple types (v4f32, v8f32, etc.) is not
practical, it just obfuscates the top-level specification.

Now we need some guts:

multiclass sse41_avx_fp_binary_vector_osta_vintrinsic_rmi_rrmi<
  bits<8> opc,
  PatFrag ImmClass,
  string OpcodeString,
  string Intrinsic,
  string BaseType,
  list<list<dag>> ipatterns = ,
  string asm = ""

{

  def rr_Int : ...
  def rm_Int : ...
  def V#NAME#_128rrr_Int : fp_binary_vector_irrr<
    opc,
    // I'm not even sure this field reference will work
    !strconcat(OpcodeStr,
               !cast<X86ValueType>(!strconcat("X86v??",
                                              BaseType)).suffix,
    !strconcat(Intrinsic,
               !strconcat("_",
                          !cast<X86ValueType>(!strconcat("X86v??",
                                                         BaseType)).suffix,
    cast<X86ValueType>(!strconcat("X86v??", BaseType)).VT,
    cast<X86ValueType>(!strconcat("X86v??", BaseType)).RegClass, // Src
    cast<X86ValueType>(!strconcat("X86v??", BaseType)).RegClass, // Dst
    [and probably some other stuff],
    ipatterns,
    asm
  > {
    let Prefix = TA;
    let HasOpSize = 1;
    let HasVEX = 1;
  }
  def V#NAME#_128rrm_Int : ...
  def V#NAME#_256rrr_Int : ... {
    let Prefix = TA;
    let HasOpSize = 1;
    let HasVEX = 1;
    let HasVEX_L = 1;
  }
  def V#NAME#_256rrm_Int : ...
}

Ok, that's the first level and right here we have a problem. How do we figure
out the vector length ("??" in the strings)? We could pass it at the top
level:

defm BLENDPS : sse41_avx_fp_binary_vector_osta_vintrinsic_rmi_rrmi<0x0C,
i32i8imm, "blend", "blend", "f32", 4>;

Now 4 is not the right vector length for VEX.L-encoded AVX, so we'll have to
munge it:

multiclass sse41_avx_fp_binary_vector_osta_vintrinsic_rmi_rrmi<
  bits<8> opc,
  PatFrag ImmClass,
  string OpcodeString,
  string Intrinsic,
  string BaseType,
  int BaseVL,
  list<list<dag>> ipatterns = ,
  string asm = ""

{

  [...]
  def V#NAME#_256rrr_Int : fp_binary_vector_irrr<
    opc,
    // Not even sure this field reference will work
    !strconcat(OpcodeStr,
               !cast<X86ValueType>(!strconcat(
                   // No support for cast to string yet, but not hard to add
                   // Need integer multiply support as well
                                   !strconcat("X86v", cast<string>(2*BaseVL)),
                                              BaseType)).suffix,
    [...]
  > {
    let Prefix = TA;
    let HasOpSize = 1;
    let HasVEX = 1;
    let HasVEX_L = 1;
  }
  [...]
}

All right, we can make it work, but is it really worth the pain? I'm really
concerned about making the top-level instruction specification .td as readable
as possible, since that's what people will be primarily editing.

So which is more intuitive and less error-prone?

defm BLENDPS : sse41_avx_fp_binary_vector_osta_vintrinsic_rmi_rrmi<0x0C,
i32i8imm, "blend", "blend", "f32", 4>;

or

defm BLENDPS : sse41_avx_fp_binary_vector_osta_vintrinsic_rmi_rrmi<0x0C,
i32i8imm, "blendps", "blendps">;

(We can probably drop the Intrinsic string specification since in most cases
we can generate it from the opcode string).

Before answering, consider that the developer must understand how top-level
parameters like type strings and vector lengths will be munged by the guts
classes. She needs to know this to know what values to pass for types and
vector lengths.

The key problem is the differing type requirements for AVX vs. SSE. There is
always some munging of some parameter required at some point. Those
parameters have to be strings since that's the only type in TableGen
flexible enough to have munging performed on it. We could add lots of
operators to munge other types but that would be needlessly complex and
expensive to implement.

Opcode strings are naturally strings. They contain all of the typing
information we need. I think it is better to pass fewer arguments and
let TableGen do the hard work of figuring out the type requirements.

If you buy that argument, we have to have some way to match string patterns.
Regular expressions is a natural solution.

-Dave

David_A_Greene · June 17, 2009, 12:49am

Here's another option:

defm BLENDPS : sse41_avx_fp_binary_vector_osta_vintrinsic_rmi_rrmi<0x0C,
i32i8imm, "blendps", "blendps", v4f32, v8f32>;

This is somewhere between the first and second options. It's not as
convenient as the second but is more intuitive than the first. Still,
looking at some random individual instruction, it wouldn't be immediately
clear to me what those multiple types mean. I might think they're source
and destination types, for example.

-Dave

Aaron_Gray1 · June 17, 2009, 1:17am

Named parameters might be a possible solution. Again maybe it might be too verbose.

Aaron

Chris_Lattner · June 17, 2009, 6:06am

I think I understand what you're saying with "munging strings is easier". However, I still don't understand why you can't pass down some 'my_f32' instead of '"f32"' and have the defm pull out the right fields from my_f32. The AVX type would be v8f32, the SSE type would be v4f32, etc.

More generally, I don't see how strings can be better in any circumstance: in any case where you pass down a string, you can pass down a def that has fields relating to how you would otherwise munge the string, no?

-Chris

David_A_Greene · June 18, 2009, 4:42pm

Ok, I've been refactoring things here a bit and I think what you propose
will work. I'll run with it and see if I get myself into any trouble.

Thanks for pushing on this. I think things'll come out better because
of it.

-Dave

Chris_Lattner · June 18, 2009, 5:32pm

Thank you Dave, I really appreciate your work on this!

-Chris

Topic		Replies	Views
RFC: AVX Pattern Specification [LONG] LLVM Dev List Archives	12	175	May 7, 2009
Regular Expression lib support LLVM Dev List Archives	38	426	August 30, 2009
Multi-Instruction Patterns LLVM Dev List Archives	7	270	September 25, 2008
TableGen and Greenspun LLVM Dev List Archives	52	554	October 11, 2011
Porting LLVM backend is no fun yet LLVM Dev List Archives	11	209	April 14, 2009

Regular Expressions

Related topics