[BUG] Incorrect ASCII escape characters on Mac

Hi,

The bug originated when one of our tests failed with an
answer-mismatch on Mac, after upgrading to XCode 6.2. The internal IR
after all our transforms, and just before it hits the final LLVM
lowering is identical on Mac and Linux:

  var t6 : int8_T{col}[10] = {34, -48, 18, -12, 33, 0, 21, -7, -20, -31};

The LLVM lowering itself doesn't try to do anything smart. When we
dumpModule, we see the difference:

   @2 = internal global [24 x i8]
c"\10\02\03\0D\05\0B\0A\08\09\07\06\0C\04\0E\0F\01\01\02\03\04\06\02\00\05"
   @3 = internal global [10 x i8] c"\22\00\12\00!\00\15\00\00\00"
   @4 = internal global [10 x i8] c"\00\19\00+\00\00#\03\00\11"
  -@5 = internal global [10 x i8] c"\22\D0\12\F4!\00\15\F9\EC\E1"
  -@6 = internal global [10 x i8] c"\D0\19\FB+\FD\F8#\03\E2\11"
  +@5 = internal global [10 x i8] c"\22Ð\12ô!\00\15ùìá"
  +@6 = internal global [10 x i8] c"Ð\19û+ýø#\03â\11"

The diff is between Linux and Mac, where lines added are from Mac.
Both the @5 character sequences represent:

  34 208 18 244 33 0 21 249 236 225

in decimal, which is our original array (from
https://r12a.github.io/apps/conversion/).

Let's try to feed this into llc then:

  llc: maci.ll:1:32: error: constant expression type mismatch
  @0 = internal global [10 x i8] c"\22Ð\12ô!\00\15ùìá"

The Linux string is fine ofcourse.

So, my conclusion is that some unicode normalization code is broken.
Although, if they represent the same code points, why should the
program fail?

Thanks.

Ram

[Posting to new list]

Hi,

The bug originated when one of our tests failed with an
answer-mismatch on Mac, after upgrading to XCode 6.2. The internal IR
after all our transforms, and just before it hits the final LLVM
lowering is identical on Mac and Linux:

  var t6 : int8_T{col}[10] = {34, -48, 18, -12, 33, 0, 21, -7, -20, -31};

The LLVM lowering itself doesn't try to do anything smart. When we
dumpModule, we see the difference:

   @2 = internal global [24 x i8]
c"\10\02\03\0D\05\0B\0A\08\09\07\06\0C\04\0E\0F\01\01\02\03\04\06\02\00\05"
   @3 = internal global [10 x i8] c"\22\00\12\00!\00\15\00\00\00"
   @4 = internal global [10 x i8] c"\00\19\00+\00\00#\03\00\11"
  -@5 = internal global [10 x i8] c"\22\D0\12\F4!\00\15\F9\EC\E1"
  -@6 = internal global [10 x i8] c"\D0\19\FB+\FD\F8#\03\E2\11"
  +@5 = internal global [10 x i8] c"\22Ð\12ô!\00\15ùìá"
  +@6 = internal global [10 x i8] c"Ð\19û+ýø#\03â\11"

The diff is between Linux and Mac, where lines added are from Mac.
Both the @5 character sequences represent:

  34 208 18 244 33 0 21 249 236 225

in decimal, which is our original array (from
https://r12a.github.io/apps/conversion/).

Let's try to feed this into llc then:

  llc: maci.ll:1:32: error: constant expression type mismatch
  @0 = internal global [10 x i8] c"\22Ð\12ô!\00\15ùìá"

The Linux string is fine ofcourse.

So, my conclusion is that some unicode normalization code is broken.
Although, if they represent the same code points, why should the
program fail?

Thanks.

Ram

Not in this century, they don't.

That Ð, for example, is U+00D0 LATIN CAPITAL LETTER ETH, which in any
21st century system should be represented by the UTF-8 bytes 195,144.

Your string "\22Ð\12ô!\00\15ùìá" is much more likely be:
34 195 144 18 195 180 33 0 21 195 185 195 172 195 161

Your "Linux" version is encoding the bytes directly and not making
assumptions about character sets.

Thanks. Whose fault is it exactly though? I can reproduce the issue
with this construction code:

    IRBuilder<> builder(getGlobalContext());
    auto M = new llvm::Module("main", getGlobalContext());
    std::vector<Type *> ATyAr;
    auto i8Ty = builder.getInt8Ty();
    auto i8ArTy = ArrayType::get(builder.getInt8Ty(), 10);
    auto pi8ArTy = PointerType::get(i8ArTy, 0);
    auto FTy = FunctionType::get(i8ArTy, ATyAr, false);
    auto Fcn = cast<Function>(M->getOrInsertFunction("testFcn", FTy));
    auto BB = BasicBlock::Create(builder.getContext(), "entry", Fcn);
    builder.SetInsertPoint(BB);
    std::vector<int> rawData = { 34 , -48 , 18 , -12 , 33 , 0 , 21 ,
-7 , -20 , -31 };
    std::vector<Constant *> coercedData;
    for (auto el : rawData) {
        coercedData.push_back(ConstantInt::get(i8Ty, el, 10));
    }
    auto V = ConstantArray::get(i8ArTy, ArrayRef<Constant *>(coercedData));
    builder.CreateRet(V);

    std::string scratchspace;
    raw_string_ostream scratch(scratchspace);
    auto aaw = new AssemblyAnnotationWriter;
    M->print(scratch, aaw);

It prints fine on Linux, but prints those weird characters on Mac.

Is just the pretty-printer broken?

Thanks.

Ram

*sigh*

So, it turns out that it's just a bug in dump(), which doesn't bother
me enough. Sure enough, the i8 array itself is fine.

    IRBuilder<> Builder(getGlobalContext());
    auto M = new llvm::Module("main", getGlobalContext());
    std::vector<Type *> ATyAr;
    auto FTy = FunctionType::get(Builder.getVoidTy(), ATyAr, false);
    auto ExecutionHandle = cast<Function>(M->getOrInsertFunction("main", FTy));
    auto BB = BasicBlock::Create(Builder.getContext(), "entry",
ExecutionHandle);
    Builder.SetInsertPoint(BB);
    std::vector<int> rawData = { 34 , -48 , 18 , -12 , 33 , 0 , 21 ,
-7 , -20 , -31 };
    std::vector<Constant *> coercedData;
    auto i8Ty = Builder.getInt8Ty();
    for (auto el : rawData) {
        coercedData.push_back(ConstantInt::get(i8Ty, el, 10));
    }

    auto i8PtrTy = Builder.getInt8PtrTy();
    ArrayRef<Type *> ArgTys(i8PtrTy);
    FunctionType *PrintfTy =
        FunctionType::get(Builder.getInt32Ty(), ArgTys, /* IsVarArgs = */ true);
    auto PrintfHandle =
        dyn_cast<Function>(M->getOrInsertFunction("printf", PrintfTy));
    PrintfHandle->setCallingConv(CallingConv::C);
    auto FormatStringPtr = Builder.CreateGlobalStringPtr("%d ");

    auto i8ArTy = ArrayType::get(i8Ty, 10);
    auto StrConstant = ConstantArray::get(i8ArTy, ArrayRef<Constant
*>(coercedData));
    auto GV = new GlobalVariable(*M, StrConstant->getType(), true,
                                 GlobalValue::PrivateLinkage, StrConstant);
    auto ConstantZero =
ConstantInt::get(Type::getInt32Ty(Builder.getContext()), 0);
    for (auto i = 0; i < 10; i++) {
        std::vector<Value *> ThisElIdx =
            { ConstantZero,
ConstantInt::get(Type::getInt32Ty(Builder.getContext()), i) };
        auto FirstEl = Builder.CreateGEP(GV, ArrayRef<Value *>(ThisElIdx));
        auto LoadedV = Builder.CreateLoad(FirstEl);
        Builder.CreateCall2(PrintfHandle, FormatStringPtr, LoadedV);
    }
    Builder.CreateRet(nullptr);

    LLVMInitializeNativeTarget();
    LLVMInitializeNativeAsmPrinter();

    auto EE = EngineBuilder(M).create();
    assert(EE && "Error creating MCJIT with EngineBuilder");
    typedef int (*MainFTy)();
    union {
        uint64_t raw;
        MainFTy usable;
    } functionPointer;
    functionPointer.raw = (uint64_t)EE->getPointerToFunction(ExecutionHandle);
    assert(functionPointer.usable && "no main function found");
    testing::internal::CaptureStdout();
    functionPointer.usable();
    auto Captured = testing::internal::GetCapturedStdout();
    // check Captured