RFC: Adding support for the z/OS platform to LLVM and clang

Other good references:
- The 'ctag' utility
  https://www.ibm.com/support/knowledgecenter/SSLTBW_2.3.0/com.ibm.zos.v2r3.bpxa500/chtag.htm
- File tagging overview
  https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.cbcpx01/cbc1p273.htm

Kai, would use of auto conversion require that users set the _BPXK_AUTOCVT, _BPXK_CCSIDS, and/or _BPXK_PCCSID environment variables? Or do you envision having the clang driver set them before invocation of the compiler? If the latter, that would imply that users (and tests) are responsible for setting them for direct 'clang -cc1' invocations.

Here is another possible direction to consider that would provide a more portable facility. Clang has interfaces for overriding file contents with a memory buffer; see the overrideFileContents() overloads in SourceManager. It should be straight forward to, when loading a file, make a determination as to whether a conversion is needed (e.g., consider file tags, environment variables, command line options, etc...) and, if needed, transcode the file contents and register the resulting buffer as an override. This would be useful for implementation of -finput-charset and would benefit deployments in Microsoft environments that have source files in ISO-8859 encodings.

Tom.

> > > 2) Add patches to Clang to allow EBCDIC and ASCII (ISO-8859-1)
> > > encoded
>
> > input source files. This would be done at the file open time to

allow

> the
> > rest of Clang to operate as if the source was UTF-8 and so require

no

> > changes downstream. Feedback on this plan is welcome from the Clang
> > community.
> > Would it be correct to assume that this EBCDIC -> UTF-8 mapping

would

> > be as prescribed by UTF-EBCDIC / IBM CDRA, notably for the control
> > characters that do not map exactly?
> > Notably, if the execution encoding is EBCDIC, is '0x06' equivalent

to

> > '0086', etc?
> >
> > The question "Is Unicode sufficient to represent all characters
> > present in the input source without using the Private Use Area?" is
> > one
> that
> > is relevant to both Clang and the C/C++ standard. ( I do hope that

it

> > is the case!)
>
> The current goal is to make only minimal changes to the frontend to

enable

> reading of EBCDIC encoded files. For this, we use the auto-
conversion service of
> z/OS UNIX System Services (
>

https://urldefense.com/v3/__https://www.ibm.com/support/knowledgecenter/

> SSLTBW_2.4.0/com.ibm.zos.v2r4.bpxb200/xpascii.htm__;!!A4F2R9G_pg!NKRnU
> eS37wLNWpYN6Yvhm9SzZwujyMlnpbFJyHV5Z8-M6-aucp0zxwXGxSZ7EKlr$
> ), together with file tagging and setting the CCSID for the program

and for

> opened files.. The auto-conversion service supports round-trip

conversion

> between EBCDIC and Enhanced ASCII. With it, boot strapping with EBCDIC
> source files is possible.
> Of course, more complete UTF-8 support is a valid implementation
alternative.

Other good references:
- The 'ctag' utility
  https://www.ibm.com/support/knowledgecenter/SSLTBW_2.3.0/
com.ibm.zos.v2r3.bpxa500/chtag.htm
- File tagging overview
  https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/
com.ibm.zos.v2r3.cbcpx01/cbc1p273.htm

Kai, would use of auto conversion require that users set the
_BPXK_AUTOCVT, _BPXK_CCSIDS, and/or _BPXK_PCCSID environment
variables? Or do you envision having the clang driver set them
before invocation of the compiler? If the latter, that would imply
that users (and tests) are responsible for setting them for direct
'clang -cc1' invocations.

Hi Tom,
the current approach is to enable auto conversion only if _BPX_AUTOCVT is
set to ON. If the variable is not set, then all input files are treated as
EBCDIC. The rational behind is that we do not want to outsmart the user.
So there is no problem with direct `clang -cc1` invocations. It's a good
hint that we need to describe this setup somewhere.

Here is another possible direction to consider that would provide a
more portable facility. Clang has interfaces for overriding file
contents with a memory buffer; see the overrideFileContents()
overloads in SourceManager. It should be straight forward to, when
loading a file, make a determination as to whether a conversion is
needed (e.g., consider file tags, environment variables, command
line options, etc...) and, if needed, transcode the file contents
and register the resulting buffer as an override. This would be
useful for implementation of -finput-charset and would benefit
deployments in Microsoft environments that have source files in
ISO-8859 encodings.

That's a good hint. I'll definitely have a look at it, as it sounds that
it could solve some problems/complexity. A separate solution would then
still be required for LLVM.

Tom.

Best regards,
Kai Nacke
IT Architect

IBM Deutschland GmbH
Vorsitzender des Aufsichtsrats: Sebastian Krause
Geschäftsführung: Gregor Pillen (Vorsitzender), Agnes Heftberger, Norbert
Janzen, Markus Koerner, Christian Noll, Nicole Reimer
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
HRB 14562 / WEEE-Reg.-Nr. DE 99369940

From: Kai Peter Nacke <kai.nacke@de.ibm.com>
Sent: Tuesday, June 16, 2020 11:17 AM
To: Tom Honermann <thonerma@synopsys.com>
Cc: Corentin <corentin.jabot@gmail.com>; llvm-dev@lists.llvm.org
Subject: RE: [llvm-dev] RFC: Adding support for the z/OS platform to LLVM and
clang

> > > > 2) Add patches to Clang to allow EBCDIC and ASCII (ISO-8859-1)
> > > > encoded
> >
> > > input source files. This would be done at the file open time to
allow
> > the
> > > rest of Clang to operate as if the source was UTF-8 and so require
no
> > > changes downstream. Feedback on this plan is welcome from the
> > > Clang community.
> > > Would it be correct to assume that this EBCDIC -> UTF-8 mapping
would
> > > be as prescribed by UTF-EBCDIC / IBM CDRA, notably for the control
> > > characters that do not map exactly?
> > > Notably, if the execution encoding is EBCDIC, is '0x06' equivalent
to
> > > '0086', etc?
> > >
> > > The question "Is Unicode sufficient to represent all characters
> > > present in the input source without using the Private Use Area?"
> > > is one
> > that
> > > is relevant to both Clang and the C/C++ standard. ( I do hope that
it
> > > is the case!)
> >
> > The current goal is to make only minimal changes to the frontend to
enable
> > reading of EBCDIC encoded files. For this, we use the auto-
> conversion service of
> > z/OS UNIX System Services (
> >
https://urldefense.com/v3/__https://www.ibm.com/support/knowledgecenter/
> >
SSLTBW_2.4.0/com.ibm.zos.v2r4.bpxb200/xpascii.htm__;!!A4F2R9G_pg!NKR
> > nU eS37wLNWpYN6Yvhm9SzZwujyMlnpbFJyHV5Z8-M6-
aucp0zxwXGxSZ7EKlr$
> > ), together with file tagging and setting the CCSID for the program
and for
> > opened files.. The auto-conversion service supports round-trip
conversion
> > between EBCDIC and Enhanced ASCII. With it, boot strapping with
> > EBCDIC source files is possible.
> > Of course, more complete UTF-8 support is a valid implementation
> alternative.
>
> Other good references:
> - The 'ctag' utility
>
> https://urldefense.com/v3/__https://www.ibm.com/support/knowledgecente
>
r/SSLTBW_2.3.0/__;!!A4F2R9G_pg!KV1im4SvVFKKMIvutwguN6maqCZttB7_zG_i
0QW
> ZFauUVe6IKXYm6CeMjYXbWNyQ6SO-TOs$
> com.ibm.zos.v2r3.bpxa500/chtag.htm
> - File tagging overview
>
> https://urldefense.com/v3/__https://www.ibm.com/support/knowledgecente
>
r/en/SSLTBW_2.3.0/__;!!A4F2R9G_pg!KV1im4SvVFKKMIvutwguN6maqCZttB7_z
G_i
> 0QWZFauUVe6IKXYm6CeMjYXbWNyQ2CwjL08$
> com.ibm.zos.v2r3.cbcpx01/cbc1p273.htm
>
> Kai, would use of auto conversion require that users set the
> _BPXK_AUTOCVT, _BPXK_CCSIDS, and/or _BPXK_PCCSID environment
> variables? Or do you envision having the clang driver set them before
> invocation of the compiler? If the latter, that would imply that
> users (and tests) are responsible for setting them for direct 'clang
> -cc1' invocations.

Hi Tom,
the current approach is to enable auto conversion only if _BPX_AUTOCVT is set
to ON. If the variable is not set, then all input files are treated as EBCDIC. The
rational behind is that we do not want to outsmart the user.
So there is no problem with direct `clang -cc1` invocations. It's a good hint that
we need to describe this setup somewhere.

That seems reasonable. How would you handle _BPX_AUTOCVT being set to ALL?

(
For anyone following along, the difference between ON and ALL is described at https://www.ibm.com/support/knowledgecenter/SSLTBW_2.3.0/com.ibm.zos.v2r3.cbcpx01/setenv.htm#setenv:

When _BPXK_AUTOCVT is ON, automatic conversion can only take place between IBM-1047 and ISO8859-1 code sets. Other CCSID pairs are not supported for automatic text conversion. To request automatic conversion for any CCSID pairs that Unicode service supports, set _BPXK_AUTOCVT to ALL.

)

Tom.

From: Tom Honermann <Thomas.Honermann@synopsys.com>
To: Kai Peter Nacke <kai.nacke@de.ibm.com>
Cc: Corentin <corentin.jabot@gmail.com>, "llvm-dev@lists.llvm.org"
<llvm-dev@lists.llvm.org>
Date: 16.06.2020 19:09
Subject: [EXTERNAL] RE: [llvm-dev] RFC: Adding support for the z/OS
platform to LLVM and clang

> From: Kai Peter Nacke <kai.nacke@de.ibm.com>
> Sent: Tuesday, June 16, 2020 11:17 AM
> To: Tom Honermann <thonerma@synopsys.com>
> Cc: Corentin <corentin.jabot@gmail.com>; llvm-dev@lists.llvm.org
> Subject: RE: [llvm-dev] RFC: Adding support for the z/OS platform
to LLVM and
> clang
>
>
> > > > > 2) Add patches to Clang to allow EBCDIC and ASCII (ISO-8859-1)
> > > > > encoded
> > >
> > > > input source files. This would be done at the file open time to
> allow
> > > the
> > > > rest of Clang to operate as if the source was UTF-8 and so

require

> no
> > > > changes downstream. Feedback on this plan is welcome from the
> > > > Clang community.
> > > > Would it be correct to assume that this EBCDIC -> UTF-8 mapping
> would
> > > > be as prescribed by UTF-EBCDIC / IBM CDRA, notably for the

control

> > > > characters that do not map exactly?
> > > > Notably, if the execution encoding is EBCDIC, is '0x06'

equivalent

> to
> > > > '0086', etc?
> > > >
> > > > The question "Is Unicode sufficient to represent all characters
> > > > present in the input source without using the Private Use Area?"
> > > > is one
> > > that
> > > > is relevant to both Clang and the C/C++ standard. ( I do hope

that

> it
> > > > is the case!)
> > >
> > > The current goal is to make only minimal changes to the frontend

to

> enable
> > > reading of EBCDIC encoded files. For this, we use the auto-
> > conversion service of
> > > z/OS UNIX System Services (
> > >
>

https://urldefense.com/v3/__https://www.ibm.com/support/knowledgecenter/

> > >
> SSLTBW_2.4.0/com.ibm.zos.v2r4.bpxb200/xpascii.htm__;!!A4F2R9G_pg!NKR
> > > nU eS37wLNWpYN6Yvhm9SzZwujyMlnpbFJyHV5Z8-M6-
> aucp0zxwXGxSZ7EKlr$
> > > ), together with file tagging and setting the CCSID for the

program

> and for
> > > opened files.. The auto-conversion service supports round-trip
> conversion
> > > between EBCDIC and Enhanced ASCII. With it, boot strapping with
> > > EBCDIC source files is possible.
> > > Of course, more complete UTF-8 support is a valid implementation
> > alternative.
> >
> > Other good references:
> > - The 'ctag' utility
> >
> >

https://urldefense.com/v3/__https://www.ibm.com/support/knowledgecente

> >
> r/SSLTBW_2.3.0/__;!!A4F2R9G_pg!KV1im4SvVFKKMIvutwguN6maqCZttB7_zG_i
> 0QW
> > ZFauUVe6IKXYm6CeMjYXbWNyQ6SO-TOs$
> > com.ibm.zos.v2r3.bpxa500/chtag.htm
> > - File tagging overview
> >
> >

https://urldefense.com/v3/__https://www.ibm.com/support/knowledgecente

> >
> r/en/SSLTBW_2.3.0/__;!!A4F2R9G_pg!KV1im4SvVFKKMIvutwguN6maqCZttB7_z
> G_i
> > 0QWZFauUVe6IKXYm6CeMjYXbWNyQ2CwjL08$
> > com.ibm.zos.v2r3.cbcpx01/cbc1p273.htm
> >
> > Kai, would use of auto conversion require that users set the
> > _BPXK_AUTOCVT, _BPXK_CCSIDS, and/or _BPXK_PCCSID environment
> > variables? Or do you envision having the clang driver set them

before

> > invocation of the compiler? If the latter, that would imply that
> > users (and tests) are responsible for setting them for direct 'clang
> > -cc1' invocations.
>
> Hi Tom,
> the current approach is to enable auto conversion only if
_BPX_AUTOCVT is set
> to ON. If the variable is not set, then all input files are
treated as EBCDIC. The
> rational behind is that we do not want to outsmart the user.
> So there is no problem with direct `clang -cc1` invocations. It's
a good hint that
> we need to describe this setup somewhere.

That seems reasonable. How would you handle _BPX_AUTOCVT being set to

ALL?

(
For anyone following along, the difference between ON and ALL is

described at

https://www.ibm.com/support/knowledgecenter/SSLTBW_2.3.0/
com.ibm.zos.v2r3.cbcpx01/setenv.htm#setenv:
> When _BPXK_AUTOCVT is ON, automatic conversion can only take place
between IBM-1047 and ISO8859-1 code sets. Other CCSID pairs are not
supported for automatic text conversion. To request automatic
conversion for any CCSID pairs that Unicode service supports, set
_BPXK_AUTOCVT to ALL.
)

Tom.

That's a bit more complicated. For reading files, I can imagine the
following approach:
- the application is still using the ASCII execution mode (to link against
the ASCII version of the library)
- on each file handle, the program CCSID is set to UTF-8 (1208)
  auto-conversion on the file is turned on if
  - _BPX_AUTOCVT set to ALL
  - file is untagged (assuming EBCDIC 1047) or file tag is not 1208
Writing text files would need a default encoding. Using UTF-8 (1208) would
makes sense.

This is really a "rough" first thought. I gave it a quick try, and it
failed. Most likely I overlooked something.

Best regards,
Kai Nacke
IT Architect

IBM Deutschland GmbH
Vorsitzender des Aufsichtsrats: Sebastian Krause
Geschäftsführung: Gregor Pillen (Vorsitzender), Agnes Heftberger, Norbert
Janzen, Markus Koerner, Christian Noll, Nicole Reimer
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
HRB 14562 / WEEE-Reg.-Nr. DE 99369940