Dear LLVM and OpenMP Members,
The purpose of this communication is to bring your attention to the availability of an open-source, multi-core GP-GPU-Compute engine, companion RISC CPU and RISC Coarse-Grained Scheduler (CGS), all three of them executing the same SYMPL ISA instruction-set (see press release below).
LLVM, including cycle-accurate instruction-set simulator and debugger, still need to be targeted to support this ISA. So if anyone would be interested in initiating a re-targeting project, let me know, as I am sure we can work out a horse-trade of some sort.
SYMPL GP-GPU-Compute Engine and SYMPL 32-bit RISC CPU repository:
Yours very truly,
For Immediate Release
Open-Source, IEEE754-2008 Compliant, GP-GPU-Compute Engine gets 32-Bit RISC CPU and Coarse-Grained Scheduler that Execute Same Instruction-Set
Austin, TX–Designed for massively parallel, FPGA-accelerated, 32-bit single-precision floating-point applications, the SYMPL ISA open-source RTL library now includes not only the multi-core, interleaving multi-threading, GP-GPU-Compute engine, but also now includes both a 32-bit RISC CPU and 32-bit Coarse-Grained Scheduler (CGS) that execute the same instructions as the GP-GPU, making the CPU, GP-GPU and CGS combination the world’s first and only RISC CPU, GP-GPU and CGS to feature a homogeneous instruction-set architecture.
Presently available for free download at the SYMPL GP-GPU-Compute Engine repository at GitHub, the Verilog RTL library includes sythesizable Verilog RTL source-code for SYMPL CPU, GP-GPU, CGS models comprising the SYMPL CPU, one to sixteen GP-GPUs and one to four CGSs. Configuring the design is easily done at the top level of the design–just follow the instructions located at bottom of the “read-me” file at the SYMPL GP-GPU-Compute Engine repository.
Also at the repository are example test cases for the single, dual, quad, eight, and sixteen-shader (64-thread) implementations that can all be simulated on Xilinx “free” version of Vivado FPGA development environment, which can be downloaded at Xilinx.com. One test case employs a combination of SYMPL RISC CPU, one or more SYMPL GP-GPUs, and one or more SYMPL CGSs to perform a 3D transformation of a 3D model in .stl file format and writing the transformed object back to the Vivado simulator working directory, again in .stl file format, so the results of the 3D transform can be viewed using any online .stl file viewer, including the one built into GitHub. Specifically, the example 3D transform rotates, scales and translates the object on all three axes according to the amounts specified in the parameter list in the CPU’s program memory. Below is a .gif showing the “before” and “after” .stl-formated 3D object as viewed using GitHub’s .stl viewer with the “Surface Angle” display mode selected.
The other example test case involves employing UC Berkeley’s RISC-V (VSCALE) CPU RTL model as the CPU in lieu of the SYMPL RISC CPU, but still performing the same 3D transform on the same 3D model and writing the result to the working directory, the purpose of which is to compare performance of the UC Berkeley RISC-V CPU with the SYMPL RISC CPU, both clocking at 100MHz.
In both cases, the role of the CPU is merely to push the 3D transform parameters and triangles comprising the 3D object into the SYMPL GP-GPU-Compute engine’s 64k-word data-pool and then issue a command to the GP-GPU’s dedicated Coarse-Grained Scheduler that data is available and to perform the 3D transform according to the parameters pushed into the data-pool, then wait for the CGS to bring its “Done” signal back high, indicating that the CGS has completed the task. Upon sensing that the CGS is done, the CPU then pulls the results out of the GP-GPU data-pool and writes them back to their original location in its own memory space, followed by a write of a semaphore value, which signals the test-bench that processing has completed. The test-bench then writes the results to the working directory in .stl binary file format. The transformed results can then be viewed using any online 3D .stl file viewer.
Also in both cases, the role of the CGS is simply to distribute the workload as evenly as possible among the available GP-GPU shader threads. There are four interleaving threads per shader. Thus, for a configuration comprising four shaders, the maximum number of threads available to perform the work is sixteen. For sixteen shaders the maximum number of available threads is sixty-four, and so on. Every four shaders (sixteen threads) has one CGS dedicated to them. Most of the time, when a given CGS is not actually pushing parameters and data into a given shader’s parameter-data buffer, it is polling its command register waiting for a command from the CPU. When it does receive a command, it then makes a determination as to how many of its GP-GPU threads are available. It does this by simply counting how many of its shader’s “Done” lines are asserted active high. Then it’s just a simple calculation: divide the number of triangles by the number of available threads, such that the result of the divide is how many triangles (plus a portion of any remainder from the divide operation) get pushed into a given thread’s parameter-data buffer.
To answer the question, “How does the UC Berkeley RISC-V (R32I) RTL model compare to the SYMPL RISC CPU, in terms of time required to push the same number of triangles when both are clocking at 100 MHz?” Answer: the RISC-V requires roughly 80 usec and the SYMPL RISC CPU requires roughly 20 usec. At first glance, this seems hard to believe, but this fact is absolutely true. If anyone would like to see for themselves, both test cases are presently available for download at the SYMPL GP-GPU repository at GitHub—just read the “read-me” file at GitHub for instructions on how to set up the simulation. Also at the repository are the original assembly language source files and assembled object files, so you can review the code yourself to make sure everyone is playing fair.
So, why the big difference? The answer is simple: the SYMPL ISA is based on an enhanced Harvard, dual-operand “mover” architecture and not on the outdated “load-store” model we were all taught in college. The classic load-store RISC model necessarily requires that each operand to a computation first be loaded into one of the load-store CPU’s internal register-file register locations before a computation involving such operands can be carried out, the results of such computation also being stored in said register file, before it can be written out to memory, with each step requiring at least one clock each.
The SYMPL ISA is very different in a number of very important respects. Firstly, it per se has no register file, in that everything, including program counter, is memory-mapped into the same data space as data memory, such that the status register, program counter, indirect pointers (AR0-AR3), stack pointer, etc., are essentially treated the same way as memory. Thus, one way to look at the SYMPL RISC model is, the entire memory space “is” the register file and the data is already “loaded” into it and available for computation as an operand. Thus, no cycles need be wasted loading a register file before a computation can be carried out. This is especially true in the modern era, particularly as pertains to modern FPGAs, where there are now literally megabytes of closely-coupled memory on-chip, wherein much of the available memory is never used. Consequently, as pertains to the newer FPGAs, designed for massively parallel applications, there is no need for a per se register file to load and hold operands before a computation can be carried out.
Secondly, unlike the classical load-store model that loads a single operand at a time into its register file (at least one clock per operand), the SYMPL ISA “mover” architecture reads two operands simultaneously, performs the computation, and writes a result from the preceding computation back to memory, all in one clock cycle. To enable this capability, the SYMPL ISA requires tri-ported SRAM having two independent read-side address/data ports and one independent write-side port. With today’s larger FPGAs, such as Xilinx Kintex 7, UltraScale and UltraScale+ FPGAs and Altera’s Stratix-V and Arria 10 FPGAs, this is not a problem because these devices have megabytes of SRAM, both distributed in the fabric and in block form, which is way more than anyone can reasonably use for most applications. Building a tri-port memory is easy. Just sandwich two block SRAMs together and connect the write-sides of each together. The RTL in the SYMPL ISA library show how to do this.
Thirdly, the SYMPL ISA has features absent from the RISC-V ISA, which enable it to continuously read dual-operands, perform a computation between them and write a result out every clock cycle without using unrolled loops (which can consume lots of program memory) and without leaving gaping holes in the instruction pipeline or individual floating-point operator pipelines. Chief among these features are four auxiliary registers that function as indirect pointers and which have auto-post-modification capability, meaning that these indirect pointers can be configured to automatically post-increment, post-decrement, or remain unchanged after each clock when used as a pointer. When used in combination with the SYMPL ISA RPT (“repeat”) instruction, the SYMPL RISC CPU can not only move data around faster than a DMA channel, but it can also perform the same computation on large blocks of data using just two instructions (RPT n followed by the desired instruction, such as MOV, ADD, MUL, etc.), yielding a result every clock cycle.
Just like the SYMPL GP-GPU, the SYMPL RISC CPU also has a complete repertoire of IEEE754-2008 compliant, 32-bit, memory-mapped, single-precision floating-point operators, including FADD, FSUB, FMUL, FDIV, FMA, DOT, SQRT, LOG, EXP, ITOF and FTOI. The floating-point operators presently employed in both the SYMPL CPU and SYMPL GP-GPU were generated using FloPoCo’s floating-point generator. Because FloPoCo-generated floating-point operators, by themselves, are not IEEE754-2008 compliant, additional logic was added to bring them into conformance. Namely, additional logic was added to enable “on-the-fly” directed rounding, quiet NaN production with diagnostic payload for invalid operation exceptions, capture registers with encoded diagnostics for divide-by-zero and overflow exceptions to name a few. Since FloPoCo-generated operators flush subnormals to zero, the operator logic was slightly modified to disable the flush, allowing results to underflow—gradually—pursuant to the IEEE754-2008 specification.
To help prevent stalls while floating-point operations are underway, each operator has associated with it sixteen, randomly addressable result buffers that are thirty-five bits wide. These three extra bits are encoded to reflect, which, if any, floating-point exception occurred during computation and can be used to programmatically trigger alternate delayed exception handling the instant the result is read from its result buffer if an exception occurred during its computation. The results of a given operation are automatically binned-out to the memory-mapped result buffer corresponding to same memory-mapped address the input operands were originally written to. Since the floating-point operator pipelines (which vary from two to eleven clocks deep) are decoupled from the processor’s main instruction pipeline, such that the CPU and/or GP-GPU can, in rapid succession, fill a given operator’s pipe, such that, by the time the CPU or GP-GPU has written the operands, the first result is already available for reading from its respective result buffer, including the one originally written after it.
Like the SYMPL GP-GPU floating-point operators, the SYMPL RISC CPU floating-point operators can accept (and the SYMPL RISC CPU has the ability to deliver) two new floating-point operands every clock cycle, especially when a RPT instruction is employed in combination with the dual-operand MOV instruction used to simultaneously write the two operands to the operator’s inputs. As a result, the sixteen-shader version of the SYMPL CPU GP-GPU combination can execute roughly 2.1 billion floating-point operations per second when implemented in a Kintex 7 device clocking in the vicinity of 125 MHz. To put this into perspective, clocking at 100 MHz, the SYMPL single-shader GP-GPU can perform a 323-triangle, 3D transformation on all three axes, including rotate, scale, and translate, in roughly 225 usec. In comparison, the sixteen-shader version can have results ready within just 8 usec after the last input triangle is pushed into the last GP-GPU’s data-pool for processing.
Finally, also now included in the SYMPL ISA RTL library is the new SYMPL Intermediate Language (IL) that can be used in lieu of, or in addition to, SYMPL assembly language for writing SYMPL threads and programs. SYMPL-IL is very similar to a primitive form of the BASIC language. For example, instead of using the assembly language mnemonic for testing a bit, you can now use a literal “IF” GOTO . Another example is the “FOR…NEXT” loop. This new IL makes resulting code much easier to read and understand than straight assembly, yet yields identical object code produced by the same assembler. The SYMPL ISA instruction table for both the SYMPL assembler and SYMPL-IL is included with the library at the GitHub repository at the following link:
About SYMPL: The Why
The SYMPL GP-GPU-Compute project began in 2014 to address the lack of an open-source GP-GPU accelerator so that anyone who wants to experiment with their own home-brew or college-brew CPU can easily put it on steroids just to see what it can do. It is hoped that academia and industry will see the merits in the open-source FPGA-accelerated GP-GPU-Compute concept and collaborate to port LLVM and or GCC to support the SYMPL ISA. In addition, SYMPL still needs a cycle-accurate instruction-set simulator and debugger that can work seamlessly with Eclipse Integrated Development Environment. Hopefully, someday soon, SYMPL will be running Android and/or iOS applications in an FPGA system designed by a bunch of college students or a guy in his garage.