> From: cfe-dev <cfe-dev-bounces@lists.llvm.org> On Behalf Of Joerg

> Sonnenberger via cfe-dev

> Sent: Monday, October 28, 2019 6:54 PM

> To: cfe-dev@lists.llvm.org

> Subject: Re: [cfe-dev] Loop optimised into div (?)

>

> > >

> > > > > I found this when compiling a function that will convert a binary

> number to a sequence of decimal digits expressed in ascii. Every digit is

> determined by counting the number of subtractions required for power-or-

> ten factors . For that algorithm, only up to 9 subtractions are required

> at every step (or factor). Replacing that by divisions is a lot more

> expensive for the affected targets.

> > > >

> > > > Aha. I see, yes.

> > > >

> > > > I think the divisions are not as expensive as you might be assuming.

> > > > The key is to do the same as for the subtact version and divide by

> > > > 10000, then when the remainder is less than 10000 divide by 1000,

> and

> > > > so on. Don't repeatedly divide by 10 as in the recursive algorithm -

> -

> > > > that *will* be slow.

> > >

> > > It is also important to note that on any somewhat sane architecture,

> no

> > > division will actually be created here. Now if your architecture has

> > > neither a division nor a multiplication in hardware... Even then,

> > > division by 10 needs ~10 adds and shift in total.

> >

> > How do you do that?

> >

> > The only ways I know of, division by 10 needs up to wordSize-3

> > iterations of a loop, with each loop containing two

> > compare-and-branches, a subtract and an add (or OR) controlled by one

> > of the compares, and two shifts. That's eight instructions on most

> > ISAs.

>

> Division by ten is essentially multiplication by 0xcccd for 16bit ops

> and shifting the result by 15 or so. That's multiplying by 0xc (shift

> and add), 0x1111 = 0x101 * 0x11 (two adds and two shifts).

> and adding the original value. Roughly at least. Double the number if

> there is no good way to express 32bit numbers.

>

> Joerg

The book _Hacker's Delight_ has a whole chapter on integer division by

a constant, with proofs, turning into mostly multiplies and shifts.

Here are functions based on multiplying by 0xcccd and probably the

best Hacker's delight algorithm:

uint16_t div10(uint16_t n){

uint32_t p = n * (uint32_t)0xcccd;

return p>>19;

}

uint16_t divu10(uint16_t n) {

uint16_t q, r;

q = (n >> 1) + (n >> 2);

q = q + (q >> 4);

q = q + (q >> 8);

//q = q + (q >> 16);

q = q >> 3;

r = n - (((q << 2) + q) << 1);

return q + (r > 9);

}

I've verified they both work for all values of n by compiling for

x86_64 and risc-v. I can compile for msp430, but I don't have runtime

libraries or an emulator for it.

The MSP430 code generated for the above functions:

00000018 <div10>:

18: 0b 12 push r11

1a: 0a 12 push r10

1c: 09 12 push r9

1e: 08 12 push r8

20: 08 4f mov r15, r8

22: 09 43 clr r9

24: 0c 48 mov r8, r12

26: 0d 49 mov r9, r13

28: 0c 5c rla r12

2a: 0d 6d rlc r13

2c: 0c 58 add r8, r12

2e: 0d 69 addc r9, r13

30: 0c 5c rla r12

32: 0d 6d rlc r13

34: 0c 5c rla r12

36: 0d 6d rlc r13

38: 0e 4c mov r12, r14

3a: 0f 4d mov r13, r15

3c: 0e 5e rla r14

3e: 0f 6f rlc r15

40: 0e 5e rla r14

42: 0f 6f rlc r15

44: 0e 5e rla r14

46: 0f 6f rlc r15

48: 0e 5e rla r14

4a: 0f 6f rlc r15

4c: 0e 5c add r12, r14

4e: 0f 6d addc r13, r15

50: 0a 4e mov r14, r10

52: 0b 4f mov r15, r11

54: 4b ee xor.b r14, r11

56: 0b ee xor r14, r11

58: 8b 10 swpb r11

5a: 4a 4a mov.b r10, r10

5c: 8a 10 swpb r10

5e: 0e 5a add r10, r14

60: 0f 6b addc r11, r15

62: 0e 58 add r8, r14

64: 0f 69 addc r9, r15

66: 0d 4f mov r15, r13

68: 0e 4d mov r13, r14

6a: 0f 43 clr r15

6c: 12 c3 clrc

6e: 0f 10 rrc r15

70: 0e 10 rrc r14

72: 12 c3 clrc

74: 0f 10 rrc r15

76: 0e 10 rrc r14

78: 12 c3 clrc

7a: 0f 10 rrc r15

7c: 0e 10 rrc r14

7e: 0f 4e mov r14, r15

80: 38 41 pop r8

82: 39 41 pop r9

84: 3a 41 pop r10

86: 3b 41 pop r11

88: 30 41 ret

0000008a <divu10>:

8a: 0e 4f mov r15, r14

8c: 12 c3 clrc

8e: 0e 10 rrc r14

90: 12 c3 clrc

92: 0e 10 rrc r14

94: 0d 4f mov r15, r13

96: 12 c3 clrc

98: 0d 10 rrc r13

9a: 0d 5e add r14, r13

9c: 0e 4d mov r13, r14

9e: 12 c3 clrc

a0: 0e 10 rrc r14

a2: 12 c3 clrc

a4: 0e 10 rrc r14

a6: 12 c3 clrc

a8: 0e 10 rrc r14

aa: 12 c3 clrc

ac: 0e 10 rrc r14

ae: 0d 5e add r14, r13

b0: 0e 4d mov r13, r14

b2: 8e 10 swpb r14

b4: 4e 4e mov.b r14, r14

b6: 0e 5d add r13, r14

b8: 12 c3 clrc

ba: 0e 10 rrc r14

bc: 12 c3 clrc

be: 0e 10 rrc r14

c0: 12 c3 clrc

c2: 0e 10 rrc r14

c4: 0d 4e mov r14, r13

c6: 0d 5d rla r13

c8: 0d 5d rla r13

ca: 0d 5e add r14, r13