MmapAllocator

Steven_Noonan · August 8, 2010, 2:05am

Hi folks,

I've been doing work on memory reduction in Unladen Swallow, and
during testing, LiveRanges seemed to be consuming one of the largest
chunks of memory. I wrote a replacement allocator for use by
BumpPtrAllocator which uses mmap()/munmap() in place of
malloc()/free(). It has worked flawlessly in testing, and reduces
memory usage quite nicely in Unladen Swallow.

The code is available for review here. I'd appreciate feedback if
there's an interest in integrating this into LLVM trunk:
http://codereview.uplinklabs.net/1905049

Here are the results of our memory utilization tests. The 'Mem max'
numbers aren't particularly revealing though, so take a look at the
graphs. I think the spambayes benchmark was one of the most
interesting.

### 2to3 ###
Mem max: 39008.000 -> 38904.000: 1.0027x smaller
Usage over time: http://tinyurl.com/3axczjc

### bzr_startup ###
Mem max: 11996.000 -> 11984.000: 1.0010x smaller
Usage over time: http://tinyurl.com/2ucnbhb

### call_method ###
Mem max: 11632.000 -> 11544.000: 1.0076x smaller
Usage over time: http://tinyurl.com/22r6y9r

### call_method_slots ###
Mem max: 10908.000 -> 10820.000: 1.0081x smaller
Usage over time: http://tinyurl.com/3yqb2r4

### call_method_unknown ###
Mem max: 11216.000 -> 11152.000: 1.0057x smaller
Usage over time: http://tinyurl.com/3ypudfj

### call_simple ###
Mem max: 10692.000 -> 10540.000: 1.0144x smaller
Usage over time: http://tinyurl.com/2a62cbv

### django ###
Mem max: 21600.000 -> 20672.000: 1.0449x smaller
Usage over time: http://tinyurl.com/35rclpd

### float ###
Mem max: 15904.000 -> 15852.000: 1.0033x smaller
Usage over time: http://tinyurl.com/2vokmep

### hg_startup ###
Mem max: 7000.000 -> 7012.000: 1.0017x larger
Usage over time: http://tinyurl.com/3x4wneu

### iterative_count ###
Mem max: 9992.000 -> 9908.000: 1.0085x smaller
Usage over time: http://tinyurl.com/24dy7ql

### nbody ###
Mem max: 13552.000 -> 13240.000: 1.0236x smaller
Usage over time: http://tinyurl.com/23dstyu

### normal_startup ###
Mem max: 5380.000 -> 5396.000: 1.0030x larger
Usage over time: http://tinyurl.com/2fh7cmv

### nqueens ###
Mem max: 12832.000 -> 12756.000: 1.0060x smaller
Usage over time: http://tinyurl.com/29whema

### pickle ###
Mem max: 6856.000 -> 6844.000: 1.0018x smaller
Usage over time: http://tinyurl.com/3az5v6y

### pickle_dict ###
Mem max: 6848.000 -> 6836.000: 1.0018x smaller
Usage over time: http://tinyurl.com/2bkjdoh

### pickle_list ###
Mem max: 6836.000 -> 6824.000: 1.0018x smaller
Usage over time: http://tinyurl.com/23llzct

### regex_compile ###
Mem max: 39176.000 -> 38536.000: 1.0166x smaller
Usage over time: http://tinyurl.com/33wylgu

### regex_effbot ###
Mem max: 12340.000 -> 12084.000: 1.0212x smaller
Usage over time: http://tinyurl.com/37u84z4

### regex_v8 ###
Mem max: 33596.000 -> 33828.000: 1.0069x larger
Usage over time: http://tinyurl.com/397hyfm

### richards ###
Mem max: 12760.000 -> 12680.000: 1.0063x smaller
Usage over time: http://tinyurl.com/25n3wkl

### rietveld ###
Mem max: 29008.000 -> 28636.000: 1.0130x smaller
Usage over time: http://tinyurl.com/25uu4x3

### slowpickle ###
Mem max: 14096.000 -> 13804.000: 1.0212x smaller
Usage over time: http://tinyurl.com/2wzxmu2

### slowspitfire ###
Mem max: 94292.000 -> 93992.000: 1.0032x smaller
Usage over time: http://tinyurl.com/2wo4lrs

### slowunpickle ###
Mem max: 11620.000 -> 11516.000: 1.0090x smaller
Usage over time: http://tinyurl.com/26pw2cr

### spambayes ###
Mem max: 35896.000 -> 35860.000: 1.0010x smaller
Usage over time: http://tinyurl.com/2dhtkeb

### startup_nosite ###
Mem max: 4876.000 -> 4868.000: 1.0016x smaller
Usage over time: http://tinyurl.com/3agcts8

### threaded_count ###
Mem max: 10048.000 -> 9972.000: 1.0076x smaller
Usage over time: http://tinyurl.com/322hltw

### unpack_sequence ###
Mem max: 11252.000 -> 11244.000: 1.0007x smaller
Usage over time: http://tinyurl.com/34lsbqd

### unpickle ###
Mem max: 6872.000 -> 6860.000: 1.0017x smaller
Usage over time: http://tinyurl.com/2aeqeua

### unpickle_list ###
Mem max: 6860.000 -> 6844.000: 1.0023x smaller
Usage over time: http://tinyurl.com/36q766k

And to gauge the performance impact, I also ran the speed tests. It
seems using mmap()/munmap() has very little performance impact in
either direction, so that's good:

### 2to3 ###
35.590589 -> 35.824554: 1.0066x slower

### bzr_startup ###
Min: 0.157976 -> 0.155976: 1.0128x faster
Avg: 0.167575 -> 0.168924: 1.0081x slower
Not significant
Stddev: 0.00334 -> 0.00716: 2.1463x larger
Timeline: http://tinyurl.com/39thymp

### call_method ###
Min: 0.878663 -> 0.884666: 1.0068x slower
Avg: 0.887148 -> 0.888667: 1.0017x slower
Not significant
Stddev: 0.02062 -> 0.02074: 1.0058x larger
Timeline: http://tinyurl.com/2fm39l2

### call_method_slots ###
Min: 0.872706 -> 0.867387: 1.0061x faster
Avg: 0.877261 -> 0.872754: 1.0052x faster
Significant (t=2.510615)
Stddev: 0.01523 -> 0.01586: 1.0410x larger
Timeline: http://tinyurl.com/3x84s9m

### call_method_unknown ###
Min: 1.031445 -> 1.028433: 1.0029x faster
Avg: 1.039063 -> 1.034296: 1.0046x faster
Not significant
Stddev: 0.03708 -> 0.03702: 1.0016x smaller
Timeline: http://tinyurl.com/395gevs

### call_simple ###
Min: 0.594110 -> 0.589934: 1.0071x faster
Avg: 0.606276 -> 0.594366: 1.0200x faster
Significant (t=5.874137)
Stddev: 0.01760 -> 0.01752: 1.0049x smaller
Timeline: http://tinyurl.com/2a2zv56

### django ###
Min: 0.997650 -> 0.993266: 1.0044x faster
Avg: 0.999423 -> 0.995495: 1.0039x faster
Significant (t=18.075408)
Stddev: 0.00093 -> 0.00122: 1.3050x larger
Timeline: http://tinyurl.com/28oa6wo

### float ###
Min: 0.102826 -> 0.102910: 1.0008x slower
Avg: 0.110088 -> 0.110280: 1.0017x slower
Not significant
Stddev: 0.02758 -> 0.02762: 1.0015x larger
Timeline: http://tinyurl.com/2w6ol8d

### hg_startup ###
Min: 0.045993 -> 0.044993: 1.0222x faster
Avg: 0.053388 -> 0.053510: 1.0023x slower
Not significant
Stddev: 0.00250 -> 0.00258: 1.0322x larger
Timeline: http://tinyurl.com/2ec392w

### iterative_count ###
Min: 0.157216 -> 0.156526: 1.0044x faster
Avg: 0.166971 -> 0.166897: 1.0004x faster
Not significant
Stddev: 0.06835 -> 0.07249: 1.0604x larger
Timeline: http://tinyurl.com/2g9agwl

### nbody ###
Min: 0.443087 -> 0.464941: 1.0493x slower
Avg: 0.456435 -> 0.475809: 1.0424x slower
Not significant
Stddev: 0.05609 -> 0.05523: 1.0156x smaller
Timeline: http://tinyurl.com/2wd6z8r

### normal_startup ###
Min: 0.438015 -> 0.437763: 1.0006x faster
Avg: 0.438425 -> 0.438810: 1.0009x slower
Not significant
Stddev: 0.00024 -> 0.00274: 11.6231x larger
Timeline: http://tinyurl.com/34nunk3

### nqueens ###
Min: 0.693033 -> 0.698259: 1.0075x slower
Avg: 0.698948 -> 0.704770: 1.0083x slower
Not significant
Stddev: 0.02644 -> 0.02590: 1.0208x smaller
Timeline: http://tinyurl.com/39ydyjs

### pickle ###
Min: 1.654750 -> 1.669246: 1.0088x slower
Avg: 1.660298 -> 1.673813: 1.0081x slower
Significant (t=-17.007317)
Stddev: 0.00391 -> 0.00403: 1.0298x larger
Timeline: http://tinyurl.com/36zz8yk

### pickle_dict ###
Min: 1.859310 -> 1.862217: 1.0016x slower
Avg: 1.864953 -> 1.863408: 1.0008x faster
Significant (t=2.269051)
Stddev: 0.00300 -> 0.00377: 1.2590x larger
Timeline: http://tinyurl.com/32kz4l6

### pickle_list ###
Min: 1.059003 -> 1.045209: 1.0132x faster
Avg: 1.065780 -> 1.048728: 1.0163x faster
Significant (t=21.791102)
Stddev: 0.00413 -> 0.00368: 1.1223x smaller
Timeline: http://tinyurl.com/27rpxol

### regex_compile ###
Min: 0.828427 -> 0.832179: 1.0045x slower
Avg: 0.890980 -> 0.894830: 1.0043x slower
Not significant
Stddev: 0.26185 -> 0.26180: 1.0002x smaller
Timeline: http://tinyurl.com/38c3z8m

### regex_effbot ###
Min: 0.162540 -> 0.162873: 1.0020x slower
Avg: 0.167092 -> 0.167389: 1.0018x slower
Not significant
Stddev: 0.02830 -> 0.02830: 1.0001x smaller
Timeline: http://tinyurl.com/33r7s5y

### regex_v8 ###
Min: 0.164368 -> 0.163174: 1.0073x faster
Avg: 0.417027 -> 0.416113: 1.0022x faster
Not significant
Stddev: 0.86580 -> 0.86190: 1.0045x smaller
Timeline: http://tinyurl.com/3yabd5v

### richards ###
Min: 0.352872 -> 0.353289: 1.0012x slower
Avg: 0.354989 -> 0.355434: 1.0013x slower
Not significant
Stddev: 0.00543 -> 0.00549: 1.0108x larger
Timeline: http://tinyurl.com/36weagp

### rietveld ###
Min: 0.693800 -> 0.695649: 1.0027x slower
Avg: 0.984689 -> 0.984075: 1.0006x faster
Not significant
Stddev: 0.36204 -> 0.36497: 1.0081x larger
Timeline: http://tinyurl.com/2wfw2z5

### slowpickle ###
Min: 0.772514 -> 0.757389: 1.0200x faster
Avg: 0.821586 -> 0.805534: 1.0199x faster
Not significant
Stddev: 0.17998 -> 0.18494: 1.0275x larger
Timeline: http://tinyurl.com/37rb8d5

### slowspitfire ###
Min: 1.022256 -> 1.023268: 1.0010x slower
Avg: 1.023244 -> 1.024326: 1.0011x slower
Significant (t=-6.305389)
Stddev: 0.00022 -> 0.00119: 5.3064x larger
Timeline: http://tinyurl.com/29h96r4

### slowunpickle ###
Min: 0.384167 -> 0.380310: 1.0101x faster
Avg: 0.410666 -> 0.409001: 1.0041x faster
Not significant
Stddev: 0.08844 -> 0.09114: 1.0305x larger
Timeline: http://tinyurl.com/286o5wt

### spambayes ###
Min: 0.417142 -> 0.398649: 1.0464x faster
Avg: 0.598665 -> 0.574443: 1.0422x faster
Not significant
Stddev: 0.58158 -> 0.57470: 1.0120x smaller
Timeline: http://tinyurl.com/23ozclq

### startup_nosite ###
Min: 0.333111 -> 0.332463: 1.0019x faster
Avg: 0.338470 -> 0.335088: 1.0101x faster
Significant (t=3.325272)
Stddev: 0.00886 -> 0.00500: 1.7726x smaller
Timeline: http://tinyurl.com/23ob2vy

### threaded_count ###
Min: 0.181111 -> 0.182459: 1.0074x slower
Avg: 0.214138 -> 0.211476: 1.0126x faster
Not significant
Stddev: 0.14987 -> 0.14531: 1.0314x smaller
Timeline: http://tinyurl.com/2dkupks

### unpack_sequence ###
Min: 0.000217 -> 0.000218: 1.0033x slower
Avg: 0.000223 -> 0.000222: 1.0048x faster
Significant (t=11.155116)
Stddev: 0.00002 -> 0.00001: 3.0782x smaller
Timeline: http://tinyurl.com/26tzfco

### unpickle ###
Min: 1.187517 -> 1.173118: 1.0123x faster
Avg: 1.212250 -> 1.178640: 1.0285x faster
Significant (t=9.457625)
Stddev: 0.02461 -> 0.00508: 4.8419x smaller
Timeline: http://tinyurl.com/2djz64c

### unpickle_list ###
Min: 1.106733 -> 1.086683: 1.0185x faster
Avg: 1.147736 -> 1.129444: 1.0162x faster
Significant (t=4.530626)
Stddev: 0.01237 -> 0.02573: 2.0800x larger
Timeline: http://tinyurl.com/2fctpe9

Any thoughts?

- Steven

Alastair_Lynn · August 8, 2010, 9:50pm

Hi Steven-

Nice, but will this not break Windows? From an initial glance over your patch, it seems to assume the existence of mmap() in some form or other.

Alistair

Steven_Noonan · August 8, 2010, 10:58pm

Correct, it does at the moment. I noted this on the code review.

- Steven

Jakob_Stoklund_Olese · August 9, 2010, 3:20am

I've been doing work on memory reduction in Unladen Swallow, and
during testing, LiveRanges seemed to be consuming one of the largest
chunks of memory.

That's interesting. How did you measure this? I'd love to see your data.

Note that the LiveRange struct is allocated by a plain std::vector, and your patch doesn't change that. I assume you are talking about the VNInfo structs?

I wrote a replacement allocator for use by
BumpPtrAllocator which uses mmap()/munmap() in place of
malloc()/free().

It's a bit more complicated than that. Modern malloc's use a whole bag of tricks to avoid lock contention in multiprocessor systems, and they know which allocation size the kernel likes, and which system calls to use.

By calling mmap directly, you are throwing all that system specific knowledge away.

It's great that you provide measurements, but it's not clear what you are measuring. Does 'mem max' include the overhead of asking the kernel for tiny 4K allocations, if any? Also, what is your operating system and architecture? That could make a big difference.

Have you looked at the effect of twiddling the default 4K slab size in BumpPtrAllocator? I suspect you could get more dramatic results that way.

Thanks for working on this!

/jakob

Reid_Kleckner · August 9, 2010, 4:20am

I've been doing work on memory reduction in Unladen Swallow, and
during testing, LiveRanges seemed to be consuming one of the largest
chunks of memory.

That's interesting. How did you measure this? I'd love to see your data.

Note that the LiveRange struct is allocated by a plain std::vector, and your patch doesn't change that. I assume you are talking about the VNInfo structs?

Steven has been using Instruments, and sending us screenshots. Does
anyone else know a better way of exporting that data?

I thought I dug into the register allocation code, and found the
VNInfo::Allocator typedef. I assumed that was getting the traffic we
saw in Instruments, but I don't have the data to back that up.

I wrote a replacement allocator for use by
BumpPtrAllocator which uses mmap()/munmap() in place of
malloc()/free().

It's a bit more complicated than that. Modern malloc's use a whole bag of tricks to avoid lock contention in multiprocessor systems, and they know which allocation size the kernel likes, and which system calls to use.

By calling mmap directly, you are throwing all that system specific knowledge away.

So the goal of this particular modification was to find ways to return
large, one-time allocations that happen during compilation back the
OS. For unladen-swallow, we have a long-lived Python process where we
JIT code every so often. We happen to generate an ungodly amount of
code, which we're trying to reduce. However, this means that LLVM
allocates a lot of memory for us, and it grows our heap by several MB
over what it would normally be. The breakdown was roughly 8 MB gets
allocated for this one compilation in the spam_bayes benchmark, with 2
MB coming form register allocation and 2 MB from SDNodes.

We are looking at using mmap/munmap to avoid permanently growing the heap.

This patch switches all allocators over to mmap, so you can see a lot
of "stitches" in the graphs, where an allocator is created and thrown
away quickly. Those allocations are probably better served by malloc.

It's great that you provide measurements, but it's not clear what you are measuring. Does 'mem max' include the overhead of asking the kernel for tiny 4K allocations, if any? Also, what is your operating system and architecture? That could make a big difference.

The memory size measurements in this data are all taken using
/proc/smaps data in Linux to find the number of dirty pages.

Have you looked at the effect of twiddling the default 4K slab size in BumpPtrAllocator? I suspect you could get more dramatic results that way.

If we did that, one thing that might happen is malloc might start
forwarding to mmap, but I think you have to allocate ~128K at a time
to hit that threshold.

Reid

Jakob_Stoklund_Olese · August 9, 2010, 4:36pm

I thought I dug into the register allocation code, and found the
VNInfo::Allocator typedef. I assumed that was getting the traffic we
saw in Instruments, but I don't have the data to back that up.

Are you using llvm from trunk? VNInfo is a lot smaller now than it was in 2.7. I would guess about a third of the liveness memory usage goes through the VNInfo BumpPtrAllocator.

[...]

By calling mmap directly, you are throwing all that system specific knowledge away.

So the goal of this particular modification was to find ways to return
large, one-time allocations that happen during compilation back the
OS. For unladen-swallow, we have a long-lived Python process where we
JIT code every so often. We happen to generate an ungodly amount of
code, which we're trying to reduce. However, this means that LLVM
allocates a lot of memory for us, and it grows our heap by several MB
over what it would normally be. The breakdown was roughly 8 MB gets
allocated for this one compilation in the spam_bayes benchmark, with 2
MB coming form register allocation and 2 MB from SDNodes.

We are looking at using mmap/munmap to avoid permanently growing the heap.

Don't try to outsmart malloc, you are going to lose

This all depends on specific kernel implementation details, but you risk badly fragmenting your address space, and chances are the kernel is not going to handle that well. You are using the kernel as a memory manager, but the kernel wants to be used as a dumb slab allocator for malloc.

I assume that LLVM is properly freeing memory after jitting? Otherwise, that should be looked at.

So why isn't your malloc returning the memory to the OS?

Is it because malloc thinks you might be needing that memory soon anyway? Is it correct?

Does your malloc know that you are running with very little memory, and the system badly needs those 8MB? Maybe your malloc needs to be tuned for a small device?

Is LLVM leaving a fragmented heap behind. Why? That would be worth looking into.

/jakob

To_ro_k_Edwin · August 9, 2010, 4:54pm

With mmap() it is always possible to fully release the memory once you
are done using it.
With malloc() no, it takes just 1 allocation at the end of the heap to
keep all the rest allocated. That wouldn't be a problem if libc would
use mmap() as the low-level allocator for malloc but it doesn't.
It uses sbrk() mostly for small (<128k) allocations, and even with
mmaps it caches them for a while.

I think that is because mmap() is slow in multithreaded apps, since it
needs to take a process level lock, which also contends with the lock
taken by pagefaults from other existing mmaps (in fact that lock is held
during disk I/O!).

Best regards,
--Edwin

Jakob_Stoklund_Olese · August 9, 2010, 5:17pm

With mmap() it is always possible to fully release the memory once you
are done using it.

Sure. Is that the goal, though? Why isn't malloc doing it already?

With malloc() no, it takes just 1 allocation at the end of the heap to
keep all the rest allocated. That wouldn't be a problem if libc would
use mmap() as the low-level allocator for malloc but it doesn't.
It uses sbrk() mostly for small (<128k) allocations, and even with
mmaps it caches them for a while.

Recommended reading: http://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf

I think that is because mmap() is slow in multithreaded apps, since it
needs to take a process level lock, which also contends with the lock
taken by pagefaults from other existing mmaps (in fact that lock is held
during disk I/O!).

Sounds awesome, let's do that

You are also leaving a bunch of 4K holes in your address space. On 32-bit systems, address space is a scarce resource.

/jakob

To_ro_k_Edwin · August 9, 2010, 5:39pm

> With mmap() it is always possible to fully release the memory once
> you are done using it.

Sure. Is that the goal, though?

If goal is to reduce fragmentation, possibly. You
don't know if you have fragmentation or not, the JITed app may fragment
memory for example.

Why isn't malloc doing it already?

Because it can't. sbrk() can only increase/decrease memory usage at the
end (like a stack), you can't release something in the middle.
Thats one of the reasons why we wrote a pool-based memory allocator for
ClamAV.

> With malloc() no, it takes just 1 allocation at the end of the heap
> to keep all the rest allocated. That wouldn't be a problem if libc
> would use mmap() as the low-level allocator for malloc but it
> doesn't. It uses sbrk() mostly for small (<128k) allocations, and
> even with mmaps it caches them for a while.

Recommended reading:
http://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf

If jemalloc provides same or better memory usage than
MMapAllocator, I think it'd be better to have a JEMallocAllocator
instead.
I think jemalloc is fairly portable (firefox uses it), isn't it?

> I think that is because mmap() is slow in multithreaded apps, since
> it needs to take a process level lock, which also contends with the
> lock taken by pagefaults from other existing mmaps (in fact that
> lock is held during disk I/O!).

Sounds awesome, let's do that

Multithreaded performance should probably be benchmarked on a real app
with MMapAllocator, and with the MallocAllocator.

You are also leaving a bunch of 4K holes in your address space. On
32-bit systems, address space is a scarce resource.

Doesn't BumpPtrAllocator use a larger chunk size?

Best regards,
--Edwin

Reid_Kleckner · August 9, 2010, 6:00pm

> With mmap() it is always possible to fully release the memory once
> you are done using it.

Sure. Is that the goal, though?

If goal is to reduce fragmentation, possibly. You
don't know if you have fragmentation or not, the JITed app may fragment
memory for example.

Yes, the goal is to fully release the memory back to the OS.

Why isn't malloc doing it already?

Because it can't. sbrk() can only increase/decrease memory usage at the
end (like a stack), you can't release something in the middle.
Thats one of the reasons why we wrote a pool-based memory allocator for
ClamAV.

Another thing malloc could do is to use madvise with MADV_DONTNEED to
free the pages in the middle of t heap, but malloc can't read your
mind, so it doesn't know that you aren't about to reallocate that
region of the heap.

> With malloc() no, it takes just 1 allocation at the end of the heap
> to keep all the rest allocated. That wouldn't be a problem if libc
> would use mmap() as the low-level allocator for malloc but it
> doesn't. It uses sbrk() mostly for small (<128k) allocations, and
> even with mmaps it caches them for a while.

Recommended reading:
http://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf

If jemalloc provides same or better memory usage than
MMapAllocator, I think it'd be better to have a JEMallocAllocator
instead.
I think jemalloc is fairly portable (firefox uses it), isn't it?

Reading the abstract, jemalloc seems like it has nothing to do with
keeping the total heap usage low and everything to do with performance
in a multithreaded app.

> I think that is because mmap() is slow in multithreaded apps, since
> it needs to take a process level lock, which also contends with the
> lock taken by pagefaults from other existing mmaps (in fact that
> lock is held during disk I/O!).

Sounds awesome, let's do that

Multithreaded performance should probably be benchmarked on a real app
with MMapAllocator, and with the MallocAllocator.

I predict that mmap will be slower than malloc, for obvious reasons.
The only way in which mmap could be better is that it reduces your
steady state heap usage.

You are also leaving a bunch of 4K holes in your address space. On
32-bit systems, address space is a scarce resource.

Doesn't BumpPtrAllocator use a larger chunk size?

Nope, it defaults to 4K. IMO that should be bumped up (pun wasn't
intended, but then I left it in...). Especially if we want to use
mmap as the allocator, increasing the slab size will reduce the number
of expensive system calls that grab the process lock.

Reid

Reid_Kleckner · August 9, 2010, 6:05pm

So just to try to sum up, the goal of this kind of change is to reduce
long-term heap usage in applications that JIT code once at startup,
and then enter a steady state of doing work. Think of server
applications written in any JITed language, or the finance
applications I keep seeing pop up on this list.

I would say that LLVM sees the *most* use as a static, ahead-of-time
compiler, where code generation occurs continously throughout the
process's life until it exits. In that kind of application, I would
presume that malloc is preferable to mmap, for all of the reasons that
Jakob described, ie avoiding address space fragmentation, avoiding
heavy-weight system calls, and (in the future, maybe?) multi-threaded
performance.

What I'm asking is whether it's worth adding code to LLVM to support
reducing memory usage for applications with the first use case.

Reid

To_ro_k_Edwin · August 9, 2010, 6:09pm

>
>>
>>
>> > With mmap() it is always possible to fully release the memory
>> > once you are done using it.
>>
>> Sure. Is that the goal, though?
>
> If goal is to reduce fragmentation, possibly. You
> don't know if you have fragmentation or not, the JITed app may
> fragment memory for example.

Yes, the goal is to fully release the memory back to the OS.

>> Why isn't malloc doing it already?
>
> Because it can't. sbrk() can only increase/decrease memory usage at
> the end (like a stack), you can't release something in the middle.
> Thats one of the reasons why we wrote a pool-based memory allocator
> for ClamAV.

Another thing malloc could do is to use madvise with MADV_DONTNEED to
free the pages in the middle of t heap, but malloc can't read your
mind, so it doesn't know that you aren't about to reallocate that
region of the heap.

>>
>> > With malloc() no, it takes just 1 allocation at the end of the
>> > heap to keep all the rest allocated. That wouldn't be a problem
>> > if libc would use mmap() as the low-level allocator for malloc
>> > but it doesn't. It uses sbrk() mostly for small (<128k)
>> > allocations, and even with mmaps it caches them for a while.
>>
>> Recommended reading:
>> http://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf
>
> If jemalloc provides same or better memory usage than
> MMapAllocator, I think it'd be better to have a JEMallocAllocator
> instead.
> I think jemalloc is fairly portable (firefox uses it), isn't it?

Reading the abstract, jemalloc seems like it has nothing to do with
keeping the total heap usage low and everything to do with performance
in a multithreaded app.

"In late 2007, the Mozilla Project was hard at work improving Firefox's
memory usage for the 3.0 release, and jemalloc was used to solve
fragmentation problems for Firefox on Microsoft Windows platforms. You
can read here about the fruits of that labor."

>> > I think that is because mmap() is slow in multithreaded apps,
>> > since it needs to take a process level lock, which also contends
>> > with the lock taken by pagefaults from other existing mmaps (in
>> > fact that lock is held during disk I/O!).
>>
>> Sounds awesome, let's do that
>
> Multithreaded performance should probably be benchmarked on a real
> app with MMapAllocator, and with the MallocAllocator.

I predict that mmap will be slower than malloc, for obvious reasons.
The only way in which mmap could be better is that it reduces your
steady state heap usage.

Did you try jemalloc though? AFAIK it can act as a drop-in replacement
for malloc().

>> You are also leaving a bunch of 4K holes in your address space. On
>> 32-bit systems, address space is a scarce resource.
>
> Doesn't BumpPtrAllocator use a larger chunk size?

Nope, it defaults to 4K. IMO that should be bumped up (pun wasn't
intended, but then I left it in...). Especially if we want to use
mmap as the allocator, increasing the slab size will reduce the number
of expensive system calls that grab the process lock.

Agreed.

Best regards,
--Edwin

Ken_Ferry · August 9, 2010, 8:42pm

I’ve been doing work on memory reduction in Unladen Swallow, and
during testing, LiveRanges seemed to be consuming one of the largest
chunks of memory.

That’s interesting. How did you measure this? I’d love to see your data.

Note that the LiveRange struct is allocated by a plain std::vector, and your patch doesn’t change that. I assume you are talking about the VNInfo structs?

Steven has been using Instruments, and sending us screenshots. Does
anyone else know a better way of exporting that data?

So, just so you’re aware, direct calls to mmap are not intercepted and reported by Instruments. So using mmap instead of malloc will make your reported numbers go down, but that doesn’t necessarily mean you have better performance.

This is a problem for people doing performance measurements on Mac OS X and iOS, because exotic memory allocation schemes seem to be becoming more common (I hope not because they dodge reporting!). In particular, may image buffers are allocated directly from mmap and vm_allocate, within CoreGraphics and elsewhere.

-Ken
Cocoa Frameworks

Reid_Kleckner · August 9, 2010, 9:05pm

>
>> I've been doing work on memory reduction in Unladen Swallow, and
>> during testing, LiveRanges seemed to be consuming one of the largest
>> chunks of memory.
>
> That's interesting. How did you measure this? I'd love to see your data.
>
> Note that the LiveRange struct is allocated by a plain std::vector, and
> your patch doesn't change that. I assume you are talking about the VNInfo
> structs?

Steven has been using Instruments, and sending us screenshots. Does
anyone else know a better way of exporting that data?

So, just so you're aware, direct calls to mmap are not intercepted and
reported by Instruments. So using mmap instead of malloc will make your
_reported_ numbers go down, but that doesn't necessarily mean you have
better performance.

I am aware. We used Instruments mostly to drill down and find the
places that were doing the allocation. The graphs generated by
perf.py use dirty pages.

This is a problem for people doing performance measurements on Mac OS X and
iOS, because exotic memory allocation schemes seem to be becoming more
common (I hope not because they dodge reporting!). In particular, may image
buffers are allocated directly from mmap and vm_allocate, within
CoreGraphics and elsewhere.

Yeah, it is kind of annoying that by doing this, we make it harder to
use Instruments to find problems. :-/

Reid

Jean-Daniel_Dupas1 · August 10, 2010, 7:11am

If this is for Mac OS X, you can use malloc zone instead. They also provide a way to dealloc all memory at once, and they probably works with Instrument too.

– Jean-Daniel

Topic		Replies	Views
Running Pool Allocated programs LLVM Dev List Archives	5	113	April 3, 2005
Making LLVM safer in out-of-memory situations LLVM Dev List Archives	22	155	January 14, 2014
JIT allocates global data in function body memory LLVM Dev List Archives	16	157	July 8, 2009
Static Analyzer Rocks Hard Clang Frontend	15	226	June 28, 2008
Promoting malloc to alloca LLVM Dev List Archives	9	171	July 13, 2010

MmapAllocator

Related topics