mobile web apps are slow -- and GC is to blame

The blog post Mobile Web Apps are slow caught my eye. It makes several bold claims: the performance of JavaScript is plateauing, the real problem is memory consumption of web apps, GC performance decreases unless 10 times more memory is available than objects are live, JS memory allocation is too hard to control, GCed languages are unsuited for mobile apps. Unfortunately these claims are only discussed in the context of JS.

Does or would FP suffer from the same problem on systems with limited memory? I would be curious to hear from programmers who use Lua or OCaml for iOS applications about the problems mentioned in the blog post and about GC overhead in FP.

Addendum: The author of the blog post elaborated his post in a much more detailed article
Why mobile web apps are slow
. Meanwhile Andreas Rossberg, who works on Google's V8 implementation of JavaScript, explained to me that the SunSpider benchmark is about the worst possible to judge JS performance. For example, SunSpider gives a lot of weight to date operations that are hardly typical for web apps. But sure enough, JS implementations are now really good at it in order to look good when benched with SunSpider.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

We could do better with GC

Perhaps the GC engines in browsers just aren't that advanced. This takes effort, and there are only a few great GC people in the world.

Or we could avoid GC more: reference counting, for example, has very good characteristics for short lived and loosely shared objects. Yes, it can't deal with cycles, but those could be special cased. Apple and Microsoft are getting a lot of mileage out of it in spite of its immediate overhead. We do have to re-architect our languages to be more friendly to referencing counting though, which Apple and Microsoft have done with Cocoa ObjectiveC and various languages under WinRT.

Stack allocation is also another optimization that we could do better on.

Finally, RAM is cheap, let's just put more into our devices. I think device manufacturers could be making the wrong tradeoff here, but then again RAM also consumes power, so what do I know.

2 hrs

Apple's analysis is doubling the ram on the iPhone 5 would have cut 2 hrs off the battery life.

Chip Stacking

In a naive implementation that would be true. Current chip stacking techniques could make significant improvements to current battery life - certainly enough to double the on-board memory.

But the other question to ask is: why not implement a concurrent collector similar to C4, which largely eliminates the issue? Can the C4 concurrency approach be implemented correctly and safely on a processor with a weakly consistent memory model, or is that an open research problem?

I think that concurrent

I think that concurrent collectors have higher CPU usage as a tradeoff for their lower latency: when increased CPU usage means lower battery life, I'm not sure that this is a good solution for mobile phones..

Of course if the CPU had an hardware assisted GC (like Vega) then maybe it would be possible.

Maybe

Mozilla is / has been putting in an incremental collector for JavaScript, and I wouldn't be surprised if there's talk of going further with a concurrent one. That sort of thing is really important for game devs, and JS is slowly taking the reigns from Flash on that end on the PC and I fully expect WebGL etc. functionality to enable the transition on the phone.

If you mean to go the step further of a parallel collector (e.g., multicore marking), the additional overheads might not be prohibitive. I'd expect the power costs are mostly related to memory access rather than compute, and the race-to-halt benefits of latency improvements may eat any additional CPU overheads.

Some background

Just to put that into perspective, V8 has had an incremental collector for long, and marking already is parallel. (Meanwhile, Firefox still ships with a non-moving collector.)

Don't assume that JS collectors are no good. I would suspect that there are rather few GCed languages out there that have seen a comparable amount of engineering resources and expert knowledge going into GC.

As for the article, take it with a kilo of salt. I don't think that the author really knows what he is talking about. Two observations (which I already made elsewhere):

1. Any analysis of JavaScript performance that is based on the SunSpider benchmark has immediately disqualified itself. Compared to SunSpider, even the Great Language Shootout is a highly scientific benchmark. And because all vendors know that SunSpider is a joke, most of them no longer try to waste resources on trying to make their numbers look better.

2. The real hard performance wall for JavaScript is not GC as such, but the language's "dynamic" semantics. You pay a lot for useless corner cases and the total lack of invariants. More importantly, many heuristic optimisations require under-the-hood allocations all over the place, which the programmer can neither predict nor avoid.

Anybody interested in a qualified evaluation of the cost of garbage collection should look at more well-behaved languages, where fewer random variables dominate the picture. As Christian hinted, FPLs would probably be a more suitable source of insight.

Neat

I saw that V8 landed the incremental GC last year, but is there a description or anything about parallelization? That's news to me. I saw that Firefox had been exploiting parallelism across compiler components, but I did't know how Chrome was exploiting parallelism within any individual component. Do you know any other examples of this?

Agreed about JSVM performance, and the uninformed nature of the performance blog posts. Ex: for small mobile caches, my benchmarks a few years back suggested that optimizing *browser* data representations would be a higher impact feat.

Parallelism

is there a description or anything about parallelization?

Nothing specific yet, I'm afraid (other than the sources), but a posting to the Chromium blog has been planned for a while.

I did't know how Chrome was exploiting parallelism within any individual component. Do you know any other examples of this?

I don't know about Chrome in general, but V8 also does concurrent (background) compilation now, at least for Crankshaft. If I got the versions right, that will ship with Chrome 29.

Thanks

Ah, thanks. For parallelism within a component, I meant such as parallelism within the parser, not running the parser concurrently with the main thread. The algorithmic properties become interesting this way, such as less data movement across stages.

I see

But just out of interest, did you pick the parser example on purpose? Because I'm not aware of any approach to parallelising parsing, at least not for LR(1)-ish grammars as normally used for PLs, that don't require search.

Totally doable

Our group does this sort of thing, it's fun :-) Parallelism within lexing, parsing, selectors, cascade, layout, fonts, tesselation, and a couple others.

For the case of lexing and parsing as MIMD, we found speculation to fit for data parallel partitioning of the input. Speculation is especially clean in the case of a lexer (I think we described the basic idea for speculative FSMs as part of our HotPar 09 paper). The intuition is that if you start an FSM in most bad states at any point of the string, after a warmup period, it'll hit the right one and will be right thereafter. (You can tack on tricks like a cheap scan for a good start such as "function"). The worst case is the speculation fails and you resort to 2x * sequential, but that's unlikely and mispeculations can be localized. Seth ended up doing another version for PEGs that built off these ideas and some of the work ended up as part of Qualcomm's Zoomm browser (PPOPP2013).

Trickier yet is to do SIMD on each core. I think it's doable for lexing (e.g., see work being done by Todd Mytkowicz), but parsing is much less clear.

another real hard performance wall seen by end users to date

in mobile web stuff, and even some desktop web stuff, is being interrupted by collections. especially lame in games, but really not acceptable by anybody who cares about ui/ux, i'd posit. when you write 'native' w/ or w/out gc (which i personally don't enjoy that much, i like gc better, assuming i can get away with using it) you don't get the problems so much of 'oh it works well in x but the same code runs in y but sucks'. (well, that's overstating it, because android is so varied and has such a long tail of old versions that it is actually still hard to do even native extremely well, but it is less hard to control than if you did it all as html5, i think.)

Best-funded implementations

I think the amount of money invested in JavaScript engine performance across Google, Apple, Mozilla, Microsoft probably outpaces the amount for any other language/programming environment. It's possible they just haven't gotten to GC yet (because there were other issues to fix first), but I wouldn't bet against them getting the best GC people.

So what are the best GC people working on?

If browser vendors spend a lot of resources on JavaScript implementations (and there are indications they indeed do) why would they not try getting the best people to work on GC? Especially if there is some consensus that GC performance is critical on mobile systems.

What makes you think the best GC people work on something else? And what would that be? Java VMs? Haskell?

Most functional programming languages

Most functional programming languages use tracing collectors, but you can always use linear types or regions to do the job. See Linear Lisp and Jhc. My current language project has strict evaluation and no mutation, so I’m planning to just use reference counting for heap-allocated values—and that’s mainly just an optimisation to avoid copying.

From my limited experience with OCaml, I’d say GC in that language seems to present fewer issues than in, say, Haskell. Laziness and purity can be somewhat problematic when it comes to memory behaviour, but OCaml’s strictness and impure style are advantages when memory is at a premium.

related links

See also today's Hacker News discussion about yesterday's post by Dan Bricklen in reply to Drew Crawford's related article Why mobile web apps are slow. (I own an IPhone, but I only look at it to check the time; my phone is just a pager so my sons can call me once a day. No, it wasn't a useful purchase since reception is worse than my old phone. I'm not part of the smart phone market, and so have no comment that's not conjecture.)

HN discussion

The discussion on HN mainly points out that the browser environment provides highly optimized routines for presentation aspects. What remains to be done is often not that complex and in the end could still be fast if you don't need to crunch data.

This argument is a bit like saying shell scripts have access to optimized tools and otherwise provide glue code. Indeed, this was the way of app development on Unix early on but hardly anymore today. The discussion doesn't address the main point of the article, namely, that GC requirements of JS is the real problem. Someone argues that DOM manipulation, which provides access to the presentation layer, is slow and poses a problem, too, and needs to be benchmarked.

Two different styles of data

I think fundamentally the issue isn't GC as it is GC for large data. What probably makes sense is that the language have GC data structures for most data and non-GC structures for large objects with high efficiency routines. Variables that keep track of time or a list of contacts can be garbage collected using any sane strategy while video or photographs need to be handled much more efficiently.

I don't see any reason a high level language with low level subroutines wouldn't work to fix this problem. Ultimately the issue with JavaScript is not GC but that there is no way for JavaScript to load low level subroutines onto the handset. I think Christian's comment above about shell scripts wrapping C routines is exactly the kind of ideal you would want. The program would then consistent of the high and low level parts together, at least on iOS where 3rd party libraries aren't allowed (excluding a few inconsequential exceptions).

I think big objects in most

I think big objects in most platforms like photos and videos are already managed manually with explicit dispose-like methods...basically like file descriptors (which in fact...they probably are).

exactly

JavaScript doesn't have that because the browser API doesn't support it, but even it could. But to my mind that's all is needed. Just some way for the programmer to say "this needs to be handled manually" while everything else is handled automatically.

Getting there

Typed arrays start to do this. It seems like the static side of JavaScript (e.g., value types) is slowly grow up around it. As an example application, asm.js uses them, and thereby skips a lot of GC even if the JSVM doesn't know about the special format.

Interestingly, exposing manual memory management puts the VM at risk from security attacks. Memory locations become predictable, etc. -- the threats might be closed, but at an unclear cost.

RM not VM

Interestingly, exposing manual memory management puts the VM at risk from security attacks. Memory locations become predictable, etc. -- the threats might be closed, but at an unclear cost.

Very good point. I'd counter by saying once you start allowing the language to interact with and directly control hardware it ain't a Virtual Machine anymore but a Real Machine. Certainly Objective-C doesn't prevent most attacks and since JavaScript (or whatever high level language) would only need to be "not worse" than Objective-C in terms of security, that's a rather low bar.

Tension

Agreed. The distinction seems to be leading to some tension with the cloud computing (NodeJS) camp. On a server, developers want access to everything and trust the cloud provider with all virtualization. Browser developers are still more invested in performance, and they can't follow such a non-existent security model. (Imagine if every website had the same permissions as an app!)

More detail please?

>Interestingly, exposing manual memory management puts the VM at risk from security attacks. Memory locations become predictable, etc. -- the threats might be closed, but at an unclear cost.

I fail to see the link, can you explain?
Allocating manually on the heap can be randomised..

Do people read papers beyond the abstract?

When I read Drew Crawford understanding of the mentioned paper by Hertz & Berger I immediately knew that the whole rest of the article on GC was misinformed and wrong.

He posts the result of a single benchmark, reads the results completely wrong and then translated that to a bunch of false analogies to "Native memory management in ObjectiveC".

I guess he did not read the part of what those GC algorithms were running against. Against two oracles that know precisely, based on past execution, where to put calls to free. Either based on last usage or reachability.

Two things that don't exist in real world C/ObjectiveC applications, unless you believe in God and that he/she publishes apps in the AppStore.

Now for the result that matter, which is in table 4, the best collector with 3x more memory performs 1.28x slower.

This means GCs are incredibly inefficient and use quite a lot more memory, right?

Well, against real world memory management techniques? Like, for example, the naive reference counting that ObjectiveC offer.

There are no benchmarks against it, but it's easy to envision how fast a system that requires a very expensive write barrier even to local variables can be.

Another point is that reference counting, even ignoring cycles, retain more memory than the idealized reachability system. It will retain, at the very minimum, the difference between the liveness and the reachability in that benchmark, which is around 20% more.

This is 20% more objects, it's not 20% more memory. How much more extra memory that means, given every object requires storage for the reference count, it's an open question that would require knowing the object populations of those benchmarks.

Source?

Not challenging your number, but I'd be interested to know the source for the 20% figure?

Though I'm not sure why what you say should be true. Can you explain the intuition?

The intuition is simple and

The intuition is simple and the number is straightforward from the paper.

RC follows the same pattern of traditional automatic memory management techniques, which is to only reclaim the space of unreachable objects.

The research paper in question evaluates memory consumption of 2 oracles, one that calls free after the last usage of an object and another that call free when the object become unreachable.

The difference in memory usage from those two oracles yields an approximation of the space overhead of RC for the benchmarks in question.

The 20% number was derived from the memory usage figures in table 4. They are, of course, a very conservative lower bound as any practical implementation of RC will need space for ref counts and other bookkeeping information.

Lua and GC

The OP mentioned Lua....

Lua uses an incremental mark and sweep GC. It also gives you some control over the GC via an API (you can pause it, step it, etc). This has made Lua a potential (currently evaluating) programming language for my embedded work.

With careful design and analysis, I should be able to predict (and control) how GC will impact critical sections of code.

(I have to admit here that my target platform is not a smartphone/tablet but an embedded (no OS) ARM Cortex. Similar GC issues apply.)

Memory is the new bottleneck, period

Memory folks have been saying this for years and most researchers have been ignoring what they say or haven't been moving this into the mainstream as a topic of discussion because most of the major problems are with hardware.

Assumption of unlimited resources

I find it interesting that we have spent the pre-smartphone era developing apps that assume unlimited resources (memory, speed, energy) and now we are having difficulty "porting" programming techniques down to things that are more constrained (smartphones, etc).

We have reached the point where our programming languages assume "unlimited" memory. Paging (to/from disk) is undesirable (from a performance perspective) on a desktop/server but your app will usually continue to run (and if you are lucky, other apps can be paged out and you don't feel a lot of the lag).

No

Memory problems go far beyond phones and stupid dinky JavaScript apps I write professionally.

I can achieve a 50% speed/energy change just by changing the allocation strategy for a particular algorithm. It has nothing to do with my architecture in this case. It has more to do with fitting allocators to specific data structure+algorithms pairs is an NP-Complete problem with a need for good heuristics and expert systems.

"We" don't necessarily mean "us"

Oh I agree. My consideration/concern for performance and bloat go beyond target architecture too.

But, my application is at the mercy of other folk's applications (when in a multitasking environment -- phone/server/etc) that may not have been written with the same care.

Phones (and other small devices) are less forgiving than big (memory paged) desktops and servers.

Video and games

Also don't forget the assumption of a reliable network. While we've been engineering around slow (either in terms of bandwidth or latency) we haven't until recently had to engineer around unreliable. We now have networks that can shut off for minutes or hours unexpected with no warning or just keep fading in and out all day long. That's a brand new problem.

Just to add to your analysis I think there has more importantly been a shift in the types of applications. These performance problems aren't showing up on form filling applications or simple reference applications. Video was always a huge hog which had its own special techniques that everyone knew needed to be kept separate. On mobile many applications need a camera / video interface. Games have always been performance sensitive. Games are 67% of tablet applications sales and 39% of smartphone sales (source Flurry). On desktop they were never this big a player.

When programming is though of in terms of scientific research and math you have one paradigm. For enterprise accounting another. For end user office productivity you have a totally different kind of paradigm. When it is designed around system administration another.
When it is designed around video and gaming...

Hume

Edit: Removed contents since I managed to submit the same comment twice.

Hume

I think there's a lot of unexplored territory on language design that would support automatic but still predictable memory management techniques.

Hume is one interesting approach. Kevin Hammond, Edwin Brady and Steffen Jost have more recent papers that tackle different parts of the problem.

I like reference counting because of its promtness.

Reference counting has one advantage over nearly every other form of garbage collection, in that the majority of collections take place immediately, as soon as the object is no longer needed. Cyclical garbage is relatively rare, and the programmer usually knows when some object is not guaranteed to be non-cyclical.

Collecting the object immediately allows more predictable behavior in terms of when finalizers, destructors, etc. are called, and IMO if the programmer can predict the semantics of something, that's usually a plus in the development of stable software.

Reference counting has an advantage over naïve implementations of most garbage collection algorithms in that it never "stops the world" at unpredictable times and for unpredictable intervals. You can run a concurrent or cooperatively-multitasked tracing collector or similar, but the "standard" tracing collector does have garbage collection pauses and takes time proportional to the amount of live data. The good news is that these pauses are so short these days that they are nearly imperceptible to humans, but they still exist.

Reference counting

What you're saying is that RC is synchronous, whereas other GC strategies are asynchronous. But that really only matters if you want finalization -- which more often than not is a Real Bad Idea anyway (even with RC it's not quite as predictable as some like to assume).

As for predictable timing, releasing one reference-counted object can occasionally cause an expensive domino effect. I'm told that this happens in practice, and can be pretty bad. If you want bounded pause times, an incremental collector is the more reliable alternative.

Reference Counting with Multiple Threads

Domino effect from reference counting has been a problem I encountered in practice. There are also related issues of determining where collection occurs - i.e. in which thread; some threads are more time-critical than others. There may be issues of priority inversion, starvation, etc. where the one thread that falls behind is left to do ALL the GC work because it's the last to touch each object. This situation easily arises for parallel observer patterns, or parallel use of persistent data structures.

When I had this problem with C++ smart pointers, I eventually just shoved the delete job off to a dedicated thread via a queue of reference counted pointers. It was a hack, but it worked. Essentially, it gave me predictable, concurrent collection.

Multi-threading also makes reference counting vastly more expensive, since atomic integer manipulations are needed. It becomes more difficult to justify RC based on predictability or cost once we move beyond a single thread.

Multi-threading also makes

Multi-threading also makes reference counting vastly more expensive, since atomic integer manipulations are needed. It becomes more difficult to justify RC based on predictability or cost once we move beyond a single thread.

Just to clarify, RC loses merely some predictability in finalization if you want to keep it cost-competitive with tracing. You can retain the predictability if you're willing to use atomic operations everywhere, but this is exceedingly expensive. Predictability isn't too bad with the above linked on-the-fly collector though; certainly better than tracing collection.

Leaky abstraction

It's a bit disingenuous to claim that reference counting is pauseless. On contemporary machines, I've seen pauses from a badly timed page fault that can exceed even severe GC pauses (on the order of several ms). If setting a pointer to null actually results in dereferencing the pointer, updating an atomic ref count and potentially calling free(), well, none of those are actually constant time operations and I see several places for a possible page fault.

There's a time and place for reference counting, of course, but I really don't believe it's the right default, and it's surely no silver bullet.

GC collection times

exceed even severe GC pauses (on the order of several ms).

Not trying to be funny but your idea of severe is orders of magnitude out for a mark compact collector with a 100+ MB heap. I can wait for over a minute sometimes on a modern 64-bit laptop. The thing about GC is that it's hard to generalise.

Fair point

That's a fair point, and I should have clarified that my idea of "severe" is biased by the kind of systems I tend to work on: relatively well-tuned server-side JVM programs that certainly try to avoid ever having to do a full GC (and succeed at that). Over a minute sounds really unexpectedly terrible for 100MB, but I believe it's quite possible. But yes, in my world GC pauses in the tens of milliseconds occur and are considered quite severe.

Anyway, the carelessly chosen example doesn't undermine my main point, which is that many seemingly constant-time operations can be quite expensive on a modern system. Reference counting certainly falls into this category, and page faults are an overlooked culprit. (Add something like Linux's "transparent huge pages", in which any page fault may trigger a kernel decision to rearrange and compact your heap, and you can be in for a very nasty surprise here.)

research is not dead yet?

Ulterior reference counting,
Unified theory,
Faster rc (1), and Faster rc (2).
but where's the shipping implementations? :-)

(Does anybody know of work that explores e.g. "back pointers" in a hybrid rc model, to deal with cycles?)

e.g.

an RC approach that successfully deals with cycles.

seems like there are ideas and even academic implementations...