Writing an Android NES Emulator — Performance optimizations

Published in

ProAndroidDev

6 min readSep 24, 2018

I’ve recently taken on the challenge of porting a Go implementation of a NES emulator to Android. Thanks to the awesome work done by Michael Fogleman, understanding and navigating the Go code was fairly easy even for me, who was a complete Go newbie. Let me also make it clear that porting an existing implementation is infinitely easier than writing it from scratch (and thus easier for emulation newbies like myself), which would be a huge undertaking.

I’ve always been fascinated by gaming console emulators, the fact that they could imitate physical hardware via software and make games run and play exactly like the original. That simply boggled my mind. Unfortunately, emulation is today still kind of a legal gray area that involves a lot of piracy and copyright infringement. However, currently in the US, emulation is 100% legal and, if you own the original game cartridge, it’s totally OK to also have the ROM for that game.

Disclaimer: My goal with this project is 100% educational and I in no way intend to incentivize illegal activities like piracy or copyright infringement.

With that out of the way, let’s talk about some of the challenges that involve writing a NES emulator and having it run on an Android phone just like the original NES console.

“The NES CPU core is based on the 6502 processor and runs at approximately 1.79 MHz”. In simple terms, this means it runs 1.79 million CPU cycles per second. It may sound slow if you compare to modern CPU processors, but it was actually quite good for its time. I had already taken a stab at implementing a 6502 CPU Android emulator in the past, so that part was already somewhat covered.

Measuring

Getting used to the Profiler, especially the CPU profiler shipped with Android Studio is absolutely a must. You’ll be doing a lot of recordings, so, being able to read and analyze the results is critical.

I found the flame and call charts to be the best for quickly identifying bottlenecks. The flame graph aggregates the same sequence of callers into a single block, so it should expose hot spots.

Fig. 2: Flame chart showing a long time spent in the APU step function

In the figure 2 above you can see that most of the Console#step time is spent in PPU#step. Another big chunk is in APU#step, which has also a pretty deep stack. More on that below.

The call chart will show you instead a timeline of method calls without aggregation. That will surface, for example, really deep call stacks, native function calls, etc.

Fig. 3: Call chart exposing a native function call in the Console thread 👎

You can also see how much CPU time is spent on each thread, which is also handy for understanding if your thread execution is being constantly interrupted:

Fig. 4: Console and GL threads competing for CPU time

From the images above, can you spot any bottlenecks or pitfalls? At this point it should be fairly clear: The APU#step call is triggering a JNI function call to android_media_AudioTrack_writeArray, which in turn makes some system calls. That brings us to our first performance offender.

Native function calls

There are a set of measures we need to take in order to have the Console thread run at maximum possible speed, so that we can achieve our target clock of 1.79MHz. The first one is:

1. Eliminate all native (JNI) function calls from your emulator console thread 🔥

There is a non-trivial amount of overhead involved with calling native functions. I couldn’t find an accurate number on how big the overhead is, but it seems that marshalling resources across the JNI layer is one of the main offenders.

Memory allocation

This one is big. Achieving zero memory allocation is really crucial for your emulator performance. Remember, you’re trying to run a highly optimized loop, almost 2 million times per second. Allocated memory will also have to be eventually cleaned up by the garbage collector, incurring a additional slowdown. Make sure nothing is instantiated from your console thread. With Java this is a bit easier since you could just do a search for usages of the new keyword everywhere. With Kotlin this is a bit harder since that doesn’t exist. Also keep in mind that just because you don’t directly instantiate any objects, it doesn’t mean that nothing is being allocated. Things like Kotlin ranges are not allocation free and may incur a hidden penalty:

// Causes Iterator allocation
for (i in collection) {
  doSomethingWith(i)
}

When in doubt, double check with the profiler call graph!

2. Prevent any and all memory allocation from your emulator console thread 🚒

Call stack depth

This one is less impactful, especially when compared with items 1 and 2 above, however once you have those two covered, this is the next thing to look for and should give you a measurable improvement. Inline as many methods as possible. And I’m not talking about using Kotlin’s inline keyword on every single method. Trust me, I tried it and it didn’t have the intended effect 😅 I’m talking about identifying the main top level components and keeping their APIs to a minimum. In this case, I kept mainly the classes Console, CPU, PPU and APU. Each of them has a method called step() that is called on every iteration of the emulator. Pretty much all the work is performed inside each of those method and, for the most part, they don’t make any other method calls. It’s all summarized to the meat: conditional statements (if, else), for loops, bitwise operations (and, or, etc.) and assignments. That’s it.

3. Keep your call stack depth to a minimum

Other considerations

Keep in mind that these are not the only factors that could contribute to poor emulator performance, but are definitely big ones. Another thing you have to keep in mind, for example, is minimizing the number of threads that are competing for CPU time. If you have 10 other threads running in your app, all in parallel trying to compete for the same CPU resource, I can almost guarantee that your emulator performance will be severely degraded.

Also, even though I absolutely love Kotlin, it’s a bit annoying to use it (and the JVM) for this type of programming that is typical for emulators, which involve lots of bit fiddling, for mainly two reasons:

Kotlin doesn’t support the same set of standard bitwise operators that you see in other languages like, for example, C and Java: & | ^ >> << &= |=, etc. it’s a lot more verbose to do a = a and b than a &= b. I’m not sure what the reason was behind that decision but doesn’t sound ideal to me.
The JVM doesn’t provide built-in support for unsigned primitive types. Even the the type byte is signed (values-128..127) which doesn’t make any sense to me. This is very painful and can lead to subtle/silent overflow errors. It also means we have to treat all types as Int even though we know a lot of them are really bytes (0..255 only).

Final remarks

Working on ktnes has been a fun side project, at the same time, very challenging and rewarding. There’s more to it that I plan to share in the future (tip: Kotlin multiplatform 😁). It’s still very much a work in progress, as you may have noticed and is far from done. If you’re interested in helping out, please do not hesitate to get in touch on Github/Twitter and send pull requests!

Thanks for reading!