vshade's blurblog

Apple's Cyclone Microarchitecture Detailed by Anand Lal Shimpi
Tuesday April 1^st, 2014 at 1:58 AM

AnandTech

The most challenging part of last year's iPhone 5s review was piecing together details about Apple's A7 without any internal Apple assistance. I had less than a week to turn the review around and limited access to tools (much less time to develop them on my own) to figure out what Apple had done to double CPU performance without scaling frequency. The end result was an (incorrect) assumption that Apple had simply evolved its first ARMv7 architecture (codename: Swift). Based on the limited information I had at the time I assumed Apple simply addressed some low hanging fruit (e.g. memory access latency) in building Cyclone, its first 64-bit ARMv8 core. By the time the iPad Air review rolled around, I had more knowledge of what was underneath the hood:

As far as I can tell, peak issue width of Cyclone is 6 instructions. That’s at least 2x the width of Swift and Krait, and at best more than 3x the width depending on instruction mix. Limitations on co-issuing FP and integer math have also been lifted as you can run up to four integer adds and two FP adds in parallel. You can also perform up to two loads or stores per clock.

With Swift, I had the luxury of Apple committing LLVM changes that not only gave me the code name but also confirmed the size of the machine (3-wide OoO core, 2 ALUs, 1 load/store unit). With Cyclone however, Apple held off on any public commits. Figuring out the codename and its architecture required a lot of digging.

Last week, the same reader who pointed me at the Swift details let me know that Apple revealed Cyclone microarchitectural details in LLVM commits made a few days ago (thanks again R!). Although I empirically verified many of Cyclone's features in advance of the iPad Air review last year, today we have some more concrete information on what Apple's first 64-bit ARMv8 architecture looks like.

Note that everything below is based on Apple's LLVM commits (and confirmed by my own testing where possible).

Apple Custom CPU Core Comparison
	Apple A6	Apple A7
CPU Codename	Swift	Cyclone
ARM ISA	ARMv7-A (32-bit)	ARMv8-A (32/64-bit)
Issue Width	3 micro-ops	6 micro-ops
Reorder Buffer Size	45 micro-ops	192 micro-ops
Branch Mispredict Penalty	14 cycles	16 cycles (14 - 19)
Integer ALUs	2	4
Load/Store Units	1	2
Load Latency	3 cycles	4 cycles
Branch Units	1	2
Indirect Branch Units	0	1
FP/NEON ALUs	?	3
L1 Cache	32KB I$ + 32KB D$	64KB I$ + 64KB D$
L2 Cache	1MB	1MB
L3 Cache	-	4MB

As I mentioned in the iPad Air review, Cyclone is a a wide machine. It can decode, issue, execute and retire up to 6 instructions/micro-ops per clock. I verified this during my iPad Air review by executing four integer adds and two FP adds in parallel. The same test on Swift actually yields fewer than 3 concurrent operations, likely because of an inability to issue to all integer and FP pipes in parallel. Similar limits exist with Krait.

I also noted an increase in overall machine size in my initial tinkering with Cyclone. Apple's LLVM commits indicate a massive 192 entry reorder buffer (coincidentally the same size as Haswell's ROB). Mispredict penalty goes up slightly compared to Swift, but Apple does present a range of values (14 - 19 cycles). This also happens to be the same range as Sandy Bridge and later Intel Core architectures (including Haswell). Given how much larger Cyclone is, a doubling of L1 cache sizes makes a lot of sense.

On the execution side Cyclone doubles the number of integer ALUs, load/store units and branch units. Cyclone also adds a unit for indirect branches and at least one more FP pipe. Cyclone can sustain three FP operations in parallel (including 3 FP/NEON adds). The third FP/NEON pipe is used for div and sqrt operations, the machine can only execute two FP/NEON muls in parallel.

I also found references to buffer sizes for each unit, which I'm assuming are the number of micro-ops that feed each unit. I don't believe Cyclone has a unified scheduler ahead of all of its execution units and instead has statically partitioned buffers in front of each port. I've put all of this information into the crude diagram below:

Unfortunately I don't have enough data on Swift to really produce a decent comparison image. With six decoders and nine ports to execution units, Cyclone is big. As I mentioned before, it's bigger than anything else that goes in a phone. Apple didn't build a Krait/Silvermont competitor, it built something much closer to Intel's big cores. At the launch of the iPhone 5s, Apple referred to the A7 as being "desktop class" - it turns out that wasn't an exaggeration.

Cyclone is a bold move by Apple, but not one that is ~~isn't~~ without its challenges. I still find that there are almost no applications on iOS that really take advantage of the CPU power underneath the hood. More than anything Apple needs first party software that really demonstrates what's possible. The challenge is that at full tilt a pair of Cyclone cores can consume quite a bit of power. So for now, Cyclone's performance is really used to exploit race to sleep and get the device into a low power state as quickly as possible. The other problem I see is that although Cyclone is incredibly forward looking, it launched in devices with only 1GB of RAM. It's very likely that you'll run into memory limits before you hit CPU performance limits if you plan on keeping your device for a long time.

It wasn't until I wrote this piece that Apple's codenames started to make sense. Swift was quick, but Cyclone really does stir everything up. The earlier than expected introduction of a consumer 64-bit ARMv8 SoC caught pretty much everyone off guard (e.g. Qualcomm's shift to vanilla ARM cores for more of its product stack).

The real question is where does Apple go from here? By now we know to expect an "A8" branded Apple SoC in the iPhone 6 and iPad Air successors later this year. There's little benefit in going substantially wider than Cyclone, but there's still a ton of room to improve performance. One obvious example would be through frequency scaling. Cyclone is clocked very conservatively (1.3GHz in the 5s/iPad mini with Retina Display and 1.4GHz in the iPad Air), assuming Apple moves to a 20nm process later this year it should be possible to get some performance by increasing clock speed scaling without a power penalty. I suspect Apple has more tricks up its sleeve than that however. Swift and Cyclone were two tocks in a row by Intel's definition, a third in 3 years would be unusual but not impossible (Intel sort of committed to doing the same with Saltwell/Silvermont/Airmont in 2012 - 2014).

Looking at Cyclone makes one thing very clear: the rest of the players in the ultra mobile CPU space didn't aim high enough. I wonder what happens next round.

Read the whole story

vshade

4108 days ago

reply

São Paulo, Brazil

Seashell
Wednesday July 10^th, 2013 at 10:41 AM

This is roughly equivalent to 'number of times I've picked up a seashell at the ocean' / 'number of times I've picked up a seashell', which in my case is pretty close to 1, and gets much closer if we're considering only times I didn't put it to my ear.

Read the whole story

vshade

4373 days ago

reply

São Paulo, Brazil

5 public comments

pepsy

4368 days ago

reply

yeah!

sentenza

4373 days ago

reply

Shouldn't the p(seashell) be in the denominator and the p(ocean) in the numerator?

lauratastic

4373 days ago

yep; fixed now

Michdevilish

4373 days ago

reply

PS: It's a shell fact

Canada

adamgurri

4373 days ago

reply

bayes law of seashells

New York, NY

Postmortem: Resident Evil 4
Sunday June 30^th, 2013 at 12:10 PM

Gamasutra News

In this reprint from the October 2005 edition of GD Mag, Resident Evil 4 cinematic lead Yoshiaki Hirabayashi explores the game's graphical overhauls and hurdles. ...

Read the whole story

vshade

4383 days ago

reply

São Paulo, Brazil

Exploring game design through technology by David Rosen
Wednesday March 27^th, 2013 at 10:27 AM

Wolfire Games Blog

This is a blog post adaption of my GDC 2013 Indie Soapbox talk, I hope you like it! I will link to the GDC vault video of it if it becomes publicly available.

Working with technology can be intimidating as an indie developer. Isn’t tech the domain of AAA? How can we compete with their large teams of experienced and talented engineers? Any one of them is probably at least as skilled at programming as we are, and there are so many of them working together!

The secret is that we don’t have to compete with them, because their process adds so much inertia. With such large teams and budgets they have to avoid bottlenecks that might stall the content pipeline, and need to minimize uncertainty about meeting development milestones. The most efficient way to accomplish that is to make sure that the departments can all work perfectly in parallel, with minimal need for communication, so design doesn’t stall tech, and tech doesn’t stall content. This works well from a scheduling and budgeting perspective, but restricts engineers to side-effect-free technology like optimization and iteration.

This approach encourages technical innovation at the periphery of the game design, instead of at the center where it would make the most difference. For example, there has been a lot of brilliant technical innovation in the Halo series that set new benchmarks for visual fidelity. However, the design of Halo is all about shooting aliens in 30-second skirmishes. It’s not about taking in the sights! All of this iteration on rendering technology and art asset creation certainly improves the experience slightly, but not nearly as much as if the technical innovations were closer to the heart of the game design.

On the other hand, indie games don’t usually have large teams or hard deadlines. We can work more serially, allowing design and technology to inform one another very closely. Antichamber has really clever technical sleight-of-hand to give the impression of non-Euclidean space, and the design is all about exploring that space. Spelunky has solid procedural level creation technology that makes sure that the game is fun to play over and over, and the design is all about mastering the rules of each environment by repeatedly failing and trying again. In both games, the technology is at the very heart of the design, and greatly elevates the player experience.

That’s not to say that graphics technology is worthless, it’s just most effective when the design takes advantage of it. In Journey, the sense of awe, beauty, and immersion is critical to the game design, and it would have been difficult to achieve that without their unique sand and cloth rendering technology (one of the programmers explains their sand tech here and here). If you remove those technologies, the game would not just have a quantitative decrease in visual fidelity, but a major qualitative change as well -- it would not be the same experience at all.

Gamers are always excited to see new technology that is central to design. When Alex Austin posted this video demonstrating physics-based infantry movement in A New Zero, people didn’t really care that it didn’t have very high visual fidelity; they were just excited to see a more embodied approach to movement in a first-person shooter! That’s how it achieved almost 500,000 views on Youtube without any promotion at all.

I sometimes encounter the idea that technology is a natural enemy of design -- that the purest form of design is found in board games, using cards, dice and tokens. This doesn’t make sense to me, because those tools are all technology themselves, and they clearly expand the design space instead of restricting it! The dice enable randomness, the tokens enable stored information outside the players’ heads, and the cards enable hidden information. Why shouldn’t digital technology expand the design space in the same way?

Consider this visualization of the design space, where distance from the center represents use of new technology. At the center is a dense cluster of games that do not rely on much new technology at all -- they focus on design and content using mostly existing technology. It’s certainly possible to create excellent games this way, and there are many examples, but most of them are by design specialists: developers who have created dozens and dozens of games over the years, often in game jams or prototypes.

The AAA teams take a different approach. They more or less use an existing design (or, rarely, start a new franchise with a novel design) and then use sequels to compete with one other in a technological arms race, moving outwards from the low-tech cluster in straight lines. There’s one line for third-person cover shooters with regenerating health, one line for open-world games with linear quests, and one line for third-person brawlers with quick time events.

The part of this diagram that I’m interested in is the space between these lines, where there is just emptiness. If you follow these lines partway out, and then take a sharp turn to the left or right, you end up with a game like Amnesia, or Overgrowth, or Natural Selection 2. These games sort of look like a AAA game, but have major differences in their design, so they really have no direct competition. There aren’t really any other asymmetric FPS/RTS hybrids, or any other physics-based tactical martial arts games.

If you take this idea farther, you can end up with something like Minecraft, or Flower, or Proteus: games that are so far from any existing genres that people debate if they are even games at all. It really doesn’t matter though, because gamers love them, and they are very successful by any measure!

I would like to encourage indie developers to consider using technology to explore all of this unmapped design space. The AAA guys are likely to continue iterating outwards on their straight lines, and there’s really nobody else to turn to: if we don’t explore this space ourselves, it will simply never be explored.