TL;DR: Server-side game recording is awesome for performing scalability testing, as well as reproducing game bugs and enabling players to replay their game experiences.
How to load-test game servers
In anticipation of the launch of Guild Wars 2 in August I thought it would be interesting to talk about some of the techniques we used to ensure that the original game of Guild Wars was capable of scaling to the massive number of users we expected, hoped for and feared.
When developing game server software being too successful can be just as scary as a flop — gamers who can’t play because the servers are overloaded don’t really care why the servers are failing, they just quit playing if they can’t get online.
Note: I don’t work at ArenaNet (makers of Guild Wars) anymore, though many of my friends still do; I’m eagerly awaiting the results of their efforts just like the legions of fans of the original game.
Simulating user load
There are lots of ways to load-test servers:
- Launch the game and watch users suffer, then fix the problems as they are discovered
- Write game “bots” to simulate the (expected) behavior of real players
- Run background processes on the server to consume CPU, memory and other resources
- Game recording and playback
Testing with real players
In the early days of online gaming — and I’m thinking here of the 1990’s — players were so happy to be able to experience multiplayer games that they gladly suffered the pain of live-debugging issues for the development team. Creating this kind of user pain was never an acceptable solution, and few companies that chose this path continue to exist. Gamers quickly became intolerant of online games that didn’t work correctly when given better alternatives.
I’m convinced that the efforts we put into releasing highly polished, mostly bug-free games during at Blizzard, and later at ArenaNet, was a key ingredient to the success of games like Warcraft and Guild Wars. Diablo (the original version) was a fun enough game experience that, while it was buggy as hell, managed to succeed despite its flaws. It also had the advantage of releasing in 1995, during a period when players were still willing to suffer mightily just to play online.
On the other hand, it was pretty obvious to folks who worked at ArenaNet that Tabula Rasa — a game developed by a sister studio to ArenaNet that was released in late 2007 — was going to fail if for no other reason than the development team was using the beta-test audience to debug the game. By that late date players were already accustomed to bailing out of a game that didn’t have a good online play experience — no point in actually paying money for a game if the devs can’t release a decent beta experience.
So yeah, foisting a game that doesn’t scale on a beta or launch audience doesn’t cut it these days, so this strategy is of limited utility.
Simulating real players
Many dev-teams choose to write “bots” that endeavor to simulate real player behavior. This turns out to be challenging for lots of reasons.
First, the programming team is already busy working on the game. Any effort spent writing bots usually means sacrifices of other game-features, which is a poor trade-off. And those bots have to continually be updated and debugged as the game is modified so that the bot can keep playing, which takes even more time.
Second, anticipating all the actions that players might take is nigh impossible. Players are much more diverse than game designers anticipate, and consequently they discover corner-cases that might never occur to anyone on the development team. That’s worth a whole ‘nother article about obscure bugs.
Third, it may be impractical to simulate even a fraction of the game. In the original release of Guild Wars there were 600 different spells/skills in the magic system (and 1500 by the time the fourth game in the series rolled out). Trying to simulate all the combinations of spells that teams of players might cast leads to enormous combinatorial possibilites. If we limit a player-simulator bot to casting ten spells then there are 600^^10 possibilites for those spells — that’s 6,046,617,600,000,000,000,000,000,000. If we imagine that the game server can test one billion sets of those ten spells per second, it would take 191 billion years to test all the combinations. That’s a lot of room for error!
Every game-server developer I’ve talked to who went down the path of writing a player-simulation bot said that the actual behavior of game-players differed radically from what their bots simulated, limiting the overall usefulness of this load-testing method. YMMV definition #4.
Load the server with fake work
Running background processes that burn up lots of memory, CPU, network bandwidth, and disk IO is easy — there are plenty of open-source and proprietary tools that do the job. Here’s a program that uses an entire CPU core all by itself, which you’re welcome to use if you’ll only send me a small royalty (2% of game revenue?!? Negotiable!).
void main () {
for (;;)
;
}
Using such programs can be useful for finding trivial problems, but doesn’t do a great job of helping the development team identify the hot-spots that can cause real problems.
A good case-in-point is Diablo 2, which was released after I left Blizzard Entertainment. While the QA testing team worked hard to identify game bugs, the job of load-testing was something that the development team didn’t tackle prior to launch.
It turned out that game-server memory access patterns were highly inefficient, leading the servers to spend a lot of time thrashing memory in and out of the cache, which caused more heat to be generated than was expected. This turned out to be a huge problem because it would cause the top-of-the-line Compaq servers which were used to host Diablo multiplayer games to overheat and crash frequently.
See, in a datacenter, “pizza box” servers are loaded into tall storage racks. It’s possible to cram 42 1U (for one unit high) servers into a single rack. That stack of servers running flat-out would create so much heat that the datacenter refrigeration system couldn’t provide enough cooling air to keep the servers happy.
The short term solution was to reduce the number of servers in the rack, leaving gaps between servers to allow for additional air-flow.
So yeah, some load testing is needed to discover problems not just in the game code, but in the infrastructure as well. So while this method can find some problems, it’s not going to help identify scalability problems the game-server code itself.
Load testing using recorded games
We had a lot of ideas about what we wanted to do with Guild Wars that overlapped to solve our load-testing needs.
One of our primary goals in building Guild Wars was to create a game that could be an “e-sport“. And in fact we gave away over $300K in prizes to players competing in Guild Wars tournaments between 2005 and 2007. Our goal was to record games by top players and play them back for the entire user community, with the kind of color commentary that John Madden did so brilliantly for US football. We miss you, John.
Recording games is also useful because if — or rather when — the game crashes it’s possible to use the game recording to recreate and fix the bug. In programming, bugs that are reproducible are trivial to correct; most of the challenge in debugging games is in discovering why the game is crashing in the first place; once that’s known everything else is easy-peasy.
Some Guild Wars players are probably familiar with the “double patch” we usually used on patch-days. We’d patch the game, and often discover a critical issue affecting a percentage of players. We’d play-back recordings of crashed games, fix the bug, run the build system, and often turn around new builds of the game in under five minutes — this does wonders for the reputation of the game that problems can be fixed so rapidly.
But the relevent use of recorded games for this article is that they can be replayed to simulate exactly what real players do when they’re playing the game and create the desirable server load. We were able to record hundreds of games and play them back on a server to (mostly) simulate how a server would behave when heavily loaded. I say mostly because our solution didn’t simulate sending network packets — the playback code just “pretended” to send because there weren’t any real players connected to the game. Not sending network packets didn’t turn out to cause any significant difference between simulated and real combat drops.
The ability to perform load-testing by playing back recorded games had all sorts of secondary utility. When AMD released a new and inexpensive series of Opteron processors we had doubts about whether those chips would be good at running 32-bit Guild Wars server code. Intel had previously released a processor called the Itanium which turned out to be terrible — it was quickly dubbed the “Itanic” — in reference to purportedly unsinkable ship Titanic — given it’s sinking reputation. We didn’t want to buy into hardware that would sink our company.
So to test the Opteron chips we simply gathered up a slew of game-recording files, tested the throughput on one of our current-gen servers, and compared the results to those of an Opteron, which were very favorable. We ended up choosing the Opteron for later hardware purchases, saving hundreds of thousands of dollars in hardware, as well as hosting, cooling and power costs, which are all of course critical to a hybrid free-to-play game like Guild Wars — you can’t host a free-to-play game if your operational costs are burning through your profit-margin.
References
Server programmers will probably be interested to read some recently-posted notes about scaling from the folks at DropBox. Check out Scaling lessons learned at Dropbox, part 1.
Important: I should mention that I don’t endorse the DropBox service. As someone with a security mindset it boggles my mind that DropBox user data can be decrypted by the DropBox staff — or a hacker — because all data is encrypted by the same key. If you’re looking for a good alternative check out SpiderOak. From their site: “SpiderOak never stores or knows a user’s password or the plaintext encryption keys which means not even SpiderOak employees can access the data”. Note that I don’t have any affiliation w/SpiderOak; I just use their service.
Conclusion
I hope I have one. Oh yeah — game recording is awesome, and not just for load-testing. It enables you to build much higher quality games than traditional debugging techniques, and is worth it to spend the time implementing for your next online game project.
It’s definitely worth another article to talk about the coding complexities required to implement recording, like recording format, random-number generation, time-sequencing, compression, debugging, playback desynchronization and other esoterica, but I hope I’ve convinced you that it’s a powerful technique for writing online games.
I wonder how would you implement recording in a open-world MMO. Original Guild Wars was instance-based so you could fairly easily gather all data needed for the recording. In an open-world game however the world state is (probably) not so easily partitionable.
Omeg, another great question, thank you!
You’re right that game-recording an instance-based game is a far easier problem than recording an open-world game. Instances have a limited lifespan (generally on the order of minutes or hours), so when the game is over it is possible to throw away the recording. An open-world by definition has no termination, except maybe weekly maintenance periods; it would be impractical to store a recording of 5000 gamers playing for a week!
First, let me mention some of the tricks we played to ensure that instance-recording in Guild Wars didn’t get too large, because they were stored in memory, not persisted to disk. Actually, one more step back — *why* did we store them in memory? Game recordings for instanced games were quite small because of the compression tricks we used, so we never felt the need to store most games on disk; we only wrote recording-files to disks when games crashed.
Since a game server might be running several thousand game instances, it would be impractical to write all those game-event streams to disk and manage the disk — unnecessary overhead. We tried to avoid hitting the disk on the game servers because we were using low-end hardware, non-RAID disks for cost reasons. The first-generation blade servers we used for Guild Wars had low-speed, laptop-quality 3.5″ drives, so reads and writes were bad for performance. In fact these hard drives were the #1 *hardware* cause of server downtime because of a manufacturing problem, but that’s another story. Incidentally, the #1 *overall* cause of server downtime was — of course — human error, but that’s true everywhere.
So in Guild Wars, we didn’t record “town” and “outpost” game instances because they were persistent and would get too big; we only recorded missions, dungeons, tournaments, guild-battles and arena games because of their limited duration. And even then, if the game recording was bigger than a certain size — 5 megabytes? Can’t remember — we would throw it away.
So I haven’t answered your question yet: when recording a persistent game, when the recording file for a game gets too big, take a memory snapshot of game state and throw away the recording, then start a new recording.
I was starting to work on a project to do this in my spare time just before leaving ArenaNet because it can be used for all servers, not just game servers. But it is a complicated task. It’s either necessary to have every game-state-containing object support a “persist” operation, or necessary to snapshot the entire memory footprint of the game instance. The persist solution is probably too slow (unless you can afford to buy a lot of server hardware), and the latter solution is fragile because — in a language like C++, where pointers can point anywhere — you have to re-load the memory snapshot at the exact same location in memory for it to work.
I think in a higher-level language it would be possible to write a memory snapshot solution that would work. Alternately snap-shotting an entire running virtual machine would also do the trick. If you get the tech working I expect you’d find lots of companies willing to pay big money for it.
It is, Smalltalk amongst others have (almost) always worked with a memory snapshot (called image) as the base development artifact.
For example, a recent effort connects serialization and testing frameworks, enabling one to open a debugger on a failing test’s stack in a different environment.
http://forum.world.st/fuelized-test-failures-td4640954.html
Not a glove fit for game development though :)
Thank you for sharing; that’s a great feature of the Smalltalk language that I had never heard about, and exactly the kind of reason to consider switching development environments.
C++ has a lot of power and can “do anything”, but it’s comically difficult to implement such a feature because the language allows direct access to memory.
The “load in the same location” problem can be mostly/partially addressed by disabling ASLR, but that’s a handy security feature you might not want to disable on production hardware.
I’m not sure if there are moderately easily accessible means to keep ASLR, store the random memory offset, and then spawn a process with that specified offset for the purpose of replaying it.
(& of course you’d have to reload it on very similar hardware/software, if not identical)
There are some ways to build some amount of pseudo-reflection on top of C++, but yeah – one way or another you’re going to have to taint a lot of your codebase to support any kind of automated persistence.
Taking snapshots is a valid solution to recording a persistent game, but it doesn’t address the issue of having to handle much more data in an open-world one.
In the original Guild Wars, there’s never more than 8 players to record within an instanced zone. An open-world game on the other hand consists of thousands of concurrent players across multiple zones, or in a truly seamless world, a single very large one. The recording size limit used in Guild Wars would be reached in a fraction of the time in such world.
Is server recording really feasible in a game of a much larger scale? Would some kind of partitioning be needed in order to limit the amount of data being recorded in order to lengthen the recording?
Were there other ways these recordings were used to help the development process?
I would guess that these recordings are an awesome way to integrate into the automated testing that runs with each build to ensure that bugs do not re-emerge with new code.
I also wonder how the use of user recordings can be used in other applications outside of games to help with development; I know that the Google QA team does something like this with its internal deployments of apps.
Game recordings were used primarily for reproducing crash-bugs so we could fix the code. It’s awesome to be able to replay a bug over and over until you figure out the root cause.
The second use of game-recordings was to allow players to view tournament battles, guild battles, and other games just like watching TV.
A third use was load-testing servers, as described in the article above.
And finally, we used recordings for regression testing. For example, I replaced the memory allocator (malloc, which uses HeapAlloc behind the scenes) on servers with the low-fragmentation heap allocator (http://msdn.microsoft.com/en-us/library/windows/desktop/aa366750%28v=vs.85%29.aspx) and needed to make sure that nothing amiss would happen when rolling that out across all our servers.
Unfortunately, it turned out not to be possible to use recordings for per-bulid automated testing because each new build of the game was likely to have incompatibilities that prevented game recordings from working. For example, if a game behavior changes slightly, the recording file wouldn’t play back the same way. That’s why, whenever there is a new build of Guild Wars, all the old recorded games disappear.
Hi Patrick,
Great article.
Recording really is extremely useful but not just for load testing. We have used offline recording for performance analysis playback and recently switched to using it for online real-time simulation/playback across a large number of computing nodes.
http://www.jinspired.com/site/jxinsight-simz-1-0
I see parallels with both trading and gaming platforms.
http://www.jinspired.com/site/jxinsight-simz-the-matrix-for-applications-threads-activities
Some like to refer to it as “event sourcing” but I am personally not too keen on this because events in the typical IT context seem lifeless and really just a means for transport…the real power is in the reconstruction of the essence of software in another container (or world).
Kinda offtopic. But since i’m blizz emulation team for 7 years i have to say that the old dev teams made an awsome job on the low level modules. Congrats on that. Too bad the new generation of coders slowly ruin it to the ground. I can only hope you will bring your background experience to ArenaNet.
I had the feeling game recording is used the first 3 exploit fixes were made for diablo 3. What are the ods that some GM is watching your game where you can kill everything without taking any dmg :D. Awsome tech idea !
Informative article! One minor Quibble: Spideroak isn’t as secure as you might think. Sure, they might say they don’t store a plain-text password, but there’s absolutely nothing stopping a rogue employee from getting your password from the form post *when you actually log into their service*, so in essence, it’s no more secure than Dropbox. At the end of the day, both companies have policies in place to prevent employees from looking at your data, but as long as the companies use a login system (versus a private, client-side-only decryption key), your data is susceptible to rogue employees peeking at it. Period.
I’m going to have to disagree with your reasoning as to why SpiderOak and DropBox have equivalent security.
SpiderOak servers don’t contain the encryption keys for the data, as do DropBox servers. So DropBox is vulnerable to server theft, remote intrusion, and rootkits on their servers. But SpiderOak isn’t because their servers contain data that was already encrypted on the client computer.
But let’s look specifically at the “rogue employee” case. Is it possible for a rogue employee at SpiderOak to modify the client application code to steal keys, push a build to users, deploy server code to intercept those keys, steal the data, and decrypt it? Sure. But if SpiderOak has auditing in place they’re likely to see one or more of those actions.
Now consider DropBox and rogue employees: any employee who has administrative access to the servers has everything: keys + data. How are you going to stop them?
So pick your poison. I have no reason to trust one company over the other, but I prefer SpiderOak because there are fewer paths of attack since they’ve designed a more secure system.
> void main() {
Game developer quality!
Perfectly legal C89.
This person knows enough C to hand out coding tips. Why don’t we head over to his/her blog to read more? Oh right, because it’s easier to make snarky comments on someone else’ blog.
Lagg much? Celebrating tech features while your clients disconnect between updates consistantly and start looking for another more stable product (mmo) is not THE best advertising. And while GW2 accounts are being resold for a handfull of apples (break even prizes below 50 bucks), the game making industry is completely blind to declining sales. Declining commitment from your playerbase is the downfall of any game company. Too bad that warning will not be heard untill it the company is broke & employees are fired.
Instead of flaming people on the Internet, I would encourage you to be civil as it’s more likely to have the effect you desire.
If it was easy to make games like GW2 with no bugs and perfect game balance, I feel sure that more companies would be doing it.
In any event, sorry to hear that you’re having trouble with GW2, but I don’t work at ArenaNet any more; I left the company in 2009.