In 2015 I participated in HandmadeCon, a convention created and run by programming guru Casey Muratori about game development. The event included five speakers: Tommy Refenes, Mike Acton, Jonathan Blow, Ron Gilbert, and myself. The event was hosted at the Seattle Public Library, with Casey interviewing each of us in turn about the technology behind the games we developed, the challenges we experienced, and the solutions we utilized.
Casey did an excellent job as an interviewer, came prepared with lists of questions, and had the programming & game-development expertise to break down our sometimes overly-technical answers with follow-up questions to unpack things for the audience.
I enjoyed being interviewed, and especially enjoyed listening to the other interviewees — and Casey — describe their thought-processes when solving the myriad challenges they faced making games.
The talks are all on YouTube
- Tommy Refenes – 2015
- Mike Acton – 2015
- Pat Wyatt – 2015
- Jonathan Blow – 2015
- Ron Gilbert – 2015
- 2016 Follow-up Q&A
There are a lot more great talks for 2016 too.
In 2016, Brian Karpala transcribed my talk (and who knows, perhaps some other ones too) and sent me the transcript to post, and I just discovered it in a corner of my inbox, eight years later! Sorry Brian! In any case, here’s the transcript!
CM: Casey Muratori (interviewer)
PW: Patrick Wyatt (yours truly)
CM: Pat is the person who I have the most notes for and I pretty much looked at it and was like we’re never going to get through all of this. So we’ll do our best, but we really do need all the time.
CM: On Handmade Hero, one of the things I get asked all the time, like all the time, is about network programming. And people are like are we going to do networking, are we going to do networking, and my answer is always no we’re not going to do networking because I’m like the last person you should ask about network programming. I’ve literally never shipped a piece of code in anything that does networking. My experience with networking is like internal tools only, and I shutter to think what would have happened to these tools if they had been deployed in the wild. I’m sure they would break like immediately.
CM: And so what I wanted to do, for this session, was to sort of get the person who I thought knew absolutely the most about game networking to sort of fill that in, in whatever we can do in an hour to sort of answer those questions, where it’s like everyone wants to know more about this so maybe I felt like we could get a perspective on it. Pat Wyatt is somebody who is, I think I said this on the actual diagram, he’s the person I think of who had the only network game that shipped with no problems somehow. It was like perfect. And that was Guild Wars and it was just this, it was like this amazing feat of engineering that had actually had the launch that it did. And because basically everything else at that time was just like oh yeah the day that they launched the game, like World of Warcraft and the server immediately goes down, or whatever else happened. So I was like, he had plenty of experience with it working, on previous titles on Warcraft and on Diablo and then when he went to do Guild Wars that was kind of the Nth generation of his network coding so I was hoping, I could get at some of that knowledge and have him give us an overview of what does it take ship to these games that are played by literally millions of people and how do you go about designing it.
(01:46)
CM: So please give a warm welcome to Mister Pat Wyatt.
Casey and Audience Claps
PW: Thank you.
Casey and Audience Claps
CM: Alright so give people some context here because this was something that I was kind of curious about as well. The first network game you worked on, was it Warcraft?
PW: No actually it was Battle Chess. I ported Battle Chess from, there was a DOS and an Amiga version and I had to do a Windows 3.0 version. It was on modem of course.
CM: Okay. This is a modem version of Battle Chess was your first experience with network programming?
PW: That’s right.
CM: Okay. So Warcraft was actually like a second generation thing in some sense, you already have some experience and this was before the internet for the most part was used in gaming, I believe, right?
PW: That’s right.
CM: So local network only.
(02:33)
CM: Tell me a little bit about that development process. What was it like, how did you guys go about doing that network development? Cause I don’t even know like what was involved in that time.
PW: Yeah, well because the internet didn’t exist [ed: “wasn’t widely available”] I sort of had to invent the whole idea about how this would be done whole cloth. And, one of the first immediate obvious things was we had about I think 600 units that were going to be running around in the game, and when I started working out the math of well if every unit sending its orders to what it’s doing all the time, that was just going to totally saturate our 2400 baud modem.
CM: laughs 2400 baud modem?
PW: That’s right.
CM: Okay.
PW: And so it was immediately obviously an impractical solution to trying send what each unit was doing, and I’m not sure exactly how I came up with it; but, I figured that what I had to do was instead just send the processed user input. So when the user clicked on a map, then I would send “well he clicked here” and he had these units selected, go tell those units to do that. And so both computers would work in lock-step of each other, so that you only had to send the processed input. Which is very very tiny, in fact you could fit like a 300 baud modem or something would do the job.
(03:40)
CM: For those synchronous models, obviously the latency becomes a bit of an issue potentially, no? I mean how did you guys…I mean Warcraft obviously being a real time strategy game isn’t the most twitch oriented but it’s still real time response, so how did that play out in the code?
PW: Actually it’s really sort of interesting. RTS by in large are not very twitchy, because what happens is you’re busy giving so many different orders at the same time that you don’t really need to see the action take effect instantaneously. In the single player version there is actually no delay: you click on a unit and then immediately it starts executing the action. But in the multiplayer version we have a tunable parameter for some of the earlier games and we set it to a hardcoded value in the later versions. Basically you say “you go to this” and it’s like “yes, my liege”. pause And then it starts going.
Casey laughs
(04:25)
PW: Right and it’s because what happens is you send the order out and you have several packets in flight. Everybody’s got these packets in flight all the time and so it takes a while for that to get to the other side. You might get the packet early but you don’t start processing that packet until it’s time for that packet to happen. There’s sort of a long queue of I think 400 milliseconds before your actions need to take effect. And, if it doesn’t get there by then, the game will still allow you to click buttons and do things; but, really the simulation has stalled out and the unit’s stopped moving until you start getting packets again from all the players.
(05:03)
CM: So sort of the architecture of that original Warcraft networking model was something like “I have a queue of user input”, basically, and they might just be doing a bunch of stuff. It’s getting stacked up and really…I start sending it out, until I get confirmation back that the next step has occurred. I can’t actually pull anything, I can’t actually consider one of these UI options executed. So I’m just trying to buy time until that happens. Is that…
PW: That’s right. I just waited until I get packets from every single player in the game, and the scale is two to eight players in Warcraft 2, and Starcraft. And as soon as you got everybody’s packet then you can execute the turn. Although you generally wait until it’s enough time has elapsed. Which I think we were doing like four turns, four network turns per second in Warcraft 1.
CM: Basically what happens, is all of the people send I guess, what I’m hearing is not sort of the way I was picturing it when you first said it. There’s basically fixed slices of time, these 400 milliseconds, which like you said it was a tunable parameter. There’s fixed slices of time and basically everyone is going to get to make a move in that time and that move maybe I didn’t make a move.
PW: Correct.
(06:11)
CM: But everyone’s going to get to send what their move was, and if I don’t hear from anybody, we just keep running out the clock? If we don’t hear from everybody in that time, or?
PW: That’s right. For the next couple seconds I’ll still take input on local client side and then after if I haven’t heard from anybody from…I am just going to guess five seconds, then I actually grey out the screen so you can’t do anything because clearly we now have a network issue.
CM: Something went wrong. Okay.
PW: And so people get the idea “oh I am not enabled to do anything”.
CM: So that’s pretty interesting! That I guess in that sense, that’s a pretty clean division of the networking code from everything else. It’s like it almost exist in a vacuum because as long the only that I guess that you need from the game from that point is the ability to not just execute something; but, that should be pretty easy because the user is not doing anything it doesn’t execute something?
PW: That’s right.
CM: Okay so that’s that’s actually kind of cool. So…
PW: And I would say like it would be really nice “yes! It could be so easily [be] isolated and everything!” But, if you were to look at the code, you go like “where is it [the network code]…It’s everywhere! it’s spread out!”
CM: Oh really?
PW: Oh yeah, the first Warcraft was a mess.
CM: Okay, okay.
(07:12)
CM: But that would be something that is fixable, meaning the integration into it isn’t necessary, it is just how it happened to organically occur?
PW: That’s right.
CM: I see.
PW: Organic is a good word for it.
Audience laughs
PW: But that was my first original development project so I had to learn along the way.
CM: Yeah. Okay, so moving forward, I guess then for like Warcraft 2 for example, same sort of thing?
PW: Yeah. Really not a whole lot had to change. There are a lot of data structures that just assume two players and there would just be boolean values for them and then it’s like “oh we are going to do eight players, now!”. Well we can pack eight players into one byte, right? We just had one bit flag for each entry. And so, it was really more a failure of vision that Warcraft 2… Warcraft 1 didn’t have more players in it. It’s like we could do two players, oh my gosh that’s a lot, right? We could have done more but we were thinking about modem play.
CM: I see. I see. And then by that time it had kind of gotten a little bit better in terms of what it could support, and how many people. I guess with Warcraft you learned what people might want to do which is play eight player games or other things like this?
PW: That’s right.
(08:15)
CM: Okay so from there, the next title is sort of drastically different, I guess is my understanding. You then did the networking layer for Diablo. Which is, you did the part where it’s the actual in game networking layer, right? I guess someone else was responsible for the BattleNet back-end sort of system?
PW: That’s right.
CM: So, Diablo is not synchronous in this way, is that correct?
PW: Right, so Diablo was actually developed by another company that was acquired by Blizzard Entertainment. It was a bunch of guys at Condor, and they had this great idea like Diablo was their idea and when we heard about it, we were like “this is awesome! We need to be doing this game with them!” But they were making a single player, single character game and we convinced them that they should do a multiplayer. Oh and when we eventually convinced them to do a multiplayer they were like they wanted it to be turn based multiplayer.
PW: “No no no no! It’s gotta be real time!”
CM: Okay.
PW: When we convinced them to do that it’s like okay so [you folks at Condor] go do that! And then they really needed some help to get it done, and so I actually flew up there with a plan to go and figure out and help them and it was going to be a synchronous network model like the…
CM: Similar to Warcraft.
PW: Yeah exactly, because I had already done that and it would prevent cheating. I looked at the code [and thought] it’s going to be really hard to back-end [insert] a synchronous system into a game that’s already been written and doesn’t really have that data constructs where it’s… think about like let’s keep the server stuff over here, the simulation stuff over here, and the game stuff over here. It was all just intermixed. And so I just threw everything out the window and wrote a different network model.
(09:39)
CM: So when you say that, you kind of mean that my representation for a particular object in the world in something like Warcraft, is a little more split, in terms of… Explain a little bit more by what you mean by you looked at the code and found something you didn’t expect, in terms of that welding?
PW: Sure. So if you are doing a network simulation you ideally want to have all your networking data structures separate because they are going to get modified only when packets show up that should be effecting that data. And your simulation data, I mean I’m sorry, your user interface data is probably going to be separate because: I [the player] am tweaking things; I [the player] have status bar updates for these units. There’s lots of stuff that the network code and the client code are gonna have that are different because of the time that the changes, right? The simulation can only change when packets show up. And the client data can change instantaneously as I [the player] clicked the mouse and the keyboard and things.
CM: Okay.
(10:28)
CM: And so in something like Warcraft 2 where you already had some experience with it, you guys had sort of gotten to the point where you were already dividing the code base up unto this way I guess is sort of the thing; but, when you went to their code base, that’s all still kind of intermixed.
PW: Yeah also the other factor is when you are doing synchronous lock-step games the problem is you cannot have any coding errors. Because if you’ve played one game already and you initialized some of your internal state variables slightly differently than mine and I’m playing my first game, then you do a command and it doesn’t behave exactly the same on your system than as it does on my system, and the game desynchronized and then we all we can do is disconnect because we don’t know what else to do.
CM: So basically there’s divergence there because of some state that was not cleanly initialized?
PW: That’s right. And you have to be exceptionally careful and it’s just ridiculous how hard this is. Accidental pointer use of after the pointer’s been freed or uninitialized variables or static variables that don’t get reinitialized. Lots of things can really impact this, and so, it takes a long time to find these bugs too and Diablo was written in haste and it just didn’t have the same level of carefulness, with respect to variables and so there was just no way to do a synchronous lock-step game with that.
(11:37)
CM: So what was your approach then? You decide to change things up, you’re like we’re gonna have to do something different. What was that process like? Cause you said you came out there with a plan and now the plan is gone, right?
PW: That’s right. So I came out with a plan in like a day and it’s what I call asynchronous loosely-coupled games. We are all playing sort of the same game, but not exactly. You happened to be the first person to get onto level one of the game, and so you’re now the master for that level. And when I come to the game, to come down to that level of the game… So for people who have not played Diablo, you have a 16 level dungeon. There’s a master for each level, so I get to level…
CM: So the master happens to load-up that level first, in some way? How do we know that? Cause it’s network, so I don’t know who loaded first.
PW: Well you created the game, and then I join into your game, and so you have the authority. You’re the first person on level 0, the town level, and then you go down to level 1. And when I join your game, you see that I go down to level 2 first, and so you are like “okay you are the master for level 2”, and as long as I am in the game, I’m the master. If I were to drop out, there would be no master, or arguably you could say any of them could pick up the master position.
CM: So, let’s just drill down that for one second. That sounds like sort of a consensus problem across these clients though because when I say I see that you dropped down to level 2 first, how do I know? How do I know that I’ve seen that at the right time, or how does another client know? Is there some kind of centralized person who owns level 0 who is the person who decides who owns level 1, or how does that….cause I mean in networking things can come in at any time and you never really know who sent what when, right?
(13:12)
PW: That’s right. The person who created the game is the arbiter of who’s the level master for each level. And if that player drops out then the next player who came into the game is, like increasing the sequence numbers.
CM: I see. Okay so basically the person who started the game, they are the master and they will be periodically sending out a list of who’s in the game and your order in that, with they are first and the next person second, third, is like the order we will go through to keep trying to play this game if something bad happens, kind of a thing?
PW: That’s right.
CM: Okay.
PW: So you are the master of the game, I go to level 2 and you tell me okay you’re the level master for 2 and so “great I get to initialize everything myself and tell people what’s going on on that level”.
CM: I see.
PW: And then after that point, so now I’ve taken over ownership of level 2 and I am picking up things and dropping them and telling all the other players yeah I picked up this item and dropped this item. If I drop out of the game and then we need a new level 2 master then somebody else can pick it up and they can sort of apply all the knowledge of what happened there previously. Like there’s a dead monsters here and this treasure chest has been broken open, then they can re-apply all those rules so that the world seems consistent.
(14:17)
CM: I see and so they kind of keep a buffer of here’s all the stuff I’ve been told about level 2, even though I myself have not actually keeping a representation of level 2 yet?
PW: That’s right.
CM: So at that time if we do need to do a fail-over or if I come on to level 2 I guess would be another case where I just am starting to look at level 2. I’m gonna go in and take this buffer that I’ve been recording of everything that I’ve been told about level 2, and I’m going to replay it, in order to get my level 2 to that state.
PW: That’s right. And our lists may not match up exactly because of hacking or things like that and that’s why I call it loosely-coupled cause we are really playing two games that are nearly identical. But it’s obviously very very cheatable.
CM: And so what happens now as the game is being played? Two of us are both on the same level and we’re doing stuff, what is that networking process looks like? One person is the master for the level and then we have N number of other people who are getting information about it; but, what does that actually look like?
(15:08)
PW: Yeah, so we’re all broadcasting messages to all the other players saying this is what happened. I pick up a sword and I tell all other players that sword is now gone; and, so you can have a race condition for that, and we just arbitrate it. The level master gets to make the determination of who actually got that.
CM: So basically two people in their separate copies of this game, they both go to pick up the sword, right and they both think that they’ve picked up the sword so they send out the packet “I picked up the sword”…
PW: Well more like I am attempting to pick up the sword.
CM: Okay so there’s actually like a two stage process? I’m attempting to pick up the sword. Two packets come in that we can’t really tell who maybe was doing it first or whatever so the level master goes I see two people are trying to pick up the sword and I decided to break the tie however I decide to break it?
PW: That’s right.
CM: Any particular nuance to that or it doesn’t really matter?
PW: No it’s pretty straight forward. It’s whoever got it first and of course if you are the level master you are always going to faster because local loop back.
Audience and Casey laugh
(16:05)
CM: Okay and so is that roughly the way that most things in the game work? So if I try to attack somebody, is it a similar two stage process? I’m trying to attack this person and then the level master acknowledges it and only once acknowledges it do I actually see the attack occur? Or is it more asynchronous than that in that I see the attack occur either way but maybe the damage goes back up later if it turns out it didn’t acknowledge? You know what I’m saying, like in terms of how does that work out when when we’re actually doing other sorts of things besides something as discreet as picking up the sword?
PW: Right, so it’s really loose. You go attack a monster on your system and you kill him and you decide that he’s dead. Then you let me know that he’s dead and I’m the level master and I go: “okay great”!
CM: Okay.
PW: Yeah. So I mean it’s it’s really possible to cheat very easily right and you can just kill anything as fast as you want.
CM: And that isn’t really…but ignoring the cheating part of it, in terms of actually keeping the stuff working in the actual game, it’s not really necessary to do anything beyond that? Meaning if a bunch of people who are all trying to play the game legitimately – not hacking it, not trying to do anything weird – then simply having the clients announce when they did something? If two people say that they killed the monster it doesn’t matter because, the state just goes to dead and so it’s fine, that kind of thing?
PW: That’s right.
(17:19)
PW: The only area of contention is in picking up things off the ground.
CM: Okay.
PW: Or breaking open a chest…actually no even that doesn’t because the chest is broken open, both players might have done it at the same time, but it leads to things popping out on the ground that you can then pick up, they [the items] are the areas that you can fight over.
CM: So the only…and is that because things like damage are just numbers that go down so if I get two people who say did damage I just apply them both? Or is that just because picking up is the one thing where it’s like there was only one item and we need to know who had it kind of a thing?
PW: Exactly.
CM: Okay. Alright, so I think that’s pretty good fundamentals. I’m trying to think if I’ve missed anything of the…
PW: Yes, you missed one thing: Don’t do a game like this!
Casey laughs
CM: Okay why not?
PW: Because people will cheat. In fact the thing that happened was Diablo came out and people really loved the game! A lot of players had fun, particularly when they played with their own friends; but, as soon as you played out there in the larger universe, there were just lots of people who were griefing. And in fact, it was really startling to me because I’d played MUDs before and things like that and occasionally there was griefing but usually you had GMs who would over see things. Volunteer GMs who would sort of make sure that people weren’t complete assholes.
CM: Right.
(18:27)
PW: But couple days after we shipped Diablo, there’s a friend of the company who came in and he was playing in an office adjacent to me. A guy named Yash, just going into dungeons and killing people because he’s a much higher level then them; and, there’s nothing that they can do about it. All their armour and swords and everything would pop-off onto the ground and then they get reincarnated in town and they come down in the dungeon naked trying to recover their stuff and they he kills them again! And again! And again and again and again!
CM: And this is at your own company?
PW: And he taunts them…well he was actually just friend of somebody who worked on the game.
CM: Oh so just a friend.
PW: And I watched him and I realized that this was a microcosm of stuff that was going to happen. I didn’t say anything…I didn’t say “you shouldn’t do this” or anything like that, instead I was like “let me understand this”. This was really my first serious experience with griefers. I mean it was shocking at that time. And now every game I’ve ever designed since I have to think about how are people going to hack this, and how are people gonna especially abuse others which is in some ways even worse than hacking.
CM: Yeah, yeah.
(19:21)
CM: Well okay, so that’s actually kind of fascinating. So, moving forward, I guess the next one you did was Guild Wars, yes?
PW: Starcraft and then Guild Wars…
CM: Okay so you did the network on Starcraft as well?
PW: Yeah it was a lot of the same stuff.
CM: Was that asynchronous or was that synchronous? Was it more like the Warcraft model….
PW: Just like Warcraft 2.
CM: Okay just like Warcraft 2.
CM: Okay so after Starcraft, you go to start Guild Wars, and I remember Guild Wars was particularly impressive to me because I remember the download to download Guild Wars was this little tiny thing, right? It was some kind of little tiny executable and it just took care of everything from there and all this sorts of stuff.
PW: That’s right. It was a 200 KB executable roughly and the challenge was that it was so small that people would click on it to download and it would go so fast they would be like “did that work?”.
Casey laughs
PW: So I think we should have made it bigger.
(20:07)
CM: We had to pad it out just to make sure they can see the progress bar.
PW: That’s right.
CM: Guild Wars is in someways a lot more ambitious than Diablo because this now is sort of designed upfront to have a lot of the permanents maybe backed up by…were authenticated sorts of things I guess. This comes out of your experience with the griefing on Diablo, I guess? Is that…would that fair comment?
PW: That’s right. It’s also the desire to have an online economy.
CM: Okay. Okay. So, this one, I guess will…I got kind of a list things to drill down in here but before I ask any of those questions, can you give people a brief understanding of what was involved in making Guild Wars happen? Because it’s got both persistent state and it’s got a peer-to-peer [networking] thing happening and those things interact and there was just…I remember you sent me an email where you were like “here’s all the servers we wrote for Guild Wars”, and there was 12 or something on this list. So, could you just give a brief high level explanation of what was involved in just the basic architecture of this system.
(21:08)
PW: Sure. First, it was not a peer-to-peer game at all, it was client-server only. So…
CM: So even when people were playing inside their own…when they would kind of spawn off the dungeon, that was still always round trip through a separate server?
PW: That’s right.
CM: Okay, sorry.
PW: It was fully server hosted.
PW: The very first thing you do is connect to the server and say “hi I’m Guild Wars version 0”, cause you’ve got this 200 KB stub and it’s like “oh you need version 53, here you go”.
PW: Download.
PW: You reboot the game engine.
PW: “Hey I’m version 53 what should I get?”
PW: “Okay great you’re up to date! Here’s the manifests for every single piece of data in the whole world”.
PW: You download that file. And then you say “well I want to go to the login screen, give me all the files I need for that.”
PW: You download those and then progressively you start accumulating all the files you need. So that you never actually have to download the entire, roughly at that point, 3 gigabytes of data. You could just download the tiny portion that you need.
CM: So this was actually like an on-demand package manager, you basically had in here? We have a dependency system, we know everything in the world and what it depends on, so when the game starts up it just kind of asks for some kind of root object for what it’s trying to do. Like I’m trying to go into this dungeon, so I ask for that and then it starts pulling in things. And you are like okay, now I’m going to ask for this, and that pulls in mere things. These get downloaded into sort of a cache, on the drive, that you keep?
PW: That’s right.
(22:21)
CM: Okay. And those are marked in some way to tell whether they’re stale? In case their updated or does that never happen?
PW: Yes. Every file effectively is an ID number, both the identity for it and the version number. You can know whether you are up to date on the file, and so if it turns out that you’re out of date, you can say “I have version 17 of the warrior armor”; and, if you look at the manifest it’s like “oh I need version 19”. You say to the file server “want 19, have 17”, and it loads up both of those files, delta compresses them…
CM: Seriously?
PW: Well actually it delta’s them, then it compresses that and sends you down that small chunk because this is 2005 when we were doing this…actually, 2002 when we were writing the code for it, and bandwidth was much lower than it is today so we had to make everything as small as possible otherwise it would have been hours and hours and hours before you can get into the game.
CM: That’s awesome.
(23:08)
CM: Okay, so you on-demand delta compressed the difference between the version they have and the version they need, and send only the delta down?
PW: Correct.
CM: That’s like better than most internet infrastructure probably laughs everywhere in the world right now but okay.
CM: Going back to that sort of process of getting things, you said it’s both the ID and the version, you mean they’re together? Or the ID is somehow intermixed with the version? It’s just two separate pieces of data: ID and version, or…?
PW: That’s right. So every file has a base ID which is the very first file number that was assigned when the asset was created. So I create a texture and it’s like you have number 3, and I can forever more refer to that texture as 3, but now [after it is changed by an artist] it has different version numbers. So it’s also file 17, and file 28. Every time the artist changes it, that texture gets a new number with it. I always refer to it by the base ID and I look at the manifest for it and for the base ID 3, the current version is 781, and then I say I need 781, and I have 3, or 17, or whatever the previous version that I had.
(24:12)
CM: This is just kind of fascinating. Sorry this is not technically…
PW: It was a couple years of my life so I understand.
CM: Yeah. And just one other question about that I guess, because we sort of talked about this when we were talking about Super Meat Boy as well and what Tommy (Refenes) was doing and it is just kind of fresh in my head. How do you assign these IDs to things that artists make? Cause by definition an artist is not somebody who goes and knows what a file ID necessarily is. So, how does that process work? I’m somebody who needs to add a texture to Guild Wars, is there a server that you query? Are there tools? Just give me a quick idea of what that looks like.
PW: We are using Perforce as our revision control system and so they check that file into perforce. There would be some way that file was referenced, so perhaps there was a model file, and the model file using 3DSMax, refers to this texture file by name. And that model file referenced in a level file. Let’s say it’s a tree seeing how it seems to be the model for today [ed: previous talks referenced tree assets].
Casey laughs
(25:06)
PW: The level file [that] contains a bunch of assets is actually referenced in some source code. Like this is level 13 of the game, or this is the ‘Lions Arch Outpost’. What happens is there’s a tool, FsBuild, and it picks up all the code references to assets, and it’s like “okay let me load this level file and then let me look at the dependencies this has. Oh it’s got a model file in it, let me load the model file and let me look at all the dependencies. Oh it’s got a texture file, and then we load that. Textures don’t have have any dependencies, okay. So now I can process that file and that file is now file ID 3 because it’s the third file that was ever processed.”
CM: Okay and that whole stack gets processed up into the piece of data that’s actually referenced?
PW: That’s right.
CM: So the textures aren’t referenced separately? Or they are?
PW: They can be referenced separately, yeah that’s right. The process is it saves file ID 3 in this big giant — you can think of it as a database of files — that has like more or less every version of a file, back to say [the last] 90 days or something like that.
(26:05)
CM: And so I’m an artist, I made a new version of this texture, how does it correlate those? Is it just use the full file path in Perforce, as the thing that knows it’s seen this before? Kind of associate the IDs internally with that Perforce path? Or…
PW: That’s right. So FsBuild loads up all the assets and creates a graph of everything, and then it goes “oh well that file hasn’t changed so I don’t need to do anything for it”, right? Cause it has lots of metadata associated with the build process, so it knows when files were last touched, and if something breaks it knows which artist checks that file in. Maybe that’s not the reason why it broke. It could be the programmer changed some code which caused all the art files to break.
(26:52)
CM: Okay. Alright, sorry that was a bit of a rat hole. I just was kind of curious. I’d wanted to know about that system because it’s pretty interesting.
CM: Going back to the networking part of things, essentially you got this system where it sounds like there’s already one server that we’ve already talked about what it is, which is, we need something that is sitting there running that has every version of every asset that we’ve ever shipped on it, which can answer this question, right? And presumably we need more than one of these cause everyone’s getting the data from this. So talk a little bit about how that work.
PW: Sure so we have lots of these files servers all over the world. Guild Wars is kind of unique in it’s era for being a game that had no monthly fee and so we couldn’t have lots of [expensive] servers. We needed to be extremely efficient in the way that we did everything, so we wrote everything from scratch. Because there weren’t tools that would do this delta compression thing and stuff. And so we have lots of file servers out there and they’re spread out all over the world, and it’s kind of interesting why.
(27:43)
PW: If you connect to a really [remote] long distance TCP server, what happens is, there’s this thing called the TCP Window, and I send you a few messages and you send me back a bunch of stuff. As a server you’re like “I’m not going to send you more than 64 KB of data because I don’t want to flood out the internet. And until that 64 KB of data gets acknowledged, I’m not going to send you more stuff. So you can only have about at most 64 KB of data outstanding.”
CM: When you say server, who do you mean? Cause you’re control server.
PW: In this case, it’s the file server.
CM: So it only sends 64 KB of data and you’ve set that up, you picked that?
PW: So the TCP stack actually set…The window size is 64 KB, and until we negotiate a bigger window size that’s about as much as it would send. And in Windows of that era, you had really limited ability to control TCP parameters, especially on the game client, because you can’t go on and hack people’s [Windows] registry [because it requires Administrator permission].
CM: I see. Yeah now-a-days I guess you can possibly set those options dynamically in code but then you couldn’t?
PW: Yeah but even then Windows XP you would have to reboot.
CM: Really?
(28:42)
PW: Yeah.
CM: Okay. Alright so keep going, so you’re about to say there were problems with this…
PW: Right, so what you want to do is instead of trying to get stuff from a really long distance away server, where 64 KB you send it and there’s a long pause before they acknowledge and it comes back so you’re not constantly filling the stream. You want to have the file server really close to people so that you can send a lot of data more quickly. So instead of the delay being the limiting factor, it’s the size of the channel of how much bandwidth you can send.
CM: And, is that work aroundable? Meaning let’s say I actually only had one file server, it sounds that’s per TCP connection? In theory could a client be written to open multiple TCP connections with the server…
PW: Certainly.
CM: …and start multiple 64 KB chunks in parallel? But you decided not to do that, I guess?
PW: That’s right. You certainly could do that. But we just distribute servers all over the world: we had them in Japan; Taiwan; Korea; two data centers in the US and a data center in Europe. That was enough to do the job, to cover most of our user base. Australian users had to suffer but we just couldn’t afford to put servers there too.
Casey laughs
(29:48)
CM: Fair enough.
PW: Yeah and so we had lots of those servers which creates another interesting problem which is whenever you want to do a build you have to make sure that every server has all the data before you can say that the build was truly live. So it’s a distributed database commit problem.
CM: How did you handle that?
PW: It’s actually fairly straight forward. It’s like “okay did everybody get the data? Everybody? Everybody ready? Ready? Go!”.
Casey and the audience laughs
PW: And then afterwards you’re like “okay did everybody get it? Okay we’re good!”, right. And if one of them didn’t get it you keep trying again and again.
CM: It is like get it, get it, get it, and eventually he wakes up.
PW: Yeah but you wait until everybody’s got all of the data and then you try and commit the metadata on all the servers simultaneously.
CM: Okay.
(30:26)
PW: And there’s only twice in five years that it ever failed.
CM: And, wow, okay. And that’s…
PW: By failed [I mean] in a way that required human intervention.
CM: When you say database, are you talking literally about something that you guys wrote? Or are you talking about…
PW: That’s right.
CM: So basically it’s not like we installed MySQL or something, it’s like we wrote this thing from scratch to do exactly this job of serving the art assets.
PW: That’s right.
CM: Okay.
PW: Yeah, and in fact it’s embedded in the Guild Wars client. When you see the gw.dat file, it’s sort of the database file format that we used for everything.
CM: What kind of load were these servers under? What were they designed to serve basically? How many people updating and how much traffic? I mean just to give people a ball-park. There weren’t that many of them. You said 6 or 7, something it sounded like?
PW: There were six data centers.
CM: Six data centers.
PW: But, within each data centers there were lots of these file servers.
CM: Oh because you would just duplicate the file servers as many times as you needed to for the load in that area?
PW: That’s right.
(31:19)
CM: How much would one typically handle, do you think? Or is that…
PW: So each one would saturate its uplink at about 400 Megabit.
CM: Okay.
PW: And they would just run 400 Megabit sustained load during peak time.
CM: So, the majority of the thing that actually prevented you from serving more stuff on a particular server was actually just the outflow pipe?
PW: It was actually a combination of factors. File servers were actually really fascinating, and really easy to do things wrong. But there were certain things, like how much the network card could push onto the network. We were using blade servers and unlike individual pizza box servers, blade servers in that era had a shared network infrastructure for the entire row of 14 blade servers. So you could saturate that would be a problem. So you had to spread your file servers out so that they couldn’t all be in the same rack. They had to be in different racks.
CM: Oh my god.
(32:13)
PW: The other thing is the amount of data we are sending is very large, and these servers all had only 2.5 gigs of memory. You gotta subtract out some for the operating system and stuff like that. So you could have about a gig and a half of cache but you have to have enough space left over for “load file A, load file B, delta compress them”. Now you have file A, file B, and the delta in memory all at the same time. And then you want to keep that in memory as long as you can because presumably other users will want that. But if you keep it in memory too long and then you run out of memory and crash and so that’s no good.
CM: I see…
PW: So it’s this constant thing of trying not to run out of cache space and trying to avoid virtual memory fragmentation, cause these servers are supposed to run for years at a time. And memory fragmentation is a big issue, so we had to use a Fibonacci heap…
CM: Are you serious?
PW: Had various different sizes of peak buffers in order to make sure that there’s always a nice slot available for the file that you want to load up.
(33:02)
CM: That’s kind of amazing.
PW: Yeah.
CM: So…
PW: It’s a lot of engineering just went into that one server by itself.
CM: Yeah, it looks like it…
PW: And to make the build process really fast, because, if there’s any one thing you can do in your games, you want to make your iteration speed as fast as you possibly can. Because that’s gonna determine the quality of your game. If you can only get one build done per day, then when people try and put something in the game and they have to wait to get user feedback, they just don’t get that feedback right away. Over the course of Guild Wars, we did an average of 20 builds a day for four years. There was roughly 10 a day in the beginning, and closer to 30 to 40 a day at the end of the project. So we’re just churning builds roughly every five minutes during the ordinary course of business and data builds took a bit longer.
PW: And so we spent a lot of engineering effort, when I say we, I mean me, myself and I…
Casey laughs
PW: There were other people who were worked on [things] like texture compressors and loaders and things like that but FsBuild was my baby and I had to make it fast because that was the core of the build time for things.
(33:58)
PW: And so every time I make that faster, I could make it more efficient for like the entire rest of the 60 person dev team, to get their stuff in and see it.
CM: And that was kind of a live process because they wanted to be able to…I’m just not sure I totally understand, cause obviously iteration you can iterate internal to the company, but this sounds like something that you are iterating external to the company?
PW: Well it’s really both. You want all the people on your team to be able to see their stuff instantaneously; but, we started on an alpha program roughly April 2001, where we invited a few friends and family. That grew within a relatively short period of time to 500 to 2,000 people, and we kept [that number] through the entire alpha process. When some people would leave, get tired and our numbers were not high enough, because of the exhaustion of the game then we’d just bring in more people.
CM: Okay.
(34:41)
PW: And so our dev build was actually going out to 2,000 people every single time that we did it.
CM: So basically FsBuild was both the server for your internal testing, and the actual game? No difference, really.
PW: That’s right. We made no difference between devs and alphas.
CM: Okay, let me just ask one other question that I had about the server though I suppose there are a lot questions I could ask; but, I want to make sure we get sort of coverage in different areas. When you said keeping the cache you were like “I gotta load these two files up, I will do the delta, and then I will try to keep the two files and the delta in it. I presumably flush the two files, just keep the delta with a tag on it that says this was the delta between these two files and if someone wants that delta, it will hit the cache and then I will send it out.” The idea is presumably that we want that second level of caching because probably most people are up to date, so probably when somebody wanted this delta, someone right after them is going to call for the same delta, that kind of thing?
PW: Exactly.
CM: Okay. Alright so let’s move from file serving, down to the next mode. Now my client can actually get all of the data that I need to play Guild Wars, and it’s self updating and sounds pretty awesome, in terms of keeping everyone on the same set of data. Now I need to start talking about how to actually play the game, like how that server infrastructure works. There’s a bunch of things involved there, why don’t you give a little overview of how that works and then we can kind of drill down to the that as well.
(35:59)
PW: The next thing you talk to is the lobby server and that’s one that handles authentication and presence. Again for people who don’t have familiarity with Guild Wars, it’s not like a traditional MMO where you join one map and you stay there for really long period of time. Instead, you’d join a particular game and you play there for a brief while, and then you join another game. So I might go to an Outpost where there’s 100-150 people running around, and [say to several players] “hey let’s go play in a instance map”, and then you and I get in our own private copy of a two, four or eight player world and then we play there for a while and we finish that, and then we go play a tournament game, which is five teams of eight people. And then we play an arena map, which is two teams of eight people. So we are playing lots of different maps and because you don’t have a persistent connection [to the game server], we don’t have a way to send you [chat and notification] messages all the time.
CM: Okay.
(36:40)
PW: Instead, you connect to the lobby server first, and your lobby server connection is persistent and when you get [chat] whispers from your friends or when we need send you any notifications they can all go through that persistent connection with the lobby [server]. So the lobby, you authenticate yourself, get your friends list, get your guild information, cause it’s called Guild Wars, right?
Casey laughs
PW: And all the other information you need to maintain persistent state. We establish some encryption keys because all the data is encrypted back and forth.
CM: Is that process…so I guess this should be brought up now so it could be mentioned in different parts of it. TCP versus UDP was a big thing you put in the emails. Actually the first thing you said. Do you wanna mention briefly what these different systems are that we are talking about? Were they TCP? Were they UDP? And why? It sounds like the file server was TCP because you mentioned the send window. What is the lobby server? TCP as well, or was it UDP?
(37:34)
PW: So for Guild Wars, everything is TCP.
CM: Okay. So everything in the game is TCP, no UDP?
PW: That’s right.
CM: Alright. And…
PW: And incidentally the reason for that was we’re gonna use TCP and then decide later on if we needed to use UDP server to client protocol.
CM: Okay.
PW: We thought we would need to, but it turned out that it just wasn’t a big issue for the type of game that we were doing.
CM: Okay.
PW: But it was expedient for us to write TCP base game.
CM: Okay. Keep going with that. I’m gonna connect to the lobby server, so I TCP to the lobby server, and I get my information back from that. Presumably this is also going to involve — you said you don’t ever do peer-to-peer — going to have to route the packets for the rest of the game to the people I’m playing with through somebody. Is that also the lobby server? Or how does that process work?
(38:21)
PW: That’s right. So briefly, the lobby server talks to the database server, and pulls down all your data. After that it talks to the server which is called SiteSrv but it’s like a match-making server, and it says “player Pat was last in Lion’s Arch” and so find a Lion’s Arch server. The match-making server doesn’t have any of those yet because it’s a brand new build or everything crashed and we end up bringing everything up, the whole thing, or it’s day one of the game, whatever. And so, it has a whole bunch of game servers that have connected to the match-making server that said “hi! I’m available to host games and I have X capacity”. The match-making server says “you are going to host the game! Go create a Lion’s Arch, we’ll call it district one,” and then the [game] server would create that. [Game server] Send back a token to the match-server, the match-server sends it back to the lobby server, lobby server sends to the client…
CM: Oh my god.
PW: Client’s got this token, which is a token that says I’m allowed to join this game and the address of that game.
CM: Okay.
(39:17)
PW: So then I join that game and I say “hi! I’m Pat. Here’s my token”.
PW: Let’s encrypt this communication, then the game server then goes to the database cache server and says “get me Pat’s record!”.
PW: “Oh, he’s not loaded yet.” And the database cache server goes to the database server “get me the record!”
PW: [The database says] Here’s your entire record which has all your inventory, the parts of the map that you’ve explored and how much gold you have. Everything interesting about your character.
CM: So there’s a ton of servers in there that you’ve just mentioned…
PW: That’s all just the basics…
CM: Just for the basics. Okay so, the first question is, I guess before I ask any more specifics about those. Why are there so many servers? Was this a load balancing thing? Was this just a cleanliness thing? Or a testability thing? Like why do we have that many servers involved in just this one process?
(40:00)
PW: Sure, so it’s really good when you have a lot of complexity to try and simplify your program because if you are trying to do too many things at once then if one of them crashes it takes down everything else.
CM: Okay.
PW: Isolation of components, and also just from an “understanding something” stand-point. If you can’t understand something it’s really hard to modify. So the smaller something is the better it fits in the seven plus or minus two items in your brain.
Casey laughs
PW: Right? And each one of those things can be separated. Separation of concerns is the CS terminology for that.
CM: So, basically your philosophical… your approach to this basically would be “if we can define a set of operations that clearly form something that can all be done within the same set of data, we’re gonna make that into a separate server if we can”. Even if that’s gonna run on the same machine as some other server, we’d rather have it be like it’s own sort of self contained thing…
PW: That’s right.
CM: …rather than merging these two things together.
(40:57)
PW: That’s right. Another really compelling reason to do that is you can keep buying bigger and bigger services but the cost for the server goes up. Instead of being a linear cost, it scales up exponentially. If you want a four processor server, it’s more than twice as expensive as a two processor server, and eight processor server more than four times expensive as a two processor server.
CM: I see. So if you can have the ability to distribute these things across lots of cheaper machines, it’s a win there as well?
PW: That’s right. And when you design your stuff, it’s called horizontally scalable. Vertically scalable buy a bigger server, right? Which you can only get so big before you run out of… there wasn’t a 256 core server that you can just go out buy. Or at least you couldn’t in 2004 era.
CM: Probably not even close but yeah…
PW: But that’s spending multi-millions of dollars probably. So you wanna scale horizontally and that causes you to re-architect the way that you write your software. And also “oh well one of our auth-servers blew up?” [We’ve got lots of them, so one is no] Big deal.
CM: Okay.
(41:51)
CM: Cause you just have lots of auth-servers running wherever and it’s totally fine.
PW: That’s right.
CM: Okay, going back to that server breakdown, so if my philosophy is lots of little servers, each that does a specific job. I can distribute wherever I want and I like programming it that way because then I have a much more well-defined problem space and solution to the problem. What about the extra interconnect that this generates? Meaning the fact that now these things have to talk to each other, don’t I sort of introduce a new failure point or new thing that I’ve gotta get right every time these two servers can’t just call a function. They’ve gotta actually now have a packet protocol. How did you approach that? Cause that seems like that then be something you wanna be systematize or something?
PW: That’s right.
CM: Tell me how you thought about that.
PW: Yeah absolutely. So, the servers speak a different protocol, an RPC protocol and happened to be a custom one although there’s lots of different versions of that stuff today, so today you would use protobufs or thrift or MsgPack or something like that.
CM: Okay.
(42:48)
PW: Because lots of people have studied this problem and figured out they are a thing to do. Another new one is Cap’n Proto.
CM: And you think these are good, meaning you’ve looked at them and these would be, we could have used these and they would have been okay?
PW: If they existed in that time period, yes.
CM: If they existed.
PW: Yeah, absolutely. You just need some standardized RPC protocol that in particular has the capability to do versioning because day one you do something and then three weeks later “oh we should have added this parameter”. Or we have this new feature. So you need to think about how you would add additional features without having to take down all of your servers and do a maintenance period.
CM: Right.
PW: One of the other unique things about Guild Wars was that we tried never to do maintenance, we ran 24/7/365. No maintenance periods. Which is extremely unusual.
CM: Yeah that does sound extremely…
(43:30)
PW: What you had to do was rolling upgrades of your server. You would upgrade one, it had to continue to work, like new version and old version; well rather new protocol and old protocol had to continue to work but now that you’ve got a new server running it can speak the new protocol and it can do additional functionality.
PW: So you take down the servers “okay this server is going to go out of service and so everybody get off it!”
PW: Okay, it’s empty. Kill. Restart with new version.
PW: “Okay server number 2, get all the users off”.
PW: Makes server shutting off and turning back on sounds
CM: Is that sort of scale down meaning I’ve gotta flush everything off, this is a thing that had to be built into all the servers, I assume with the concept of like you have to go down for maintenance now and we built that from day one. And we’ve built that from day 1 that you can go down for maintenance, you get everyone off, and then go to some other servers, and then we can do whatever kinds of upgrades we are going to do to you.
PW: That’s right.
CM: And this was built into every server? We talked about having so many servers…I can’t remember the number but had listed so many servers there on that line when you sent me the email. There’s over 10, let’s put it that way.
(44:28)
PW: Yeah I think there’s 18 at launch.
CM: 18 servers. Did you develop that with a particular, I don’t wanna say framework because that’s kind of a loaded term; but, did you essentially have some code that is shared between all these servers which is like “okay we kind of get this now maybe a little bit into it, how we want these servers to work”. So we kind of built a standard server library.
PW: Absolutely.
CM: What does that look like?
PW: Well another interesting thing is the core of all the servers was one set of code that could do file handling, memory management, socket, and hot-reloading of modules.
CM: Okay.
PW: So a module would be the file server module or the match making server or the lobby server. A module can actually be dropped and reloaded dynamically.
CM: In a sense, a server instance may be thought of as the lobby server, but really it’s sort of the generic server, like, top thing, whatever this is, with only the lobby server loaded up and running?
PW: That’s right.
CM: But technically if we’d wanted to, we could have also had it load up a match making server?
PW: That’s right. The way to look at it is ArenaSrv, which was the loader module is an executable and everything else is a DLL [hot-loadable Dynamic Link Library] and so on a developer system you would actually load up every single DLL so that you could have the entire stack of 18 servers running in one process.
CM: And test them just right there and see how they…
PW: That’s right.
CM: That’s pretty fascinating.
(45:45)
CM: Okay so, I guess…
PW: So, we had to create a [network] protocol to talk to everything, and that protocol library could have been part of the core [ArenaSrv] but because we wanted the capability to update really frequently every single server used the same code but we had it load in the DLL so it could be updated. And that [protocol] is called SrvSocket.
CM: Basically when you load a DLL, it also includes linked into its version of whatever the protocol was because it may be different than the one that some other thing that might be loaded on it or whatever. Or when we wanted to have to even if we are only loading one, when we want to go away and come back we don’t want to then have to go “oh right, he’s going away and changing his version so now I have to like switch out the version of my protocol that I’m using” and so it’s like no. It’s all just welded into that one DLL, it knows which protocol it was using; but, that did come from a shared library that we use internally so that everyone just links to that and knows they can use it.
PW: Correct.
CM: Okay. Wow.
(46:35)
PW: He knows his stuff.
CM: Alright. We don’t have nearly as much time left as I would like of course, but, I guess I just wanna quickly kind of go through some of these things here and and get little bit of I guess of your perspective on them. So talk a little bit about the TCP versus UDP thing because you’d mention that as a topic and said Guild Wars is all TCP which is kind of surprising to me. I actually had this discussion with some friends recently were as lamenting the fact that you can’t actually download a file in a browser…like downloading a file in a browser is like an unsolved problem. Half the time it just like stops, and you know nobody knows why, right?
PW: Uh hmm.
CM: And Guild Wars I don’t think ever had any of these sorts of problems so, can you give me a little bit of background about TCP versus UDP, and how if you’re gonna use TCP, how did you do it right? Versus doing it wrong, cause I see so many things where it’s obviously done wrong.
CM: Does that make sense as a question?
PW: Sure. Yeah.
(47:34)
PW: So, first thing is TCP versus UDP. TCP is a reliable, ordered protocol, which means even if you received data out of order then you’ll wait for the the piece that was supposed to come before it and insert that so that everything shows up in a nice [ordered] stream. It’s like reading a file, except that sometimes the file handle gets broken and you can’t read anymore from it. Where as UDP you can send lots of packets, they arrive in any order that is unguaranteed and some of them don’t ever arrive at all.
CM: Yeah.
PW: Obviously TCP and UDP are built on the same underlying primitives and TCP has a lot of mechanisms to ensure that the ordered guarantee actually happens. That may be not what you want for your game, because if you’re doing a first person shooter, you’re sending position and angle… position and velocity. Something like that. So that at any moment in time when I receive a packet I know exactly where you are and where you are headed, but if I lose one of those I don’t care because I will get the next one which is your new position. And in fact it would be bad to receive an old stale packet because now you’re gonna pop backwards in time. So TCP would be a terrible protocol for that.
(48:39)
PW: But for a an MMO it [TCP] actually works really well because almost every single message that you would send is a something that you would have wanted to know correctly, right? Like, I picked up this [item] right, that is not a message that you can lose because if you did then multiple people could pick up the same item. And so reliable, ordered protocol works really well. The downside to reliable ordered is on some devices like in mobile devices for example you lose those connections all the time, right? You’re on the bus, you go through a tunnel and TCP doesn’t survive trip through the tunnel where as UDP the phone keep going “no really try again! I’m here! I’m here! I’m here …here”. So UDP works great for mobile devices if you need a high bandwidth connection.
PW: If you are going to do something low bandwidth, you’d just use a simple REST-ful interface like HTTP.
CM: But TCP connections are not necessarily reliable either so presumably the game kind of has to deal with that state as well? Using the straw-man of the browser that can’t seem to download a file for some reason, is that just basically about okay the game just has to be smart about checking the connections. If I haven’t received anything in a certain amount of time, kill it, start another one. How much logic would you say you have to put into making a TCP connection work the way it needs to for a game, versus the way it needs to apparently work for a file browser? Or a web browser?
(49:56)
PW: If you were to look at the socket library for all of our server and the client, it is all like same kind of stuff, there’s probably a thousand lines for Windows 95, and a thousand lines for [Windows] NT.
CM: Okay.
PW: So it’s not a huge amount of code. I think it’s more the conceptual model that you have from working with sockets it is just that you have to assume that it could break at any moment in time.
CM: Okay.
PW: And plan for breaking. So we had sort of a coding standard in ArenaNet, don’t have functions that return failure. Don’t return boolean values “I’ve failed! Something went wrong!” Except for a couple of cases, one of them is sockets and the other one is files.
CM: Okay.
PW: Like if you open a file obviously it can fail, but for memory allocation it can either succeed or your program blows up.
CM: Okay.
PW: [Alternatively] You can set a flag saying, I’d [like] to know if this would fail and then you can…
CM: Like a probe…
PW: Yeah it’ll return null and say “I’m sorry I can’t allocate that memory.” And that code is obviously prepared to deal with that failure case. But the idea is you try and load a model file and it doesn’t load up, what the heck are you gonna do?
CM: Okay
(50:53)
PW: Well you could crash the game but that’s not very user friendly. So we just show a white six-foot square box.
CM: So you basically always had a case where it’s like when we do something we know that can fail, the code is designed around the fact that’s actually just a legal operating condition. I’ll either get this model or I won’t, it’s not an imperative, basically?
PW: Right, well for the model file I will always get a model file. I can ask if it’s valid after it loading, but for a socket you just have to look at it and go at any moment this thing could fail and so what’s my failure resolution for that. And of course a lot of testing doesn’t hurt either.
CM: Yeah. Okay and so you never really had any misgivings about using TCP then? You felt that was totally fine as long as you architect around the fact that you expect it to maybe go down and go back up, that sort of thing?
PW: Yes. Although in practice the thing that really affected us the most was at any moment in time like somebody’s doing maintenance on their gear somewhere…
CM: Okay
(51:45)
PW: And so your packets are taking one path like they’re hopping through 18 different routers to get from your house to the servers, and gamers play at night, maintenance periods are at night and so, the routes change, or the company that is publishing our game decides that they are going to do maintenance in their data centre and switch their devices, and you would think this is in 2005, right? Cisco had been making routers since the 1980s, that they would have had solved this problem. But, there’s this thing called route convergence and what happens is your router takes like 30 seconds before… you have like an active/passive set of routers, right? And the active gets taken down either by manual control or it just fails and the other ones supposed to pick up and it takes such a long time that your TCP connection all get torn down.
CM: Okay.
PW: And so you talk to the folks at the data centres [who say] “oh yeah our routes are converging.” What the fuck does that mean?
Casey and the audience laughs
(52:40)
PW: Right, what it means is that 70,000 players just lost their connection, like half of our users just lost their connection.
CM: It does sound like something that in the sci-fi film could be about a bunch of people losing connection… the great convergence is happening… it’s upon us. Some routes are getting… yeah.
PW: Right, well if you’d use UDP then you have much more control over that process because you can just keep trying and eventually routes will converge, and your packets will start getting through again.
CM: Why didn’t you guys decide to do something like for example use your own reliable stuff on top of UDP or something like that? It just never became that much of an issue? Or?
PW: Sure yeah the game worked well enough and we had plenty of technical challenges to solve after launch.
CM: That weren’t that? Okay.
PW: Because that problem didn’t really crop up until after launch, when we started seeing maintenance periods from our publisher.
CM: I see.
PW: Really? Can you guys not do that? Just don’t change stuff, right? Cause changing things breaks [our servers].
(53:34)
CM: One more question on that was you were talking about first person shooter versus non or whatever, Guild Wars still is kind of an action-y game in the sense that I do see people walking around and doing these sorts of things. Why do I not want that sort of UDP likeness for that portion of it where I’m just sort of sending where guys are and I don’t really wanna know where they were 30 seconds ago, if that packet got dropped, I don’t care anymore. Was that just not that important? Or was there reasons why for Guild Wars that’s not true in the same way?
PW: Right, well so we could’ve designed the game differently so it was more first person shooter action-y but we wanted to make a game less twitch focused. And so, all these things feed into each other, right? We had lots of different business rationales and gameplay rationales and technical rationales that we merged together to form our game. And they all sort of overlap with each other, so yeah we could have done arrow combat where you fire an arrow and you have to be aiming the right direction but we decided that was not what what the spirit of the game that we were trying to do was.
CM: Okay.
(54:30)
PW: So when you fire an arrow, you’re actually over there because you’ve been running this way and so I don’t know that yet because of network latency so I fire the arrow and the arrow doesn’t really curve in space but conceptually that’s what it’s doing even if you can’t see it. Cause on my system you’re standing over there and I fire and it hits you. But really you’re standing over there so you see me in a different place and the arrow on your system it looks one way and on my system it looks another, the same resolution happens, right? At the point the arrow is fired we decide whether it is going to hit or not. Has nothing to do with motion.
CM: And I suppose that’s a good… let’s see… like I said we… we’re super long time unfortunately but, so I suppose that’s a good thing to tie it back to the earlier games as well so we haven’t talked much about how the architecture works for the actual game play inside like Guild Wars, so it sounds like this is maybe sort of a hybrid of the two in some sense? Meaning it’s a little bit of lock-step code cause we’re going through a server now so obviously everything is getting resolved in one place which is a little bit more like a lock-step sort of thing. But, it’s also very asynchronous in the sense that what the client sees is sort of just you trying to represent what it got from the server in some way. So, could you talk a little bit about how you decided to do that, and was that formed a lot by sort of the experience with Warcraft, Diablo, Starcraft…?
(55:41)
PW: Right. In Guild Wars the server is authoritative for any important decisions, like where players are actually located. Who owns which items. But clients do their bests to model that in a way that seems seamless to users. So if I start walking, I don’t have to go ask the server “is it okay if I walk over here?” I already know all the collision information that’s local to my system. I know where all the players are and I know where all the objects in the world are. So I can effectively perform path finding around the those things, and if I bump into a wall my local client will tell me back-up a little bit. Or [it will say] you can’t interpenetrate into the wall. And I’m telling the server the whole time while I’m doing this and the server telling me what is actually true. So it can provide modifications. If there are a bunch of people standing in my way and I didn’t quite know that was true yet, cause they all sort of formed a wall right in front of me so that I couldn’t pass through. Then in my system I might have to teleport back a little bit. But by-and-large the rate of communication that you have with the server is high enough that you don’t see those artifacts.
(56:42)
PW: But if you did then the server just says nope. I [the server] am telling the truth. You have to move back.
CM: Okay so basically you sort of have a model where the client side is effectively running the game as much as it can run the game. And it’s getting what are amount to corrections from the server as to “hey yeah you moved there and I understand you drew that, but that was wrong so, move the guy back, and correct the state.” That sort of thing?
PW: That’s right. Or if you want to pick up an item then what happens is you initiate the pick-up request and now you have an animation sequence that takes a while to get done so that the likelihood is that the round trip to the server has already elapsed. So it’s not like you pick up something instantaneously and it disappears, and it’s like “oh wait, you didn’t get it!” It takes a while and like it could disappear before you pick up the object.
(57:28)
CM: And, what it’s like programming with that understanding in general? Do you find that’s a completely new process every time or is there certain systemic nature to that programming? Do I have to think about it when I write the code to pick up the object, or is there more of a system that I did? Can you give a little bit of a perspective on how that works when I’m writing the code that’s different from if I wasn’t doing networking?
PW: Yeah so there’s two things that I think are really helpful. One is to model everything as if it takes a minute to do each action. Or you know, a ridiculous amount of round-trip time like “an hour” because you start to think about the problem differently. It’s like if I have to ask the server something, well if every time I do that it takes a minute round-trip and you start thinking about like “oh well there’s a lot of failure cases in there! Somebody could pick it up in that interval of time”. So what am I going to do about that? And, so it helps thinking about problems differently. And the other is that you should inject simulated latency into your servers, and in fact you should not run servers locally in your office. Since most people are doing cloud-based games these days that’s not such an issue, but even then locate your servers far away or if you have the code budget for it then put in simulated latency, especially simulated jitter.
(58:35)
CM: Right so that the latency changes over time like sometimes it’s really quick, sometimes it’s really long.
PW: That’s right. So a minimum of 200 milliseconds. The last game that I worked on we were using 250 to 500 milliseconds, so there was a lot jitter in there. And that really helped cause you could feel when the problems showed up.
CM: I see.
PW: If you click on a button and then nothing happens? I think the user interaction first off has to change. When I click a button, something should happen on the local system, that indicates that I’ve accomplished that action. And then, the dialogue [button] changes now I’m trying to do something else, and that way you don’t have this stall where the client feels like [the game is] unresponsive.
CM: So you would say that pretty much all of the code that’s gameplay, user facing kinds of code at that point, pretty much all gets written with the network in mind? It’s not really a separate thing where for example in Warcraft 1 where you could almost imagine it being separate… two separate systems that don’t really have to think that hard about each other. In this case it’s no, if you’re a gameplay programmer on this thing you need to think about how that works and that’s just part of your day to day writing the code, kind of thing?
PW: Absolutely.
CM: Okay.
(59:45)
CM: So, let’s see here. Yeah. Oh man. We need another hour. I guess I’ll just quickly mention some of these things that you said before and could see if maybe you could give a little bit of a background here. So, some of the things that you talked about in the email that you sent me where things like hacking and encryption, and peer-to-peer versus server, and then there was OS choice for were you gonna be Linux or Windows for hosting the server and all these sorts of things. So, I guess my question… my sort of broad question on that stuff is, there’s a tremendous amount new things you have to think about if you wanted to do one of these games. Is there a particular resource, like you said you did a talk on one of these at one point but has anyone compiled sort of like “things you should think about for your network game from some experts” that kind of just go through like you know it’s not trying to tell you how everything needs to do, but it’s bringing it up all of the points. You should think about all these things, cause there were things on that list that I had never even heard of before, right?
(1:00:53)
PW: Yeah uh, um…
CM: So where would you start?
PW: I couldn’t point you to a specific blog or a book, but they do exist out there.
CM: Okay.
PW: But you just have to look for like game network programming and the issues do get raised.
CM: Okay.
PW: I mean I think biggest thing to be aware of is that whatever you create will be hacked, because some people really enjoy the process of hacking and so their game is different from what you think you are selling.
CM: Right right.
Audience laughs
PW: It’s like some people like to grief. If you don’t take griefing into account in your game, and it could be griefing your players or it could be griefing you the game operator, right like when they take down your database for fun.
CM: Yeah.
PW: You have to think about rate-limiting so that people can’t try and do too much stuff at one time. You have to think about the interaction model so that I can’t get my guild together and we can all attack you at one time with whisper messages that push you off the internet.
CM: I see.
PW: You don’t want to expose other players IP addresses because in the olden days I could send you the ping of death and kill your computer.
CM: Oh, the ping of death?
PW: The ping of death: I would send you an ICMP packet that would cause your computer, your Windows computer to crash.
Casey laughs
PW: Good stuff. How many players can you do that to today? For example we had a bug in a Guild Wars chat system at one point where we had these color codes that you could send along [with chat messages]. But it wasn’t perfectly validated so there was a [invalid] color code that [if] you sent it would cause clients to crash.
CM: Okay.
(1:02:17)
PW: And so all of a sudden we started seeing this weird effects in our concurrency graph. We would get alerts: something is going wrong! Lots of players are disconnecting, our concurrency isn’t stable and we’re also getting lots of crash reports. You have to have really good crash reporting for your game, right? So we would see stack dumps of what was going on, and we were able to immediately look at it and go, well look, the text processor is crashing and oh we’re able figure out really quickly somebody is sending data that is not valid. Patch the game, five minutes later we have a new build of the game out and the problems fixed.
CM: Okay.
PW: Right, so that’s again where iteration speed is really important, but in literally everything you do you have to think about the griefing and hacking aspects of it…
CM: Yeah I guess that’s something that happens with network games. Really just doesn’t happen anywhere else, cause if someone wants to hack their own game, that they’re playing in single player, that’s just fine, right? It’s like there’s no problem but once you do networking you kind of open this whole…
PW: Yeah and even more than that it’s the responsibility of the programmers and designers. There’s no legal recourse you have right because the players could be living over seas, or they could be under age and so you can’t stop em. So the worst hacker we ever had in Guild Wars was this 15 year old kid named Pablo from France.
CM: Pablo from France.
PW: We were able to identify him, and we talked to his parents from time to time.
Pat, Casey and the audience laughs
PW: But he never stopped!
CM: How does that phone call go?
PW: Yeah, I don’t know. I was not on the call.
CM: We really would like little Pablo to stop taking down the Guild Wars servers.
PW: The problem was he was making too damn much money, so like he could sell items online. But the thing was we started watching him really closely and watching what his interactions with the game were, and we were able to identify bugs before he could really take advantage of them in a big way.
Audience laughs
(1:03:52)
CM: Okay. I think we’re just about out of time so before we finish I wanted to leave time for you told me, you walked up here right before you’re gonna start talking and you said you wanted to say something about “how do I get a job in the games industry?” Cause I had asked Mike (Acton) about that, you had something you wanted to say, so we’ve got time. What did you want to say?
PW: Sure, I mean it’s the number one question I get asked so I figured rather than everybody rushing down here afterwards and asking all of us about how to get a job I’d talk about what I think is the right things to do. I mean I think the games industry great because I’ve worked here for so many years, but I think the one thing that people really need to have is demonstrable passion. And so passion is an excitement for stuff, it sort of relates to the curiosity that Mike (Acton) was talking about; but, demonstrable means that you have something you can show. It doesn’t have to be a game but it could be like here’s some code that I wrote that wasn’t just for a class project or something. Or if you’re a designer, here’s this level I designed in whatever, Half-Life 2, like my own level that I did that nobody else participated in it at all and look at all these cool things that I did that I can talk about. And it’s great stuff that you can bring to talk about in the interview, that shows that you have not just done your school work.
(1:05:03)
PW: I’ve interviewed tons of people and they always say, well not all of them, but a bunch of people will say “well you know I was so busy with school that I didn’t really have time to do extra curricular activities like writing code.”
CM: Yeah.
PW: Bullshit.
Casey laughs
PW: We obviously have hours and hours in a day and we just choose what we want to do. And it could be watching TV, or hanging out with friends or whatever. But if you’re really passionate and you want to be successful at something when you’re early in your career …
Audience laughs
PW: Nice. Then you have to find a way to demonstrate that, so I think that helps a lot. You can also contribute to other people’s projects but I think it’s harder to demonstrate “this is what I did”. And looking at hiring lots of people, I have tried programming tests and whiteboard tests and things but unlike a financial industry, past performance is predictor of future success.
CM: Okay.
Casey laughs
(1:05:56)
PW: And so the best way to hire is to look back at what somebody’s done in the past because that best predictor of what they will do in the future beyond any other thing that you can do.
CM: Okay.
PW: So demonstrate that you’ve done something. Also, coding interviews are really hard. Just about any company you go to is going to have a really hard coding test, they are going to grill you all day long. There’s a great book for this, it’s called Cracking the Coding interview, and if you can just solve every problem in that book, you will get a job anywhere you want.
Casey and the audience laughs
PW: And it’s funny but people go to college for four years or something, and then they’re like “well maybe I study a little bit for the interview”. No, the interview is the most critical thing you can do to get a job at a place that you want, so treat it like it’s a college course and put a month into that book and solve every single problem in the book, right? And then you would get a job anywhere.
CM: I guess I didn’t know that there is such a book. Would these be applicable to game companies even though?
PW: Absolutely, yeah.
CM: Really? Okay.
PW: All of the stuff is the kind of questions that Amazon, Google, Facebook, Microsoft ask these days.
(1:06:56)
CM: But what about companies…
PW: Or for example the companies that I’ve worked at too like I wrote the coding tests there, some of the problems are very similar to things that I ask.
CM: Okay so you think it does map well even across the board?
PW: Yeah. The kind of stuff we did like reverse a linked list in place. A singly linked list in place, right? You know, stuff like that. It’s really common.
CM: Okay. Well I think that’s it. Thank you so much for talking with us Pat, that was fantastic. I wish we had more time.
PW: Thank you.
Speak Your Mind