Battling the Lag Monster
Posted: Tue May 28, 2019 12:53 pm
One of the toughest beasties of them all, the Lag Monster has once again reared its head and attacked our beloved Waterdeep home. I've had a lot of personal inquiries and suggestions in the last week so I thought it best to consolidate information here so that I don't have to re-explain it. I do genuinely appreciate that folks would like to help resolve the issue, however without an in depth understanding of the mod and its history its difficult for people to make sensible suggestions. I apologize for being curt in those replies and I realize that the problem is a lack of communication from my part, which I hope to alleviate here.
Background:
The Waterdeep module is one of ALFA's oldest, and I have the pleasure of saying I got to be one of the original testers when Indio was first assembling it. It was passed to El Chip not long after who really put some scripting magic in and brought the place to life. The module and ALFA in general have been passed through many hands since then with varying degrees of technical competency. This has led to a mis-mash of things and lots of legacy stuff that is in need of fixing. There was never really a unified "plan" for things, people just built things for their own particular group and never bothered to clean up other things along the way. This, combined with a lack of basic coding principles, compounded many issues into the final "official" state of the mod that I began work on in 2017.
I have managed to make slow improvements over time in the module to increase its stability. When we first launched in May 2017 we had a hard time with even 5 players on at a time and even then the server needed reboot every day or so. Eventually I tracked the issue to a pseudo-heartbeat script inside of an actual heartbeat script, putting a huge load upon the server. Correcting that got us to be stable up to like 5 people or so, which was a major hurdle to overcome.
From there we held pretty well, and eventually started growing. Again around 12 people or so we started having problems again and the lag monster returned. This time it became clear that the log file was ballooning up to rather large file sizes very quickly. The log files were regularly 100MB of pure text which naturally caused a lot of I/O slowdown as it tried to write to the file. I tracked down the responsible scripts and reduced the output, which in the end worked and got us safely up to 15 or so. Quite a large number now!
Current Situation:
We peaked around 23 people on Friday, wow! Such huge turnout which is really inspiring. Unfortunately the server performance took a big hit handling that many players. A lot of things in old ALFA were never built with scalability in mind, and actually in old ALFA 1 it was rare to see more than 10 players on a server at any given time so probably no one ever even noticed it. In a lot of ways we have outgrown the old implementations of yore and need new systems in place to handle core functionality. You see, in the old ALFA zealotry of hunting down PGers they made complicated systems to track and watch players wealth, which in the end made the game unplayable in more ways than one. To this they tied poorly implemented sub-systems such as horses and subraces. All of these are add up a lot as we get more and more players on at the same time.
To put it simply, the problem isn't finding what causes the lag, its how to fix it without breaking everything else. I'm fairly confident I can make a few adjustments without editing the haks, but you might start seeing some weird behavior or features getting lost. I think some things, like subraces, were not well implemented nor popular and can be cut using a DM to adjust stats instead. Additionally I will probably take horses out entirely (as every player gets a horse heartbeat every cycle) until Duck or someone else can fix the system. Hopefully this will go a long way to making 20+ players a more stable environment.
In short, the problem isn't identifying the issue, the problem is fixing the issue without breaking everything else.
FAQ:
1)"But what about X? I'm an expert in X and I think the problem is here." - You are right! That is a source of lag, however its not the main and most pressing source of lag we currently face. Additionally, finding sources of lag isn't really the problem. The Waterdeep module is a "target rich" environment for finding problems. I do however get a lot of weird ideas from folks on where they think the lag comes from, often which borders on near religious fervor. Sadly we are our own worst enemy and deciding work priorities by committee as we did in the past is a sure way to fail. While your suggestions have good intent, the end result is a distraction from higher priority fixes. The real issue isn't in identification, its in correction.
2)"Can't we just buy a beefier server?" - Sadly no. At the height of our issues last week we had about 30% processor usage on one core (the other core idle) and about 35% memory usage. Network capacity is also underutilized and is probably one of the faster internet connections I've seen in my life, pulling my 100MB updates from google in less than a second. I use an external cloud based VM host from softsys who has had excellent support and at $20 a month I couldn't be happier with their service.
3)"What can I do to help?" - Be patient for the solution! This has been a much more successful project than I anticipated, however at the end of the day I'm just one dude sitting around in his underwear writing code as some perverse idea of "fun". My time is somewhat limited by the fact that I actually manage several different (non NWN) projects as well on the side, in addition to work and RL demands and of course the desire to run cool stories and events as well. A glutton for punishment you might say, but if you feel like being crazy with me then learn to code and put out some quality stuff. There's no shortage of scripts that need rework and I'm sure we can find something that interests you, even easy stuff to start.
Background:
The Waterdeep module is one of ALFA's oldest, and I have the pleasure of saying I got to be one of the original testers when Indio was first assembling it. It was passed to El Chip not long after who really put some scripting magic in and brought the place to life. The module and ALFA in general have been passed through many hands since then with varying degrees of technical competency. This has led to a mis-mash of things and lots of legacy stuff that is in need of fixing. There was never really a unified "plan" for things, people just built things for their own particular group and never bothered to clean up other things along the way. This, combined with a lack of basic coding principles, compounded many issues into the final "official" state of the mod that I began work on in 2017.
I have managed to make slow improvements over time in the module to increase its stability. When we first launched in May 2017 we had a hard time with even 5 players on at a time and even then the server needed reboot every day or so. Eventually I tracked the issue to a pseudo-heartbeat script inside of an actual heartbeat script, putting a huge load upon the server. Correcting that got us to be stable up to like 5 people or so, which was a major hurdle to overcome.
From there we held pretty well, and eventually started growing. Again around 12 people or so we started having problems again and the lag monster returned. This time it became clear that the log file was ballooning up to rather large file sizes very quickly. The log files were regularly 100MB of pure text which naturally caused a lot of I/O slowdown as it tried to write to the file. I tracked down the responsible scripts and reduced the output, which in the end worked and got us safely up to 15 or so. Quite a large number now!
Current Situation:
We peaked around 23 people on Friday, wow! Such huge turnout which is really inspiring. Unfortunately the server performance took a big hit handling that many players. A lot of things in old ALFA were never built with scalability in mind, and actually in old ALFA 1 it was rare to see more than 10 players on a server at any given time so probably no one ever even noticed it. In a lot of ways we have outgrown the old implementations of yore and need new systems in place to handle core functionality. You see, in the old ALFA zealotry of hunting down PGers they made complicated systems to track and watch players wealth, which in the end made the game unplayable in more ways than one. To this they tied poorly implemented sub-systems such as horses and subraces. All of these are add up a lot as we get more and more players on at the same time.
To put it simply, the problem isn't finding what causes the lag, its how to fix it without breaking everything else. I'm fairly confident I can make a few adjustments without editing the haks, but you might start seeing some weird behavior or features getting lost. I think some things, like subraces, were not well implemented nor popular and can be cut using a DM to adjust stats instead. Additionally I will probably take horses out entirely (as every player gets a horse heartbeat every cycle) until Duck or someone else can fix the system. Hopefully this will go a long way to making 20+ players a more stable environment.
In short, the problem isn't identifying the issue, the problem is fixing the issue without breaking everything else.
FAQ:
1)"But what about X? I'm an expert in X and I think the problem is here." - You are right! That is a source of lag, however its not the main and most pressing source of lag we currently face. Additionally, finding sources of lag isn't really the problem. The Waterdeep module is a "target rich" environment for finding problems. I do however get a lot of weird ideas from folks on where they think the lag comes from, often which borders on near religious fervor. Sadly we are our own worst enemy and deciding work priorities by committee as we did in the past is a sure way to fail. While your suggestions have good intent, the end result is a distraction from higher priority fixes. The real issue isn't in identification, its in correction.
2)"Can't we just buy a beefier server?" - Sadly no. At the height of our issues last week we had about 30% processor usage on one core (the other core idle) and about 35% memory usage. Network capacity is also underutilized and is probably one of the faster internet connections I've seen in my life, pulling my 100MB updates from google in less than a second. I use an external cloud based VM host from softsys who has had excellent support and at $20 a month I couldn't be happier with their service.
3)"What can I do to help?" - Be patient for the solution! This has been a much more successful project than I anticipated, however at the end of the day I'm just one dude sitting around in his underwear writing code as some perverse idea of "fun". My time is somewhat limited by the fact that I actually manage several different (non NWN) projects as well on the side, in addition to work and RL demands and of course the desire to run cool stories and events as well. A glutton for punishment you might say, but if you feel like being crazy with me then learn to code and put out some quality stuff. There's no shortage of scripts that need rework and I'm sure we can find something that interests you, even easy stuff to start.