At consumer-facing web startups like Skillshare, we spend much of our time working for a hockey stick. Not the old-fashioned, wooden kind which a kid might ask Santa for at Christmas, but rather the one which refers to Paul Graham’s startup growth curve , where the curve resembles an upright hockey stick in its switch from flatlined growth to an explosive tick up and to the right. This section of Graham’s graph is called “The Promised Land,” and for good reason. Once a startup reaches the hockey stick, all of your hard work has been validated: The number of people using your product grows by an order of magnitude; the press comes calling; heck, you might even make some money!
But on the engineering side, achieving hockey-stick levels of growth can present a host of problems. Your old web server will struggle under the load of thousands of new users; processes which previously could be run instantly will take multiple seconds to finish; your database will explode in size, struggling to process queries which it previously laughed at.
Luckily, you can prepare for the hockey stick and its ensuing challenges. About a month ago, we launched Hybrid Classes, which allow anyone from around the world to enroll in a Skillshare class. We knew this would lead to a huge uptick in growth (Potentially hockey stick levels…). Here are the steps we took, which you can follow as a guide to avoiding downtime when taking your early-stage startup to the next level.
Budget Time for the Backend
Before you can solidify your application, you must budget some time to focus on scaling issues. This is harder than it sounds. Why? Well, when you push a new user-facing feature, you get support from your non-technical coworkers and your users. But when you upgrade the server or setup better caching? No one except you and the engineering team will know. The lack of visible results can prove a deterrent against working on scaling solutions.
At Skillshare, we set a goal that eighty percent of the time we work on application features, and twenty percent of the time we will work on scaling / engineering projects. This strategy gives us plenty of time to work on scaling the site, while still pushing out user-facing features every week. Now we get the positive feedback from our community while still ensuring that they have a reliable experience under any load.
Before you can begin to solidify your site, you must know how much work you have to do. We began by forecasting how much traffic we expected, and then tested the site under those conditions to learn how our setup would perform. There are a ton of free tools you can use to flood your site with traffic and track how it responds. We used autobench; ab is another popular option. Once we saw how much traffic we could handle and at how heavy load we started to notice a performance decrease, we knew how much work was ahead of us.
Getting an idea of how your system reacts to increased load can lead to some quick, specific changes. One thing we noticed is that our memory usage skyrocketed under increased load. But when we checked our server’s memory availability, we realized we could add much more RAM to the server. We did, and then under a repeated load test the memory performance was absolutely fine. Here, basic load testing led us to a scaling solution which didn’t require touching the application at all.
The first step we took to speed up Skillshare was to do an audit of our most expensive database queries, and figure out ways we can cache them. If you already have memcached or a similar caching system implemented, this process can take as little as a day or two and can speed up the application significantly. If you don’t have a caching system installed, this will be the time to install it, as running the same database queries for hundreds of thousands of users is not feasible in the long run.
The way we kept the process quick was by identifying a few key queries which were bottlenecks on our most-visited pages, and caching them. Using Google Analytics, we figured out that our Browse page and Class Details page receive the most traffic. Then, we used a basic database query profiler to identify their most expensive queries and implement caching on those queries. Even before we began to see hockey stick growth, this helped the speed of the application. But once we did get an influx of new users, the pages we implemented heavy caching on kept on running as if nothing had changed.
The first place you’ll want to optimize for high growth is in your application. That makes sense. It’s where you hang out all day, slinging code and pushing out new features. But if you’re going to be completely ready for hockey stick growth, then the server which supports the application needs some loving too.
There are a few steps you can take to strengthen your hardware setup. If you use Amazon AWS, then tools like Elastic Beanstalk and EC2 let you spin up web servers behind a load balancer as you need to. If you are not on AWS, you can prepare by beefing up your components (We added a few dozen gigs of RAM in the weeks before Hybrid Classes launched). Additionally, you can migrate your database to its own server, to reduce load on both it and the web server. Isolating the components and beefing up your hardware will ensure that your server is performing at top speed.
Keep an Eye on Things
While we all wish to reach the hockey stick, how will you know when it has arrived? Often explosive growth can come as the result of a few press articles or influential people linking to your site; this can happen in a matter of a few hours. Google Analytics works well for long-term monitoring of your traffic, but to be preparing for the hockey stick requires a more granular view.
Throughout the day that we launched hybrid classes, we kept an eye on both our server and application logs. We use New Relic, which provides really easy-to-use, real-time monitoring of your application and server. Of course, it can’t hurt to setup your own logging as well, using a tool like statsd. Keeping an eye on your server and application’s performance gives you real feedback as to whether the caching, queueing and upgrading efforts you’ve put in have had any real effect.
On launch day, we were relieved to find that all of our preparations had worked. We received unprecedented levels of traffic and user signups (for us, at least), and the site continued to run smoothly. We had 0% downtime, and only a slight downtick in performance at peak load. A month after launch, we are still seeing increased usage in the site, and thanks to much of the caching and batching queries which we put into place, the site is handling the increase in usage completely fine.
Do you have experience taking websites through similar high growth periods? Would you have done anything differently? What strategies have you taken?