TL;DR: Nodecraft moved 23TB of customer backup files from AWS S3 to Backblaze B2 in just 7 hours.
Overview
Nodecraft.com is a multiplayer cloud platform, where gamers can rent and use our servers to build and share unique online multiplayer servers with their friends and/or the public. In the course of server owners running their game servers, there are backups generated including the servers’ files, game backups and other files. It goes without saying that backup reliability is important for server owners.
In November 2018, it became clear to us at Nodecraft that we could improve our costs if we re-examine our cloud backup strategy. After looking at the current offerings, we decided we were moving our backups from Amazon’s S3 to Backblaze’s B2 service. This article describes how our team approached it, why, and what happened, specifically so we could share our experiences.
Benefits
Due to S3 and B2 being at least nearly equally* accessible, reliable, available, as well as many other providers, our primary reason for moving our backups now became pricing. As we started into the effort, other factors such as variety of API, quality of API, real-life workability, and customer service started to surface.
After looking at a wide variety of considerations, we decided on Backblaze’s B2 service. A big part of the costs of this operation is their bandwidth, which is amazingly affordable.
The price gap between the two object storage systems come from the Bandwidth Alliance between Backblaze and Cloudflare, a group of providers that have agreed to not charge (or heavily discount) for data leaving inside the alliance of networks (“egress” charges). We at Nodecraft use Cloudflare extensively and so this left only the egress charges from Amazon to Cloudflare to worry about.
In normal operations, our customers both constantly make backups as well as access them for various purposes and there has been no change to their abilities to perform these operations compared to the previous provider.
Considerations
As with any change in providers, the change-over must be thought out with great attention to detail. When there were no quality issues previously and circumstances are such that a wide field of new providers can be considered, the final selection must be carefully evaluated. Our list of concerns included these:
- Safety: we needed to move our files and ensure they remain intact, in a redundant way
- Availability: the service must both be reliable but also widely available
** (which means we needed to “point” at the right file after its move, during the entire process of moving all the files: different companies have different strategies, one bucket, many buckets, regions, zones, etc) - API: we are experienced, so we are not crazy about proprietary file transfer tools
- Speed: we needed to move the files in bulk and not brake on rate limitations, and…
…improper tuning could turn the operation into our own DDoS.
All these factors individually are good and important, but when crafted together, can be a significant service disruption. If things can move easily, quickly, and, reliably, improper tuning could turn the operation into our own DDoS. We took thorough steps to make sure this wouldn’t happen, so an additional requirement was added:
Tuning: Don’t down your own services, or harm your neighbors
What this means to the lay person is “We have a lot of devices in our network, we can do this in parallel. If we do it at full-speed, we can make our multiple service providers not like us too much… maybe we should make this go at less than full speed.”
Important Parts
To embrace our own cloud processing capabilities, we knew we would have to take a two tier approach in both the Tactical (move a file) and Strategic (tell many nodes to move all the files) levels.
Strategic
Our goals here are simple: we want to move all the files, move them correctly, and only once, but also make sure operations can continue while the move happens. This is key because if we had used a one computer to move the files, it would take months.
The first step to making this work in parallel was to build a small web service to allow us to queue a single target file to be moved at a time to each worker node. This service provided a locking mechanism so that the same file wouldn’t be moved twice, both concurrently or eventually. The timer for the lock to expire (with error message) was set to a couple hours. This service was intended to be accessed via simple tools such as curl.
We deployed each worker node as a Docker container, spread across our Docker Swarm. Using the parameters in a docker stack file, we were able to define how many workers per node joined the task. This also ensured more expensive bandwidth regions like Asia Pacific didn’t join the worker pool.
Tactical
Nodecraft has multiple fleets of servers spanning multiple datacenters, and our plan was to use spare capacity on most of them to move the backup files. We have experienced a consistent pattern of access of our servers by our users in the various data centers across the world, and we knew there would be availability for our file moving purposes.
- Our goals in this part of the operation are also simple, but have more steps:
- Get the name/ID/URL of a file to move which…
- locks the file, and…
- starts the fail timer
- Get the file info, including size
- DOWNLOAD: Copy the file to the local node (without limiting the node’s network availability)
- Verify the file (size, ZIP integrity, hash)
- UPLOAD: Copy the file to the new service (again without impacting the node)
- Report “done” with new ID/URL location information to the Strategic level, which…
- …releases the lock in the web service, cancels the timer, and marks the file DONE
The Kill Switch
In the case of a potential run-away, where even the in-band Docker Swarm commands themselves, we decided to make sure we had a kill switch handy. In our case, it was our intrepid little web service–we made sure we could pause it. Looking back, it would be better if it used a consumable resource, such as a counter, or a value in a database cell. If we didn’t refresh the counter, then it would stop all its own. More on “runaways” later.
Real Life Tuning
Our business has daily, weekly, and other cycles of activity that are predictable. Most important is our daily cycle, that trails after the Sun. We decided to use our nodes that were in low-activity areas to carry the work, and after testing, we found that if we tune correctly this doesn’t affect the relatively light loads of the servers in that low-activity region. This was backed up by verifying no change in customer service load using our metrics and those of our CRM tools. Back to tuning.
Initially we tuned the DOWN file transfer speed equivalent to 3/4ths of what wget(1) could do. We thought “oh, the network traffic to the node will fit in-between this so it’s ok”. This is mostly true, but only mostly. This is a problem in two ways. The cause of the problems is that isolated node tests are just that—isolated. When a large number of nodes in a datacenter are doing the actual production file transfers, there is a proportional impact that builds as the traffic is concentrated towards the egress point(s).
Problem 1: you are being a bad neighbor on the way to the egress points. Ok, you say “well we pay for network access, let’s use it” but of course there’s only so much to go around, but also obviously “all the ports of the switch have more bandwidth than the uplink ports” so of course there will be limits to be hit.
Problem 2: you are being your own bad neighbor to yourself. Again, if you end-up with your machines being network-near to each other in a network-coordinates kind of way, your attempts to “use all that bandwidth we paid for” will be throttled by the closest choke point, impacting only or nearly only yourself. If you’re going to use most of the bandwidth you CAN use, you might as well be mindful of it and choose where you will put the chokepoint, that the entire operation will create. If one is not cognizant of this concern, one can take down entire racks of your own equipment by choking the top-of-rack switch, or, other networking.
By reducing our 3/4ths-of-wget(1) tuning to 50% of what wget could do for a single file transfer, we saw our nodes still functioning properly. Your mileage will absolutely vary, and there’s hidden concerns in the details of how your nodes might or might not be near each other, and their impact on hardware in between them and the Internet.
Old Habits
Perhaps this is an annoying detail: Based on previous experience in life, I put in some delays. We scripted these tools up in Python, with a Bourne shell wrapper to detect fails (there were) and also because for our upload step, we ended up going against our DNA and used the Backblaze upload utility. By the way, it is multi-threaded and really fast. But in the wrapping shell script, as a matter of course, in the main loop, that was first talking to our API, I put in a sleep 2 statement. This creates a small pause “at the top” between files.
This ended up being key, as we’ll see in a moment.
How It (The Service, Almost) All Went Down
What’s past is sometimes not prologue. Independent testing in a single node, or even a few nodes, was not totally instructive to what really was going to happen as we throttled up the test. Now when I say “test” I really mean, “operation”.
Our initial testing was concluded “Tactically” as above, for which we used test files, and were very careful in the verification thereof. In general, we were sure that we could manage copying a file down (Python loop) and verifying (unzip -T) and operate the Backblaze b2 utility without getting into too much trouble…but it’s the Strategic level that taught us a few things.
Remembering to a foggy past where “6% collisions on a 10-BASE-T network and its game over”…yeah that 6%. We throttled up the number of replicas in the Docker Swarm, and didn’t have any problems. Good. “Alright.” Then we moved the throttle so to speak, to the last detent.
We had nearly achieved self-DDoS.
It wasn’t all that bad, but, we were suddenly very, very happy with our 50%-of-wget(1) tuning, and our 2 second delays between transfers, and most of all, our kill switch.
Analysis
TL;DR — Things went great.
There were a couple files that just didn’t want to transfer (weren’t really there on S3, hmm). There were some DDoS alarms that tripped momentarily. There was a LOT of traffic…and, then, the bandwidth bill.
Your mileage may vary, but there’s some things to think about with regards to your bandwidth bill. When I say “bill” it’s actually a few bills.
As per the diagram above, moving the file can trigger multiple bandwidth charges, especially as our customers began to download the files from B2 for instance deployment, etc. In our case, we now only had the S3 egress bill to worry about. Here’s why that works out:
- We have group (node) discount bandwidth agreements with our providers
- B2 is a member of the Bandwidth Alliance…
- …and so is Cloudflare
- We were accessing our S3 content through our (not free!) Cloudflare account public URLs, not by the (private) S3 URLs.
Without saying anything about our confidential arrangements with our service partners, the following are both generally true: you can talk to providers and sometimes work out reductions. Also, they especially like it when you call them (in advance) and discuss your plans to run their gear hard. For example, on another data move, one of the providers gave us a way to “mark” our traffic a certain way, and it would go through a quiet-but-not-often-traveled part of their network; win win!
Want More?
Thanks for your attention, and good luck with your own byte slinging.
Gregory R. Sudderth
Nodecraft Senior DevOps Engineer
* Science is hard, blue keys on calculators are tricky, and we don’t have years to study things before doing them