sondehub

Data transfer cost improvements

Added 2022-02-16 00:20:33 +0000 UTC

As SondeHub grows in the number of stations and the number of website users our data transfer costs increase. When there was only a handful for stations using SondeHub this wasn't a big deal but now that SondeHub but now that SondeHub is nearing 800 stations the data transfer fees are starting to add up. In the last month alone we've picked up an extra 100 stations! Added to this when we migrated from habhub it was done in a way to minimize the time to switch over. APIs were rushed into production and not much thought went into to optimisation. Last month data transfer costs became our number one expense. Because of this I spent some time over the last few weeks to optimise our platform to reduce this cost, and I'll run through some of the savings we've made.

SNS message compression

To ensure reliability and performance within SondeHub we heavily rely on Simple Notification Service (SNS). SNS is a message distribution system. When we receive a batch of frames on the API we turn this in an SNS message. SNS then passes this on to Simple Queue Service (SQS) for processing into ElasticSearch, along with a Lambda function for processing onto websockets / MQTT.

A typical SNS message for SondeHub is a JSON array of payload data. In my test case, about 7879 bytes. (for the purposes of testing I've actually used payloads from several different receivers to make the task the worst case scenario - much higher entropy)

SNS per message cost is free for SQS and Lambda (we get charged on the SQS and Lambda side), however you are still charged for data transfer - $0.09 per GB.

To reduce this overall cost we can compress this down. Using GZIP this gives us 1401 bytes. There's a problem here though. SNS and SQS require strings, not binary data. So we then have to base64 the data. After base64ing the data we get back to 1869 bytes. This gives us roughly a 4.2x compression ratio.

The code to do this is actually quite simple:

<code><code>    # compress
   compressed = BytesIO()
   with gzip.GzipFile(fileobj=compressed, mode='w') as f:
       f.write(json.dumps(payload).encode('utf-8'))
   payload = base64.b64encode(compressed.getvalue()).decode("utf-8")

   #decompress
   decoded = json.loads(zlib.decompress(base64.b64decode(sns_message["Message"]), 16 + zlib.MAX_WBITS))</code></code></code>

And in practice we can see the decrease in our overall SNS spend

AZ Traffic

Inside AWS you can choose where your data and compute are stored. There's two main concepts, regions and availability zones. An availability zone basically one or more datacenters. Availability zones don't share any resources with each other. Regions contain multiple availability zones. Availability zones within the same region are interconnected allowing for high speed traffic.

Moving data around costs depends on where your moving it. The basic version is:

Traffic inside an availability zone is free
Traffic between availability zones is cheap
Traffic between regions is a little more expensive
Traffic out to the internet is very expensive

SondeHub is hosted in a single region however we had some interesting traffic flows. Most of our Lambda functions don't use a VPC to allow them to have quick startup times. The problem with this approach is that when we place messages on the websockets we are being charged for traffic to a load balancer.

The Lambda function that posts to the websockets endpoint was updated to use the internal IP addresses and limited to only be a single availability zone. This required a little bit of extra logic to detect which IP address was active internally but the end result is we are no longer charged for this traffic.

Our websocket servers were also modified to be single AZ to reduce traffic costs.

ElasticSearch compression

Another big saving was ElasticSearch compression. For a long while we have been compressing our requests/queries to ElasticSearch. However we never sent the required headers to get a compressed result. What this meant is that is that all our responses (which are sometimes containing thousands of documents) were completely uncompressed JSON.

Adding compression to the responses was pretty straight forward:

<code><code>    headers = {"Host": ES_HOST, "Content-Type": "application/json",
              "Content-Encoding": "gzip", 'Accept-Encoding': 'gzip'}
   ...
   if (
      'Content-Encoding' in r.headers
       and r.headers['Content-Encoding'] == 'gzip'
   ):
       return json.loads(zlib.decompress(r.content, 16 + zlib.MAX_WBITS))</code></code></code>

This provides a significant cost saving on more data costs.

Lighsail for data out

The final data saving was really about picking the right services for the job. Inside Fargate (where we host websockets) we are charged $92 per TB of data transfer out to the internet. However if we host our application inside Lightsail, we can get a 1core / 512MB instance with 1TB of traffic (in/out) for $3.50/month. Significantly cheaper and we get compute as well! Lightsail does have some limitations such as no autoscaling.

We switched to 3x the $5/month Lightsail instances as they provide 2TB a month each - this is well over our typical usage so shouldn't require autoscaling to cope with most of the traffic spikes we see while still providing a significant reduction is overall data transfer out costs.

As these are running inside Amazon it was also possible to configure them to be Elastic Container Service hosts which made migrating over easy as we just provisioned our websocket container tasks to the new hosts.

The only tricky part of this is that we are forced to use the Lightsail load balancer and rely of health checks to add and remove the instances from the load balancer - however this is fine for us.