SakeTami
sondehub
sondehub

patreon


What's been happening over Jan and Feb

Going to try to be brief (future me here - turns out this update was less brief than I  thought it would be - we've been doing a lot!) because everything has been a lot lately but I wanted to cover some of the things we've been working on behind the scenes.

MQTT WebSocket compression issues

For several months we've been running "per-message-deflate" on our WebSockets solution which powers our tracker page and third party access. This has saved us significantly in bandwidth usage. However we've been seeing our WebSocket servers crash with a segfault. Unfortunately when the crash occurs it happens to all our servers at once, so the impact is significant.

We were able to get core dumps, but after help from several people and a lot of GDB debugging we were unable to determine the problem or a suitable solution. The problem could lie within mosquitto, or libwebsockets but we are unable to determine.

Another approach we looked at was using a modified version of python wsproxy project to provide the compression. The problem with this solution is that it used significantly more CPU than was acceptable. At this stage we are running without compression which is frustrating for both us from a cost perspective and for our clients who have to consume more data.

OpenSearch capacity

We've seen an increase in usage in OpenSearch CPU usage. I haven't been able to determine the extra cause of this increase - it could be extra usage, more amateur usage, or querying over more data (eg we aren't cleaning stuff up properly).

The last few days / weeks OpenSearch often hit max CPU and delayed predictions and caused other data features to fail.

What's worse is when WebSockets crashed it often causes additional CPU load on OpenSearch as clients failed back to API requests.

We've made some efforts to clean up some older data - however more research is required in this area.

Predictor issues

At some point we hit some weird limitation with the predictor where it would just go slow. I don't think we've truely worked out exactly why this has started happening but we've worked around it by improving the scaling system of the predictor to handle the load and reduce the overall response time. This system is complex are very hard to troubleshoot. We even went as far as instrumenting the server in New Relic however without past data it's hard to work out if something has changed or if we hit a performance limit.


Increased load

In the last few days of news cycles have made balloon tracking interesting to more and more people. As such we've seen over a doubling of usage of our tracking websites. I was hoping this would come and go fairly quickly however it seems this extended usage (and possibly more) is likely to stick around for longer.

We've added 3 additional websocket servers and doubled the size of our OpenSearch cluster to handle the load.

I wanted to quickly mention what happens to SondeHub during high load:

 - Our WebSocket servers have a lot of burst capacity - so can often ride increases in load (see above screenshot) provided it isn't sustained for multiple days
- If OpenSearch is overloaded cached versions of results are served up where possible - this is usually enough to get the tracker to connect to WebSockets for live data. Even if OpenSearch is completely offline live data through WebSockets should work.
- Our ingestion pipeline is very robust. It's unlikely that we will drop uploaded data even with WebSockets or OpenSearch are entirely offline. It just might take awhile before data will show up on the tracker or in databases.

Future

During all this troubleshooting, debugging and increasing of resources - many changes have occurred without being committed into our infrastructure as code repo. Once things have calmed down this will be first task to fix up.

Me

During all of this I've been trying to take care of myself, but it's been fairly hard. At the moment I'll just be focusing on maintaining system uptime and availability. It may be awhile until new features are added to our APIs and backend systems. Feel free to keep on adding issues to the sondehub-infra GitHub repo but keep in mind it may be awhile before I can get to them.

That's it for now, happy balloon flying and hunting.

~ Michaela.

Comments

Thank you Michaela!

BTC


More Creators