db0

Massive Refactoring

Added 2022-12-03 17:14:56 +0000 UTC

When I first designed the AI horde,I wanted to create something quickly to put it to use for the video game I'm making, Hypnagonia, so that I could automatically generate AI story blurbs for my various Torments. For that I needed a system which provided a REST API without any complex configuration on the client side.

The initial design I expected it would be like something to better distribute the limited resources required to run KoboldAI, so a very optimized setup would not be necessary. We had like 4 workers between all of us.

Therefore the original code was very simple. Specifically it didn't utilize a Database of any kind, instead keeping everything in RAM as python built-in structures and merely writing then as JSON to disk every 10 minutes or so.

This was not only because I needed to go fast, but because I also just didn't have the necessary knowledge needed to make it more advanced. When I started this project, I didn't know how to use use python imports correctly, how python OOP works, and I had basically 0 Database knowledge. So in order to be able to proceed, I used the knowledge I knew

Almost at the same time KAI horde came up and running, Stable Diffusion was unleashed, and the world was never again the same! It immediately occurred to me that the system I had built to work with text, could just as well work with images. So I forked my repo into the Stable Horde and quickly tweaked my inputs and outputs to work with images. I felt that there was going to be quite a bit of demand for free and open Stable Diffusion, even by people who couldn't afford a GPU or Midjourney.

I did not expect the meteoric usage that I would run into!

Soon after I had the stable horde running, I realized that it was getting impossible to maintain 2 forks at the same time, for very similar code. So I decided to merge the codebase of the two hordes back in October. That was the first massive refactoring I undertook, and through it I had to quickly learn to utilize Python's OOP programming and imports, as well as things like a more advanced API framework like flask-restx.

This was good enough, as by this point the Stable Horde has less than 5 regular workers and a couple hundred registered users at best. The existing performance was plenty to handle it. However deep inside I knew this was eventually going to need a database of some sort. The reason being that python can only run in 1 core at a time. And even if I could undertake the very very tricky work of converting my code to be multi-process, I would still run into the problem when my VM run our of CPUs and/or RAM.

No the only "proper" solution was to implement a central Database which would allow my frontend VMs to just take care of the number crunching. However not only did I still had basically 0 knowledge on how to use databases in python, but I had a lot of other things that always took priority in improving the horde. Img2img, Inpainting, threading, bots...

And then Stable Horde 2.0 was released...

The Stable Horde was the first to implement support for Stable Diffusion 2.0, even faster than Dreamstudio (the official generator from the creators of Stable Diffusion - Stability.ai). A world first!

...and then my server figuratively melted from everyone rushing to try it out.

While we still had plenty of network capacity and other CPUs, the single process of the stable horde was simply incapable of keeping up. It didn't matter what I tweaked anymore, threads, connections, open files etc. If your CPU core is full, it's full!

So at the moment of our greatest triumph to date, we were betrayed by my spaghetti code. It was time to refactor.

However I still had practically no knowledge of how to use a database. I knew I had to use what is called an ORM but I had no experience with it, nor with any of the databases that it would use. Initially I tried to use the existing redis database I was using for my various caching, but I quickly slowed to a crawl.

After a desperate cry to the stable horde community in my discord, 2 people stepped up to help me: Warlax#9278 who had a lot of experience with ORM and they provided some initial guidelines on how to format my objects which I could glom to and build upon. stuck)in_state_space#5254 gave advise and could help build some other components which I required for the scaling, such as a haproxy configuration for loadbalancing.

And thus started last Friday. I informed my family that I needed to "no-life" it until the refactoring was done, and I started grinding code last Friday night.

Progress was fast and Warlax was there to give advice (and some sqlalchemy code as well) so by the end of the weekend I had converted almost everything to ORM. Monday after another episode of the existing horde dying under the load, I decided to try to deploy it, since it was effectively down anyway.

It immediately dropped dead. Turns out, you can convert the code to ORM, but if it's using the same paradigm as before, where you expect everything to be in RAM, you're going to have a very very bad time.

Due to the now requirements to fetch data from a remote DB, previous greedy statements slowed everything to a crawl. The first deployment was so slow, it wouldn't even generate a single thing, no matter how many frontends I threw on it.

So I had to rollback and try again. This time I decided to try it more smartly. with the help of stuck)in_state_space#5254 I deployed a new loadbalancer in a different location (kindly offered by the hlky from Sygil.dev) and set up some of my other frontends as dev.stablehorde.net. And let me tell you, I am very fortunate I knew Ansible by now, as the same code is the one that is going to be deployed in production as well.

I then spent the next 3 days just using the dev instance and crushing bugs and adding optimizations as much as I could find. Without exaggeration, this was the most productive, most stressful and most complex week in code I've done in my life!

Finally, yesterday evening, I deployed the code to production, and this time it didn't fall down! In fact our speed has now increased by an order of magnitude and we can scale so much more easily with new frontends. Our only bottleneck in the future might be the Database itself, but I'm confident we can scale that up way easier.

Due to this code, my existing servers can now run 2-4 processes of the horde, fully utilizing all their CPUs. Thus I spread the load among all my (new) servers quite nicely and they're all chugging along happily.

There were of course a few small bugs discovered, some which are still stubbornly persisting, but nothing which is a showstopper.

All in all, this refactoring took 300+ commits and 4000+ lines of changes! Pretty much nothing stayed untouched!

There's still plenty to do in this infrastructure, but that can be done in a more relaxed manner. The most important part is that horde now has the capacity to grow 100-fold!

So by the time StableDiffusion 3.0 comes out, we'll be ready!