Touhou-Project.com

No Escape

Added 2019-11-17 01:34:35 +0000 UTC

Hey all, I’d like to detail some of the stuff I’ve had to deal with lately and use the opportunity to explain a little about how the site works.

Whenever a post is made on THP, three basic things happen:

1) The input fields are checked by board software to see if they’re valid, special rules are converted (think “sage”) and the post is otherwise formatted in a particular fashion.

2) The formatted data is then inserted into the database

3) Information is retrieved from the database and html pages are built

Now, I’ve talked in the past about several of these stages and what they entail but I wanted to be a little more general this time around and gloss over most of the details. What’s important to note here is that information is handled and converted to specific syntax.

Yes, things like bold text and timers and whatever are all nice features to have. What’s even nicer is to prevent arbitrary code execution. By transforming the inputted data into a format that’s escaped it prevents people from accidentally or intentionally doing things on the site they’re not supposed to be doing. Every tag, bracket, quote and whatever else is treated as plain text before it’s inserted into the database.

There is code that acts as an abstraction layer that handles this input safely, minimizing the risk of SQL injections. This obviously adds complexity to the code base but any website worth its salt will have these basic measures to transform any potentially exploitable input into harmless characters that can be parsed at leisure. If you were to copy and paste a database entry for the average post it would not render as HTML, instead as text with weird markings and characters.

The third part is taking all that transformed data and outputting it back into something the HTML document perceives as valid and, more importantly, so does your web browser. When pages are built the database data is processed again and some of that process is reversed. Things like [i] are converted into the appropriate HTML tags and invalid tags (like, say, something to alter the text’s size) are ignored and output as plain text. When working well this ensures that in no step of the way there’s potential for exploits or trouble.

Now here’s where it gets tricky: the last step is variable depending on what part of the site is being rendered. Because of a mix of historical reasons and how code is structured, not every piece of data is reconstructed in the same way. This meant, for example, that after switching to the Twig templating engine a while back, there were some issues with getting parts of the site rendering as they should. Twig is fairly strict and errs on the side of not outputting variables (as in the parsed data from the database) as HTML. This meant that things like an apostrophe or a tag of any sort rendered as character codes instead of what they should have properly looked at.

When I first switched, I took that into consideration and added exceptions and code that marked bits as “safe” to render in its raw form. The problem with that is that, in some cases, this was too permissive. Some parts of the input was getting rendered anyways even though it wasn’t supposed to be. It was subtle enough that I didn’t notice it for some edge cases and for a little while it would have been possible to use HTML in some places to do things not normally allowed in posts.

As I mentioned, not every part of the site is rendered with the same bits of code. Some features, like previewing posts or the thread updater use semi-independent calls to handle what they render. They bypass parts of the other two steps because their scope is limited and they have no attack vector into inserting into the database. This meant that escaping or not escaping the data was different than for regular posts.

In the end I had to partially rewrite some bits here and there to make things a) the least permissive as possible and b) avoid needlessly complex code and special cases. The post previews, thread expansion etc still need to be consolidated into common chunks of code but that’s more of a mid-term kind of thing as I plan to overhaul some of the HTML structures of the site in the near future, thus making it pointless to do right away and have to revise it later.

The most tedious part of all of this was testing every possible permutation of input and rendering which was pretty time consuming. It may not be hard to tell the software to escape the message part of a post but you do then need to see if the same needs to be done for, say, the email field and whether or not that interferes with the special features of the site (like, “noko” or “sage”). Most templates (there’s about two dozen of them in total) had to be modified in some way as a result of these requirements.

I know that this isn’t the most interesting topic to put up but I’ve delayed most of my other work and rolling changes until NaNoWriMo is over with. This is partly because I’ve simply had very little time to work on the site but also because I don’t wish to break anything needlessly until things quiet down a little more. Next time I hope to talk about some of the more consequential things going on.

Until then, take it easy!