SakeTami
dobiestation
dobiestation

patreon


Sins of the PS2: Harry Potter and the Prisoner of Azkaban

Every console has its sins, and the PS2 is no exception. Whether it be caused by tight deadlines imposed by the publisher or bad design practices, video games often have bugs hiding in them that are only exposed when running them in an emulator. Older consoles tend to be quite forgiving when a game uses undocumented or illegal behavior, which is a pain for emudevs as it then means figuring out what the console is supposed to do when it's fed invalid data.

The bugs in many PS2 games are easy enough to understand, since they tend to follow similar patterns, such as trying to start DMA transfers without checking if the DMAC is ready for another transfer. Some truly insane bugs, like the DMA bug True Crime: Streets of L.A. has, can take a while to understand, because they interact with the hardware in a way no other game has before, defying all expectations. Generally speaking though, it takes at most several days to identify a proper solution for a game.

Harry Potter: Prisoner of Azkaban is not one of those games. It relies on such strange behavior that the answer evaded me for literal months. Every time I thought I had found a solution, it was shut down, and I was left without any clues. Yet the fix for the game was right under my nose the entire time...

A Magical Journey

Prisoner of Azkaban is one of those games that has never really worked right on PCSX2. Although it is playable nowadays, it needs a rather invasive patch for it to boot properly, which is made worse by the game having multiple variants for each country it was released in, meaning the same patch has to be applied around 10 different times! Of course, the game also failed for the same reason on DobieStation, so I decided to see why the game needed a patch. I was not expecting this to take over a year, however.

The image above shows a decompilation of the game's timer interrupt handler. This function gets called at a frequency of 30 Hz, the game's framerate. Its main job is to start a DMA transfer to the VIF, a hardware interface used for processing 3D geometry and sending polygons to be drawn. The exact way this works isn't important for the purposes of this article - think of this code as the game saying "draw everything that needs to go on the screen".

The game maintains a linked list of DMA buffers, which this function pops from. It didn't take long for me to figure out this is the cause of the crash in PCSX2. During startup, the game would eventually run out of buffers to send, which makes this function read from NULL. On a real PS2, this would immediately cause a hang. PCSX2 is more lax with error handling, so the game ends up sending junk to the VIF, which ultimately causes a hang later on because the transfer never finishes.

Obviously, this doesn't happen on real hardware, so the next step is figuring out why not. And... that's where I hit a wall.

Manager Bloat

Prisoner of Azkaban is a thoroughly unpleasant reverse-engineering experience. The codebase heavily relies on C++ virtual functions, meaning that a lot of the function references in the code cannot be resolved with a static analyzer like Ghidra. Furthermore, the game is highly inefficient, containing a good mess of managers, custom libraries, and redundant code. It can be surmised that even the game's source code is a disaster - manager references in the executable imply a rigid and cumbersome design, such as "SpellManager", "BloomRenderManager", "PlayerCharacterManager", "ThingManager", and so on.

Ghidra was designed to decompile procedural languages like C, where functions are known ahead of time, so it has a difficult time dealing with concepts like virtual tables that are initialized in runtime. Some work can be done to at least understand what virtual functions are being called, but Ghidra is currently unable to track references to them, making reversing more of a chore than it usually is.

Complaints about the code aside, the first thing I wanted to do was figure out how the game refilled its VIF DMA buffer, and why it was unable to do so. The core game loop is split into a series of tasks - the snippet of code above handles some of the rendering logic, for instance. The very last function above, DisplayManager::SendDisplayList, is what ultimately fills the DMA buffer. After a lot of reversing, I found out that during startup, the game calls a huge script. The game gets stuck inside of the scripting task for a while, making it unable to call the rendering task. After three game frames, the timer interrupt handler would access NULL, eventually crashing the game.

I thought that maybe this is a general timing issue - perhaps the game doesn't have enough time to complete the script. This is not the case, as speeding up everything except timers by a wildly unrealistic amount still causes the game to crash. PoA does absolutely bizarre things like reading a large file a single DVD sector at a time. This is like driving to a supermarket, buying a single item, driving back home and dropping it off, then repeating this 20 more times for the rest of the stuff you need! I concluded that there's no way the script can execute in time on a real PS2.

I then figured that the game must have some other way of refilling its buffers, which is in fact true. Obviously, those places don't get called on an emulator, but perhaps they do on a real PS2?

Hope Spot

I managed to contact someone who owns a TOOL, a PS2 devkit. TOOLs are rare and valuable, often being sold for hundreds of dollars, and for good reason: they are development PS2s attached to a separate computer, which has its own PS2 debugging software! Our time to play around with it was limited, but I could at least see if the game refills its buffers elsewhere.

First, we ran into a weird issue: the game calls "interrupt" variants of syscalls in non-interrupt code. This works fine on a retail PS2, but the TOOL refused to boot the game. We managed to get around this by reflashing the BIOS with one found on a retail console.

Afterwards, we placed breakpoints on all of the other functions that call the DMA code.  If my guess was correct, we would then be able to work backwards and see why the code doesn't get called on emulators.

We booted up the game, and... nothing. The game went to its memory card screen without breaking. My lead was incorrect, and I had no other clues.

I spent several months on-and-off with Prisoner of Azkaban to no avail. Each time I understood the codebase a bit better, but I still had no answers.

Cross-Platform Inspiration

Recently, a thought occurred to me. Prisoner of Azkaban wasn't just released for the PS2 - it was made on a variety of platforms. The same developer was responsible for the PS2, GameCube, and Xbox ports. Maybe looking at a different version of the game would reveal some insight?

I acquired the GameCube version of the game and looked at its executable. Not too surprisingly, the GC executable has nearly identical code, save for the "driver" portion that touches the hardware and console SDK. As the game does not crash on Dolphin like it does on PCSX2, I realized that the PS2-specific code was the problem.

One late night, I stared at the VIF DMA interrupt handler, which gets called whenever a DMA transfer finishes. I had known for a while that this code was responsible for starting up the dreaded timer. PS2 timers have two different kinds of interrupts: a target interrupt when the counter reaches a specific value, and an overflow interrupt. When VIF DMA was started manually and not by the timer, the DMA interrupt handler set up the timer's target so that its interrupt would trigger 30 times a second. It was rather strange how the DMA interrupt handler was also clearing the timer's interrupt flags...

Wait.

Are timer interrupts edge-triggered?!

Finally!

Interrupt controllers have a mask register and a flag register. The mask is manually set by the application and means "I want these interrupts to happen". The flag is automatically set by other peripherals and means "These peripherals are requesting an interrupt." When interrupts are edge-triggered, this means that the CPU will only process an interrupt when the bitwise AND of the mask and flag goes from 0 to a non-zero value. Assuming interrupts don't have a priority system, this means that the CPU will not be interrupted more than once until the previous interrupt has been processed. However, this means that if the same interrupt request occurs without the CPU acknowledging the first, it will be completely lost.

In PoA, the timer interrupt handler does not clear the timer flags, and the VIF DMA handler only clears the flags if the transfer did not come from the timer. This means that if timer interrupts are edge-triggered... a second timer interrupt after the first one would just not occur! In other words, it would be impossible for the game to run out of buffers to send because the timer can't fire any more interrupts.

I modified DobieStation so that timer interrupts could not occur again if the interrupt flag was already set...

And thus, the mystery of Harry Potter was solved after months of work. 

Closing

After finding out about this bug, I notified the PCSX2 team, who quickly wrote a PR to fix it and remove PoA's patches from the database. Chamber of Secrets, another HP game made by the same company, was also affected by this bug, and so patches for it could also be removed. It may even be possible that this fixes other games, but the PR has not been up very long and has yet to be tested.

The moral of the story is that finding bugs in games is a matter of intuition. When emudevs have no idea why a game is broken, they have to make a guess, and many times this guess will be incorrect. It can be demoralizing to spend weeks or even months on a problem and still have no answers. The payoff of finding a proper solution that can easily be implemented is worth it, however.


More Creators