The problem
Gateware's audio library runs through a few dozen unit tests to ensure it is fit to be used by developers. On Windows, Mac, and Linux, all tests of the library pass. However, sometimes the Linux tests don't pass. The failures are random, not always occurring at the same test.
How GAudio works
GAudio uses the PulseAudio library to play sounds and music on Linux. Pulse has an implementation that GAudio takes advantage of to allow for asynchronous operations. This implementation uses threading, which is necessary to prevent the sounds from pausing a game or other sounds while being played. Threading can significantly speed up programming operations. By breaking up problems into smaller chunks, working with each chunk on a separate thread, and then merging those chunks, solutions can be produced faster. In games, threading enables sounds to be played over one another while input, graphics, and AI are all being updated.
Why it might be threading issues
Threading can be challenging to program and debug. The order in which threads execute and how long they run for cannot be guaranteed. Therefore, a program can run hundreds of times with a threading bug before it crashes. With that said, the very nature of threading bugs matches what we see with the unit test failures.
An issue with threading
One cause of threading bugs is due to how the operating system schedules threads. Since the operating system can interrupt a thread at any time, assumptions shouldn't be made of where each thread will be at a given time. This can be problematic when working with shared data between threads, for example:
- Thread1 and thread2 are started.
- Thread1 checks if x is 0, it's not, and the thread moves to the next instruction.
- Thread1 is interrupted by the thread scheduling system.
- Thread2 sets x to 0.
- Thread2 is interrupted.
- Thread1 attempts to divide 100 by x.
- A divide by zero error is thrown.
Normally, Thread1 dividing 100 by x would be safe since we checked if x is 0. However, because Thread2 was able to set x to 0 after the check, we get an error. In this situation, we get an error, but it is also possible that next time we run the program, the Thread1 doesn't get interrupted until after the divide. Threads run by the will of the operating system unless we put methods in place to gain back some control.
How to fix it
The PulseAudio documentation states "[its threaded implementation] doesn't allow concurrent accesses to objects, a locking scheme must be used to guarantee safe usage." Specific functions defined by PulseAudio are provided to lock and unlock Pulse objects between uses. The easiest thing to do is lock each Pulse object before they are used and unlock them after. A thread that locks an object will wait if a different thread already locks the object. This system prevents threads from working with shared data at the same time. Implementing this system of locks and unlocks resulted in all the tests passing with no crashes.
How to test if the fix works
It seems like everything is fixed now, but we need to know for sure. In this situation there is no way I could find to be certain. However, there are things we can do to create enough confidence that the bug is fixed, such as:
- Increase the number of tests.
- Run the program many times.
- Test on various machines.
I did this by running the tests on machines known to show the error. The tests were run on both hardware and virtual machines. I also increase the unit test amount by running through them all ten times per program run, and I ran the program 25 times per machine.
Conclusion
Every test passed satisfactorily. Threading issues can be problematic, and I got lucky the solution worked out so quickly. During my research into this issue, I got a much-needed refresher on threading. I also got the sweet satisfaction of hearing the audio tests play one after the other without interruption, "Playing sound, sound resume, front, front, front, left, right, stream started..." (a few hundred times).
No comments:
Post a Comment