Friday, June 26, 2020

Music Resumed: Fixing Random Thread Crashing

The problem
Gateware's audio library runs through a few dozen unit tests to ensure it is fit to be used by developers. On Windows, Mac, and Linux, all tests of the library pass. However, sometimes the Linux tests don't pass. The failures are random, not always occurring at the same test.

How GAudio works
GAudio uses the PulseAudio library to play sounds and music on Linux. Pulse has an implementation that GAudio takes advantage of to allow for asynchronous operations. This implementation uses threading, which is necessary to prevent the sounds from pausing a game or other sounds while being played. Threading can significantly speed up programming operations. By breaking up problems into smaller chunks, working with each chunk on a separate thread, and then merging those chunks, solutions can be produced faster. In games, threading enables sounds to be played over one another while input, graphics, and AI are all being updated.

Why it might be threading issues
Threading can be challenging to program and debug. The order in which threads execute and how long they run for cannot be guaranteed. Therefore, a program can run hundreds of times with a threading bug before it crashes. With that said, the very nature of threading bugs matches what we see with the unit test failures.

An issue with threading
One cause of threading bugs is due to how the operating system schedules threads. Since the operating system can interrupt a thread at any time, assumptions shouldn't be made of where each thread will be at a given time. This can be problematic when working with shared data between threads, for example:

  1. Thread1 and thread2 are started.
  2. Thread1 checks if is 0, it's not, and the thread moves to the next instruction.
  3. Thread1 is interrupted by the thread scheduling system.
  4. Thread2 sets to 0.
  5. Thread2 is interrupted.
  6. Thread1 attempts to divide 100 by x.
  7. A divide by zero error is thrown.

Normally, Thread1 dividing 100 by would be safe since we checked if is 0. However, because Thread2 was able to set to 0 after the check, we get an error. In this situation, we get an error, but it is also possible that next time we run the program, the Thread1 doesn't get interrupted until after the divide. Threads run by the will of the operating system unless we put methods in place to gain back some control.

How to fix it
The PulseAudio documentation states "[its threaded implementation] doesn't allow concurrent accesses to objects, a locking scheme must be used to guarantee safe usage." Specific functions defined by PulseAudio are provided to lock and unlock Pulse objects between uses. The easiest thing to do is lock each Pulse object before they are used and unlock them after. A thread that locks an object will wait if a different thread already locks the object. This system prevents threads from working with shared data at the same time. Implementing this system of locks and unlocks resulted in all the tests passing with no crashes.

How to test if the fix works
It seems like everything is fixed now, but we need to know for sure. In this situation there is no way I could find to be certain. However, there are things we can do to create enough confidence that the bug is fixed, such as:

  • Increase the number of tests.
  • Run the program many times.
  • Test on various machines.

I did this by running the tests on machines known to show the error. The tests were run on both hardware and virtual machines. I also increase the unit test amount by running through them all ten times per program run, and I ran the program 25 times per machine.

Conclusion
Every test passed satisfactorily. Threading issues can be problematic, and I got lucky the solution worked out so quickly. During my research into this issue, I got a much-needed refresher on threading. I also got the sweet satisfaction of hearing the audio tests play one after the other without interruption, "Playing sound, sound resume, front, front, front, left, right, stream started..." (a few hundred times).

Friday, June 19, 2020

Watch out for Modularity

Background
A few days ago, I wrote about my encounter with the BadMatch error while running OpenGL tests on Linux. The error was caused by the context and the window not being created with compatible visuals. My fix for the issue ensured that visuals were compatible. All the tests passed and the error disappeared, however, I broke modularity in the process.

Why modularity is important
The modularity of the Gateware framework ensures that the project is easy to maintain, understand, and use. The code is separated into logical groups, such as classes or libraries. Limiting the amount of interconnectedness between each module makes it easier for developers to understand the code, test it, and track down bugs.

How I broke it
GWindow_linux creates the window module on Linux using X11. GOpenGLSurface_linux sets up the OpenGL graphics module on a GWindow. Similarly, GVulkanSurface_linux sets up the Vulkan graphics module on a GWindow. The way the framework is designed, the graphics module knows about the window module but not the other way around. My solution introduced GLX into the window module, which is specific to OpenGL. In doing so, the window module now has code specific to the OpenGL graphics module, therefor modularity is broken. This means my fix for the bug is incomplete, and I have to find another way; A way that preserves modularity.

Two potential solutions
Since the graphics module relies on the window module, my question became how to create or modify a window to ensure it has a 32-bit color buffer.

  • Solution 1: Creating a 32-bit color buffer window
    1. Window creation is done inside the window module. First, we get a list of visuals and search through them. This can be done by calling an X11 function called XGetVisual().
    2. Determine which visual in the list has 32-bit color buffer. This part is a problem because in order to get the color buffer property of the visual we need to either use a function from GLX or an X11 extension library. We already determined using GLX inside the windows module breaks modularity because it is specific to OpenGL. The X11 extension library is an additional library that needs to be installed. Ideally, we want to limit the number of steps and libraries the end-user has to install to use OpenGL with the framework.
  • Solution 2: Modify a window to have a 32-bit color buffer
    1. The ideal place to modify the window's properties would be from the graphics module. Using a reference to the window, we change its color buffer to be 32-bit. There is a function called XGetWindowAttributes() we can use to get a pointer to the window's visual. However, modifying the properties of the window fails to change its color buffer. There is also another function called XChangeWindowAttributes(), but that produces similar results.
    2. If the previous step had worked, we would then make sure the context is created with the same visual as the window.

Less than ideal solution
Unfortunately, I didn't find any other ideal solutions. The solution that ended up getting implemented and merged to the developer branch was a modified version of solution 1. The code is set up to be compiled when OpenGL is enabled; otherwise, the previous window creation code is used. Doing this prevents the OpenGL-specific code in the window module from affecting the other graphics modules like Vulkan.

Conclusion
While the solution wasn't desirable, it does fix this issue with the BadMatch error and 32-bit color buffer. Ultimately, it is good for developers to have stable error-free code. Even so, every effort should be made to preserve modularity. This solution will do for now, but I'll be on the lookout for a better solution as I move on to other bugs and features.

Monday, June 15, 2020

BadMatch Error: My First Linux Bug

First Some Background
My first task as a new member of the project involved setting up Gateware on Windows, Mac, and Linux. When running the Gateware unit tests on each operating system, all the tests should pass without error. However, Linux's graphics tests failed, and that is when I first encountered the BadMatch error. I am new to developing on Linux, and this bug was the perfect opportunity to sink my teeth in. The following is what I learned.


Introducing the Bug
The error message is helpful in that it provides the name of the function that generated it, glXMakeCurrent(). The GLX function produces a BadMatch error if:

  • The drawable is not created with the same screen and visual as the context.
  • The drawable is none, and the context is not null.

It doesn't take long to rule-out the second cause of the error. The first cause, however, requires some background information.

X11 and GLX
X11 is the window system responsible for creating and managing windows on the version of Linux I'm using, Linux Mint. GLX is an extension of X11 that enables OpenGL to work within an X11 window. Surprisingly, I found both X11 and GLX documentation quite useful during my efforts to learn more about the error.

The Path to the Error
These are the steps Gateware takes leading up to the error:

  1. Create a display connection to the X11 server.
  2. Create a window using that connection, the root window, the depth of the root window, and the visual from the parent window.
  3. Find a config that matches the framebuffer attributes we want.
  4. Get a visual that matches that config.
  5. Creating a context using the server connection and visual.
  6. glXMakeCurrent(server connection, window, context).
  7. BadMatch error.

I'll reference these steps as we go from here.

Visuals and Configs
When running "glxinfo" in the terminal, you can see a print out of the different GLX visuals and frame buffer configurations (configs) available. The visual and config must be compatible to avoid errors. On my machine, there are 180 visuals and 263 configs. Each visual has a corresponding config.



The Cause of the Error
In the path to the error, 'step 2' creates a window using the parent's visual. In 'step 4', a visual is found using a config. That visual is then used in 'step 6' to create the context. Because two different visuals are being used in setting up the parameters for glXMakeCurrent(), the function outputs a Badmatch error.

A Fix
By choosing a config in 'step 3' that is compatible with the parent visual, the BadMatch error can be avoided. In my case, half of the GLX visuals available were compatible, allowing all of the unit tests to pass without error.

A Better Fix
All of the visuals that were compatible only supported a 24-bit color buffer. What if we wanted to use a 32-bit color buffer? This would allow us to use 8 bits per RGBA color channel. The visual from the parent window in 'step 2' has a 24-bit color buffer. The solution is to find a visual that has a 32-bit depth and use it in 'step 2.' Then, in 'step 3', choose a compatible configuration that has a 32-bit color buffer.


Farewell BadMatch
This error was great to have as my first Linux bug. It caused me to learn more about X11 and GLX, how they function, and how Gateware uses them. I'm looking forward to my next bug and the other things I learn along the way.

Tuesday, June 9, 2020

Hello Gateware, I'm Ozzie

I'm excited to start working on this project as a Generalist Engineer! My responsibilities include fixing bugs, memory leaks, and assisting teams with framework integration. Last week, I got set up for multi-platform development and started getting familiar with Gateware. This week, I'm getting into a workflow and fixing bugs. There is still a lot more for me to understand about the project, and I'll be posting about the interesting things I'm learning along the way. Stay tuned.