Friday, May 3, 2024

Month 5 - Week 4 - Fixing Arch Linux X11 Bug

 Last week we touched upon a bug that was occurring with X11 for Arch Linux. The bug was specifically when creating a window for a test. The window would fail to create and now show up at all, failing the test. I was only modifying the unit tests at the time but took time this week to look more deeply into the actual code to see what could be going wrong.

One of the first things I decided to do was enable the macros inside of GWindow to see the X11 debug output. This would allow me to watch in real time what was failing, and better help towards finding the problem child in the code.


As you can see, the errors being reported are talking about a bad window. This normally means that the pixel map for the internal window is bad, or the window itself failed to create and can’t have anything done to it. I knew immediately the first place to check was the create function for GWindow.


Here we see two functions are called. The first function is just initializing some basic fields, the second function, however, is the one we’re interested in. `OpenWindow()` is the function responsible for creating the window and initializing all the information that GWindow needs to properly manage it.

Remember, we can’t debug, or the error won’t occur. So, instead, I decided to inspect some documentation and play around with the code, probing it to see which function was giving problems. Upon inspection, I noticed that anytime I messed with the `XInitThreads()` function, it had a change in the functionality itself. It seems this might be the problem child I mentioned earlier. I decided to put a `std::this_thread::sleep_for()` call right after calling for the initialization of threads. The sleep time was small, only 10ms, however, doing just this resulted in all of my tests passing consistently.


I found the bug that was occurring. So, how do I fix this then? Well, the moment I discovered this, I made a commit with the sleep call change as the fix. However, using this kind of call to fix the bug isn’t exactly good. Some issues can occur. For example:

  • How would this code react on faster or slower hardware? We can’t be certain that this sleep fix will work for all systems, it might even only work on mine.
  • Is this truly a threading issue, or a mutex issue? The display for X11 is multithreading capable, so if I don’t lock it, it could be other X11 functions are modifying or trying to modify the variable I’m reading immediately.
  • Lastly, what if this fix is temporary like the event system bug months prior? This is a real concern; it would not be the first time I “fixed” a bug only to have it come back spontaneously because I wasn’t doing the idiomatic approach expected.

So, what I have to do is test for different solutions and see what I can find. This was, unfortunately, all I could do for now. I wanted to do more but was constrained on time with due dates and graduation fast approaching. While this is going to be my last blog post, it will not be the last of the changes I make to Gateware. I have full intention of continuing work on Gateware to not only fix this bug, but also to see what I can do to fix, implement, and update features for the project. This is in the hope that I learn more from it over time, that and I simply enjoy working at a low level. So, with that, this is my last blog post.