After the final touches to audio, my focus has now shifted to a new and rather interesting bug involving X11 window events and, arguably more importantly, Arch Linux. This bug has shown itself to me for a little while but was largely inconsistent in how it appeared. It would show in the unit tests ~80% of the time, and the source of it seems to be unknown currently. This bug will be my last bug to fix for the project as part of FSU, however, if I can't fix this bug by the deadline, I will be continuing work to fix it after. This is largely because this is the only bug left to fix to finally add Arch Linux as a fully supported platform. This will be brief as I didn't have a lot of time to fully inspect the bug like I wanted to, but there will be enough here to mark my current progress. So, with that, let's get started.
The first thing to note is the description of the bug.
Here are the things to note:
- The bug has a visual aspect (the window will sometimes show, sometimes stay hidden, or show completely but be transparent).
- It fails specifically when the check is run to see if the window is fullscreen.
- Previous calls in the said test don't do anything.
Here is the test.
We can see on the overview that the window is initially created with a basic position (0, 0) and dimension (800, 500). It then reconfigures itself to a different position with different dimensions. Something else you might have taken notice of is the usage of `GiveWindowTimeBeforeRequiringPass`. From what I've gathered, this is here because sometimes X11 requires a bit of time to process events before it can check for what we want. So this is here to effectively wait out the processing of events to check if a function has passed properly. From here, this is where I started testing.
First off, breakpoint debugging is useless for this bug. Because if I placed a breakpoint and walked the function line by line, nothing would go wrong. Debugging would mainly be useful here if I could catch the bug failing in action, however, I'm not able to do that here. So, this leads me to test a little more conventionally. The thing I checked for immediately was timing. It could be that the code is going too quick and preventing X11 from processing everything in time. So I added small delays to check for this, however, it did not work. I also tried using the exact delays used in other tests, but the results were the same.
Interestingly enough, if I duplicated the exact same test and ran the full unit test suite, only the first test would fail, with the duplicate passing. Odd behavior.
From the discord messages above (which are posted in the Gateware Development channel), I described my difficulties with this bug as well as what I feel the bug is.
I feel this bug is a timing bug. Basically, one thread is doing things so quickly that the other thread can't keep up in time. This would explain the odd behavior, and why the test doesn't always fail. Of course, there is more testing and inspecting of the code that will be needed, but in total, this seems to be the main issue. So now, what has to happen, is we have to specialize the probes for this bug to target the timing over the functionality. If I can pinpoint that it is timing-related, then I focus my efforts on fixing that. However, if this isn't the case, I will have to inspect further to see what is exactly happening.
Currently, this is all that I have done for this bug. Next week should hopefully go into a deeper discussion on how I found the fix for it. But for now, this is all.