Again a lot of questions and I’m not able to answer all of them in detail, I’m afraid, but let me try to give you some information that might help you for further understanding and to help you continuing your own research. In the end, this performance optimisation topic is highly complex and even as a developer it’s not always obvious why a system behaves like it does and which of multiple optimisation strategies is the best one.
Processes and Threads
I think it’s important to clarify a bit what processes and threads are and how they are managed by an operating system. So both are abstractions of parallelism.
A process is basically one instance of some software running on a system. E.g. if a web browser and a text editor run at the same time, these are two parallel processes that are executed. By default, those two pieces of software have no direct interconnection. An operating system tries to distribute time slices of the CPU processing time between all running processes, letting each of them run for a short timeframe and then halt execution of one process to let another process continuing its work, giving the user the impression of fully parallel execution. So processes abstract system level parallelism.
Threads are a way to implement process level parallelism. By default when a new process starts, there is exactly one thread that is created with the process that runs the processes main function. In case of a UI based process, this main thread usually becomes the systems message thread, which means that after general application setup, this thread hooks into the operating system’s messaging system and then e.g. receives system events like mouse and keyboard inputs and runs all CPU based UI rendering there. If the application needs parallel actions to happen, it usually starts an extra thread from within the application. A good example is e.g. a download which means talking to some network interface, waiting for data to arrive, processing a bit of data, waiting for more data and once all data is there maybe post processing that data and writing it to a file. In theory this could be done on the main thread, but that would mean that while waiting on the next few bytes arriving via network we could not react to mouse events, so the UI of the app would appear frozen. So we start a new thread from within the application and handle the relevant network I/O on that thread. Requesting data from the network works via system calls, so functionality from the operating system that can be access from within an application. If the operating system figures out that there is no data arriving soon because of slow internet connection it will probably stop execution of the thread serving the download for a moment and use the free resources for some other process. When new data arrives on the network, it will re-assign the sleeping thread to a CPU core and let it continue working on the received data, either until it has to wait for new data again or until there are other threads from other processes that would also be ready to do work. Then the scheduler might also decide to just halt execution of that download thread for a few milliseconds and let another thread use that CPU core to do some work before it might resume the download thread or assign the core to another thread that is also waiting.
One way how the operating system decides how to prioritise the work of one thread over another is via thread priority. This is a value that the process that starts the thread assigns to the thread and it communicates the priority of the work that is executed on that thread to the operating system. The download thread from the example above should be created with a medium low priority, just because a download is slow and non time-critical compared to reacting to user interaction via the UI. Therefore, the main thread will usually have a medium high priority. The audio processing threads of an audio application should have the highest priority so that the system would always prioritise them over other threads, e.g. also over the main thread, because a stuttering UI is a lot less critical than stuttering audio. This prioritization works well when the system is not under full load but gets more and more difficult if the system load rises. Especially with a lot of audio processing, that means a lot load on high priority threads the system needs to halt even highest priority threads from time to time to handle system events etc. With macOS audio workgroups, the systems scheduler is capable of making more informed decisions as it knows that certain threads have to meet a deadline in order to avoid audio stuttering but even then, when the system load is too high the scheduler might have no option but making bad decisions. So better performance of one macOS audio application over the other in terms of robustness to interruption from UI events might be caused by better usage of the relatively new audio workgroup API by one application over the other, but this is highly theoretical, I’m explicitly not stating that the dropouts after UI interaction that you notice with the Waves system are caused by poor audio workgroup usage of the application. I just don’t know anything about the internals of this application.
But there is one more interesting aspect to consider and that is
Synchronization
Let’s stick to the example of our 96 channel audio host that runs 16 audio threads. Only one of these threads will actually wait on a system call from the audio driver. Just as the download thread described above will have to wait for new data from the network to arrive, the main audio thread will have to wait until the audio driver signalises that now e.g. 32 samples of audio for all 96 input channels have been copied into a certain memory location that the application can read from. Only once this has happened, the apps main audio thread can start its computation on the samples. If the app had only one audio thread that thread would work on the data and eventually write the output samples to a prepared memory location. Once it is ready with that, it will inform the audio driver which would then take over and make sure to push the new samples out to the soundcard via whatever hardware connection it’s connected with and read back new samples from that hardware connection into some memory accessible by the application. So we identify some inter-process synchronisation point. The systems scheduler could react to that by assigning CPU resources to either the audio driver process or the audio application process each time these events happen as it’s capable of tracking such system events.
But what about the 15 other threads in the audio app? These threads also need some synchronisation, they basically need to wait for the main audio thread to start processing the new block of samples then following some distribution strategy start working on a subset of the processing work and contribute their part to the final processed block of multichannel audio. The main audio thread will also do some processing but then has to wait for all other processing threads to finish their work before handing over to the audio driver again, so in case other threads take longer than the main tread, it needs a way to wait on another thread. And this might be a more complex problem than it seems at first. For thread synchronisation, there are system calls that tell the operating system “I want to pause here and resume working once another thread has reached a certain point” and in general programming, using these system calls is considered best practice, because it helps keeping a system reactive. Because in times where the system gets the information that one thread is currently waiting for another thread, it would usually take the opportunity of executing some other work in between and switching back to the waiting thread afterwards in case the condition it waited for is fulfilled. However, in audio software, we sometimes try to actively avoid using such system calls because we might get scheduled back a bit too much. Especially if the main audio thread assumes that the other thread should be ready very soon it would decide to rather switch to busy spinning which means constantly checking if the condition it’s waiting for is already true instead of telling the system to inform it once that event has happened. To the operating system this can not be distinguished from actual useful processing work, so it looks like the thread is doing heavy work which might be even beneficial since the scheduler gets the impression that this thread should really not be blocked to get the audio processing done. And so the application might chose to burn CPU power with stupid checks just in order to not miss any audio processing deadline. Of course busy spinning is also a waste of resources, so this is a really tricky optimisation field and good thread synchronisation is really a tricky.
A lot of general purpose software library that are used for high throughput number crunching that applications and plugins might use to speed up their processing are not optimised in this regard. This is especially the case for highly complex AI based computations, there are a lot of building blocks out there these days that will make AI algorithms run fast but nearly none of them is optimised in terms of lock free audio processing. So if you come across plugins that use AI for audio processing that show occasional load spikes it might be caused due to the usage of general purpose libraries in those products. I could identify such cases in the wild.
That brings me to a last point
Why is AI inference in audio plugins mostly CPU based?
Using external hardware like GPUs or the Apple Neural Engine requires synchronisation points again. The software interfaces that are available to access these units always follow more or less a pattern like outlined above when talking to e.g. the audio driver or the network stack. It we chose to wait for e.g. our GPU to finalise our sample processing in time we have to compete with other applications using the GPU. So theoretically, running plugins that use the GPU for AI inference could suddenly be interrupted by opening the UI of a plugin that uses the GPU based UI rendering. And the scheduling of GPU access might follow completely different rules compared to our discussed thread scheduling. Also memory access latency is an issue here. Still, there are companies that are actively working on these issues and there seem to be solutions. Still this is extremely niche knowledge and therefore not feasible for most plugin manufacturers. The Apple Neural Engine itself is interesting but also cannot even be accessed directly like a GPU can be accessed. Instead you need an abstract description of your neural network and pass it on to the Apple CoreML API which then decides if it runs them either on the Neural Engine, on the GPU or the CPU. So you hand over a big amount of control to a system library, which might perform all kinds of multithreading and uncontrollable synchronisation internally, which is usually something that you want to avoid at all costs for the reasons outlined above. So we often end up using the CPU as the most predictable execution unit for AI inference workloads in audio apps today, even if it’s not the most efficient hardware unit.
Final words
I think the level of detail that we reach here quickly gets a bit too much for a place like this and unfortunately I’m not able to continue this conversation over a longer period of time at this level of detail. If you are really interested in the deep details of how audio applications work, you will need to start programming yourself I guess and learn how all works together behind the scene. If you are interested in some behind the scenes look, I might recommend you the Audio Developer Conference YouTube channel, where you find a wide range of talks from all kinds of people that are working on audio software development.