I have big question about VST3 processing systems

I have a question. It’s a fairly detailed one.

I would like to better understand the performance differences between Fourier’s processing approach, VST3 processing on Windows PCs, and VST3 processing on macOS, as well as the implications of how much access the operating system provides for plugin processing. In particular, I’m interested in the challenges faced by VST3 plugin developers and host application developers when working on an OS like macOS, which is relatively closed, compared to systems that are more open at the OS level.

I have been using the Fourier system since the very early beta stage, having received the hardware before the official release, and I’ve continued using it since day one of launch. I find it to be an excellent system, and I am very satisfied with the continuous software updates. I also feel that I have a reasonable understanding of its unique plugin processing architecture.

Recently, I’ve been running Waves Performer on an M3 Ultra 32-core Mac Studio, combined with a DAD CORE 256, and I’m able to use a very large number of VST3 plugins with extremely low latency. I’m inserting up to 8 plugins per slot across 64 Performer slots, processing 96 MADI channels at 96 kHz with buffer sizes of 64 - 96 samples. Apart from occasional, unexplained momentary CPU spikes, the system runs without any real issues. Overall latency is consistently in the 1.32 –1.99 ms range.

Lately, I’ve been spending a lot of time studying defeedback plugins, which have become a hot topic. While I’m not a developer and therefore don’t know the exact nature of the computations these plugins require, the M3 Ultra system is currently handling more than 10 instances of plugins that are known to be very CPU intensive, and hundreds of plugins in total, all under 2 ms of latency. Plugin developers are also actively maintaining and updating their products.

However, at the same time, some developers state that they can no longer continue development due to macOS’s increasingly restrictive policies. From a user’s perspective, this naturally raises the question: are other developers somehow achieving what is supposedly “impossible”?

With Fourier as well, when appropriate plugins are chosen and used correctly, VST3 plugins can be run very comfortably and stably with no major issues. The relatively long I/O latency is really the only downside; aside from that, plugin compatibility testing and updates are extremely fast. This level of support is something that is hard to find in other systems.

That said, one common issue across both systems is the occurrence of unexplained CPU spikes. Given the large number of plugins involved and the extremely low latency processing requirements, this feels like a genuinely hard to trace problem.

Could you provide more detailed insight into the causes of these momentary CPU spikes, as well as the differences and challenges involved in developing VST3 hosts and managing plugin processing on Windows versus macOS? After more than a year of building and operating these systems, I’ve come to believe that users need to understand the overall architecture of VST3 systems. not just how to “run plugins” in order to achieve maximum stability.

I have a general understanding of this to some extent. On Windows-based systems, I know that VST3 host applications can directly control and assign specific CPU cores, preventing other cores, such as efficiency cores from interfering, and can be customized to use exactly the cores they want. This is why systems like NUC PCs, while not particularly high spec on paper, can still exist as highly purpose built and optimized solutions.

On macOS, however, due to Apple’s policies, this kind of direct control is largely not possible. As a result, processing has to be implemented through various workarounds. This can lead to situations where processing initially runs on performance cores and then, at some point, gets migrated to efficiency cores, causing issues such as unexpected performance drops or the inability to fully utilize the desired number of CPU cores.

That said, Waves Performer actually makes excellent use of 26 performance cores, so from that perspective it performs extremely well. Nevertheless, since I am not a developer, I have been trying for quite a long time to find a solution to the momentary CPU spike issue at times even leveraging AI tools in an effort to better understand and resolve the problem.

Plugin Developer and transform engine user here :wink:

You are asking interesting questions here and indeed, to answer your questions in detail you really have to get a fairly deep understanding of the low level details.

As you already pointed out in your question, a key factor when building efficient software for modern CPUs is parallelization of workloads. A modern CPU has multiple cores that can run simultaneous workloads, on modern hardware like the Apple M series CPUs these cores are asymmetrical in their general processing power and the CPU speed is even adjustable. And CPUs even have parallel execution units on a core level that allow applying the same processing steps on multiple data elements in parallel (SIMD).

Now if we want to use this hardware to process audio we can think of multiple parallelization approaches, that can take place at different levels. The most important one is multi threading and I think it’s good to start with sharing some information on that topic first.

Host Level
A simple approach for a plugin host is to distribute multiple plugin chains to multiple cores. E.g. processing 96 channels with one plugin chain per channel on a 16 core CPU results in letting each CPU core handle 6 plugin chains. All serious plugin hosts do at least that. In most cases, this works by creating as many threads as you need for your processing, telling the operating system that you need the highest priority for these threads and then just take care of evenly distributing the workload to these threads.
However while this seems straightforward, a big problem is the actual even distribution. The assumption that distributing the 96 plugin chains to the 16 cores in the example above is solved best by just segmenting the 96 chains into 6 channel blocks and distributing them to the available cores is probably not true in reality, at least when your chains contain different plugins. There might be plugins in the chains that are relatively lightweight processing wise and others that take more resources. But there is no other way of figuring that out, except for running the plugins. There is no standardised way to ask a plugin ahead of time how much resources it will take. So plugin host load balancing can always only react to observations and predictions from observed behaviour. One thing that makes this even more challenging is that plugins might not even have a stable resource consumption over time. Especially for plugins that use FFT based algorithms, the plugins might require something like 1024 samples for each processing step. Running a plugin like that at 32 samples block size results that 31 processing callbacks into the plugin consist of pushing new samples into a plugin internal input buffer and pulling preprocessed samples out of another plugin internal buffer, which is a very cheap operation. Then when the 32nd processing callback is performed, there are enough samples buffered to run the actual processing which will take place only in the 32nd run, leading to a much higher processing resource consumption for this single run, so the host suddenly sees a load spike. While this is periodic, you can assume that this is relatively complex to predict and therefore hosts are often relatively conservative about their load balancing strategies, leaving a lot of headroom for occasional load spikes.

Plugin Level
As already sketched above, a plugin is called by the host to process a block of samples. The plugin has to use the host supplied thread that calls into the plugin to perform the processing and it should better do that as quick and efficient as possible. Well crafted plugins often optimise the actual processing by using the mentioned SIMD processing to e.g. process multiple audio channels in parallel on the same CPU core. But the time box for the processing is limited. If the processing takes too long, it won’t be ready at the time when the Soundcard driver needs to push out the samples and you will notice audio dropouts. Now if a plugin manufacturer wants to implement an algorithm that won’t make it in the given time box but identifies tricks how to split the computation behind the algorithm into multiple independent parts some plugins just start extra high priority threads on their own and start distributing audio to these threads as soon as the plugin’s processing function is called by the host and make the plugin use multiple CPU cores in parallel like that.
AI based defeedback plugins tend to do exactly that, at least a famous one that I inspected a while ago.

Challenges
As soon as both plugins and host start launching their own processing threads things can get messy quickly. A host application can no longer assume that it is the only instance running high priority threads and the predictions required for workload distribution as sketched out above will likely no longer be accurate, because once there are more high priority threads requested on a system than cores are available, the OS thread scheduler will pause some threads to give other threads a time slot.

The best way around this would be if instead of just creating extra threads in the plugin to let the plugin ask the host for extra processing threads. To my knowledge, the relatively new CLAP plugin standard is the only plugin standard that allows this as an optional feature, and then both the host and the plugin in question would require supplying/using that feature. VST3 just has no standardised way of managing this, so there is kind of a Wild West situation at the moment. I hope and expect this to be an area of further development in the near future as there are only winners with such development.

So at the time being, we are at the mercy of the operating system thread scheduler. Optimising a systems thread scheduler is a highly complex topic, where macOS and Windows take different routes. On macOS the scheduler has a specific concept of threads that contribute to the processing of realtime audio. Audio applications can access the Audio Workgroup API to give the system more information about threads that it creates to compute audio related workloads and the system will prioritise these workloads as long as the application does not violate its own predictions. Also here plugins can’t currently hook into the hosts audio workgroup, so if a plugin messes around with its own processing threads it might not get the same high priority resources as the host managed threads, which might ultimately lead to priority inversion problems. Still from my observation, the macOS scheduler often seems to manage to react fairly well to e.g. plugins also requesting extra threads. So we have a rather dynamic scheduler that knows the concept of real-time processing, but we have to rely on the scheduler doing its best with the given information, we cannot pin a processing thread to a certain CPU core. On Windows on the other hand, there is no real system level concept of specific audio threads and therefore, audio threads are competing for resources with other high priority threads. But on Windows and also on Linux we can e.g. pin threads to a certain CPU core or block threads from running on a certain core which is a more static approach of making sure that some resources of a CPU are nearly exclusively used by the intended threads or processes. This way it’s easy to degrade the overall performance of a system, so this should only be done for special cases where there the application runs in an isolated environment, but a plugin server could be seen as an environment like that – as long as plugins don’t start threads on their own and mess up with the hosts assumptions. I also wouldn’t be surprised of the special configured hardware that one manufacturer of defeedback plugins sells is basically a Windows or Linux system that is configured like that.

So that’s a basic overview of threading challenges for audio plugin hosts. And I realise that I didn’t even managed to answer half of your questions, but I hope it helped you gaining some more understanding in this area? Also note that I don’t have in depth knowledge of how the transform engine approaches some of those challenges in details, so take it more as a general overview that helps understanding some observations in the wild better.