I have big question about VST3 processing systems

I have a question. It’s a fairly detailed one.

I would like to better understand the performance differences between Fourier’s processing approach, VST3 processing on Windows PCs, and VST3 processing on macOS, as well as the implications of how much access the operating system provides for plugin processing. In particular, I’m interested in the challenges faced by VST3 plugin developers and host application developers when working on an OS like macOS, which is relatively closed, compared to systems that are more open at the OS level.

I have been using the Fourier system since the very early beta stage, having received the hardware before the official release, and I’ve continued using it since day one of launch. I find it to be an excellent system, and I am very satisfied with the continuous software updates. I also feel that I have a reasonable understanding of its unique plugin processing architecture.

Recently, I’ve been running Waves Performer on an M3 Ultra 32-core Mac Studio, combined with a DAD CORE 256, and I’m able to use a very large number of VST3 plugins with extremely low latency. I’m inserting up to 8 plugins per slot across 64 Performer slots, processing 96 MADI channels at 96 kHz with buffer sizes of 64 - 96 samples. Apart from occasional, unexplained momentary CPU spikes, the system runs without any real issues. Overall latency is consistently in the 1.32 –1.99 ms range.

Lately, I’ve been spending a lot of time studying defeedback plugins, which have become a hot topic. While I’m not a developer and therefore don’t know the exact nature of the computations these plugins require, the M3 Ultra system is currently handling more than 10 instances of plugins that are known to be very CPU intensive, and hundreds of plugins in total, all under 2 ms of latency. Plugin developers are also actively maintaining and updating their products.

However, at the same time, some developers state that they can no longer continue development due to macOS’s increasingly restrictive policies. From a user’s perspective, this naturally raises the question: are other developers somehow achieving what is supposedly “impossible”?

With Fourier as well, when appropriate plugins are chosen and used correctly, VST3 plugins can be run very comfortably and stably with no major issues. The relatively long I/O latency is really the only downside; aside from that, plugin compatibility testing and updates are extremely fast. This level of support is something that is hard to find in other systems.

That said, one common issue across both systems is the occurrence of unexplained CPU spikes. Given the large number of plugins involved and the extremely low latency processing requirements, this feels like a genuinely hard to trace problem.

Could you provide more detailed insight into the causes of these momentary CPU spikes, as well as the differences and challenges involved in developing VST3 hosts and managing plugin processing on Windows versus macOS? After more than a year of building and operating these systems, I’ve come to believe that users need to understand the overall architecture of VST3 systems. not just how to “run plugins” in order to achieve maximum stability.

I have a general understanding of this to some extent. On Windows-based systems, I know that VST3 host applications can directly control and assign specific CPU cores, preventing other cores, such as efficiency cores from interfering, and can be customized to use exactly the cores they want. This is why systems like NUC PCs, while not particularly high spec on paper, can still exist as highly purpose built and optimized solutions.

On macOS, however, due to Apple’s policies, this kind of direct control is largely not possible. As a result, processing has to be implemented through various workarounds. This can lead to situations where processing initially runs on performance cores and then, at some point, gets migrated to efficiency cores, causing issues such as unexpected performance drops or the inability to fully utilize the desired number of CPU cores.

That said, Waves Performer actually makes excellent use of 26 performance cores, so from that perspective it performs extremely well. Nevertheless, since I am not a developer, I have been trying for quite a long time to find a solution to the momentary CPU spike issue at times even leveraging AI tools in an effort to better understand and resolve the problem.

Plugin Developer and transform engine user here :wink:

You are asking interesting questions here and indeed, to answer your questions in detail you really have to get a fairly deep understanding of the low level details.

As you already pointed out in your question, a key factor when building efficient software for modern CPUs is parallelization of workloads. A modern CPU has multiple cores that can run simultaneous workloads, on modern hardware like the Apple M series CPUs these cores are asymmetrical in their general processing power and the CPU speed is even adjustable. And CPUs even have parallel execution units on a core level that allow applying the same processing steps on multiple data elements in parallel (SIMD).

Now if we want to use this hardware to process audio we can think of multiple parallelization approaches, that can take place at different levels. The most important one is multi threading and I think it’s good to start with sharing some information on that topic first.

Host Level
A simple approach for a plugin host is to distribute multiple plugin chains to multiple cores. E.g. processing 96 channels with one plugin chain per channel on a 16 core CPU results in letting each CPU core handle 6 plugin chains. All serious plugin hosts do at least that. In most cases, this works by creating as many threads as you need for your processing, telling the operating system that you need the highest priority for these threads and then just take care of evenly distributing the workload to these threads.
However while this seems straightforward, a big problem is the actual even distribution. The assumption that distributing the 96 plugin chains to the 16 cores in the example above is solved best by just segmenting the 96 chains into 6 channel blocks and distributing them to the available cores is probably not true in reality, at least when your chains contain different plugins. There might be plugins in the chains that are relatively lightweight processing wise and others that take more resources. But there is no other way of figuring that out, except for running the plugins. There is no standardised way to ask a plugin ahead of time how much resources it will take. So plugin host load balancing can always only react to observations and predictions from observed behaviour. One thing that makes this even more challenging is that plugins might not even have a stable resource consumption over time. Especially for plugins that use FFT based algorithms, the plugins might require something like 1024 samples for each processing step. Running a plugin like that at 32 samples block size results that 31 processing callbacks into the plugin consist of pushing new samples into a plugin internal input buffer and pulling preprocessed samples out of another plugin internal buffer, which is a very cheap operation. Then when the 32nd processing callback is performed, there are enough samples buffered to run the actual processing which will take place only in the 32nd run, leading to a much higher processing resource consumption for this single run, so the host suddenly sees a load spike. While this is periodic, you can assume that this is relatively complex to predict and therefore hosts are often relatively conservative about their load balancing strategies, leaving a lot of headroom for occasional load spikes.

Plugin Level
As already sketched above, a plugin is called by the host to process a block of samples. The plugin has to use the host supplied thread that calls into the plugin to perform the processing and it should better do that as quick and efficient as possible. Well crafted plugins often optimise the actual processing by using the mentioned SIMD processing to e.g. process multiple audio channels in parallel on the same CPU core. But the time box for the processing is limited. If the processing takes too long, it won’t be ready at the time when the Soundcard driver needs to push out the samples and you will notice audio dropouts. Now if a plugin manufacturer wants to implement an algorithm that won’t make it in the given time box but identifies tricks how to split the computation behind the algorithm into multiple independent parts some plugins just start extra high priority threads on their own and start distributing audio to these threads as soon as the plugin’s processing function is called by the host and make the plugin use multiple CPU cores in parallel like that.
AI based defeedback plugins tend to do exactly that, at least a famous one that I inspected a while ago.

Challenges
As soon as both plugins and host start launching their own processing threads things can get messy quickly. A host application can no longer assume that it is the only instance running high priority threads and the predictions required for workload distribution as sketched out above will likely no longer be accurate, because once there are more high priority threads requested on a system than cores are available, the OS thread scheduler will pause some threads to give other threads a time slot.

The best way around this would be if instead of just creating extra threads in the plugin to let the plugin ask the host for extra processing threads. To my knowledge, the relatively new CLAP plugin standard is the only plugin standard that allows this as an optional feature, and then both the host and the plugin in question would require supplying/using that feature. VST3 just has no standardised way of managing this, so there is kind of a Wild West situation at the moment. I hope and expect this to be an area of further development in the near future as there are only winners with such development.

So at the time being, we are at the mercy of the operating system thread scheduler. Optimising a systems thread scheduler is a highly complex topic, where macOS and Windows take different routes. On macOS the scheduler has a specific concept of threads that contribute to the processing of realtime audio. Audio applications can access the Audio Workgroup API to give the system more information about threads that it creates to compute audio related workloads and the system will prioritise these workloads as long as the application does not violate its own predictions. Also here plugins can’t currently hook into the hosts audio workgroup, so if a plugin messes around with its own processing threads it might not get the same high priority resources as the host managed threads, which might ultimately lead to priority inversion problems. Still from my observation, the macOS scheduler often seems to manage to react fairly well to e.g. plugins also requesting extra threads. So we have a rather dynamic scheduler that knows the concept of real-time processing, but we have to rely on the scheduler doing its best with the given information, we cannot pin a processing thread to a certain CPU core. On Windows on the other hand, there is no real system level concept of specific audio threads and therefore, audio threads are competing for resources with other high priority threads. But on Windows and also on Linux we can e.g. pin threads to a certain CPU core or block threads from running on a certain core which is a more static approach of making sure that some resources of a CPU are nearly exclusively used by the intended threads or processes. This way it’s easy to degrade the overall performance of a system, so this should only be done for special cases where there the application runs in an isolated environment, but a plugin server could be seen as an environment like that – as long as plugins don’t start threads on their own and mess up with the hosts assumptions. I also wouldn’t be surprised of the special configured hardware that one manufacturer of defeedback plugins sells is basically a Windows or Linux system that is configured like that.

So that’s a basic overview of threading challenges for audio plugin hosts. And I realise that I didn’t even managed to answer half of your questions, but I hope it helped you gaining some more understanding in this area? Also note that I don’t have in depth knowledge of how the transform engine approaches some of those challenges in details, so take it more as a general overview that helps understanding some observations in the wild better.

1 Like

This is the first time I’ve received such a detailed and insightful answer to the questions I’ve had while building and using VST3-based systems over the years.

Thank you very much for taking the time to write such a thorough response.

Excluding transform-engine–based solutions, the only software hosts realistically available to general users are Waves Performer and Live Professor, and these two clearly operate on fundamentally different algorithms. In terms of performance and stability, Waves Performer stands out as the only host that can process a significantly larger number of plugins at extremely low buffer sizes on identical hardware. It’s evident that their long experience in live sound platforms is deeply reflected in the design.

Through this process, I assume that the software itself is already operating in an optimally tuned manner. Nevertheless, I personally experimented on macOS with forcibly assigning the highest possible priority (e.g. -20) by tracking the application’s PID and applying tools such as taskpolicy and renice. In practice, however, this resulted in unpredictable and random CPU spikes, so I have since reverted to a completely stock configuration and am currently re-testing the same session under default conditions.

From your perspective, to what extent do you think manually raising process priority via user-level commands can conflict with an application’s own internal scheduling and optimization strategies, potentially causing issues at the macOS level with thread prioritization or with how plugins request and receive processing time?

This is a continuation of the discussion. I believe users also need to understand how plugins and host applications work from multiple angles, so they can plan and use plugins more strategically.

Right now I’m using an M3 Ultra 32-core Mac Studio, but previously I used an M4 Max CTO MacBook. Both worked well overall, but the fundamental difference I noticed was this: the M4 tends to maintain a high clock more consistently, with less fluctuation, and it can hold its boost clock for longer. In contrast, the M3 Ultra appears to run with more variable clock behavior rather than staying at peak speed. Because of that, I can see more movement on the CPU meter, and I suspect this kind of variability may be contributing to unpredictable, “for no apparent reason” CPU spikes. When I used the M4 Max, I don’t think I experienced these random CPU spikes nearly as much.

I also have a few more questions.

In LiveProfessor, it feels like this issue has been largely resolved. But when using Waves Performer with a lot of plugins at a low buffer size, simply opening a plugin window can cause a sudden CPU load jump and result in audio dropouts. My understanding is that VST3 plugin GUI rendering is handled entirely by the CPU (with essentially no GPU involvement). So when GUI drawing suddenly consumes CPU resources at the wrong moment, it steals time from the real-time audio thread, creates a bottleneck, and eventually causes real-time audio processing failures. What I’m wondering is: from a technical standpoint, is this something that cannot realistically be improved at the host level? In practice, because of this, I almost never tweak plugins during a show. Leaving everything untouched is the safest approach.

Also, when using the same plugin on a mono channel versus a stereo channel, it feels like the stereo instance can take 2× or more processing resources. I understand it’s not simply “two mono tracks,” but I’d like to ask—since you’re a developer—why this difference can become so extreme, and why something that is totally usable on mono can become almost unusable on a stereo channel.

Finally, I want to ask about a very hot topic lately: Alpha Sound’s “de-feedback” plugin.

In LiveProfessor, it’s currently unusable due to CPU peak issues (though I understand they’re working on fixing it). On macOS it’s an even worse situation, but at the same time, I’ve seen posts from users saying it runs fine on base M3 Mac minis or other Macs, which is confusing.

This plugin seems like it only uses AI training data as the basis for processing, rather than doing real-time AI inference. If that’s true, then it may require significantly more processing power than typical audio DSP. And if it were truly doing AI inference, it could potentially leverage the Neural Engine—but it appears to be CPU-only. So to me, it looks like one of the many recent “AI-labeled” plugins that are essentially processing based on learned data, and it also seems like the internal code simply isn’t well optimized. I’d like to ask if you have any personal opinion about this plugin—not just performance-wise, but specifically about how its processing approach likely works.

On my system, as soon as I run audio through it, it just causes CPU spikes. I also get jitter. I’ve spoken with the developer a few times, but they’re basically abandoning optimization/stabilization of the current version and going all-in on the next version. From a customer’s perspective, it honestly feels like I paid money and got nothing but empty promises.

I’m currently processing 95 input channels using roughly 280 to 300 instances. (851 thread)
I’m not able to upload images right now, so I can’t attach a screenshot of the screen I’m using. (96Khz - 96 Buffer)

So I’m currently keeping the session as it is, while identifying which plugins are causing problems and simply not using them. I don’t think the computer’s processing power is the issue. Rather, if a plugin’s processing behavior or its graphical resource usage makes it difficult to use reliably in a live environment, then the correct answer is to avoid using that plugin.

Also, I’m curious how much disabling Wi-Fi or Bluetooth actually contributes to real-time processing stability. I always keep Wi-Fi off due to licensing concerns, but I do use Bluetooth because of my input devices. That said, I’ve seen a lot of opinions claiming that Bluetooth scanning/searching can affect performance in heavy sessions—especially in a low-buffer, high-plugin-count real-time processing setup.

In the case of Performer, they’ve completely blocked internal networking and wireless networking.

Again a lot of questions and I’m not able to answer all of them in detail, I’m afraid, but let me try to give you some information that might help you for further understanding and to help you continuing your own research. In the end, this performance optimisation topic is highly complex and even as a developer it’s not always obvious why a system behaves like it does and which of multiple optimisation strategies is the best one.

Processes and Threads
I think it’s important to clarify a bit what processes and threads are and how they are managed by an operating system. So both are abstractions of parallelism.

A process is basically one instance of some software running on a system. E.g. if a web browser and a text editor run at the same time, these are two parallel processes that are executed. By default, those two pieces of software have no direct interconnection. An operating system tries to distribute time slices of the CPU processing time between all running processes, letting each of them run for a short timeframe and then halt execution of one process to let another process continuing its work, giving the user the impression of fully parallel execution. So processes abstract system level parallelism.

Threads are a way to implement process level parallelism. By default when a new process starts, there is exactly one thread that is created with the process that runs the processes main function. In case of a UI based process, this main thread usually becomes the systems message thread, which means that after general application setup, this thread hooks into the operating system’s messaging system and then e.g. receives system events like mouse and keyboard inputs and runs all CPU based UI rendering there. If the application needs parallel actions to happen, it usually starts an extra thread from within the application. A good example is e.g. a download which means talking to some network interface, waiting for data to arrive, processing a bit of data, waiting for more data and once all data is there maybe post processing that data and writing it to a file. In theory this could be done on the main thread, but that would mean that while waiting on the next few bytes arriving via network we could not react to mouse events, so the UI of the app would appear frozen. So we start a new thread from within the application and handle the relevant network I/O on that thread. Requesting data from the network works via system calls, so functionality from the operating system that can be access from within an application. If the operating system figures out that there is no data arriving soon because of slow internet connection it will probably stop execution of the thread serving the download for a moment and use the free resources for some other process. When new data arrives on the network, it will re-assign the sleeping thread to a CPU core and let it continue working on the received data, either until it has to wait for new data again or until there are other threads from other processes that would also be ready to do work. Then the scheduler might also decide to just halt execution of that download thread for a few milliseconds and let another thread use that CPU core to do some work before it might resume the download thread or assign the core to another thread that is also waiting.

One way how the operating system decides how to prioritise the work of one thread over another is via thread priority. This is a value that the process that starts the thread assigns to the thread and it communicates the priority of the work that is executed on that thread to the operating system. The download thread from the example above should be created with a medium low priority, just because a download is slow and non time-critical compared to reacting to user interaction via the UI. Therefore, the main thread will usually have a medium high priority. The audio processing threads of an audio application should have the highest priority so that the system would always prioritise them over other threads, e.g. also over the main thread, because a stuttering UI is a lot less critical than stuttering audio. This prioritization works well when the system is not under full load but gets more and more difficult if the system load rises. Especially with a lot of audio processing, that means a lot load on high priority threads the system needs to halt even highest priority threads from time to time to handle system events etc. With macOS audio workgroups, the systems scheduler is capable of making more informed decisions as it knows that certain threads have to meet a deadline in order to avoid audio stuttering but even then, when the system load is too high the scheduler might have no option but making bad decisions. So better performance of one macOS audio application over the other in terms of robustness to interruption from UI events might be caused by better usage of the relatively new audio workgroup API by one application over the other, but this is highly theoretical, I’m explicitly not stating that the dropouts after UI interaction that you notice with the Waves system are caused by poor audio workgroup usage of the application. I just don’t know anything about the internals of this application.

But there is one more interesting aspect to consider and that is

Synchronization
Let’s stick to the example of our 96 channel audio host that runs 16 audio threads. Only one of these threads will actually wait on a system call from the audio driver. Just as the download thread described above will have to wait for new data from the network to arrive, the main audio thread will have to wait until the audio driver signalises that now e.g. 32 samples of audio for all 96 input channels have been copied into a certain memory location that the application can read from. Only once this has happened, the apps main audio thread can start its computation on the samples. If the app had only one audio thread that thread would work on the data and eventually write the output samples to a prepared memory location. Once it is ready with that, it will inform the audio driver which would then take over and make sure to push the new samples out to the soundcard via whatever hardware connection it’s connected with and read back new samples from that hardware connection into some memory accessible by the application. So we identify some inter-process synchronisation point. The systems scheduler could react to that by assigning CPU resources to either the audio driver process or the audio application process each time these events happen as it’s capable of tracking such system events.

But what about the 15 other threads in the audio app? These threads also need some synchronisation, they basically need to wait for the main audio thread to start processing the new block of samples then following some distribution strategy start working on a subset of the processing work and contribute their part to the final processed block of multichannel audio. The main audio thread will also do some processing but then has to wait for all other processing threads to finish their work before handing over to the audio driver again, so in case other threads take longer than the main tread, it needs a way to wait on another thread. And this might be a more complex problem than it seems at first. For thread synchronisation, there are system calls that tell the operating system “I want to pause here and resume working once another thread has reached a certain point” and in general programming, using these system calls is considered best practice, because it helps keeping a system reactive. Because in times where the system gets the information that one thread is currently waiting for another thread, it would usually take the opportunity of executing some other work in between and switching back to the waiting thread afterwards in case the condition it waited for is fulfilled. However, in audio software, we sometimes try to actively avoid using such system calls because we might get scheduled back a bit too much. Especially if the main audio thread assumes that the other thread should be ready very soon it would decide to rather switch to busy spinning which means constantly checking if the condition it’s waiting for is already true instead of telling the system to inform it once that event has happened. To the operating system this can not be distinguished from actual useful processing work, so it looks like the thread is doing heavy work which might be even beneficial since the scheduler gets the impression that this thread should really not be blocked to get the audio processing done. And so the application might chose to burn CPU power with stupid checks just in order to not miss any audio processing deadline. Of course busy spinning is also a waste of resources, so this is a really tricky optimisation field and good thread synchronisation is really a tricky.

A lot of general purpose software library that are used for high throughput number crunching that applications and plugins might use to speed up their processing are not optimised in this regard. This is especially the case for highly complex AI based computations, there are a lot of building blocks out there these days that will make AI algorithms run fast but nearly none of them is optimised in terms of lock free audio processing. So if you come across plugins that use AI for audio processing that show occasional load spikes it might be caused due to the usage of general purpose libraries in those products. I could identify such cases in the wild.

That brings me to a last point

Why is AI inference in audio plugins mostly CPU based?
Using external hardware like GPUs or the Apple Neural Engine requires synchronisation points again. The software interfaces that are available to access these units always follow more or less a pattern like outlined above when talking to e.g. the audio driver or the network stack. It we chose to wait for e.g. our GPU to finalise our sample processing in time we have to compete with other applications using the GPU. So theoretically, running plugins that use the GPU for AI inference could suddenly be interrupted by opening the UI of a plugin that uses the GPU based UI rendering. And the scheduling of GPU access might follow completely different rules compared to our discussed thread scheduling. Also memory access latency is an issue here. Still, there are companies that are actively working on these issues and there seem to be solutions. Still this is extremely niche knowledge and therefore not feasible for most plugin manufacturers. The Apple Neural Engine itself is interesting but also cannot even be accessed directly like a GPU can be accessed. Instead you need an abstract description of your neural network and pass it on to the Apple CoreML API which then decides if it runs them either on the Neural Engine, on the GPU or the CPU. So you hand over a big amount of control to a system library, which might perform all kinds of multithreading and uncontrollable synchronisation internally, which is usually something that you want to avoid at all costs for the reasons outlined above. So we often end up using the CPU as the most predictable execution unit for AI inference workloads in audio apps today, even if it’s not the most efficient hardware unit.

Final words
I think the level of detail that we reach here quickly gets a bit too much for a place like this and unfortunately I’m not able to continue this conversation over a longer period of time at this level of detail. If you are really interested in the deep details of how audio applications work, you will need to start programming yourself I guess and learn how all works together behind the scene. If you are interested in some behind the scenes look, I might recommend you the Audio Developer Conference YouTube channel, where you find a wide range of talks from all kinds of people that are working on audio software development.

1 Like

Really appreciate your answer!

1 Like