In relation to the general performance, has anyone looked into the elephant in the room, namely the slightly weird 2x oversampling from 32,000 to 64,000 Hz? If that is not strictly necessary, and the whole digital circuit simulation (or a large portion of it) literally runs 64k times instead of just 32k times per second, getting rid of it might yield major dividends.
The best type of performance optimisation is not doing the work at all 😎
Does the hardware even do any 2x oversampling on the digital sample-generation side? Does anyone know? Or is this something that Nukey tacked on for whatever reasons? I've spent ten minutes now looking into it, but I can't tell if
1. the hardware does 2x oversampling, so emulating this accurately is a must.
2. it's a Nuked-SC55 special, and the digital circuit simulation is doing twice the work unnecessarily.
3. it's a Nuked-SC55 special, the digital circuit simulation runs at the correct 32k rate, then there's some extra interpolation/upsampling going on as a post-processing step.
If case 1 is true, probably there's nothing to do.
In case 3, there's not much to be gained by not oversampling as an extra post-processing step as it's cheap. (But the 2x upsampling is still very much questionable... Why 2x? The correct way would be to render at 32k, then resample to whatever the host rate is by a high-quality resampler. Currently, resampling from 64k to the host rate is done the OS or the audio subsystem anyway, just you'll have no control over it, and you're resampling twice instead of just once.)
But in case 2 things get interesting, and there are some major potential performance improvements to be had! Ideally, in case 2 we'd want to render at 32k, true to the real hardware, then introduce a high-quality resampler library that introduces low latencies, like Speex DSP. The Speex DSP resampler has much lower latency than, say, libresample, is very good quality, and has negligible CPU overhead on modern hardware (like 1-2% of a single CPU core at high-quality settings). We use it DOSBox Staging for most resampling duties now, it's great (e.g., from the weird native OPL rate to 48k or 44.1k, to do the "brickwall filter" style resampling on the SB16, etc.)
DOS: Soyo SY-5TF, MMX 200, 128MB, S3 Virge DX, ESS 1868F, AWE32, QWave, S2, McFly, SC-55, MU80, MP32L
Win98: Gigabyte K8VM800M, Athlon64 3200+, 512MB, Matrox G400, SB Live
WinXP: Gigabyte P31-DS3L, C2D 2.33 GHz, 2GB, GT 430, Audigy 4