I’m not particularly knowledgeable about the actual Neuro-sama
, but from what I’ve gathered by referencing threads and such, the difficulty in replication seems to lie more in the software engineering aspects than the AI models itself.
Since the models themselves have become more efficient over time, the core challenges are more on the software engineering side: things like pipeline structure, real-time performance, model integration, network processing, and leveraging existing components.
Regarding the AI models needed within the system, a basic understanding of LLMs for text, VAD, ASR, and TTS for audio should likely suffice for implementation. You might have to wrestle with Python or library version mismatches though…