职位描述
We are looking for a senior-level engineer to focus on high-performance model inference across PC and Android platforms. The role centers on optimizing LLM/multimodal models for low latency and efficient memory use, implementing C++ runtimes, applying advanced acceleration techniques, and collaborating closely with research teams to bring optimized inference solutions into production environments.
Key Responsibilities
• Design and implement optimized model inference pipelines for PC (x86/AMD/Intel) and Android (ARM).
• Apply quantization, operator/kernal fusion, memory optimization, and runtime scheduling techniques.
• Work with at least one major inference stack: llama.cpp, Qualcomm AI SDKs (QNN/QAIRT/QSDK) or MTK Neuro Pillot; better to have experience with OpenVINO, Ryzen AI, and other inference SDKs.
• Profile and tune CPU/GPU/NPU performance using industry-standard profiling tools.
• Collaborate with model researchers to translate new methods into efficient runtime implementations.
Required Qualifications
• Master’s degree or above, with 3+ years experience in model inference, runtime engineering, or performance optimization.
• Strong C++ programming skills; familiarity with Android NDK/JNI is a plus.
• Solid understanding of transformer architectures, inference mechanisms, and acceleration methods.
• Hands-on experience with at least one of: llama.cpp, Qualcomm AI SDKs (QNN/QAIRT/QSDK) or MTK Neuro Pillot; better to have experience with OpenVINO, Ryzen AI, and other inference SDKs.
• Ability to read English technical papers and documentation; English communication preferred.
Preferred
• Experience with ONNX Runtime, TVM, XNNPack, or mobile performance tools.
• Contributions to open-source inference or optimization frameworks.