I'm partnered with a startup who are building highly advanced physics foundation models for climate prediction and control.
Key Responsibilities:
Build and Manage Large-Scale ML Infrastructure: Architect and maintain distributed systems to support training and inference of large machine learning models, ensuring optimal performance across all stages.
Design Scalable Pipelines: Develop and implement end-to-end data processing pipelines capable of handling massive datasets, from ingestion and transformation to model training and deployment.
Explore and Test New Training Techniques: Research cutting-edge training methods, including parallelization strategies and precision trade-offs, to improve the performance and scalability of model training.
Optimize GPU Performance: Analyze and enhance low-level GPU operations to improve efficiency, reduce latency, and maximize hardware utilization in complex ML tasks.
Stay Updated on Industry Trends: Continuously monitor advancements in ML research to incorporate new ideas and techniques into our systems.
What We're Looking For:
Strong Problem-Solving and Fast Execution: You should thrive on tackling complex problems with speed and creativity, and adapt quickly to new technologies or challenges.
Expertise in Optimizing ML Workloads: Proven experience in optimizing training and inference for large models, including leveraging advanced techniques like mixed-precision training and hardware optimization.
Experience with Distributed Training Frameworks: Deep familiarity with distributed systems for training large models, such as FSDP or DeepSpeed.
Cloud Platform Knowledge: Hands-on experience with major cloud services (e.g., GCP, AWS, or Azure) and their AI/ML offerings for deploying and scaling models.
Containerization and Orchestration Skills: Proficient in tools like Docker and Kubernetes for deploying and managing containerized machine learning workloads in cloud environments.
Distributed Systems and Scalable Serving Expertise: Experience in building scalable task management systems and deploying machine learning models in production environments.
Monitoring and Observability Practices: Knowledge of best practices for monitoring, logging, and tracking performance in machine learning systems to ensure reliability and efficient version control.