Member of Technical Staff - ML Infra at Impax Recruitment in Santa Clara, California

Posted in Other 3 days ago.

Type: full-time

Job Description:

I'm partnered with a startup who are building highly advanced physics foundation models for climate prediction and control.

Key Responsibilities:

Build and Manage Large-Scale ML Infrastructure: Architect and maintain distributed systems to support training and inference of large machine learning models, ensuring optimal performance across all stages.
Design Scalable Pipelines: Develop and implement end-to-end data processing pipelines capable of handling massive datasets, from ingestion and transformation to model training and deployment.
Explore and Test New Training Techniques: Research cutting-edge training methods, including parallelization strategies and precision trade-offs, to improve the performance and scalability of model training.
Optimize GPU Performance: Analyze and enhance low-level GPU operations to improve efficiency, reduce latency, and maximize hardware utilization in complex ML tasks.
Stay Updated on Industry Trends: Continuously monitor advancements in ML research to incorporate new ideas and techniques into our systems.

What We're Looking For:

Strong Problem-Solving and Fast Execution: You should thrive on tackling complex problems with speed and creativity, and adapt quickly to new technologies or challenges.
Expertise in Optimizing ML Workloads: Proven experience in optimizing training and inference for large models, including leveraging advanced techniques like mixed-precision training and hardware optimization.
Experience with Distributed Training Frameworks: Deep familiarity with distributed systems for training large models, such as FSDP or DeepSpeed.
Cloud Platform Knowledge: Hands-on experience with major cloud services (e.g., GCP, AWS, or Azure) and their AI/ML offerings for deploying and scaling models.
Containerization and Orchestration Skills: Proficient in tools like Docker and Kubernetes for deploying and managing containerized machine learning workloads in cloud environments.
Distributed Systems and Scalable Serving Expertise: Experience in building scalable task management systems and deploying machine learning models in production environments.
Monitoring and Observability Practices: Knowledge of best practices for monitoring, logging, and tracking performance in machine learning systems to ensure reliability and efficient version control.

Fully onsite in SF - Startup hours

Other 13 minutes ago Sales Development Representative - REMOTE hirepluto Santa Clara, California
Other 14 minutes ago Data Processing Manager InfoIMAGE, Inc. Santa Clara, California
Other 16 minutes ago RAD Data Architect (Contractor) Judicial Council of California Santa Clara, California

Other less than a minute ago Psychology Expert Outlier Hialeah, Florida
Other less than a minute ago Graduate Engineer - Water & Wastewater Pennoni Winter Haven, Florida
Other less than a minute ago Psychology Expert Outlier Lexington, Kentucky