Member of Technical Staff - ML Infra at Impax Recruitment in Santa Clara, California

Posted in Other 3 days ago.

Type: full-time





Job Description:

I'm partnered with a startup who are building highly advanced physics foundation models for climate prediction and control.

Key Responsibilities:

  • Build and Manage Large-Scale ML Infrastructure: Architect and maintain distributed systems to support training and inference of large machine learning models, ensuring optimal performance across all stages.
  • Design Scalable Pipelines: Develop and implement end-to-end data processing pipelines capable of handling massive datasets, from ingestion and transformation to model training and deployment.
  • Explore and Test New Training Techniques: Research cutting-edge training methods, including parallelization strategies and precision trade-offs, to improve the performance and scalability of model training.
  • Optimize GPU Performance: Analyze and enhance low-level GPU operations to improve efficiency, reduce latency, and maximize hardware utilization in complex ML tasks.
  • Stay Updated on Industry Trends: Continuously monitor advancements in ML research to incorporate new ideas and techniques into our systems.

What We're Looking For:

  • Strong Problem-Solving and Fast Execution: You should thrive on tackling complex problems with speed and creativity, and adapt quickly to new technologies or challenges.
  • Expertise in Optimizing ML Workloads: Proven experience in optimizing training and inference for large models, including leveraging advanced techniques like mixed-precision training and hardware optimization.
  • Experience with Distributed Training Frameworks: Deep familiarity with distributed systems for training large models, such as FSDP or DeepSpeed.
  • Cloud Platform Knowledge: Hands-on experience with major cloud services (e.g., GCP, AWS, or Azure) and their AI/ML offerings for deploying and scaling models.
  • Containerization and Orchestration Skills: Proficient in tools like Docker and Kubernetes for deploying and managing containerized machine learning workloads in cloud environments.
  • Distributed Systems and Scalable Serving Expertise: Experience in building scalable task management systems and deploying machine learning models in production environments.
  • Monitoring and Observability Practices: Knowledge of best practices for monitoring, logging, and tracking performance in machine learning systems to ensure reliability and efficient version control.

Fully onsite in SF - Startup hours
More jobs in Santa Clara, California

Other
13 minutes ago

hirepluto
Other
14 minutes ago

InfoIMAGE, Inc.
Other
16 minutes ago

Judicial Council of California
More jobs in Other

Other
less than a minute ago

Outlier
Other
less than a minute ago

Pennoni
Other
less than a minute ago

Outlier