Job ID: 2024-20065 Type: Full-Time # of Openings: 1 Category: Information Technology
Overview
Princeton Plasma Physics Laboratory (PPPL), a U.S. Department of Energy (DOE) National Laboratory, is dedicated to pursuing one of the most transformative scientific goals of our time—Fusion Nuclear Energy. Fusion, the process that powers the sun and stars, has the potential to provide a nearly limitless, clean energy source for the world. A key enabler of this pursuit is high-performance computing (HPC), which plays a pivotal role in advancing the complex science behind fusion. PPPL's researchers rely on cutting-edge computational capabilities to unlock knowledge that can only be revealed through large-scale simulations and data analysis.
We are seeking a dynamic and innovative Research Computing Specialist to join our dedicated team. This role bridges our Computational Sciences and IT departments, focusing on managing and enhancing PPPL’s research computing infrastructure. The successful candidate will play a critical role in supporting researchers by maintaining high-performance software stacks, managing in-house developed computational codes, and ensuring robust and efficient computational environments that are essential for our research mission.
Collaborate with researchers to understand and fulfill their computational requirements through effective technical solutions.
Install, maintain, and administer research computing systems, including clusters and individual systems.
Monitor and troubleshoot hardware and networking issues, ensuring optimal performance and reliability.
Assist in resolving system-level software, data, and job submission problems.
Develop and maintain comprehensive documentation for all implemented systems to facilitate collaboration and knowledge sharing.
Provide training and support to researchers and staff as needed.
Research and recommend new technologies and solutions to meet evolving research needs.
Contribute to the automation of configuration management in Linux-based systems, adhering to security best practices and laboratory cybersecurity policies.
Manage and maintain software stacks, including in-house developed research computational codes.
Apply AI and ML techniques to inference scientific data and improve computational research methods.
Create and maintain neural networks using proven technologies or develop new methods tailored to specific research needs.
A proud U.S. Department of Energy National Laboratory managed by Princeton University, Princeton Plasma Physics Laboratory (PPPL) is a longstanding leader in the science and innovation behind the development of fusion energy — a clean, safe, and virtually limitless energy source. With an eye on the future and in response to national priorities, PPPL also has begun a strategic shift from a singular focus on fusion energy to a multi-focus approach that includes microelectronics, quantum information science, and sustainability science. Whether it be through science, engineering, technology or professional services, every team member has an opportunity to make their mark on our world. PPPL aims to attract and support people with a rich variety of backgrounds, interests, experiences, and cultural viewpoints. We are committed to equity, diversity, inclusion and accessibility and believe that each member of our team contributes to our scientific mission in their own unique way. Come join us!
Responsibilities
Core Duties
35% Software Stack Development, Maintenance, and Support
Manage and maintain complex software stacks, including in-house developed research computational codes and external research applications.
Develop, optimize, and configure software environments to support various research workloads, ensuring compatibility and performance in high-performance computing (HPC) environments.
Troubleshoot software issues, ensuring seamless integration of new tools, libraries, and frameworks into the research computing environment.
Work with research teams to customize software solutions, ensuring that computational codes are tuned for optimal performance.
Provide ongoing support for software updates, security patches, and version control of research applications.
Automate software deployment and configuration management tasks using tools like Ansible to improve efficiency and reliability.
30% Collaboration with Researchers and User Support
Work directly with researchers to understand their computational needs and assist in integrating research-specific software into the HPC environment.
Provide user support for software and hardware issues, helping researchers optimize their workflows and computational models.
Deliver training on new software features and hardware capabilities to ensure researchers can effectively use the resources available.
15% AI/ML Integration and Development
Apply AI/ML techniques to optimize computational research, including the integration of neural networks or custom-built models into the research workflow.
Assist researchers in incorporating AI/ML tools where applicable, ensuring compatibility with existing hardware and software systems.
15% Hardware Installation, Maintenance and Support
Install and configure HPC hardware, including clusters, servers, networking equipment, and specialized hardware for research.
Monitor hardware performance, troubleshoot failures, and perform preventive maintenance to ensure system reliability.
Work closely with vendors to manage hardware repairs, upgrades, and replacements as needed.
Maintain detailed documentation of hardware configurations and performance metrics for system tuning and future upgrades.
Manage storage systems (e.g., Ceph), ensuring that they meet the data needs of research projects.
5% Other Duties
Stay informed about new developments in HPC hardware and software, recommending improvements to the research computing environment.
Evaluate potential hardware and software upgrades, making recommendations to improve system performance and researcher productivity.
Other duties as assigned.
Qualifications
Education and Experience
Required Qualifications:
Bachelor’s degree in Computational Science, Information Technology, or a related field.
5+ years of experience managing research computing environments.
Proficiency in high-performance computing technologies and architectures.
Experience with parallel file systems (e.g., Ceph) and high-speed interconnects (e.g., InfiniBand, Ethernet fabrics).
Strong knowledge of job scheduling systems, such as SLURM.
Excellent written and verbal communication skills.
Ability to multitask and manage multiple projects effectively.
Experience with configuration management tools, such as Ansible.
Familiarity with automated deployment systems (e.g., Cobbler).
Knowledge of security benchmarks and best practices, such as CIS.
Ability to develop and implement technical solutions for specialized software and research data requirements.
General knowledge of networking equipment and techniques.
Experience with AI and ML technologies and their application in scientific research.
Preferred Qualifications:
Master’s degree in a relevant field.
Specialization background in areas such as Plasma Physics or Mathematics.
Experience working in a research or academic environment.
Familiarity with the development and management of computational codes used in scientific research.
Experience in training and supporting non-technical users in a research setting.
Proficiency in creating and maintaining neural networks, and applying AI/ML methods to inference scientific data.
Proficiency in managing and administering HPC systems, including experience with parallel file systems (e.g., Ceph) and high-speed interconnects (InfiniBand, Ethernet fabrics).
Job Scheduling Systems Expertise
In-depth knowledge of job scheduling tools like SLURM, with an ability to configure and optimize job submissions for computational research needs.
System Administration and Troubleshooting
Expertise in Linux system administration, with a focus on monitoring and troubleshooting hardware, networking, and system-level software issues to ensure optimal system performance and uptime.
Software Stack Management
Ability to manage and maintain complex software stacks, including in-house developed research computational codes and various software packages used in scientific research.
Configuration Management and Automation
Proficiency in configuration management tools like Ansible and automated deployment systems, with experience automating and securing computational environments in line with cybersecurity policies.
AI/ML Workload Familiarity
Strong knowledge of artificial intelligence (AI) and machine learning (ML) techniques, including the ability to apply these methods to scientific research, build and maintain neural networks, and develop custom solutions for research-specific challenges.
Networking Expertise
General knowledge of networking technologies and techniques, particularly those required to support high-speed research computing environments.
Documentation and Collaboration: Proven ability to develop comprehensive system documentation and facilitate collaboration and knowledge sharing with researchers and IT staff.
User Support and Training: Strong communication skills, with experience providing technical support and training to researchers and non-technical users in the effective use of HPC resources and software tools.
Physical Requirements
Ability to lift and carry equipment (up to 50 lbs) as needed for the installation, maintenance, and troubleshooting of research computing hardware (e.g., servers, networking equipment).
On-site availability for physical access to server rooms, data centers, or networking locations to address hardware or networking issues as needed.
Ability to work in low-temperature environments typical of data centers or server rooms.
Princeton University is an Equal Opportunity/Affirmative Action Employer and all qualified applicants will receive consideration for employment without regard to age, race, color, religion, sex, sexual orientation, gender identity or expression, national origin, disability status, protected veteran status, or any other characteristic protected by law. KNOW YOUR RIGHTS
Please be aware that the Department of Energy (DOE) prohibits DOE employees and contractors from participation in certain foreign government talent recruitment programs. All PPPL employees are required to disclose any participation in a foreign government talent recruitment program and may be required to withdraw from such programs to remain employed under the DOE Contract.