Location: Cupertino, CA(Onsite on Tuesday, Wednesday, and Thursday)
Job Summary
This role involves managing petabytes of data for machine learning applications and designing and implementing new frameworks to build scalable and efficient data processing workflows and machine learning pipelines. The successful candidate will be responsible for ensuring complete data lineage and legal workflow integration while optimizing performance and scalability. You will also be responsible for monitoring the performance of the system, optimizing it for cost and efficiency, and solving any issues that arise. This is an exciting opportunity to work on cutting-edge technology and collaborate with cross-functional teams to deliver high-quality software solutions. The ideal candidate should have a strong background in software development, experience with public cloud platforms, and familiarity with distributed databases.
Responsibilities:
10+ years of experience in software engineering with deep knowledge in computer science fundamentals.
Strong in data structures and algorithms. Must write good quality code with test cases and review PRs in fast fast-paced environment.
Expert in one or more functional or object-oriented programming languages.
Fluent in Python.
Experience or knowledge in distributed data systems like Hadoop, Spark, Kafka, or Flink.
Experience or knowledge in the public cloud is a big plus, preferably AWS.
Strong collaboration and communication (verbal and written) skills.
Experience in Datahub Open-Source projects is a plus.