We are looking for a skilled PySpark Engineer with hands-on experience in Google Cloud Platform (GCP) to join our team. The ideal candidate will have expertise in building scalable data processing pipelines using PySpark and leveraging GCP services for big data solutions. Experience with Spring Boot is a plus, as it will be helpful for integrating data processing pipelines with web-based applications or microservices.
Key Responsibilities:
Data Processing: Develop and maintain PySpark-based ETL pipelines to process and transform large datasets.
Cloud-Based Data Solutions: Utilize GCP services (e.g., Google Cloud Storage, BigQuery, Dataflow, Dataproc) for data storage, processing, and analysis.
Optimization: Ensure performance optimization of PySpark jobs by configuring clusters and utilizing best practices in Spark processing.
Big Data Management: Collaborate with data engineers to design and implement efficient storage systems, leveraging GCP's distributed data storage and compute resources.
Machine Learning: Work with data scientists to preprocess data for machine learning tasks, using Spark MLlib and other tools in the GCP ecosystem.
Data Integration: Integrate data pipelines with other GCP services like BigQuery for analysis, Cloud Pub/Sub for messaging, and Cloud Dataproc for Spark job management.
Spring Boot Integration (Optional): Develop RESTful APIs or microservices using Spring Boot to serve processed data or integrate PySpark pipelines with applications.
Collaboration: Work closely with other engineers, analysts, and data scientists to ensure smooth operation of the data pipeline.
Documentation: Write clean, efficient, and well-documented code. Maintain technical documentation for data pipelines and workflows.
Required Skills and Qualifications:
Python: Strong experience with Python programming for data manipulation, processing, and building PySpark jobs.
PySpark: Extensive experience working with Apache Spark using the PySpark library for distributed data processing.
Google Cloud Platform (GCP): Hands-on experience with GCP services like BigQuery, Cloud Storage, Dataflow, Dataproc, and Pub/Sub.
SQL: Proficiency in writing complex SQL queries to process and analyze large datasets.
Big Data Ecosystem: Familiarity with the big data ecosystem, including Hadoop, HDFS, and data warehousing concepts.
Data Storage: Experience working with cloud-based data storage (Google Cloud Storage, BigQuery, etc.).
Version Control: Experience with Git and code versioning tools.
Preferred Skills:
Spring Boot: Experience with Spring Boot for developing RESTful APIs or integrating PySpark pipelines with microservices or web applications.
Data Science/Machine Learning: Exposure to machine learning workflows and tools for preprocessing data for ML models.
CI/CD: Familiarity with continuous integration and deployment practices.
Containerization: Knowledge of Docker and container-based deployment (Kubernetes is a plus).
Education and Experience:
Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience.
3+ years of experience in data engineering or a related field with a focus on big data technologies.
Why Join Us:
Opportunity to work with cutting-edge cloud technologies and big data tools.