Find your dream job faster with JobLogr
AI-powered job search, resume help, and more.
Try for Free
Highbrow LLC

Highbrow LLC

via LinkedIn

Apply Now
All our jobs are verified from trusted employers and sources. We connect to legitimate platforms only.

Pyspark Data Engineer

Hanover, NJ
full-time
Posted 10/11/2025
Verified Source
Key Skills:
PySpark
Apache Spark
Python
Hadoop
HDFS
PII detection
Tokenization
SQL
Query optimization
Version control (Git)

Compensation

Salary Range

$120K - 160K a year

Responsibilities

Design, develop, and optimize PySpark ETL pipelines for large datasets with PII detection and tokenization integration ensuring data privacy compliance.

Requirements

5+ years of experience with PySpark and Hadoop, strong Python skills, expertise in PII identification and tokenization, and knowledge of big data query optimization.

Full Description

Job Title: PySpark Developer Company Overview [Insert Company Name] is a leading organization in [industry/sector, e.g., data analytics, finance, or technology services], committed to handling large-scale data securely and efficiently. We are seeking a talented PySpark Developer to join our data engineering team, focusing on processing high-volume datasets while ensuring compliance with data privacy standards through PII identification and tokenization. Job Summary As a PySpark Developer, you will be responsible for building and optimizing data pipelines that ingest massive datasets from Hadoop systems. Your primary focus will be on scanning dataset fields to detect Personally Identifiable Information (PII), integrating tokenization services for data anonymization, and ensuring high-performance query execution. This role requires expertise in big data technologies, Python, and Apache Spark, with a strong emphasis on scalability, efficiency, and data security. Key Responsibilities • Design, develop, and maintain PySpark-based ETL pipelines to read and process high volumes of multiple datasets from Hadoop Distributed File System (HDFS). • Analyze and traverse multiple fields within datasets to identify attributes containing PII data, using pattern matching, rules-based logic, or machine learning-assisted detection where applicable. • Integrate and call external tokenization services to tokenize sensitive PII data for secure storage and processing, as well as de-tokenize data when required for authorized access. • Optimize PySpark queries and data processing workflows to handle huge volumes of data efficiently, minimizing latency and resource consumption. • Collaborate with data architects, security teams, and stakeholders to ensure compliance with data privacy regulations (e.g., GDPR, CCPA). • Monitor and troubleshoot data pipeline performance, implementing best practices for partitioning, caching, and join optimizations in PySpark. • Document code, processes, and data flows to support team knowledge sharing and maintainability. • Participate in code reviews, testing, and deployment of data solutions in a CI/CD environment. Required Qualifications and Skills • Bachelor's or Master's degree in Computer Science, Data Engineering, or a related field. • 5+ years of hands-on experience with Apache Spark and PySpark for big data processing. • Advanced proficiency in Python for data processing, scripting, and integration with Spark applications. • Proven expertise in working with Hadoop ecosystems, including HDFS, YARN, and related tools. • Strong understanding of data privacy concepts, including PII identification techniques (e.g., regex patterns, entity recognition). • Experience integrating APIs or services for tokenization/de-tokenization (e.g., via RESTful services or cloud-based tools like AWS Macie or custom microservices). • Deep knowledge of handling large-scale data volumes, including data partitioning, shuffling, and broadcast joins in Spark. • Acute awareness of query optimization strategies, such as cost-based optimization, predicate pushdown, and tuning Spark configurations (e.g., executor memory, parallelism). • Proficiency in SQL for data querying. • Experience with version control systems (e.g., Git) and agile methodologies. Preferred Qualifications • Certifications in big data technologies (e.g., Databricks Certified Developer for Apache Spark, Cloudera Certified Data Engineer). • Familiarity with cloud platforms like AWS, Azure, or GCP for big data processing. • Knowledge of additional data security tools or frameworks (e.g., Apache Ranger, Kerberos for authentication). • Experience with machine learning libraries in PySpark (e.g., MLlib) for advanced PII detection. • Background in data governance or compliance roles. What We Offer • Competitive salary and benefits package. • Opportunities for professional growth in a dynamic, innovative environment. • Flexible work arrangements, including remote options. • Access to cutting-edge tools and technologies for big data and AI.

This job posting was last updated on 10/14/2025

Ready to have AI work for you in your job search?

Sign-up for free and start using JobLogr today!

Get Started »
JobLogr badgeTinyLaunch BadgeJobLogr - AI Job Search Tools to Land Your Next Job Faster than Ever | Product Hunt