Stitch Fix
Employee
TELECOMMUTE – US National
4/30/22
Job Description
Title: Data Platform Engineer – Compute Infrastructure
Remote, USA
Job Description
About the Team
The Compute Infrastructure team is our data platform team responsible for providing frameworks and services for operating on our data including the use of Spark, Presto and Kafka. The team is also responsible for most of the initial ingestion of data into our data warehouse from our logging pipelines and regular snapshots of transactional databases.
Our infrastructure is 100% deployed on AWS
We make heavy use of Spark to process and transform data
We use several Presto clusters for ad-hoc queries and analysis
Our Kafka based logging pipeline collects messages from applications broadly used throughout Stitch Fix as well as our internal applications
We run 1000s of batch jobs each night, training 100s of models that feed our recommendation engines and other data driven APIs
The source of truth for our data warehouse is AWS S3, and we use the Hive Metastore to manage schemas, data locations, and versions
We write most of our code in JVM languages (Java and Scala) and Python
About the Role
In this role, you will be developing, monitoring and improving our services, libraries and tools that operate on our data. In addition, you’ll consult with Data Scientists and Analysts to build robust pipelines that take advantage of Spark and our data ecosystem.
This is an individual contributor role on the compute infrastructure team, part of the larger data platform team in our tech organization
You’ll have opportunities to work on high impact projects that improve data availability and quality, and provide reliable access to data for the analytics division and the rest of the business
You will help make our infrastructure more scalable, more reliable, and easier to use
You’ll consult with others on the team, helping them with some of their daily data challenges
You’re excited about this opportunity because you will:
You’ll create services and tools to help make the experience better for our data scientists, and create the abstractions they need to effectively use our data ecosystem
You’ll help us improve our Spark and Presto deployments to function well under load and in AWS.
You’ll build services to ingest data into our warehouse and ensure it’s clean and consistent.
You’ll work on core pipelines that provide data lineage information as well as make sure that our data warehouse is GDPR compliant
You will help enhance our Spark infrastructure and help develop libraries that improve the reading and writing of data and enhance user capabilities with UDFs.
You’ll provide ETL patterns that others can follow, abstractions to make working with data easier, and consult with other team members to create new data pipelines and improve them.
Many of the changes we need would also benefit others in the big data community. You’ll have the opportunity to contribute back.
We’re excited about candidates who have:
You have 5 or more years of relevant project experience with significant contributions.
You have exceptional coding and design skills, particularly in Java/Scala and Python.
You’ve used Spark extensively and are comfortable with the Hive Metastore. You know how to take advantage of Spark APIs as well as write SQL.
You’ve worked on some challenging data migration and data transformation projects.
You …
Software Development , Java & Android , Python
US National
To apply for this job please visit www.stitchfix.com.