We are seeking an experienced engineer to design, build, and maintain Kafka clusters and their associated ecosystem, ensuring the resilience, reliability, and scalability of critical services. In this role, you will play a key part in accelerating deployments that enhance products used by millions of users daily. Your responsibilities will span the entire lifecycle of the storage cluster and its ecosystem on AWS, from initial design through development and ongoing operation. You will also develop support tools and technical processes aimed at simplifying workflows and enabling engineers working across multiple services. A crucial aspect of the role involves identifying opportunities to automate and scale systems efficiently, all while maintaining stringent security and reliability standards.
As part of your duties, you will participate in on-call rotations and contribute to improving incident response procedures, ensuring rapid and effective resolution of issues. This role requires a proactive approach to troubleshooting and a strong commitment to operational excellence.
To be successful, you should bring at least five years of relevant experience, particularly in developing, operating, and troubleshooting storage clusters or other highly available systems at scale. Hands-on experience running Kafka on Kubernetes is essential. You should be proficient in one or more programming languages such as Go, Python, Java, Groovy, Scala, or Ruby, enabling you to contribute effectively to both development and automation tasks.
A solid background in cloud infrastructure, preferably AWS, is required, along with experience in infrastructure automation using Infrastructure as Code principles. Your expertise should also include a strong Unix or Linux foundation, with a good understanding of the network stack and scripting capabilities. Experience in incident response or incident management is highly valued and will be considered a strong asset.
Additionally, familiarity with automation in monitoring, continuous integration and continuous deployment (CI/CD), and security practices will be advantageous. These skills will help you build robust systems that are easier to manage and scale, while also enhancing overall security posture.
This position offers the opportunity to work in a dynamic environment where your contributions directly impact the performance and reliability of services used by millions. If you are passionate about building scalable, secure, and highly available systems and enjoy collaborating across teams to drive innovation, we encourage you to apply.