UpGuard - Site Reliability Engineer
At UpGuard, our Platform team handles scale, deployment, uptime, monitoring and infrastructure for both our cloud and enterprise appliance customers. We build autonomous, self-healing clusters of systems using distributed consensus protocols and containers. Our internal tools are built with open-source projects like CoreOS, Etcd, Docker, Fleet, and Kubernetes. We follow a strong release process and collaborate with the Engineering and Product teams. We've built continuous integration and delivery mechanisms (DevOps) and test the resilience of our systems often with live host reboots in production. We’ve got experience building systems that scale and work across datacenter regions. We write code, so the ideal candidate will have experience in both systems and software development.
Our goal is to create an SRE team that incorporate many of the attributes that Google describes in O'Reilly's "Site Reliability Engineering" book. We are looking for candidates who are fast learners, great communicators (both within and outside the team), strong troubleshooters and always strive to build better systems.
- Design, write and deliver software to improve our product's availability, scalability, latency and efficiency
- Troubleshoot and solve problems related to operation of our service
- Provide on-call support for production issues as part of a rotation among SRE team members
- Respond to trouble tickets in a timely manner, looping in members of others teams as needed
- Automate away all non-exceptional service issues to prevent recurrence in the future and to lessen toil by SREs
- Work with Engineering to assure that our software is compatible with designs, patterns and standards for large-scale distributed systems
- Provide capacity planning, demand forecasting, software performance analysis and system tuning
- BS degree in Computer Science or related technical field, or equivalent practical experience.
- Experience in one or more of: C, C++, Java, Perl, Python, Go, or scripting experience in Shell and Perl.
- Experience working with Unix/Linux systems from kernel to shell and beyond, with experience working with system libraries, file systems, and client-server protocols.
- Networking experience with network theory e.g. TCP/IP, UDP, ICMP, etc., MAC addresses, IP packets, DNS, OSI layers, and load balancing.
- Experience designing and troubleshooting distributed systems
- In-depth knowledge of operating systems (processes, threads, concurrency issues, locks, mutexes, semaphores, monitors and how they work)
- Familiarity with algorithms, data structures and complexity analysis. We like folks who are problem solvers and have a strong sense of ownership and drive
- Family Coverage. We cover 100% of the cost of medical, dental and vision for you and your family. It's the right thing to do.
- Free Lunch. Delivered every day. And there's a fully stocked kitchen and breakroom, because of course there is.
- Team Events. Barbecues in the summer, happy hours, and even the occasional after-hours D&D session. Why not?
- Commute Simplicity. Our headquarters is located within walking distance of the light rail and Caltrain transit options. Commuter benefits, too.
- Great Gadgetry. We give you the equipment you need to do your best work. A powerful laptop and additional widescreen monitor at your desk are standard-issue.
- Competetive Salary. When we find the right people, we want them to feel comfortable and enjoy their lives outside of work.