This role will have the primary accountability of designing, implementing, and operating Couchbase’s Cloud platforms. Golang knowledge is a huge plus! The team operates with a “run what you write” philosophy and each engineer is responsible for deploying and operating the code they write.
A successful candidate must have demonstrable experience in at least one programming language (preferably Go), previous work in SaaS application development and operations. You will be working closely with the Support and Development team on the architecture and configuration of our AWS hosted infrastructure. You will be responsible to ensure the environment is built, deployed, configured, managed, and monitored correctly to support the business. You will drive decisions on the correct usage of cloud resources, troubleshoot performance issues, and ensure the highest level of reliability for the platform by tuning the environment for maximum scalability, cost efficiency, and security. Candidates must have experience developing and maintaining applications running on large public cloud platforms - ideally AWS, Azure, and GCP.
This role is also open to remote work as our teams are globally distributed. We are a remote-first team.
- Design, deploy and maintain the requirements of a large scale cloud platform with a focus on the key pillars of the cloud: Reliability, Operational excellence, Security, Performance and Cost Optimization
- Own and be responsible for best practice use of our cloud ecosystem from the cloud infrastructure through to the use of our application
- Passionate about automating everything and proficient in at least one of the following languages (Golang, Python, Ruby)
- Understand why using infrastructure as code to efficiently provision infrastructure and services is the only way to build and maintain a large-scale cloud platform
- Develop comprehensive monitoring solutions to provide full visibility to the different platform components using tools and services like Kubernetes, Prometheus, Grafana, ELK, Datadog, New Relic, and other similar tools.
- Experience working within an Agile/Scrum SDLC
- Integrate different components and develop new services with a focus on open source to allow a minimal friction developer interaction with the platform and application services
- Identify and troubleshoot any availability and performance issues at multiple layers of deployment, from hardware, operating environment, network, and application
- Evaluate performance trends and expected changes in demand and capacity, and establish the appropriate scalability plansTroubleshoot and solve customer issues on production deployments
- Ensure that SLAs are met in executing operational tasksCollaborate with other engineers to implement operational solutions while defining, adhering to industry best practices
- Experience in Building and managing Virtualized systems (KVM, OVM, Containers/Docker) and ability to read and understand source code
- Systematic problem-solving approach, combined with a strong sense of ownership and drive
- Conduct periodic on-call duties
- Working knowledge of information security issues
- Working knowledge of web and network protocols and standards (HTTP, TLS, DNS, etc)
- 5+ years related professional experience
- 2 to 5 years as a cloud administrator supporting enterprise computing platforms and systems
- Public cloud provider certifications are great to have
- Strong experience with Infrastructure as Code and Configuration Management tools. Preferably Terraform
- Demonstrable experience of methods to promote the correct use of cloud platforms with multiple layers of abstraction and responsibility
- Experience with Prometheus/Grafana for metrics aggregation/visualization
- Configuration of CI/CD pipelines. Preferably Spinnaker
- Experience using Kubernetes
- Experience with automation tools/platforms
- Experience with alerting and monitoring tools
- Experience working with NoSQL databases is a plus
- Experience working in a highly distributed company is a plus
- Experience writing backend applications is not required but definitely a plus
- Experience working within an Agile/Scrum SDLC.Align a portion of your day with the business hours of Pacific Time Zone - UTC -8
What does success in this role look like?
- In three months, you have become the cloud administrator with respect to overall site availability, security, latency, system health, customer accounts, and billing. You’ll have taken on independent code review responsibilities and are collaborating on the design of new features
- In six months, you have earned the trust of the team and are delivering tasks through the entire SDLC, from design through development with minimal guidance, and are helping to effectively mentor new engineers joining the team
- In twelve months, you have established a cadence of predictable, on-time delivery without cutting corners