We are looking for an experienced Site Reliability Engineer (SRE) who will work to harden and provide visibility to both infrastructure and customer resources to make them more robust and secure.
- Monitor alerts and respond to outages or performance degradation
- Develop tools to streamline that activity and to automate as much as possible
- Reduce manual, repetitive, error prone workload, freeing up engineering to take on longer term projects and becoming more proactive than reactive
- Strive to reduce alert-fatigue to ensure the monitoring system minimizes false positives and prioritizes clearly actionable and timely alerts
- Continuously improve upon and refine both monitoring systems and deployment workflows
- Prioritize both security and the end user experience
- Provide not only customer value, but also value to the rest of the Blockdaemon team
- Performs other duties and responsibilities as assigned
- Linux service administration (systemd, docker, etc)
- Linux shell (bash, ssh, etc.)
- Certificate management (SSH key, TLS, CA, PKI, etc)
- Linux troubleshooting (curl, tcpdump, ps, top, swap, memory, cpu usage, kernel logs, etc)
- Cloud VM/network provisioning and administration (AWS, GCE, Azure)
- Beats, ElasticSearch, Logstash, Kibana
- Terraform and/or Ansible, Vault a plus
- Git and continuous integration
- K8s experience a must
- Nginx/HTTP/HTTPS/JSONRPC/REST a plus
With over 20 years of experience, we have come to understand that innovation is the only way to provide agile, practical solutions that transform businesses and careers.
Our resourcing and smart services help you to realize tomorrow’s potential. Discover the amazing things possible when you bring the right people and the right technologies together.
To apply for this job please visit www.monstergulf.com.