Site Reliability Engineer - Linux, Observability, Containers
Visa
Company Description
Visa is a world leader in payments and technology, with over 259 billion payments transactions flowing safely between consumers, merchants, financial institutions, and government entities in more than 200 countries and territories each year. Our mission is to connect the world through the most innovative, convenient, reliable, and secure payments network, enabling individuals, businesses, and economies to thrive while driven by a common purpose – to uplift everyone, everywhere by being the best way to pay and be paid.
Make an impact with a purpose-driven industry leader. Join us today and experience Life at Visa.
Job Description
We are seeking a motivated Site Reliability Engineer (SRE) to join our Observability team. In this role, you will support the team in maintaining and improving the reliability, security, and performance of our systems. You will learn from experienced engineers while gaining hands-on experience with modern monitoring, logging, and automation tools.
As an SRE I, you will assist in day-to-day operational tasks, help monitor system health, and participate in basic troubleshooting. You will also contribute to the maintenance of documentation and develop your technical skills through training and on-the-job experience.
Responsibilities
Assist in maintaining system security by applying hotfixes and operating system patches under guidance to protect against cybersecurity threats.
Support the deployment and configuration of monitoring and logging tools.
Help automate routine operational tasks to improve efficiency and support system integration.
Assist with the maintenance and basic management of observability tools such as Splunk, ClickHouse, Grafana, Prometheus, OpenTelemetry, Fluent Bit, ElasticSearch, OpenSearch, and CloudWatch.
Work with team members to help implement and maintain monitoring solutions in development, staging, and production environments.
Learn and apply DevOps and SRE best practices as directed by senior engineers.
Contribute to the setup and maintenance of CI CD pipelines to support automated build, test, and deployment processes.
Provide support in managing cloud infrastructure (AWS, GCP) to help ensure availability and security.
Learn to use infrastructure as code tools such as Terraform, Ansible, or CloudFormation to support environment configuration.
Monitor system performance and assist in identifying and escalating issues for resolution.
Support the implementation and management of containerization technologies like Docker and Kubernetes.
Participate in basic troubleshooting and assist with root cause analysis for production incidents.
Help create and update documentation for infrastructure, processes, and operational procedures.
Provide first-level support for routine infrastructure and deployment issues, escalating complex problems as needed.
Look for opportunities to automate repetitive tasks and suggest improvements to workflows.
Visa’s Observability ecosystem includes over 2,000 platform nodes, utilizing approximately 15 different tools for logging, monitoring, and tracing, alongside 80,000 client agents. The system handles daily log ingestion exceeding 100TB and oversees hundreds of critical applications, supporting vital alerts, dashboards, and reports. To maintain this high level of performance and reliability, we need a Site Reliability Engineer (SRE) with comprehensive knowledge and practical experience. This position requires an I4-level engineer who can operate independently with minimal supervision.
About Visa’s PRE Observability Team
Visa’s Product Reliability Engineering (PRE) Observability team partners with Product Development as well as Operations & Infrastructure teams to build and manage innovative, reliable, scalable, secure, and cost-effective observability platform solutions. We are looking for talented Senior Site Reliability Engineers to join our driven team, with a focus on maximizing system availability, performance, security, and reliability. This dynamic role requires technical leadership, strong problem-solving skills, and expertise in coding, testing, and debugging.
This is a hybrid position. Expectation of days in office will be confirmed by your hiring manager.
Qualifications
Basic Qualifications:
- Bachelor’s degree with 1-3 yrs of relevant work experience.
Preferred Qualifications:
- Hands-on experience with at least one observability tool (e.g., Splunk, ClickHouse, Grafana, Prometheus, OpenTelemetry, Fluent Bit, ElasticSearch, OpenSearch, or CloudWatch).
- Familiarity with setting up or configuring exporters (such as Node exporter or Cert exporter) for collecting metrics.
- Exposure to containerization technologies such as Docker or Kubernetes, either through coursework, projects, or internships.
- Basic understanding or experience with CI CD tools and pipelines (e.g., GitHub Actions, Jenkins, or Ansible).
- Introductory knowledge of Infrastructure as Code concepts and tools like Terraform or Ansible.
- Awareness of query languages such as PromQL, SQL, or Splunk SPL.
- Experience using Linux or Unix environments and basic scripting skills in Python and or Shell.
- Interest in cloud platforms such as AWS or GCP, Cloud certifications are a plus.
- Strong problem-solving and analytical skills, with a willingness to learn and grow in a collaborative environment.
- Effective verbal and written communication skills.
- Ability to work well in a team and take initiative in learning new technologies and practices.
Additional Information
Visa is an EEO Employer. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability or protected veteran status. Visa will also consider for employment qualified applicants with criminal histories in a manner consistent with EEOC guidelines and applicable local law.