Agent Ops Engineer

Ho Chi Minh

Tech/Engineer

We’re hiring an Agent Ops Engineer to scale AI agent capabilities across HRS Domain and products. This is a high-impact role at the intersection of AI engineering, platform operations, and knowledge enablement. You’ll provide directions and build AI agents reliable in production across teams by owning the lifecycle, quality gates, observability, and operational standards—while embedding with teams to accelerate adoption. The larger goal of this centralized Agent Ops model is to enable Ai enablers and product builders within each product team for agent development and at the same time contribution common best practices, guard rails, to MFBS adoption across other domains like ERP and SMB.

What you will do:

1) Agent Engineering & operation

Design, build, and maintain production-grade AI agent systems, including: context engineering and instruction architecture, prompt hardening and safe execution boundaries, tool integrations and multi-step orchestration, memory strategies and reliability patterns.
Own the full agent lifecycle: prototype → evaluate → deploy → monitor → iterate.
Build and maintain an evaluation pipeline to measure agent quality, catch regressions, and enforce deployment gates (golden datasets, scenario suites, automated checks).
Instrument agents and agent platforms for production observability: structured logging, tracing, and metrics; latency and cost monitoring; tool-call success rates and failure analysis.
Define operational readiness standards including: rollback criteria, incident response playbooks, recovery paths for common failure modes.

2) Team Enablement & Coaching

Embed with product engineering teams to identify high-value use cases ready for agent automation. We will be operating in a Central Agent Ops role enabling Ai product builders through AI enablers.
Translate business workflows into agent-executable tasks with clear: contact boundaries/interfaces, assumptions and inputs/outputs, failure modes and safe fallbacks.
Deliver targeted coaching to engineers on: context engineering best practices, harness design and regression testing patterns, agent skill design and tool-contract discipline.
Reduce onboarding time for teams adopting AI capabilities—from first conversation to a production-ready agent.
Train product engineers to extend and maintain agent skills independently.

3) Standards & Knowledge operations

Author and maintain org-level standards for agents, including: naming conventions, context file structures and ownership rules, skill interface contracts (inputs/outputs, invariants, error handling), evaluation criteria and release quality bars.
Establish and enforce “repo-as-discipline” practices so agent knowledge is: versioned, reviewable, discoverable, reusable; not trapped in prompt snippets or individual heads.
Build and grow a shared agent skills library that teams can reuse and extend.
Track and aggregate AI tooling/framework updates and external best practices, serving as a central intake so product teams don’t each have to follow the entire AI landscape.
Run internal knowledge-sharing sessions, showcases, and retrospectives to propagate learnings efficiently.

What you bring:

Must have

12+ years of experience in the software development industry
Hands-on experience building and deploying production AI agents using modern frameworks (LangGraph, LangChain, OpenAI Agents SDK, trueAI, or equivalent).
Strong understanding of context engineering, including instruction architecture, token management, caching strategies, and latency-aware design.
Experience building evaluation pipelines: golden datasets and scenario libraries; automated quality gates and regression detection.
Familiarity with agent observability: tracing, structured logging, latency, and cost monitoring; tool-call reliability metrics and failure analysis.
Ability to design guardrails: output validation; prompt injection mitigation; safe execution boundaries for tools/actions.
Solid backend engineering skills; comfortable owning services/APIs end-to-end.
Strong communicator who can coach engineers, facilitate cross-team discussions, and write clear technical documentation.
Experience with production reliability and platform operations, including: event-driven architectures (Kafka and/or message queues); retries/backoff, DLQs, idempotency, ordering, backpressure; CDC/outbox-style patterns (or similar asynchronous reliability patterns); Kubernetes-based deployment and day-2 operations; CI/CD pipelines and infrastructure as code; on-call, incident response, postmortems, and SRE-style practices (SLOs/SLIs, runbooks).

Nice to have

Experience with RAG systems: ingestion, chunking, embeddings, hybrid search, retrieval evaluation.
Familiarity with MCP / Model Context Protocol or similar agent tooling standards (e.g., “MPTV”), and tool integration ecosystems.
Proficiency across Java/Kotlin (Spring Boot) and Python in production environments.

Who thrives in this role?

Engineers with an SRE/DevOps background pivoting into AI who naturally think about reliability, observability, and incident response.
Backend engineers with hands-on LLM/agent framework experience who want to work cross-functionally and enable multiple teams.
MLOps/LLM engineers who want to embed in product orgs and ship applied systems (not only model infrastructure).
Engineers who treat documentation, standards, and knowledge transfer as first-class engineering outputs.

What you can expect

A greenfield mandate to define what “good AI operations” looks like at scale inside an engineering organization.
Direct influence on the standards, patterns, and tooling multiple product teams will adopt.
A role that grows from team-level impact to organization-wide impact as the practice matures.
Work at the frontier of applied AI engineering, where best practices are still being written.

Our stack

Agent frameworks and LLM APIs, OpenTelemetry, Kafka/event-driven systems, Kubernetes, Spring Boot, Java, Kotlin, Python, CI/CD pipelines, AWS/cloud infrastructure.

Our benefits

Caring Mental & Physical Recreation:

Hybrid working: 2 days at the office and 3 days WFH
Working hour: Flexible start 8AM-9AM from Mon-Fri
Full salary in probation
Insurance: Applied from Probation period:
- Social Insurance, Health Insurance, Unemployment Insurance (on 100% salary)
- Private health insurance & accident insurance. From Managing level: extra for family members
Bonus: 13th month salary
16 - 24 paid days off and more
Paternity leave: Extra 5 days
Annual company trip; Quarterly team building
Billiards & Running club
Annual health check
Well-equipped facility: Macbook pro, additional monitor,..