$120K - 160K a year
Design and iterate prompts for voice AI agents, build prompt-authoring tools, run evaluations and experiments, and ensure privacy compliance.
3+ years Python production code experience, hands-on prompt engineering, LLM API integration, evaluation mindset, and customer collaboration.
Description: • Design & iterate prompts (system, tool/function-calling, task prompts) to boost voice AI agent success, reliability, and tone. • Build co-pilots for customers to author their own prompts: meta-prompted assistants that suggest structures, lint for risks, autocomplete tool schemas, critique drafts, and generate eval cases. • Work directly with customer feedback and conversation logs to identify failure modes; translate them into prompt changes, guardrails, and data improvements. • Build eval datasets (success labels, rubrics, edge cases, regressions) and run offline/online evaluations (A/B tests, canaries) to quantify impact. • Create Python utilities/services for prompt versioning, config-as-code, rollout/rollback, and guardrails (policies, refusals, redaction). • Partner with PM/Success to define success metrics (task completion, first-pass accuracy, cost, latency) and instrument dashboards/alerts. • Own LLM integration details: function/tool schemas, output parsing/validation (pydantic), retrieval-aware prompting, and fallback strategies. • Ensure privacy & compliance (PII handling, anonymization, regional data boundaries) in datasets and logs. • Share learnings via concise docs, playbooks, and internal demos. • Run a tight feedback loop with customers, turn real conversations into better prompts and eval datasets, and ship changes that measurably improve agent outcomes. Requirements: • Python: 3+ years writing clean, tested, production code (typing, pytest, profiling); experience building small services/APIs (FastAPI preferred). • Prompt Engineering: Hands-on experience designing system/tool prompts, meta-prompting, rubric graders, and iterative prompt tuning based on real user data. • LLM Integration: Comfortable with major APIs (OpenAI/Anthropic/Google/Mistral), function/tool calling, streaming, and robust output handling. • Evaluation Mindset: Ability to define measurable success, create labeled datasets, and run methodical experiments/A/B tests. • Product Sense: Comfortable talking with customers, turning qualitative feedback into shipped improvements. • Data Hygiene: Practical experience cleaning, labeling, and balancing datasets; awareness of privacy/PII constraints. • Nice-to-haves: Experience building prompt-authoring UIs/SDKs or internal tooling for prompt versioning and governance. • Nice-to-haves: Agentic frameworks & tooling: DSpy, MCP, LangGraph, LlamaIndex, Rasa; experience with agent/tool schemas and orchestration. • Nice-to-haves: Observability & eval tooling: Langfuse, LangSmith, Braintrust; building eval harnesses and experiment dashboards. • Nice-to-haves: RAG & vector stores: Qdrant/Weaviate/Pinecone and retrieval-aware prompting. • Nice-to-haves: Experimentation workflows: A/B testing, prompt diffing/versioning. • Nice-to-haves: Infra & analytics: light SQL/log analysis, metrics & tracing, simple Grafana/OTel dashboards. • Nice-to-haves: Writing public blog posts or talks about applied LLM techniques. Benefits:
This job posting was last updated on 9/23/2025