Fully remote, with the exception of occasional meetings in San Francisco to collaborate.
Bay Area residency required.
We believe that everyone deserves their own personal army of AI helpers with deep access to company data to automate any task. Datagrid ingests business data continuously from 100+ sources, makes it all available to AI, and eliminates grunt-work such as categorizing 10k support tickets in minutes.
We are a Series-A startup headquartered in San Francisco, but operate as a distributed company. We offer competitive salaries and health benefits, along with equity and respect for work/life balance.
Join our tight-knit team that ships fast and pushes the boundaries of AI! In the last few months, our agents learned to use Microsoft Teams, write SQL queries, and automate tasks on complex schedules like "MWF at half past 9". Our Agents live where people work (Slack, Microsoft Teams, etc.) and automatically take useful actions like producing safety reports from worksite photos.
Responsibilities
Datagrid Agents operate where our customers work- across Teams, Slack, and even SMS. Agents make multistep plans, leverage vectorized data from 100+ sources, use tools like Docusign, and manipulate the Datagrid app. We cannot possibly test this all manually.
Your job will be to:
Work closely with an ex-Googler who built Gemini evals to create a harness for evaluating Agent performance, make that harness available both for local development and in CI/CD pipelines, and set up alerting for when Agents misbehave.
Influence and contribute to the extension of Datagrid's Agentic capabilities.
Choose the best open/closed source components to build out the testing infra.
Integrate publicly available benchmarks such as RAGBench into the testing system.
Grant subject matter experts the ability to add to the test library using customer queries, manually authored cases, and synthetically generated questions.
Expose evaluation performance so the company can track improvement over time.
Desired Experience
Proven track record of building test harnesses for Chat Agents from 0 ? 1.
10+ years of B2B software engineering experience.
Ability to write effective LLM prompts without assistance.
Proficiency with nodejs and server side frameworks such as NestJS or NextJS.
Familiarity with JavaScript frameworks such as React, Angular JS.
Experience with databases such as Weaviate and BigQuery.
Experience working with GCP or similar cloud providers.
Salary Range: $200k - $240k
Equity
100% covered medical, dental and vision
401k
All candidates for this role will be asked the following interview question: "Work with me to design a system to evaluate the Agent's performance at SQL queries." We don't expect you to have the perfect answer, but will evaluate you on your ability to clearly explain your thinking.