Runtime Evaluation
See the Chat Completion docs and API
Reference for more information on making chat completions
with OpenPipe.
/chat/completions
endpoint, you can specify a list of criteria to run immediately after a completion is generated. We recommend generating multiple responses from the same prompt, each of which will be scored by the specified criteria. The responses will be sorted by their combined score across all criteria, from highest to lowest. This technique is known as Best of N sampling.
To invoke criteria, add an op-criteria
header to your request with a list of criterion IDs, like so:
- Python
- NodeJS
- cURL
criterion-1@v1
, or default to the latest criterion version, like criterion-2
.
In addition to the usual fields, each chat completion choice will now include a criteria_results
object, which contains the judgements of the specified criteria. The array of completion choices will take the following form:
Offline Testing
See the API Reference for more details.
/criteria/judge
endpoint. You can request judgements using either the TypeScript or Python SDKs, or through a cURL request.
- Python
- NodeJS