SDK Reference

Connectors Module

class AggregationLogicIdentifier(*, logic: str, evaluation_schema: dict[str, Any], aggregation_schema: dict[str, Any])[source]: Bases: BaseModel

class ArgillaClient[source]

Bases: ABC

Client interface for accessing an Argilla server.

Argilla supports human in the loop evaluation. This class defines the API used by the intelligence layer to create feedback datasets or retrieve evaluation results.

abstract add_record(dataset_id: str, record: RecordData) → None[source]

Adds a new record to the given dataset.

Parameters:

dataset_id – id of the dataset the record is added to
record – the actual record data (i.e. content for the dataset’s fields)

add_records(dataset_id: str, records: Sequence[RecordData]) → None[source]

Adds new records to the given dataset.

Parameters:

dataset_id – id of the dataset the record is added to
records – list containing the record data (i.e. content for the dataset’s fields)

abstract create_dataset(workspace_id: str, dataset_name: str, fields: Sequence[Any], questions: Sequence[Any]) → str[source]

Creates and publishes a new feedback dataset in Argilla.

Raises an error if the name exists already.

Parameters:

workspace_id – the id of the workspace the feedback dataset should be created in. The user executing this request must have corresponding permissions for this workspace.
dataset_name – the name of the feedback-dataset to be created.
fields – all fields of this dataset.
questions – all questions for this dataset.

Returns:

The id of the created dataset.

abstract ensure_dataset_exists(workspace_id: str, dataset_name: str, fields: Sequence[Any], questions: Sequence[Any]) → str[source]

Retrieves an existing dataset or creates and publishes a new feedback dataset in Argilla.

Parameters:

workspace_id – the id of the workspace the feedback dataset should be created in. The user executing this request must have corresponding permissions for this workspace.
dataset_name – the name of the feedback-dataset to be created.
fields – all fields of this dataset.
questions – all questions for this dataset.

Returns:

The id of the dataset to be retrieved .

abstract evaluations(dataset_id: str) → Iterable[ArgillaEvaluation][source]

Returns all human-evaluated evaluations for the given dataset.

Parameters:: dataset_id – the id of the dataset.
Returns:: An Iterable over all human-evaluated evaluations for the given dataset.

abstract split_dataset(dataset_id: str, n_splits: int) → None[source]

Adds a new metadata property to the dataset and assigns a split to each record.

Parameters:

dataset_id – the id of the dataset
n_splits – the number of splits to create

class ArgillaEvaluation(*, example_id: str, record_id: str, responses: Mapping[str, Any], metadata: Mapping[str, Any])[source]

Bases: BaseModel

The evaluation result for a single rating record in an Argilla feedback-dataset.

example_id

the id of the example that was evaluated.

Type:: str

record_id

the id of the record that is evaluated.

Type:: str

responses

Maps question-names (Question.name ) to response values.

Type:: collections.abc.Mapping[str, Any]

metadata

Metadata belonging to the evaluation, for example ids of completions.

Type:: collections.abc.Mapping[str, Any]

class BenchmarkLineage(*, trace_id: str, input: Input, expected_output: ExpectedOutput, output: Output, example_metadata: dict[str, Any] | None = None, evaluation: Any, run_latency: int, run_tokens: int)[source]: Bases: BaseModel, Generic[Input, ExpectedOutput, Output, Evaluation]

class EvaluationLogicIdentifier(*, logic: str, input_schema: dict[str, Any], output_schema: dict[str, Any], expected_output_schema: dict[str, Any], evaluation_schema: dict[str, Any])[source]: Bases: BaseModel

class GetBenchmarkLineageResponse(*, id: str, trace_id: str, benchmark_execution_id: str, input: JsonSerializable, expected_output: JsonSerializable, example_metadata: dict[str, JsonSerializable] | None = None, output: JsonSerializable, evaluation: JsonSerializable, run_latency: int, run_tokens: int)[source]: Bases: BaseModel

class GetBenchmarkResponse(*, id: str, project_id: str, dataset_id: str, name: str, description: str | None, benchmark_metadata: dict[str, Any] | None, evaluation_logic: EvaluationLogicIdentifier, aggregation_logic: AggregationLogicIdentifier, created_at: datetime, updated_at: datetime | None, last_executed_at: datetime | None, created_by: str | None, updated_by: str | None)[source]

Bases: BaseModel

classmethod transform_id_to_str(value) → str[source]

class GetDatasetExamplesResponse(*, total: int, page: int, size: int, num_pages: int, items: Sequence[StudioExample])[source]: Bases: BaseModel, Generic[Input, ExpectedOutput]

class PostBenchmarkExecution(*, name: str, description: str | None, labels: set[str] | None, metadata: dict[str, Any] | None, start: datetime, end: datetime, run_start: datetime, run_end: datetime, run_successful_count: int, run_failed_count: int, run_success_avg_latency: float, run_success_avg_token_count: float, eval_start: datetime, eval_end: datetime, eval_successful_count: int, eval_failed_count: int, aggregation_start: datetime, aggregation_end: datetime, statistics: JsonSerializable)[source]: Bases: BaseModel

class PostBenchmarkLineagesRequest(root: RootModelRootType = PydanticUndefined)[source]

Bases: RootModel[Sequence[BenchmarkLineage]]

classmethod model_construct(root: RootModelRootType, _fields_set: set[str] | None = None) → Self

Create a new model using the provided root object and update fields set.

Parameters:

root – The root object of the model.
_fields_set – The set of fields to be updated.

Returns:

The new model.

Raises:

NotImplemented – If the model is not a subclass of RootModel.

class PostBenchmarkLineagesResponse(root: RootModelRootType = PydanticUndefined)[source]

Bases: RootModel[Sequence[str]]

classmethod model_construct(root: RootModelRootType, _fields_set: set[str] | None = None) → Self

Create a new model using the provided root object and update fields set.

Parameters:

root – The root object of the model.
_fields_set – The set of fields to be updated.

Returns:

The new model.

Raises:

NotImplemented – If the model is not a subclass of RootModel.

class PostBenchmarkRequest(*, dataset_id: str, name: str, description: str | None, benchmark_metadata: dict[str, Any] | None, evaluation_logic: EvaluationLogicIdentifier, aggregation_logic: AggregationLogicIdentifier)[source]: Bases: BaseModel

class Record(*, content: ~collections.abc.Mapping[str, str], example_id: str, metadata: ~collections.abc.Mapping[str, str | int] = <factory>, id: str)[source]

Bases: RecordData

Represents an Argilla record of an feedback-dataset.

Just adds the id to a RecordData

id

the Argilla generated id of the record.

Type:: str

class RecordData(*, content: ~collections.abc.Mapping[str, str], example_id: str, metadata: ~collections.abc.Mapping[str, str | int] = <factory>)[source]

Bases: BaseModel

Input-data for a Argilla evaluation record.

This can be used to add a new record to an existing Argilla feedback-dataset. Once it is added it gets an Argilla provided id and can be retrieved as Record

content

Maps field-names (Field.name ) to string values that can be displayed to the user.

Type:: collections.abc.Mapping[str, str]

example_id

the id of the corresponding Example from a Dataset.

Type:: str

metadata

Arbitrary metadata in form of key/value strings that can be attached to a record.

Type:: collections.abc.Mapping[str, str | int]

class StudioClient(project: str, studio_url: str | None = None, auth_token: str | None = None, create_project: bool = False)[source]

Bases: object

Client for communicating with Studio.

project_id: The unique identifier of the project currently in use.

url: The url of your current Studio instance.

create_project(project: str, description: str | None = None, reuse_existing: bool = False) → str[source]

Creates a project in Studio.

Projects are uniquely identified by the user provided name.

Parameters:

project – User provided name of the project.
description – Description explaining the usage of the project. Defaults to None.
reuse_existing – Reuse project with specified name if already existing. Defaults to False.

Returns:

The ID of the newly created project.

get_benchmark(benchmark_id: str) → GetBenchmarkResponse | None[source]

get_benchmark_lineage(benchmark_id: str, execution_id: str, lineage_id: str) → GetBenchmarkLineageResponse | None[source]

get_dataset_examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) → Iterable[StudioExample][source]

static get_headers(auth_token: str | None = None) → dict[str, str][source]

static get_url(studio_url: str | None = None) → str[source]

submit_benchmark(dataset_id: str, eval_logic: EvaluationLogicIdentifier, aggregation_logic: AggregationLogicIdentifier, name: str, description: str | None = None, metadata: dict[str, Any] | None = None) → str[source]

submit_benchmark_execution(benchmark_id: str, data: PostBenchmarkExecution) → str[source]

submit_benchmark_lineages(benchmark_lineages: Sequence[BenchmarkLineage], benchmark_id: str, execution_id: str, max_payload_size: int = 52428800) → PostBenchmarkLineagesResponse[source]

Submit benchmark lineages in batches to avoid exceeding the maximum payload size.

Parameters:

benchmark_lineages – List of :class: BenchmarkLineages to submit.
benchmark_id – ID of the benchmark.
execution_id – ID of the execution.
max_payload_size – Maximum size of the payload in bytes. Defaults to 50MB.

Returns:

Response containing the results of the submissions.

submit_dataset(dataset: StudioDataset, examples: Iterable[StudioExample]) → str[source]

Submits a dataset to Studio.

Parameters:

dataset – Dataset to be uploaded
examples – Examples of the Dataset

Returns:

ID of the created dataset

submit_from_tracer(tracer: Tracer) → list[str][source]

Sends all trace data from the Tracer to Studio.

Parameters:: tracer – Tracer to extract data from.
Returns:: List of created trace IDs.

submit_trace(data: Sequence[ExportedSpan]) → str[source]

Sends the provided spans to Studio as a singular trace.

The method fails if the span list is empty, has already been created or if spans belong to multiple traces.

Parameters:: data – Spans to create the trace from. Created by exporting from a Tracer.
Returns:: The ID of the created trace.

class StudioDataset(*, id: str = <factory>, name: str, labels: set[str] = {}, metadata: dict[str, JsonSerializable] = {})[source]

Bases: BaseModel

Represents a Dataset linked to multiple examples as sent to Studio.

id

Dataset ID.

Type:: str

name

A short name of the dataset.

Type:: str

label: Labels for filtering datasets. Defaults to empty list.

metadata

Additional information about the dataset. Defaults to empty dict.

Type:: dict[str, JsonSerializable]

class StudioExample(*, input: ~pharia_studio_sdk.connectors.studio.studio.Input, expected_output: ~pharia_studio_sdk.connectors.studio.studio.ExpectedOutput, id: str = <factory>, metadata: dict[str, JsonSerializable] | None = None)[source]

Bases: BaseModel, Generic[Input, ExpectedOutput]

Represents an instance of :class:`Example`as sent to Studio.

input

Input for the Task. Has to be same type as the input for the task used.

Type:: pharia_studio_sdk.connectors.studio.studio.Input

expected_output

The expected output from a given example run. This will be used by the evaluator to compare the received output with.

Type:: pharia_studio_sdk.connectors.studio.studio.ExpectedOutput

id

Identifier for the example, defaults to uuid.

Type:: str

metadata

Optional dictionary of custom key-value pairs.

Type:: dict[str, JsonSerializable] | None

Generics:: Input: Interface to be passed to the Task that shall be evaluated. ExpectedOutput: Output that is expected from the run with the supplied input.

class StudioProject(*, name: str, description: str | None)[source]: Bases: BaseModel

Evaluation Module

class AggregatedComparison(*, scores: Mapping[str, PlayerScore])[source]: Bases: BaseModel

class AggregationLogic[source]

Bases: ABC, Generic[Evaluation, AggregatedEvaluation]

abstract aggregate(evaluations: Iterable[Evaluation]) → AggregatedEvaluation[source]

Evaluator-specific method for aggregating individual Evaluations into report-like Aggregated Evaluation.

This method is responsible for taking the results of an evaluation run and aggregating all the results. It should create an AggregatedEvaluation class and return it at the end.

Parameters:: evaluations – The results from running eval_and_aggregate_runs with a Task.
Returns:: The aggregated results of an evaluation run with a Dataset.

class AggregationOverview(*, evaluation_overviews: frozenset[EvaluationOverview], id: str, start: datetime, end: datetime, successful_evaluation_count: int, crashed_during_evaluation_count: int, description: str, statistics: Annotated[AggregatedEvaluation, SerializeAsAny()], labels: set[str] = {}, metadata: dict[str, JsonSerializable] = {})[source]

Bases: BaseModel, Generic[AggregatedEvaluation]

Complete overview of the results of evaluating a Task on a dataset.

Created when running Evaluator.eval_and_aggregate_runs(). Contains high-level information and statistics.

evaluation_overviews

:class:`EvaluationOverview`s used for aggregation.

Type:: frozenset[pharia_studio_sdk.evaluation.evaluation.domain.EvaluationOverview]

id

Aggregation overview ID.

Type:: str

start

Start timestamp of the aggregation.

Type:: datetime.datetime

end

End timestamp of the aggregation.

Type:: datetime.datetime

end

The time when the evaluation run ended

Type:: datetime.datetime

successful_evaluation_count

The number of examples that where successfully evaluated.

Type:: int

crashed_during_evaluation_count

The number of examples that crashed during evaluation.

Type:: int

failed_evaluation_count: The number of examples that crashed during evaluation plus the number of examples that failed to produce an output for evaluation.

run_ids: IDs of all :class:`RunOverview`s from all linked :class:`EvaluationOverview`s.

description

A short description.

Type:: str

statistics

Aggregated statistics of the run. Whatever is returned by Evaluator.aggregate()

Type:: pharia_studio_sdk.evaluation.aggregation.domain.AggregatedEvaluation

labels

Labels for filtering aggregation. Defaults to empty list.

Type:: set[str]

metadata

Additional information about the aggregation. Defaults to empty dict.

Type:: dict[str, JsonSerializable]

raise_on_evaluation_failure() → None[source]

run_overviews() → Iterable[RunOverview][source]

class AggregationRepository[source]

Bases: ABC

Base aggregation repository interface.

Provides methods to store and load aggregated evaluation results: AggregationOverview.

abstract aggregation_overview(aggregation_id: str, aggregation_type: type[AggregatedEvaluation]) → AggregationOverview | None[source]

Returns an AggregationOverview for the given ID.

Parameters:

aggregation_id – ID of the aggregation overview to retrieve.
aggregation_type – Type of the aggregation.

Returns:

EvaluationOverview if it was found, None otherwise.

abstract aggregation_overview_ids() → Sequence[str][source]

Returns sorted IDs of all stored :class:`AggregationOverview`s.

Returns:: A Sequence of the AggregationOverview IDs.

aggregation_overviews(aggregation_type: type[AggregatedEvaluation]) → Iterable[AggregationOverview][source]

Returns all :class:`AggregationOverview`s sorted by their ID.

Parameters:: aggregation_type – Type of the aggregation.
Yields:: :class:`AggregationOverview`s.

abstract store_aggregation_overview(aggregation_overview: AggregationOverview) → None[source]

Stores an AggregationOverview.

Parameters:: aggregation_overview – The aggregated results to be persisted.

class Aggregator(evaluation_repository: EvaluationRepository, aggregation_repository: AggregationRepository, description: str, aggregation_logic: AggregationLogic[Evaluation, AggregatedEvaluation])[source]

Bases: Generic[Evaluation, AggregatedEvaluation]

Aggregator that can handle automatic aggregation of evaluation scenarios.

This aggregator should be used for automatic eval. A user still has to implement :class: AggregationLogic.

Parameters:

evaluation_repository – The repository that will be used to store evaluation results.
aggregation_repository – The repository that will be used to store aggregation results.
description – Human-readable description for the evaluator.
aggregation_logic – The logic to aggregate the evaluations.

Generics:: Evaluation: Interface of the metrics that come from the evaluated Task. AggregatedEvaluation: The aggregated results of an evaluation run with a Dataset.

final aggregate_evaluation(*eval_ids: str, description: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → AggregationOverview[source]

Aggregates all evaluations into an overview that includes high-level statistics.

Aggregates Evaluation`s according to the implementation of :func:`AggregationLogic.aggregate.

Parameters:

*eval_ids – An overview of the evaluation to be aggregated. Does not include actual evaluations as these will be retrieved from the repository.
description – Optional description of the aggregation. Defaults to None.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the aggregation overview. Defaults to an empty dict.

Returns:

An overview of the aggregated evaluation.

aggregated_evaluation_type() → type[AggregatedEvaluation][source]

Returns the type of the aggregated result of a run.

Returns:: Returns the type of the aggreagtion result.

evaluation_type() → type[Evaluation][source]

Returns the type of the evaluation result of an example.

This can be used to retrieve properly typed evaluations of an evaluation run from a EvaluationRepository

Returns:: Returns the type of the evaluation result of an example.

class ArgillaEvaluationLogic(fields: Mapping[str, Any], questions: Sequence[Any])[source]

Bases: EvaluationLogicBase[Input, Output, ExpectedOutput, Evaluation], ABC

abstract from_record(argilla_evaluation: ArgillaEvaluation) → Evaluation[source]

This method takes the specific Argilla evaluation format and converts into a compatible Evaluation.

The format of argilla_evaluation.responses depends on the questions attribute. Each name of a question will be a key in the argilla_evaluation.responses mapping.

Parameters:: argilla_evaluation – Argilla-specific data for a single evaluation.
Returns:: An Evaluation that contains all evaluation specific data.

abstract to_record(example: Example, *output: SuccessfulExampleOutput) → RecordDataSequence[source]

This method is responsible for translating the Example and Output of the task to RecordData.

The specific format depends on the fields.

Parameters:

example – The example to be translated.
*output – The output of the example that was run.

Returns:

A RecordDataSequence that contains entries that should be evaluated in Argilla.

class ArgillaEvaluator(dataset_repository: DatasetRepository, run_repository: RunRepository, evaluation_repository: AsyncEvaluationRepository, description: str, evaluation_logic: ArgillaEvaluationLogic[Input, Output, ExpectedOutput, Evaluation], argilla_client: ArgillaClient, workspace_id: str)[source]

Bases: AsyncEvaluator[Input, Output, ExpectedOutput, Evaluation]

Evaluator used to integrate with Argilla (https://github.com/argilla-io/argilla).

Use this evaluator if you would like to easily do human eval. This evaluator runs a dataset and sends the input and output to Argilla to be evaluated.

Parameters:

dataset_repository – The repository with the examples that will be taken for the evaluation.
run_repository – The repository of the runs to evaluate.
evaluation_repository – The repository that will be used to store evaluation results.
description – Human-readable description for the evaluator.
evaluation_logic – The logic to use for evaluation.
argilla_client – The client to interface with argilla.
workspace_id – The argilla workspace id where datasets are created for evaluation.

See the EvaluatorBase for more information.

evaluation_lineage(evaluation_id: str, example_id: str) → EvaluationLineage[Input, ExpectedOutput, Output, Evaluation] | None

Wrapper for RepositoryNagivator.evaluation_lineage.

Parameters:

evaluation_id – The id of the evaluation
example_id – The id of the example of interest

Returns:

The EvaluationLineage for the given evaluation id and example id. Returns None if the lineage is not complete because either an example, a run, or an evaluation does not exist.

evaluation_lineages(evaluation_id: str) → Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]

Wrapper for RepositoryNagivator.evaluation_lineages.

Parameters:: evaluation_id – The id of the evaluation
Returns:: An iterator over all :class:`EvaluationLineage`s for the given evaluation id.

evaluation_type() → type[Evaluation]

Returns the type of the evaluation result of an example.

This can be used to retrieve properly typed evaluations of an evaluation run from an EvaluationRepository

Returns:: Returns the type of the evaluation result of an example.

expected_output_type() → type[ExpectedOutput]

Returns the type of the evaluated task’s expected output.

This can be used to retrieve properly typed Example`s of a dataset from a :class:`DatasetRepository.

Returns:: The type of the evaluated task’s expected output.

failed_evaluations(evaluation_id: str) → Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]

Returns the EvaluationLineage objects for all failed example evaluations that belong to the given evaluation ID.

Parameters:: evaluation_id – The ID of the evaluation overview
Returns:: Iterable of :class:`EvaluationLineage`s.

input_type() → type[Input]

Returns the type of the evaluated task’s input.

This can be used to retrieve properly typed Example`s of a dataset from a :class:`DatasetRepository.

Returns:: The type of the evaluated task’s input.

output_type() → type[Output]

Returns the type of the evaluated task’s output.

This can be used to retrieve properly typed outputs of an evaluation run from a RunRepository.

Returns:: The type of the evaluated task’s output.

retrieve(partial_evaluation_id: str) → EvaluationOverview[source]

Retrieves external evaluations and saves them to an evaluation repository.

Failed or skipped submissions should be viewed as failed evaluations. Evaluations that are submitted but not yet evaluated also count as failed evaluations.

Parameters:: partial_overview_id – The id of the corresponding PartialEvaluationOverview.
Returns:: An EvaluationOverview that describes the whole evaluation.

submit(*run_ids: str, num_examples: int | None = None, dataset_name: str | None = None, abort_on_error: bool = False, skip_example_on_any_failure: bool = True, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → PartialEvaluationOverview[source]

Submits evaluations to external service to be evaluated.

Failed submissions are saved as FailedExampleEvaluations.

Parameters:

*run_ids – The runs to be evaluated. Each run is expected to have the same dataset as input (which implies their tasks have the same input-type) and their tasks have the same output-type. For each example in the dataset referenced by the runs the outputs of all runs are collected and if all of them were successful they are passed on to the implementation specific evaluation. The method compares all run of the provided ids to each other.
num_examples – The number of examples which should be evaluated from the given runs. Always the first n runs stored in the evaluation repository. Defaults to None.
abort_on_error – Abort the whole submission process if a single submission fails. Defaults to False.

Returns:

A PartialEvaluationOverview containing submission information.

class AsyncEvaluationRepository[source]

Bases: EvaluationRepository

abstract evaluation_overview(evaluation_id: str) → EvaluationOverview | None

Returns an EvaluationOverview for the given ID.

Parameters:: evaluation_id – ID of the evaluation overview to retrieve.
Returns:: EvaluationOverview if it was found, None otherwise.

abstract evaluation_overview_ids() → Sequence[str]

Returns sorted IDs of all stored :class:`EvaluationOverview`s.

Returns:: A Sequence of the EvaluationOverview IDs.

evaluation_overviews() → Iterable[EvaluationOverview]

Returns all :class:`EvaluationOverview`s sorted by their ID.

Yields:: :class:`EvaluationOverview`s.

abstract example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) → ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation] | None

Returns an ExampleEvaluation for the given evaluation overview ID and example ID.

Parameters:

evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in Evaluator.do_evaluate()

Returns:

ExampleEvaluation if it was found, None otherwise.

abstract example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation]]

Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of :class:`ExampleEvaluation`s.

failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation[FailedExampleEvaluation]]

Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of failed :class:`ExampleEvaluation`s.

initialize_evaluation() → str

Initializes an EvaluationOverview and returns its ID.

If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.

Returns:: The created ID.

abstract partial_evaluation_overview(partial_evaluation_id: str) → PartialEvaluationOverview | None[source]

Returns an PartialEvaluationOverview for the given ID.

Parameters:: partial_evaluation_id – ID of the partial evaluation overview to retrieve.
Returns:: PartialEvaluationOverview if it was found, None otherwise.

abstract partial_evaluation_overview_ids() → Sequence[str][source]

Returns sorted IDs of all stored :class:`PartialEvaluationOverview`s.

Returns:: A Sequence of the PartialEvaluationOverview IDs.

partial_evaluation_overviews() → Iterable[PartialEvaluationOverview][source]

Returns all :class:`PartialEvaluationOverview`s sorted by their ID.

Yields:: :class:`PartialEvaluationOverview`s.

abstract store_evaluation_overview(evaluation_overview: EvaluationOverview) → None

Stores an EvaluationOverview.

Parameters:: evaluation_overview – The overview to be persisted.

abstract store_example_evaluation(example_evaluation: ExampleEvaluation) → None

Stores an ExampleEvaluation.

Parameters:: example_evaluation – The example evaluation to be persisted.

abstract store_partial_evaluation_overview(partial_evaluation_overview: PartialEvaluationOverview) → None[source]

Stores an PartialEvaluationOverview.

Parameters:: partial_evaluation_overview – The partial overview to be persisted.

successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation]

Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of successful :class:`ExampleEvaluation`s.

class AsyncFileEvaluationRepository(root_directory: Path)[source]

Bases: FileEvaluationRepository, AsyncEvaluationRepository

evaluation_overview(evaluation_id: str) → EvaluationOverview | None

Returns an EvaluationOverview for the given ID.

Parameters:: evaluation_id – ID of the evaluation overview to retrieve.
Returns:: EvaluationOverview if it was found, None otherwise.

evaluation_overview_ids() → Sequence[str]

Returns sorted IDs of all stored :class:`EvaluationOverview`s.

Returns:: A Sequence of the EvaluationOverview IDs.

evaluation_overviews() → Iterable[EvaluationOverview]

Returns all :class:`EvaluationOverview`s sorted by their ID.

Yields:: :class:`EvaluationOverview`s.

example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) → ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation] | None

Returns an ExampleEvaluation for the given evaluation overview ID and example ID.

Parameters:

evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in Evaluator.do_evaluate()

Returns:

ExampleEvaluation if it was found, None otherwise.

example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation]]

Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of :class:`ExampleEvaluation`s.

exists(path: Path) → bool

failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation[FailedExampleEvaluation]]

Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of failed :class:`ExampleEvaluation`s.

file_names(path: Path, file_type: str = 'json') → Sequence[str]

initialize_evaluation() → str

Initializes an EvaluationOverview and returns its ID.

If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.

Returns:: The created ID.

mkdir(path: Path) → None

partial_evaluation_overview(evaluation_id: str) → PartialEvaluationOverview | None[source]

Returns an PartialEvaluationOverview for the given ID.

Parameters:: partial_evaluation_id – ID of the partial evaluation overview to retrieve.
Returns:: PartialEvaluationOverview if it was found, None otherwise.

partial_evaluation_overview_ids() → Sequence[str][source]

Returns sorted IDs of all stored :class:`PartialEvaluationOverview`s.

Returns:: A Sequence of the PartialEvaluationOverview IDs.

partial_evaluation_overviews() → Iterable[PartialEvaluationOverview]

Returns all :class:`PartialEvaluationOverview`s sorted by their ID.

Yields:: :class:`PartialEvaluationOverview`s.

static path_to_str(path: Path) → str

Returns a string for the given Path so that it’s readable for the respective file system.

Parameters:: path – Given Path that should be converted.
Returns:: String representation of the given Path.

read_utf8(path: Path) → str

remove_file(path: Path) → None

store_evaluation_overview(overview: EvaluationOverview) → None

Stores an EvaluationOverview.

Parameters:: evaluation_overview – The overview to be persisted.

store_example_evaluation(example_evaluation: ExampleEvaluation) → None

Stores an ExampleEvaluation.

Parameters:: example_evaluation – The example evaluation to be persisted.

store_partial_evaluation_overview(overview: PartialEvaluationOverview) → None[source]

Stores an PartialEvaluationOverview.

Parameters:: partial_evaluation_overview – The partial overview to be persisted.

successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation]

Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of successful :class:`ExampleEvaluation`s.

write_utf8(path: Path, content: str, create_parents: bool = False) → None

class AsyncInMemoryEvaluationRepository[source]

Bases: AsyncEvaluationRepository, InMemoryEvaluationRepository

evaluation_overview(evaluation_id: str) → EvaluationOverview | None

Returns an EvaluationOverview for the given ID.

Parameters:: evaluation_id – ID of the evaluation overview to retrieve.
Returns:: EvaluationOverview if it was found, None otherwise.

evaluation_overview_ids() → Sequence[str]

Returns sorted IDs of all stored :class:`EvaluationOverview`s.

Returns:: A Sequence of the EvaluationOverview IDs.

evaluation_overviews() → Iterable[EvaluationOverview]

Returns all :class:`EvaluationOverview`s sorted by their ID.

Yields:: :class:`EvaluationOverview`s.

example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) → ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation] | None

Returns an ExampleEvaluation for the given evaluation overview ID and example ID.

Parameters:

evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in Evaluator.do_evaluate()

Returns:

ExampleEvaluation if it was found, None otherwise.

example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation]]

Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of :class:`ExampleEvaluation`s.

failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation[FailedExampleEvaluation]]

Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of failed :class:`ExampleEvaluation`s.

initialize_evaluation() → str

Initializes an EvaluationOverview and returns its ID.

If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.

Returns:: The created ID.

partial_evaluation_overview(evaluation_id: str) → PartialEvaluationOverview | None[source]

Returns an PartialEvaluationOverview for the given ID.

Parameters:: partial_evaluation_id – ID of the partial evaluation overview to retrieve.
Returns:: PartialEvaluationOverview if it was found, None otherwise.

partial_evaluation_overview_ids() → Sequence[str][source]

Returns sorted IDs of all stored :class:`PartialEvaluationOverview`s.

Returns:: A Sequence of the PartialEvaluationOverview IDs.

partial_evaluation_overviews() → Iterable[PartialEvaluationOverview]

Returns all :class:`PartialEvaluationOverview`s sorted by their ID.

Yields:: :class:`PartialEvaluationOverview`s.

store_evaluation_overview(overview: EvaluationOverview) → None

Stores an EvaluationOverview.

Parameters:: evaluation_overview – The overview to be persisted.

store_example_evaluation(evaluation: ExampleEvaluation) → None

Stores an ExampleEvaluation.

Parameters:: example_evaluation – The example evaluation to be persisted.

store_partial_evaluation_overview(overview: PartialEvaluationOverview) → None[source]

Stores an PartialEvaluationOverview.

Parameters:: partial_evaluation_overview – The partial overview to be persisted.

successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation]

Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of successful :class:`ExampleEvaluation`s.

class BleuGrader[source]

Bases: object

calculate_bleu(hypothesis: str, reference: str) → float[source]

Calculates the BLEU-score for the given hypothesis and reference.

In the summarization use-case the BLEU-score roughly corresponds to the precision of the generated summary with regard to the expected summary.

Parameters:

hypothesis – The generation to be evaluated.
reference – The baseline for the evaluation.

Returns:

BLEU-score, float between 0 and 100. Where 100 means perfect match and 0 no overlap.

class ComparisonEvaluation(*, first_player: str, second_player: str, outcome: MatchOutcome)[source]: Bases: BaseModel

class ComparisonEvaluationAggregationLogic[source]

Bases: AggregationLogic[ComparisonEvaluation, AggregatedComparison]

aggregate(evaluations: Iterable[ComparisonEvaluation]) → AggregatedComparison[source]

Evaluator-specific method for aggregating individual Evaluations into report-like Aggregated Evaluation.

This method is responsible for taking the results of an evaluation run and aggregating all the results. It should create an AggregatedEvaluation class and return it at the end.

Parameters:: evaluations – The results from running eval_and_aggregate_runs with a Task.
Returns:: The aggregated results of an evaluation run with a Dataset.

class Dataset(*, id: str = <factory>, name: str, labels: set[str] = {}, metadata: dict[str, JsonSerializable] = {})[source]

Bases: BaseModel

Represents a dataset linked to multiple examples.

id

Dataset ID.

Type:: str

name

A short name of the dataset.

Type:: str

label: Labels for filtering datasets. Defaults to empty list.

metadata

Additional information about the dataset. Defaults to empty dict.

Type:: dict[str, JsonSerializable]

class DatasetRepository[source]

Bases: ABC

Base dataset repository interface.

Provides methods to store and load datasets and their linked examples (:class:`Example`s).

abstract create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → Dataset[source]

Creates a dataset from given :class:`Example`s and returns the ID of that dataset.

Parameters:

examples – An Iterable of :class:`Example`s to be saved in the same dataset.
dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.

Returns:

The created Dataset.

abstract dataset(dataset_id: str) → Dataset | None[source]

Returns a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.
Returns:: Dataset if it was not, None otherwise.

abstract dataset_ids() → Iterable[str][source]

Returns all sorted dataset IDs.

Returns:: Iterable of dataset IDs.

datasets() → Iterable[Dataset][source]

Returns all :class:`Dataset`s sorted by their ID.

Yields:: :class:`Dataset`s.

abstract delete_dataset(dataset_id: str) → None[source]

Deletes a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.

abstract example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) → Example | None[source]

Returns an Example for the given dataset ID and example ID.

Parameters:

dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.

Returns:

Example if it was found, None otherwise.

abstract examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) → Iterable[Example][source]

Returns all :class:`Example`s for the given dataset ID sorted by their ID.

Parameters:

dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.

Returns:

class`Example`s.

Return type:

Iterable of

class EloCalculator(players: Iterable[str], k_start: float = 20.0, k_floor: float = 10.0, decay_factor: float = 0.0005)[source]

Bases: object

calculate(matches: Sequence[ComparisonEvaluation]) → None[source]

class EloEvaluationLogic[source]

Bases: IncrementalEvaluationLogic[Input, Output, ExpectedOutput, Matches]

do_evaluate(example: Example, *output: SuccessfulExampleOutput) → Evaluation

Executes the evaluation for this specific example.

Responsible for comparing the input & expected output of a task to the actually generated output. The difference to the standard EvaluationLogic’s do_evaluate is that this method will separate already processed evaluation from new ones before handing them over to do_incremental_evaluate.

Parameters:

example – Input data of Task to produce the output.
*output – Outputs of the Task.

Returns:

The metrics that come from the evaluated Task.

Return type:

Evaluation

do_incremental_evaluate(example: Example, outputs: list[SuccessfulExampleOutput], already_evaluated_outputs: list[list[SuccessfulExampleOutput]]) → Matches[source]

abstract grade(first: SuccessfulExampleOutput, second: SuccessfulExampleOutput, example: Example) → MatchOutcome[source]

Returns a :class: MatchOutcome for the provided two contestants on the given example.

Defines the use case specific logic how to determine the winner of the two provided outputs.

Parameters:

first – Instance of :class: SuccessfulExampleOutut[Output] of the first contestant in the comparison
second – Instance of :class: SuccessfulExampleOutut[Output] of the second contestant in the comparison
example – Datapoint of :class: Example on which the two outputs were generated

Returns:

class: MatchOutcome

Return type:

Instance of

set_previous_run_output_ids(previous_run_output_ids: list[set[str]]) → None[source]

class EloGradingInput(*, instruction: str, first_completion: str, second_completion: str)[source]: Bases: BaseModel

exception EvaluationFailed(evaluation_id: str, failed_count: int)[source]

Bases: Exception

add_note(): Exception.add_note(note) – add a note to the exception

args

description: str

end: datetime

failed_example_count: int

successful_example_count: int

with_traceback(): Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class EvaluationLogic[source]

Bases: ABC, EvaluationLogicBase[Input, Output, ExpectedOutput, Evaluation]

abstract do_evaluate(example: Example, *output: SuccessfulExampleOutput) → Evaluation[source]

Executes the evaluation for this specific example.

Responsible for comparing the input & expected output of a task to the actually generated output.

Parameters:

example – Input data of Task to produce the output.
*output – Output of the Task.

Returns:

The metrics that come from the evaluated Task.

class EvaluationOverview(*, run_overviews: frozenset[RunOverview], id: str, start_date: datetime, end_date: datetime, successful_evaluation_count: int, failed_evaluation_count: int, description: str, labels: set[str] = {}, metadata: dict[str, JsonSerializable] = {})[source]

Bases: BaseModel

Overview of the un-aggregated results of evaluating a Task on a dataset.

run_overviews

Overviews of the runs that were evaluated.

Type:: frozenset[pharia_studio_sdk.evaluation.run.domain.RunOverview]

id

The unique identifier of this evaluation.

Type:: str

start_date

The time when the evaluation run was started.

Type:: datetime.datetime

end_date

The time when the evaluation run was finished.

Type:: datetime.datetime

successful_evaluation_count

Number of successfully evaluated examples.

Type:: int

failed_evaluation_count

Number of examples that produced an error during evaluation. Note: failed runs are skipped in the evaluation and therefore not counted as failures

Type:: int

description

human-readable for the evaluator that created the evaluation.

Type:: str

labels

Labels for filtering evaluation. Defaults to empty list.

Type:: set[str]

metadata

Additional information about the evaluation. Defaults to empty dict.

Type:: dict[str, JsonSerializable]

class EvaluationRepository[source]

Bases: ABC

Base evaluation repository interface.

Provides methods to store and load evaluation results:: EvaluationOverview`s and :class:`ExampleEvaluation.
An EvaluationOverview is created from and is linked (by its ID): to multiple :class:`ExampleEvaluation`s.

abstract evaluation_overview(evaluation_id: str) → EvaluationOverview | None[source]

Returns an EvaluationOverview for the given ID.

Parameters:: evaluation_id – ID of the evaluation overview to retrieve.
Returns:: EvaluationOverview if it was found, None otherwise.

abstract evaluation_overview_ids() → Sequence[str][source]

Returns sorted IDs of all stored :class:`EvaluationOverview`s.

Returns:: A Sequence of the EvaluationOverview IDs.

evaluation_overviews() → Iterable[EvaluationOverview][source]

Returns all :class:`EvaluationOverview`s sorted by their ID.

Yields:: :class:`EvaluationOverview`s.

abstract example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) → ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation] | None[source]

Returns an ExampleEvaluation for the given evaluation overview ID and example ID.

Parameters:

evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in Evaluator.do_evaluate()

Returns:

ExampleEvaluation if it was found, None otherwise.

abstract example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation]][source]

Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of :class:`ExampleEvaluation`s.

failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation[FailedExampleEvaluation]][source]

Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of failed :class:`ExampleEvaluation`s.

initialize_evaluation() → str[source]

Initializes an EvaluationOverview and returns its ID.

If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.

Returns:: The created ID.

abstract store_evaluation_overview(evaluation_overview: EvaluationOverview) → None[source]

Stores an EvaluationOverview.

Parameters:: evaluation_overview – The overview to be persisted.

abstract store_example_evaluation(example_evaluation: ExampleEvaluation) → None[source]

Stores an ExampleEvaluation.

Parameters:: example_evaluation – The example evaluation to be persisted.

successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation][source]

Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of successful :class:`ExampleEvaluation`s.

class Evaluator(dataset_repository: DatasetRepository, run_repository: RunRepository, evaluation_repository: EvaluationRepository, description: str, evaluation_logic: EvaluationLogic[Input, Output, ExpectedOutput, Evaluation])[source]

Bases: EvaluatorBase[Input, Output, ExpectedOutput, Evaluation]

Evaluator designed for most evaluation tasks. Only supports synchronous evaluation.

See the EvaluatorBase for more information.

final evaluate(example: Example, evaluation_id: str, abort_on_error: bool, *example_outputs: SuccessfulExampleOutput) → Evaluation | FailedExampleEvaluation[source]

evaluate_runs(*run_ids: str, num_examples: int | None = None, abort_on_error: bool = False, skip_example_on_any_failure: bool = True, description: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → EvaluationOverview[source]

Evaluates all generated outputs in the run.

For each set of successful outputs in the referenced runs, EvaluationLogic.do_evaluate() is called and eval metrics are produced & stored in the provided EvaluationRepository.

Parameters:

*run_ids – The runs to be evaluated. Each run is expected to have the same dataset as input (which implies their tasks have the same input-type) and their tasks have the same output-type. For each example in the dataset referenced by the runs the outputs of all runs are collected and if all of them were successful they are passed on to the implementation specific evaluation. The method compares all run of the provided ids to each other.
num_examples – The number of examples which should be evaluated from the given runs. Always the first n runs stored in the evaluation repository. Defaults to None.
abort_on_error – Flag to abort all evaluations when an error occurs. Defaults to False.
skip_example_on_any_failure – Flag to skip evaluation on any example for which at least one run fails. Defaults to True.
description – Optional description of the evaluation. Defaults to None.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the evaluation overview. Defaults to an empty dict.

Returns:

An overview of the evaluation. Individual Evaluation`s will not be returned but instead stored in the :class:`EvaluationRepository provided in the __init__.

Return type:

EvaluationOverview

evaluation_lineage(evaluation_id: str, example_id: str) → EvaluationLineage[Input, ExpectedOutput, Output, Evaluation] | None

Wrapper for RepositoryNagivator.evaluation_lineage.

Parameters:

evaluation_id – The id of the evaluation
example_id – The id of the example of interest

Returns:

The EvaluationLineage for the given evaluation id and example id. Returns None if the lineage is not complete because either an example, a run, or an evaluation does not exist.

evaluation_lineages(evaluation_id: str) → Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]

Wrapper for RepositoryNagivator.evaluation_lineages.

Parameters:: evaluation_id – The id of the evaluation
Returns:: An iterator over all :class:`EvaluationLineage`s for the given evaluation id.

evaluation_type() → type[Evaluation]

Returns the type of the evaluation result of an example.

This can be used to retrieve properly typed evaluations of an evaluation run from an EvaluationRepository

Returns:: Returns the type of the evaluation result of an example.

expected_output_type() → type[ExpectedOutput]

Returns the type of the evaluated task’s expected output.

This can be used to retrieve properly typed Example`s of a dataset from a :class:`DatasetRepository.

Returns:: The type of the evaluated task’s expected output.

failed_evaluations(evaluation_id: str) → Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]

Returns the EvaluationLineage objects for all failed example evaluations that belong to the given evaluation ID.

Parameters:: evaluation_id – The ID of the evaluation overview
Returns:: Iterable of :class:`EvaluationLineage`s.

input_type() → type[Input]

Returns the type of the evaluated task’s input.

This can be used to retrieve properly typed Example`s of a dataset from a :class:`DatasetRepository.

Returns:: The type of the evaluated task’s input.

output_type() → type[Output]

Returns the type of the evaluated task’s output.

This can be used to retrieve properly typed outputs of an evaluation run from a RunRepository.

Returns:: The type of the evaluated task’s output.

class Example(*, input: ~pharia_inference_sdk.core.task.Input, expected_output: ~pharia_studio_sdk.evaluation.dataset.domain.ExpectedOutput, id: str = <factory>, metadata: dict[str, JsonSerializable] | None = None)[source]

Bases: BaseModel, Generic[Input, ExpectedOutput]

Example case used for evaluations.

input

Input for the Task. Has to be same type as the input for the task used.

Type:: pharia_inference_sdk.core.task.Input

expected_output

The expected output from a given example run. This will be used by the evaluator to compare the received output with.

Type:: pharia_studio_sdk.evaluation.dataset.domain.ExpectedOutput

id

Identifier for the example, defaults to uuid.

Type:: str

metadata

Optional dictionary of custom key-value pairs.

Type:: dict[str, JsonSerializable] | None

Generics:: Input: Interface to be passed to the Task that shall be evaluated. ExpectedOutput: Output that is expected from the run with the supplied input.

class ExampleEvaluation(*, evaluation_id: str, example_id: str, result: Annotated[Evaluation | FailedExampleEvaluation, SerializeAsAny()])[source]

Bases: BaseModel, Generic[Evaluation]

Evaluation of a single evaluated Example.

Created to persist the evaluation result in the repository.

evaluation_id

Identifier of the run the evaluated example belongs to.

Type:: str

example_id

Identifier of the Example evaluated.

Type:: str

result

If the evaluation was successful, evaluation’s result, otherwise the exception raised during running or evaluating the Task.

Type:: pharia_studio_sdk.evaluation.evaluation.domain.Evaluation | pharia_studio_sdk.evaluation.evaluation.domain.FailedExampleEvaluation

Generics:: Evaluation: Interface of the metrics that come from the evaluated Task.

class ExampleOutput(*, run_id: str, example_id: str, output: Output | FailedExampleRun)[source]

Bases: BaseModel, Generic[Output]

Output of a single evaluated Example.

Created to persist the output (including failures) of an individual example in the repository.

run_id

Identifier of the run that created the output.

Type:: str

example_id

Identifier of the Example that provided the input for generating the output.

Type:: str

output

Generated when running the Task. When the running the task failed this is an FailedExampleRun.

Type:: pharia_inference_sdk.core.task.Output | pharia_studio_sdk.evaluation.run.domain.FailedExampleRun

Generics:: Output: Interface of the output returned by the task.

class FScores(precision: float, recall: float, f_score: float)[source]: Bases: object

class FailedExampleEvaluation(*, error_message: str)[source]

Bases: BaseModel

Captures an exception raised when evaluating an ExampleOutput.

error_message

String-representation of the exception.

Type:: str

static from_exception(exception: Exception) → FailedExampleEvaluation[source]

class FileAggregationRepository(root_directory: Path)[source]

Bases: FileSystemAggregationRepository

aggregation_overview(aggregation_id: str, aggregation_type: type[AggregatedEvaluation]) → AggregationOverview | None

Returns an AggregationOverview for the given ID.

Parameters:

aggregation_id – ID of the aggregation overview to retrieve.
aggregation_type – Type of the aggregation.

Returns:

EvaluationOverview if it was found, None otherwise.

aggregation_overview_ids() → Sequence[str]

Returns sorted IDs of all stored :class:`AggregationOverview`s.

Returns:: A Sequence of the AggregationOverview IDs.

aggregation_overviews(aggregation_type: type[AggregatedEvaluation]) → Iterable[AggregationOverview]

Returns all :class:`AggregationOverview`s sorted by their ID.

Parameters:: aggregation_type – Type of the aggregation.
Yields:: :class:`AggregationOverview`s.

exists(path: Path) → bool

file_names(path: Path, file_type: str = 'json') → Sequence[str]

mkdir(path: Path) → None

static path_to_str(path: Path) → str[source]

Returns a string for the given Path so that it’s readable for the respective file system.

Parameters:: path – Given Path that should be converted.
Returns:: String representation of the given Path.

read_utf8(path: Path) → str

remove_file(path: Path) → None

store_aggregation_overview(aggregation_overview: AggregationOverview) → None

Stores an AggregationOverview.

Parameters:: aggregation_overview – The aggregated results to be persisted.

write_utf8(path: Path, content: str, create_parents: bool = False) → None

class FileDatasetRepository(root_directory: Path)[source]

Bases: FileSystemDatasetRepository

create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → Dataset

Creates a dataset from given :class:`Example`s and returns the ID of that dataset.

Parameters:

examples – An Iterable of :class:`Example`s to be saved in the same dataset.
dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.

Returns:

The created Dataset.

dataset(dataset_id: str) → Dataset | None

Returns a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.
Returns:: Dataset if it was not, None otherwise.

dataset_ids() → Iterable[str]

Returns all sorted dataset IDs.

Returns:: Iterable of dataset IDs.

datasets() → Iterable[Dataset]

Returns all :class:`Dataset`s sorted by their ID.

Yields:: :class:`Dataset`s.

delete_dataset(dataset_id: str) → None

Deletes a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.

example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) → Example | None

Returns an Example for the given dataset ID and example ID.

Parameters:

dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.

Returns:

Example if it was found, None otherwise.

examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) → Iterable[Example]

Returns all :class:`Example`s for the given dataset ID sorted by their ID.

Parameters:

dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.

Returns:

class`Example`s.

Return type:

Iterable of

exists(path: Path) → bool

file_names(path: Path, file_type: str = 'json') → Sequence[str]

mkdir(path: Path) → None

static path_to_str(path: Path) → str

Returns a string for the given Path so that it’s readable for the respective file system.

Parameters:: path – Given Path that should be converted.
Returns:: String representation of the given Path.

read_utf8(path: Path) → str

remove_file(path: Path) → None

write_utf8(path: Path, content: str, create_parents: bool = False) → None

class FileEvaluationRepository(root_directory: Path)[source]

Bases: FileSystemEvaluationRepository

evaluation_overview(evaluation_id: str) → EvaluationOverview | None

Returns an EvaluationOverview for the given ID.

Parameters:: evaluation_id – ID of the evaluation overview to retrieve.
Returns:: EvaluationOverview if it was found, None otherwise.

evaluation_overview_ids() → Sequence[str]

Returns sorted IDs of all stored :class:`EvaluationOverview`s.

Returns:: A Sequence of the EvaluationOverview IDs.

evaluation_overviews() → Iterable[EvaluationOverview]

Returns all :class:`EvaluationOverview`s sorted by their ID.

Yields:: :class:`EvaluationOverview`s.

example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) → ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation] | None

Returns an ExampleEvaluation for the given evaluation overview ID and example ID.

Parameters:

evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in Evaluator.do_evaluate()

Returns:

ExampleEvaluation if it was found, None otherwise.

example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation]]

Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of :class:`ExampleEvaluation`s.

exists(path: Path) → bool

failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation[FailedExampleEvaluation]]

Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of failed :class:`ExampleEvaluation`s.

file_names(path: Path, file_type: str = 'json') → Sequence[str]

initialize_evaluation() → str

Initializes an EvaluationOverview and returns its ID.

If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.

Returns:: The created ID.

mkdir(path: Path) → None

static path_to_str(path: Path) → str[source]

Returns a string for the given Path so that it’s readable for the respective file system.

Parameters:: path – Given Path that should be converted.
Returns:: String representation of the given Path.

read_utf8(path: Path) → str

remove_file(path: Path) → None

store_evaluation_overview(overview: EvaluationOverview) → None

Stores an EvaluationOverview.

Parameters:: evaluation_overview – The overview to be persisted.

store_example_evaluation(example_evaluation: ExampleEvaluation) → None

Stores an ExampleEvaluation.

Parameters:: example_evaluation – The example evaluation to be persisted.

successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation]

Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of successful :class:`ExampleEvaluation`s.

write_utf8(path: Path, content: str, create_parents: bool = False) → None

class FileRunRepository(root_directory: Path)[source]

Bases: FileSystemRunRepository

create_temporary_run_data(tmp_hash: str, run_id: str) → None

create_tracer_for_example(run_id: str, example_id: str) → Tracer

Creates and returns a Tracer for the given run ID and example ID.

Parameters:

run_id – The ID of the linked run overview.
example_id – ID of the example whose Tracer should be retrieved.

Returns:

.class:Tracer.

Return type:

A

delete_temporary_run_data(tmp_hash: str) → None

example_output(run_id: str, example_id: str, output_type: type[Output]) → ExampleOutput | ExampleOutput[FailedExampleRun] | None

Returns ExampleOutput for the given run ID and example ID.

Parameters:

run_id – The ID of the linked run overview.
example_id – ID of the example to retrieve.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

ExampleOutput if it was found, None otherwise.

Return type:

class

example_output_ids(run_id: str) → Sequence[str]

Returns the sorted IDs of all :class:`ExampleOutput`s for a given run ID.

Parameters:: run_id – The ID of the run overview.
Returns:: A Sequence of all ExampleOutput IDs.

example_outputs(run_id: str, output_type: type[Output]) → Iterable[ExampleOutput | ExampleOutput[FailedExampleRun]]

Returns all ExampleOutput for a given run ID sorted by their example ID.

Parameters:

run_id – The ID of the run overview.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`ExampleOutput`s.

example_tracer(run_id: str, example_id: str) → Tracer | None

Returns an Optional[Tracer] for the given run ID and example ID.

Parameters:

run_id – The ID of the linked run overview.
example_id – ID of the example whose Tracer should be retrieved.

Returns:

A Tracer if it was found, None otherwise.

exists(path: Path) → bool

failed_example_outputs(run_id: str, output_type: type[Output]) → Iterable[ExampleOutput[FailedExampleRun]]

Returns all ExampleOutput for failed example runs with a given run-overview ID sorted by their example ID.

Parameters:

run_id – The ID of the run overview.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`ExampleOutput`s.

file_names(path: Path, file_type: str = 'json') → Sequence[str]

finished_examples(tmp_hash: str) → RecoveryData | None

mkdir(path: Path) → None

static path_to_str(path: Path) → str[source]

Returns a string for the given Path so that it’s readable for the respective file system.

Parameters:: path – Given Path that should be converted.
Returns:: String representation of the given Path.

read_utf8(path: Path) → str

remove_file(path: Path) → None

run_overview(run_id: str) → RunOverview | None

Returns a RunOverview for the given ID.

Parameters:: run_id – ID of the run overview to retrieve.
Returns:: RunOverview if it was found, None otherwise.

run_overview_ids() → Sequence[str]

Returns sorted IDs of all stored :class:`RunOverview`s.

Returns:: A Sequence of the RunOverview IDs.

run_overviews() → Iterable[RunOverview]

Returns all :class:`RunOverview`s sorted by their ID.

Yields:: Iterable of :class:`RunOverview`s.

store_example_output(example_output: ExampleOutput) → None

Stores an ExampleOutput.

Parameters:: example_output – The example output to be persisted.

store_example_output_parallel(tmp_hash: str, example_output: ExampleOutput) → None

store_run_overview(overview: RunOverview) → None

Stores a RunOverview.

Parameters:: overview – The overview to be persisted.

successful_example_outputs(run_id: str, output_type: type[Output]) → Iterable[ExampleOutput]

Returns all ExampleOutput for successful example runs with a given run-overview ID sorted by their example ID.

Parameters:

run_id – The ID of the run overview.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`ExampleOutput`s.

temp_store_finished_example(tmp_hash: str, example_id: str) → None

write_utf8(path: Path, content: str, create_parents: bool = False) → None

class HighlightCoverageGrader(beta_factor: float = 1.0)[source]

Bases: object

Evaluates how well the generated highlights match the expected highlights (via precision, recall and f1-score).

Parameters:: beta_factor – factor to control weight of precision (0 <= beta < 1) vs. recall (beta > 1) when computing the f-score

compute_fscores(generated_highlight_indices: Sequence[tuple[int, int]], expected_highlight_indices: Sequence[tuple[int, int]]) → FScores[source]

Calculates how well the generated highlight ranges match the expected ones.

Parameters:

generated_highlight_indices – list of tuples(start, end) of the generated highlights
expected_highlight_indices – list of tuples(start, end) of the generated highlights

Returns:

FScores, which contains precision, recall and f-score metrics, all will be floats between 0 and 1, where 1 means perfect match and 0 no overlap

class HuggingFaceAggregationRepository(repository_id: str, token: str, private: bool)[source]

Bases: FileSystemAggregationRepository, HuggingFaceRepository

aggregation_overview(aggregation_id: str, aggregation_type: type[AggregatedEvaluation]) → AggregationOverview | None

Returns an AggregationOverview for the given ID.

Parameters:

aggregation_id – ID of the aggregation overview to retrieve.
aggregation_type – Type of the aggregation.

Returns:

EvaluationOverview if it was found, None otherwise.

aggregation_overview_ids() → Sequence[str]

Returns sorted IDs of all stored :class:`AggregationOverview`s.

Returns:: A Sequence of the AggregationOverview IDs.

aggregation_overviews(aggregation_type: type[AggregatedEvaluation]) → Iterable[AggregationOverview]

Returns all :class:`AggregationOverview`s sorted by their ID.

Parameters:: aggregation_type – Type of the aggregation.
Yields:: :class:`AggregationOverview`s.

create_repository(repository_id: str, token: str, private: bool) → None

delete_repository() → None

exists(path: Path) → bool

file_names(path: Path, file_type: str = 'json') → Sequence[str]

mkdir(path: Path) → None

static path_to_str(path: Path) → str

Returns a string for the given Path so that it’s readable for the respective file system.

Parameters:: path – Given Path that should be converted.
Returns:: String representation of the given Path.

read_utf8(path: Path) → str

remove_file(path: Path) → None

store_aggregation_overview(aggregation_overview: AggregationOverview) → None

Stores an AggregationOverview.

Parameters:: aggregation_overview – The aggregated results to be persisted.

write_utf8(path: Path, content: str, create_parents: bool = False) → None

class HuggingFaceDatasetRepository(repository_id: str, token: str, private: bool, caching: bool = True)[source]

Bases: HuggingFaceRepository, FileSystemDatasetRepository

create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → Dataset

Creates a dataset from given :class:`Example`s and returns the ID of that dataset.

Parameters:

examples – An Iterable of :class:`Example`s to be saved in the same dataset.
dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.

Returns:

The created Dataset.

create_repository(repository_id: str, token: str, private: bool) → None

dataset(dataset_id: str) → Dataset | None[source]

Returns a dataset identified by the given dataset ID.

This implementation should be backwards compatible to datasets created without a dataset object (i.e., there is no dataset file with dataset metadata).

Parameters:: dataset_id – Dataset ID of the dataset to delete.
Returns:: Dataset if it was not, None otherwise.

dataset_ids() → Iterable[str]

Returns all sorted dataset IDs.

Returns:: Iterable of dataset IDs.

datasets() → Iterable[Dataset]

Returns all :class:`Dataset`s sorted by their ID.

Yields:: :class:`Dataset`s.

delete_dataset(dataset_id: str) → None[source]

Deletes a dataset identified by the given dataset ID.

This implementation should be backwards compatible to datasets created without a dataset object (i.e., there is no dataset file with dataset metadata).

Note, that HuggingFace API does not seem to support deleting not-existing files.

Parameters:: dataset_id – Dataset ID of the dataset to delete.

delete_repository() → None

example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) → Example | None

Returns an Example for the given dataset ID and example ID.

Parameters:

dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.

Returns:

Example if it was found, None otherwise.

examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) → Iterable[Example]

Returns all :class:`Example`s for the given dataset ID sorted by their ID.

Parameters:

dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.

Returns:

class`Example`s.

Return type:

Iterable of

exists(path: Path) → bool

file_names(path: Path, file_type: str = 'json') → Sequence[str]

mkdir(path: Path) → None

static path_to_str(path: Path) → str

Returns a string for the given Path so that it’s readable for the respective file system.

Parameters:: path – Given Path that should be converted.
Returns:: String representation of the given Path.

read_utf8(path: Path) → str

remove_file(path: Path) → None

write_utf8(path: Path, content: str, create_parents: bool = False) → None

class HuggingFaceRepository(repository_id: str, token: str, private: bool)[source]

Bases: FileSystemBasedRepository

HuggingFace base repository.

create_repository(repository_id: str, token: str, private: bool) → None[source]

delete_repository() → None[source]

exists(path: Path) → bool

file_names(path: Path, file_type: str = 'json') → Sequence[str]

mkdir(path: Path) → None

static path_to_str(path: Path) → str[source]

Returns a string for the given Path so that it’s readable for the respective file system.

Parameters:: path – Given Path that should be converted.
Returns:: String representation of the given Path.

read_utf8(path: Path) → str

remove_file(path: Path) → None

write_utf8(path: Path, content: str, create_parents: bool = False) → None

class InMemoryAggregationRepository[source]

Bases: AggregationRepository

aggregation_overview(aggregation_id: str, aggregation_type: type[AggregatedEvaluation]) → AggregationOverview | None[source]

Returns an AggregationOverview for the given ID.

Parameters:

aggregation_id – ID of the aggregation overview to retrieve.
aggregation_type – Type of the aggregation.

Returns:

EvaluationOverview if it was found, None otherwise.

aggregation_overview_ids() → Sequence[str][source]

Returns sorted IDs of all stored :class:`AggregationOverview`s.

Returns:: A Sequence of the AggregationOverview IDs.

aggregation_overviews(aggregation_type: type[AggregatedEvaluation]) → Iterable[AggregationOverview]

Returns all :class:`AggregationOverview`s sorted by their ID.

Parameters:: aggregation_type – Type of the aggregation.
Yields:: :class:`AggregationOverview`s.

store_aggregation_overview(aggregation_overview: AggregationOverview) → None[source]

Stores an AggregationOverview.

Parameters:: aggregation_overview – The aggregated results to be persisted.

class InMemoryDatasetRepository[source]

Bases: DatasetRepository

create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → Dataset[source]

Creates a dataset from given :class:`Example`s and returns the ID of that dataset.

Parameters:

examples – An Iterable of :class:`Example`s to be saved in the same dataset.
dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.

Returns:

The created Dataset.

dataset(dataset_id: str) → Dataset | None[source]

Returns a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.
Returns:: Dataset if it was not, None otherwise.

dataset_ids() → Iterable[str][source]

Returns all sorted dataset IDs.

Returns:: Iterable of dataset IDs.

datasets() → Iterable[Dataset]

Returns all :class:`Dataset`s sorted by their ID.

Yields:: :class:`Dataset`s.

delete_dataset(dataset_id: str) → None[source]

Deletes a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.

example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) → Example | None[source]

Returns an Example for the given dataset ID and example ID.

Parameters:

dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.

Returns:

Example if it was found, None otherwise.

examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) → Iterable[Example][source]

Returns all :class:`Example`s for the given dataset ID sorted by their ID.

Parameters:

dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.

Returns:

class`Example`s.

Return type:

Iterable of

class InMemoryEvaluationRepository[source]

Bases: EvaluationRepository

An EvaluationRepository that stores evaluation results in memory.

Preferred for quick testing or to be used in Jupyter Notebooks.

evaluation_overview(evaluation_id: str) → EvaluationOverview | None[source]

Returns an EvaluationOverview for the given ID.

Parameters:: evaluation_id – ID of the evaluation overview to retrieve.
Returns:: EvaluationOverview if it was found, None otherwise.

evaluation_overview_ids() → Sequence[str][source]

Returns sorted IDs of all stored :class:`EvaluationOverview`s.

Returns:: A Sequence of the EvaluationOverview IDs.

evaluation_overviews() → Iterable[EvaluationOverview]

Returns all :class:`EvaluationOverview`s sorted by their ID.

Yields:: :class:`EvaluationOverview`s.

example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) → ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation] | None[source]

Returns an ExampleEvaluation for the given evaluation overview ID and example ID.

Parameters:

evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in Evaluator.do_evaluate()

Returns:

ExampleEvaluation if it was found, None otherwise.

example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation]][source]

Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of :class:`ExampleEvaluation`s.

failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation[FailedExampleEvaluation]]

Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of failed :class:`ExampleEvaluation`s.

initialize_evaluation() → str

Initializes an EvaluationOverview and returns its ID.

If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.

Returns:: The created ID.

store_evaluation_overview(overview: EvaluationOverview) → None[source]

Stores an EvaluationOverview.

Parameters:: evaluation_overview – The overview to be persisted.

store_example_evaluation(evaluation: ExampleEvaluation) → None[source]

Stores an ExampleEvaluation.

Parameters:: example_evaluation – The example evaluation to be persisted.

successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation]

Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of successful :class:`ExampleEvaluation`s.

class InMemoryRunRepository[source]

Bases: RunRepository

create_temporary_run_data(tmp_hash: str, run_id: str) → None

create_tracer_for_example(run_id: str, example_id: str) → Tracer[source]

Creates and returns a Tracer for the given run ID and example ID.

Parameters:

run_id – The ID of the linked run overview.
example_id – ID of the example whose Tracer should be retrieved.

Returns:

.class:Tracer.

Return type:

A

delete_temporary_run_data(tmp_hash: str) → None

example_output(run_id: str, example_id: str, output_type: type[Output]) → ExampleOutput | ExampleOutput[FailedExampleRun] | None[source]

Returns ExampleOutput for the given run ID and example ID.

Parameters:

run_id – The ID of the linked run overview.
example_id – ID of the example to retrieve.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

ExampleOutput if it was found, None otherwise.

Return type:

class

example_output_ids(run_id: str) → Sequence[str][source]

Returns the sorted IDs of all :class:`ExampleOutput`s for a given run ID.

Parameters:: run_id – The ID of the run overview.
Returns:: A Sequence of all ExampleOutput IDs.

example_outputs(run_id: str, output_type: type[Output]) → Iterable[ExampleOutput | ExampleOutput[FailedExampleRun]][source]

Returns all ExampleOutput for a given run ID sorted by their example ID.

Parameters:

run_id – The ID of the run overview.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`ExampleOutput`s.

example_tracer(run_id: str, example_id: str) → Tracer | None[source]

Returns an Optional[Tracer] for the given run ID and example ID.

Parameters:

run_id – The ID of the linked run overview.
example_id – ID of the example whose Tracer should be retrieved.

Returns:

A Tracer if it was found, None otherwise.

failed_example_outputs(run_id: str, output_type: type[Output]) → Iterable[ExampleOutput[FailedExampleRun]]

Returns all ExampleOutput for failed example runs with a given run-overview ID sorted by their example ID.

Parameters:

run_id – The ID of the run overview.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`ExampleOutput`s.

finished_examples(tmp_hash: str) → RecoveryData | None[source]

run_overview(run_id: str) → RunOverview | None[source]

Returns a RunOverview for the given ID.

Parameters:: run_id – ID of the run overview to retrieve.
Returns:: RunOverview if it was found, None otherwise.

run_overview_ids() → Sequence[str][source]

Returns sorted IDs of all stored :class:`RunOverview`s.

Returns:: A Sequence of the RunOverview IDs.

run_overviews() → Iterable[RunOverview]

Returns all :class:`RunOverview`s sorted by their ID.

Yields:: Iterable of :class:`RunOverview`s.

store_example_output(example_output: ExampleOutput) → None[source]

Stores an ExampleOutput.

Parameters:: example_output – The example output to be persisted.

store_example_output_parallel(tmp_hash: str, example_output: ExampleOutput) → None

store_run_overview(overview: RunOverview) → None[source]

Stores a RunOverview.

Parameters:: overview – The overview to be persisted.

successful_example_outputs(run_id: str, output_type: type[Output]) → Iterable[ExampleOutput]

Returns all ExampleOutput for successful example runs with a given run-overview ID sorted by their example ID.

Parameters:

run_id – The ID of the run overview.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`ExampleOutput`s.

temp_store_finished_example(tmp_hash: str, example_id: str) → None

class IncrementalEvaluationLogic[source]

Bases: EvaluationLogic[Input, Output, ExpectedOutput, Evaluation]

do_evaluate(example: Example, *output: SuccessfulExampleOutput) → Evaluation[source]

Executes the evaluation for this specific example.

Responsible for comparing the input & expected output of a task to the actually generated output. The difference to the standard EvaluationLogic’s do_evaluate is that this method will separate already processed evaluation from new ones before handing them over to do_incremental_evaluate.

Parameters:

example – Input data of Task to produce the output.
*output – Outputs of the Task.

Returns:

The metrics that come from the evaluated Task.

Return type:

Evaluation

abstract do_incremental_evaluate(example: Example, outputs: list[SuccessfulExampleOutput], already_evaluated_outputs: list[list[SuccessfulExampleOutput]]) → Evaluation[source]

set_previous_run_output_ids(previous_run_output_ids: list[set[str]]) → None[source]

class IncrementalEvaluator(dataset_repository: DatasetRepository, run_repository: RunRepository, evaluation_repository: EvaluationRepository, description: str, incremental_evaluation_logic: IncrementalEvaluationLogic[Input, Output, ExpectedOutput, Evaluation])[source]

Bases: Evaluator[Input, Output, ExpectedOutput, Evaluation]

Evaluator for evaluating additional runs on top of previous evaluations. Intended for use with IncrementalEvaluationLogic.

Parameters:

dataset_repository – The repository with the examples that will be taken for the evaluation.
run_repository – The repository of the runs to evaluate.
evaluation_repository – The repository that will be used to store evaluation results.
description – Human-readable description for the evaluator.
incremental_evaluation_logic – The logic to use for evaluation.

Generics:: Input: Interface to be passed to the Task that shall be evaluated. Output: Type of the output of the Task to be evaluated. ExpectedOutput: Output that is expected from the run with the supplied input. Evaluation: Interface of the metrics that come from the evaluated Task.

evaluate(example: Example, evaluation_id: str, abort_on_error: bool, *example_outputs: SuccessfulExampleOutput) → Evaluation | FailedExampleEvaluation

evaluate_additional_runs(*run_ids: str, previous_evaluation_ids: list[str] | None = None, num_examples: int | None = None, abort_on_error: bool = False, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → EvaluationOverview[source]

Evaluate all runs while considering which runs have already been evaluated according to previous_evaluation_id.

For each set of successful outputs in the referenced runs, EvaluationLogic.do_evaluate() is called and eval metrics are produced & stored in the provided EvaluationRepository.

Parameters:

*run_ids – The runs to be evaluated. Each run is expected to have the same dataset as input (which implies their tasks have the same input-type) and their tasks have the same output-type. For each example in the dataset referenced by the runs the outputs of all runs are collected and if all of them were successful they are passed on to the implementation specific evaluation. The method compares all run of the provided ids to each other.
previous_evaluation_ids – IDs of previous evaluation to consider
num_examples – The number of examples which should be evaluated from the given runs. Always the first n runs stored in the evaluation repository. Defaults to None.
abort_on_error – Flag to abort all evaluations when an error occurs. Defaults to False.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the evaluation overview. Defaults to an empty dict.

Returns:

An overview of the evaluation. Individual Evaluation`s will not be returned but instead stored in the :class:`EvaluationRepository provided in the __init__.

Return type:

EvaluationOverview

evaluate_runs(*run_ids: str, num_examples: int | None = None, abort_on_error: bool = False, skip_example_on_any_failure: bool = True, description: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → EvaluationOverview[source]

Evaluates all generated outputs in the run.

For each set of successful outputs in the referenced runs, EvaluationLogic.do_evaluate() is called and eval metrics are produced & stored in the provided EvaluationRepository.

Parameters:

*run_ids – The runs to be evaluated. Each run is expected to have the same dataset as input (which implies their tasks have the same input-type) and their tasks have the same output-type. For each example in the dataset referenced by the runs the outputs of all runs are collected and if all of them were successful they are passed on to the implementation specific evaluation. The method compares all run of the provided ids to each other.
num_examples – The number of examples which should be evaluated from the given runs. Always the first n runs stored in the evaluation repository. Defaults to None.
abort_on_error – Flag to abort all evaluations when an error occurs. Defaults to False.
skip_example_on_any_failure – Flag to skip evaluation on any example for which at least one run fails. Defaults to True.
description – Optional description of the evaluation. Defaults to None.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the evaluation overview. Defaults to an empty dict.

Returns:

An overview of the evaluation. Individual Evaluation`s will not be returned but instead stored in the :class:`EvaluationRepository provided in the __init__.

Return type:

EvaluationOverview

evaluation_lineage(evaluation_id: str, example_id: str) → EvaluationLineage[Input, ExpectedOutput, Output, Evaluation] | None

Wrapper for RepositoryNagivator.evaluation_lineage.

Parameters:

evaluation_id – The id of the evaluation
example_id – The id of the example of interest

Returns:

The EvaluationLineage for the given evaluation id and example id. Returns None if the lineage is not complete because either an example, a run, or an evaluation does not exist.

evaluation_lineages(evaluation_id: str) → Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]

Wrapper for RepositoryNagivator.evaluation_lineages.

Parameters:: evaluation_id – The id of the evaluation
Returns:: An iterator over all :class:`EvaluationLineage`s for the given evaluation id.

evaluation_type() → type[Evaluation]

Returns the type of the evaluation result of an example.

This can be used to retrieve properly typed evaluations of an evaluation run from an EvaluationRepository

Returns:: Returns the type of the evaluation result of an example.

expected_output_type() → type[ExpectedOutput]

Returns the type of the evaluated task’s expected output.

This can be used to retrieve properly typed Example`s of a dataset from a :class:`DatasetRepository.

Returns:: The type of the evaluated task’s expected output.

failed_evaluations(evaluation_id: str) → Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]

Returns the EvaluationLineage objects for all failed example evaluations that belong to the given evaluation ID.

Parameters:: evaluation_id – The ID of the evaluation overview
Returns:: Iterable of :class:`EvaluationLineage`s.

input_type() → type[Input]

Returns the type of the evaluated task’s input.

This can be used to retrieve properly typed Example`s of a dataset from a :class:`DatasetRepository.

Returns:: The type of the evaluated task’s input.

output_type() → type[Output]

Returns the type of the evaluated task’s output.

This can be used to retrieve properly typed outputs of an evaluation run from a RunRepository.

Returns:: The type of the evaluated task’s output.

class InstructComparisonArgillaEvaluationLogic(high_priority_runs: frozenset[str] | None = None)[source]

Bases: ArgillaEvaluationLogic[InstructInput, CompleteOutput, None, ComparisonEvaluation]

from_record(argilla_evaluation: ArgillaEvaluation) → ComparisonEvaluation[source]

This method takes the specific Argilla evaluation format and converts into a compatible Evaluation.

The format of argilla_evaluation.responses depends on the questions attribute. Each name of a question will be a key in the argilla_evaluation.responses mapping.

Parameters:: argilla_evaluation – Argilla-specific data for a single evaluation.
Returns:: An Evaluation that contains all evaluation specific data.

to_record(example: Example[InstructInput, NoneType], *outputs: SuccessfulExampleOutput[CompleteOutput]) → RecordDataSequence[source]

This method is responsible for translating the Example and Output of the task to RecordData.

The specific format depends on the fields.

Parameters:

example – The example to be translated.
*output – The output of the example that was run.

Returns:

A RecordDataSequence that contains entries that should be evaluated in Argilla.

class LanguageMatchesGrader(acceptance_threshold: float = 0.1)[source]

Bases: object

Provides a method to evaluate whether two texts are of the same language.

Parameters:: acceptance_threshold – probability a language must surpass to be accepted

languages_match(input: str, output: str) → bool[source]

Calculates if the input and output text are of the same language.

The length of the texts and its sentences should be reasonably long in order for good performance.

Parameters:

input – text for which languages is compared to
output – text

Returns:

whether input and output language match: returns true if clear input language is not determinable

Return type:

bool

class MatchOutcome(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

static from_rank_literal(rank: int) → MatchOutcome[source]

class Matches(*, comparison_evaluations: Sequence[ComparisonEvaluation])[source]: Bases: BaseModel

class MatchesAggregationLogic[source]

Bases: AggregationLogic[Matches, AggregatedComparison]

aggregate(evaluations: Iterable[Matches]) → AggregatedComparison[source]

Evaluator-specific method for aggregating individual Evaluations into report-like Aggregated Evaluation.

This method is responsible for taking the results of an evaluation run and aggregating all the results. It should create an AggregatedEvaluation class and return it at the end.

Parameters:: evaluations – The results from running eval_and_aggregate_runs with a Task.
Returns:: The aggregated results of an evaluation run with a Dataset.

class MeanAccumulator[source]

Bases: Accumulator[float, float]

add(value: float) → None[source]

Responsible for accumulating values.

Parameters:: value – the value to add
Returns:: nothing

extract() → float[source]

Accumulates the mean.

Returns:: 0.0 if no values were added before, else the mean

standard_deviation() → float[source]: Calculates the standard deviation.

standard_error() → float[source]: Calculates the standard error of the mean.

class MultipleChoiceInput(*, question: str, choices: Sequence[str])[source]: Bases: BaseModel

class RecordDataSequence(*, records: Sequence[RecordData])[source]: Bases: BaseModel

class RepositoryNavigator(dataset_repository: DatasetRepository, run_repository: RunRepository, evaluation_repository: EvaluationRepository | None = None)[source]

Bases: object

The RepositoryNavigator is used to retrieve coupled data from multiple repositories.

evaluation_lineage(evaluation_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], output_type: type[Output], evaluation_type: type[Evaluation]) → EvaluationLineage[Input, ExpectedOutput, Output, Evaluation] | None[source]

Retrieves the EvaluationLineage for the evaluation with id evaluation_id and example with id example_id.

Parameters:

evaluation_id – The id of the evaluation
example_id – The id of the example of interest
input_type – The type of the input as defined by the Example
expected_output_type – The type of the expected output as defined by the Example
output_type – The type of the run output as defined by the Output
evaluation_type – The type of the evaluation as defined by the Evaluation

Returns:

The EvaluationLineage for the given evaluation id and example id. Returns None if the lineage is not complete because either an example, a run, or an evaluation does not exist.

evaluation_lineages(evaluation_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], output_type: type[Output], evaluation_type: type[Evaluation]) → Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]][source]

Retrieves all EvaluationLineage`s for the evaluation with id `evaluation_id.

Parameters:

evaluation_id – The id of the evaluation
input_type – The type of the input as defined by the Example
expected_output_type – The type of the expected output as defined by the Example
output_type – The type of the run output as defined by the Output
evaluation_type – The type of the evaluation as defined by the Evaluation

Yields:

All :class:`EvaluationLineage`s for the given evaluation id.

run_lineage(run_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], output_type: type[Output]) → RunLineage[Input, ExpectedOutput, Output] | None[source]

Retrieves the RunLineage for the run with id run_id and example with id example_id.

Parameters:

run_id – The id of the run
example_id – The id of the example
input_type – The type of the input as defined by the Example
expected_output_type – The type of the expected output as defined by the Example
output_type – The type of the run output as defined by the Output

Returns:

The RunLineage for the given run id and example id, None if the example or an output for the example does not exist.

run_lineages(run_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], output_type: type[Output]) → Iterable[RunLineage[Input, ExpectedOutput, Output]][source]

Retrieves all RunLineage`s for the run with id `run_id.

Parameters:

run_id – The id of the run
input_type – The type of the input as defined by the Example
expected_output_type – The type of the expected output as defined by the Example
output_type – The type of the run output as defined by the Output

Yields:

An iterator over all :class:`RunLineage`s for the given run id.

class RougeGrader[source]

Bases: object

calculate_rouge(hypothesis: str, reference: str) → FScores[source]

Calculates the ROUGE-score for the hypothesis and reference.

In the summarization use-case the ROUGE-score roughly corresponds to the recall of the generated summary with regard to the expected summary.

Parameters:

hypothesis – The generation to be evaluated.
reference – The baseline for the evaluation.

Returns:

ROUGE-score, which contains precision, recall and f1 metrics, all will be floats between 0 and 1. Where 1 means perfect match and 0 no overlap.

class RunOverview(*, dataset_id: str, id: str, start: datetime, end: datetime, failed_example_count: int, successful_example_count: int, description: str, labels: set[str] = {}, metadata: dict[str, JsonSerializable] = {})[source]

Bases: BaseModel

Overview of the run of a Task on a dataset.

dataset_id

Identifier of the dataset run.

Type:: str

id

The unique identifier of this run.

Type:: str

start

The time when the run was started

Type:: datetime.datetime

end

The time when the run ended

Type:: datetime.datetime

failed_example_count

The number of examples where an exception was raised when running the task.

Type:: int

successful_example_count

The number of examples that where successfully run.

Type:: int

description

Human-readable of the runner that run the task.

Type:: str

labels

Labels for filtering runs. Defaults to empty list.

Type:: set[str]

metadata

Additional information about the run. Defaults to empty dict.

Type:: dict[str, JsonSerializable]

class RunRepository[source]

Bases: ABC

Base run repository interface.

Provides methods to store and load run results: RunOverview and ExampleOutput. A RunOverview is created from and is linked (by its ID) to multiple :class:`ExampleOutput`s representing results of a dataset.

final create_temporary_run_data(tmp_hash: str, run_id: str) → None[source]

abstract create_tracer_for_example(run_id: str, example_id: str) → Tracer[source]

Creates and returns a Tracer for the given run ID and example ID.

Parameters:

run_id – The ID of the linked run overview.
example_id – ID of the example whose Tracer should be retrieved.

Returns:

.class:Tracer.

Return type:

A

final delete_temporary_run_data(tmp_hash: str) → None[source]

abstract example_output(run_id: str, example_id: str, output_type: type[Output]) → ExampleOutput | ExampleOutput[FailedExampleRun] | None[source]

Returns ExampleOutput for the given run ID and example ID.

Parameters:

run_id – The ID of the linked run overview.
example_id – ID of the example to retrieve.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

ExampleOutput if it was found, None otherwise.

Return type:

class

abstract example_output_ids(run_id: str) → Sequence[str][source]

Returns the sorted IDs of all :class:`ExampleOutput`s for a given run ID.

Parameters:: run_id – The ID of the run overview.
Returns:: A Sequence of all ExampleOutput IDs.

abstract example_outputs(run_id: str, output_type: type[Output]) → Iterable[ExampleOutput | ExampleOutput[FailedExampleRun]][source]

Returns all ExampleOutput for a given run ID sorted by their example ID.

Parameters:

run_id – The ID of the run overview.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`ExampleOutput`s.

abstract example_tracer(run_id: str, example_id: str) → Tracer | None[source]

Returns an Optional[Tracer] for the given run ID and example ID.

Parameters:

run_id – The ID of the linked run overview.
example_id – ID of the example whose Tracer should be retrieved.

Returns:

A Tracer if it was found, None otherwise.

failed_example_outputs(run_id: str, output_type: type[Output]) → Iterable[ExampleOutput[FailedExampleRun]][source]

Returns all ExampleOutput for failed example runs with a given run-overview ID sorted by their example ID.

Parameters:

run_id – The ID of the run overview.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`ExampleOutput`s.

abstract finished_examples(tmp_hash: str) → RecoveryData | None[source]

abstract run_overview(run_id: str) → RunOverview | None[source]

Returns a RunOverview for the given ID.

Parameters:: run_id – ID of the run overview to retrieve.
Returns:: RunOverview if it was found, None otherwise.

abstract run_overview_ids() → Sequence[str][source]

Returns sorted IDs of all stored :class:`RunOverview`s.

Returns:: A Sequence of the RunOverview IDs.

run_overviews() → Iterable[RunOverview][source]

Returns all :class:`RunOverview`s sorted by their ID.

Yields:: Iterable of :class:`RunOverview`s.

abstract store_example_output(example_output: ExampleOutput) → None[source]

Stores an ExampleOutput.

Parameters:: example_output – The example output to be persisted.

final store_example_output_parallel(tmp_hash: str, example_output: ExampleOutput) → None[source]

abstract store_run_overview(overview: RunOverview) → None[source]

Stores a RunOverview.

Parameters:: overview – The overview to be persisted.

successful_example_outputs(run_id: str, output_type: type[Output]) → Iterable[ExampleOutput][source]

Returns all ExampleOutput for successful example runs with a given run-overview ID sorted by their example ID.

Parameters:

run_id – The ID of the run overview.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`ExampleOutput`s.

final temp_store_finished_example(tmp_hash: str, example_id: str) → None[source]

class Runner(task: Task[Input, Output], dataset_repository: DatasetRepository, run_repository: RunRepository, description: str)[source]

Bases: Generic[Input, Output]

failed_runs(run_id: str, expected_output_type: type[ExpectedOutput]) → Iterable[RunLineage[Input, ExpectedOutput, Output]][source]

Returns the RunLineage objects for all failed example runs that belong to the given run ID.

Parameters:

run_id – The ID of the run overview
expected_output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`RunLineage`s.

input_type() → type[Input][source]

output_type() → type[Output][source]

Returns the type of the evaluated task’s output.

This can be used to retrieve properly typed outputs of an evaluation run from a RunRepository

Returns:: the type of the evaluated task’s output.

run_dataset(dataset_id: str, tracer: Tracer | None = None, num_examples: int | None = None, abort_on_error: bool = False, max_workers: int = 10, description: str | None = None, trace_examples_individually: bool = True, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None, resume_from_recovery_data: bool = False) → RunOverview[source]

Generates all outputs for the provided dataset.

Will run each Example provided in the dataset through the Task.

Parameters:

dataset_id – The id of the dataset to generate output for. Consists of examples, each with an Input and an ExpectedOutput (can be None).
tracer – An optional Tracer to trace all the runs from each example. Use trace_examples_individually to trace each example with a dedicated tracer individually.
num_examples – An optional int to specify how many examples from the dataset should be run. Always the first n examples will be taken.
abort_on_error – Flag to abort all run when an error occurs. Defaults to False.
max_workers – Number of examples that can be evaluated concurrently. Defaults to 10.
description – An optional description of the run. Defaults to None.
trace_examples_individually – Flag to create individual tracers for each example. Defaults to True.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the run overview. Defaults to an empty dict.
resume_from_recovery_data – Flag to resume if execution failed previously.

Returns:

An overview of the run. Outputs will not be returned but instead stored in the RunRepository provided in the __init__.

run_is_already_computed(metadata: dict[str, JsonSerializable]) → bool[source]

Checks if a run with the given metadata has already been computed.

Parameters:: metadata – The metadata dictionary to check.
Returns:: True if a run with the same metadata has already been computed. False otherwise.

run_lineage(run_id: str, example_id: str, expected_output_type: type[ExpectedOutput]) → RunLineage[Input, ExpectedOutput, Output] | None[source]

Wrapper for RepositoryNavigator.run_lineage.

Parameters:

run_id – The id of the run
example_id – The id of the example of interest
expected_output_type – The type of the expected output as defined by the Example

Returns:

The RunLineage for the given run id and example id, None if the example or an output for the example does not exist.

run_lineages(run_id: str, expected_output_type: type[ExpectedOutput]) → Iterable[RunLineage[Input, ExpectedOutput, Output]][source]

Wrapper for RepositoryNavigator.run_lineages.

Parameters:

run_id – The id of the run
expected_output_type – The type of the expected output as defined by the Example

Returns:

An iterator over all :class:`RunLineage`s for the given run id.

class SingleHuggingfaceDatasetRepository(huggingface_dataset: DatasetDict | Dataset | IterableDatasetDict | IterableDataset)[source]

Bases: DatasetRepository

create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → Dataset[source]

Creates a dataset from given :class:`Example`s and returns the ID of that dataset.

Parameters:

examples – An Iterable of :class:`Example`s to be saved in the same dataset.
dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.

Returns:

The created Dataset.

dataset(dataset_id: str) → Dataset | None[source]

Returns a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.
Returns:: Dataset if it was not, None otherwise.

dataset_ids() → Iterable[str][source]

Returns all sorted dataset IDs.

Returns:: Iterable of dataset IDs.

datasets() → Iterable[Dataset]

Returns all :class:`Dataset`s sorted by their ID.

Yields:: :class:`Dataset`s.

delete_dataset(dataset_id: str) → None[source]

Deletes a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.

example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) → Example | None[source]

Returns an Example for the given dataset ID and example ID.

Parameters:

dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.

Returns:

Example if it was found, None otherwise.

examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) → Iterable[Example][source]

Returns all :class:`Example`s for the given dataset ID sorted by their ID.

Parameters:

dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.

Returns:

class`Example`s.

Return type:

Iterable of

class SingleOutputEvaluationLogic[source]

Bases: EvaluationLogic[Input, Output, ExpectedOutput, Evaluation]

final do_evaluate(example: Example, *output: SuccessfulExampleOutput) → Evaluation[source]

Executes the evaluation for this specific example.

Responsible for comparing the input & expected output of a task to the actually generated output.

Parameters:

example – Input data of Task to produce the output.
*output – Output of the Task.

Returns:

The metrics that come from the evaluated Task.

abstract do_evaluate_single_output(example: Example, output: Output) → Evaluation[source]

class StudioBenchmark(benchmark_id: str, name: str, dataset_id: str, eval_logic: EvaluationLogic[Input, Output, ExpectedOutput, Evaluation], aggregation_logic: AggregationLogic[Evaluation, AggregatedEvaluation], studio_client: StudioClient, **kwargs: Any)[source]

Bases: Benchmark

execute(task: Task[Input, Output], name: str, description: str | None = None, labels: set[str] | None = None, metadata: dict[str, Any] | None = None, max_workers: int = 10) → str[source]

Executes the benchmark on a given task.

Parameters:

task – The task to be evaluated in the benchmark.
name – Name of the benchmark execution.
description – Description of the task to be evaluated.
labels – Labels for filtering or categorizing the benchmark.
metadata – Additional information about the task for logging or configuration.
max_workers – Maximum number of concurrent workers to use for the benchmark execution.

Returns:

Identifier of the benchmark run.

class StudioBenchmarkRepository(studio_client: StudioClient)[source]

Bases: BenchmarkRepository

create_benchmark(dataset_id: str, eval_logic: EvaluationLogic[Input, Output, ExpectedOutput, Evaluation], aggregation_logic: AggregationLogic[Evaluation, AggregatedEvaluation], name: str, metadata: dict[str, Any] | None = None, description: str | None = None) → StudioBenchmark[source]

Creates a new benchmark and stores it in the repository.

Parameters:

dataset_id – Identifier for the dataset associated with the benchmark.
eval_logic – Evaluation logic to be applied in the benchmark.
aggregation_logic – Aggregation logic for combining individual evaluations.
name – Name of the benchmark.
metadata – Additional information about the benchmark, defaults to None.
description – Description of the benchmark, defaults to None.

Returns:

The created benchmark instance.

get_benchmark(benchmark_id: str, eval_logic: EvaluationLogic[Input, Output, ExpectedOutput, Evaluation], aggregation_logic: AggregationLogic[Evaluation, AggregatedEvaluation], allow_diff: bool = False) → StudioBenchmark | None[source]

Retrieves an existing benchmark from the repository.

Parameters:

benchmark_id – Unique identifier for the benchmark to retrieve.
eval_logic – Evaluation logic to apply.
aggregation_logic – Aggregation logic to apply.
allow_diff – Retrieve the benchmark even though logics behaviour do not match.

Returns:

The retrieved benchmark instance. Raises ValueError if no benchmark is found.

class StudioDatasetRepository(studio_client: StudioClient)[source]

Bases: DatasetRepository

Dataset repository interface with Data Platform.

create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → Dataset[source]

Creates a dataset from given :class:`Example`s and returns the ID of that dataset.

Parameters:

examples – An Iterable of :class:`Example`s to be saved in the same dataset.
dataset_name – A name for the dataset.
id – ID is not used in the StudioDatasetRepository as it is generated by the Studio.
labels – A list of labels for filtering. Defaults to an empty list. Defaults to None.
metadata – A dict for additional information about the dataset. Defaults to an empty dict. Defaults to None.

Returns:

Dataset

dataset(dataset_id: str) → Dataset | None[source]

Returns a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.
Returns:: Dataset if it was not, None otherwise.

dataset_ids() → Iterable[str][source]

Returns all sorted dataset IDs.

Returns:: Iterable of dataset IDs.

datasets() → Iterable[Dataset][source]

Returns all :class:`Dataset`s. Sorting is not guaranteed.

Returns:: Sequence of :class:`Dataset`s.

delete_dataset(dataset_id: str) → None[source]

Deletes a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.

example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) → Example | None[source]

Returns an Example for the given dataset ID and example ID.

Parameters:

dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.

Returns:

Example if it was found, None otherwise.

examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) → Iterable[Example][source]

Returns all :class:`Example`s for the given dataset ID sorted by their ID.

Parameters:

dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output. Defaults to None.

Returns:

class`Example`s.

Return type:

Iterable of

static map_to_example(example_to_map: StudioExample[Input, ExpectedOutput]) → Example[source]

static map_to_many_example(examples_to_map: Iterable[StudioExample[Input, ExpectedOutput]]) → Iterable[Example][source]

static map_to_many_studio_example(examples_to_map: Iterable[Example]) → Iterable[StudioExample[Input, ExpectedOutput]][source]

static map_to_studio_dataset(dataset_to_map: Dataset) → StudioDataset[source]

static map_to_studio_example(example_to_map: Example) → StudioExample[Input, ExpectedOutput][source]

class SuccessfulExampleOutput(*, run_id: str, example_id: str, output: Output)[source]

Bases: BaseModel, Generic[Output]

Successful output of a single evaluated Example.

run_id

Identifier of the run that created the output.

Type:: str

example_id

Identifier of the Example.

Type:: str

output

Generated when running the Task. This represent only the output of an successful run.

Type:: pharia_inference_sdk.core.task.Output

Generics:: Output: Interface of the output returned by the task.

class WinRateCalculator(players: Iterable[str])[source]

Bases: object

calculate(matches: Sequence[ComparisonEvaluation]) → Mapping[str, float][source]

aggregation_overviews_to_pandas(aggregation_overviews: Sequence[AggregationOverview], unwrap_statistics: bool = True, strict: bool = True, unwrap_metadata: bool = True) → DataFrame[source]

Converts aggregation overviews to a pandas table for easier comparison.

Parameters:

aggregation_overviews – Overviews to convert.
unwrap_statistics – Unwrap the statistics field in the overviews into separate columns. Defaults to True.
strict – Allow only overviews with exactly equal statistics types. Defaults to True.
unwrap_metadata – Unwrap the metadata field in the overviews into separate columns. Defaults to True.

Returns:

A pandas DataFrame containing an overview per row with fields as columns.

evaluation_lineages_to_pandas(evaluation_lineages: Sequence[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]) → DataFrame[source]

Converts a sequence of EvaluationLineage objects to a pandas DataFrame.

The EvaluationLineage objects are stored in the column “lineage”. The DataFrame is indexed by (example_id, evaluation_id, run_id). Each output of every lineage will contribute one row in the DataFrame.

Parameters:: evaluation_lineages – The lineages to convert.
Returns:: A pandas DataFrame with the data contained in the evaluation_lineages.

run_lineages_to_pandas(run_lineages: Sequence[RunLineage[Input, ExpectedOutput, Output]]) → DataFrame[source]

Converts a sequence of RunLineage objects to a pandas DataFrame.

The RunLineage objects are stored in the column “lineage”. The DataFrame is indexed by (example_id, run_id).

Parameters:: run_lineages – The lineages to convert.
Returns:: A pandas DataFrame with the data contained in the run_lineages.