SDK Reference
Connectors Module
- class AggregationLogicIdentifier(*, logic: str, evaluation_schema: dict[str, Any], aggregation_schema: dict[str, Any])[source]
Bases:
BaseModel
- class ArgillaClient[source]
Bases:
ABCClient interface for accessing an Argilla server.
Argilla supports human in the loop evaluation. This class defines the API used by the intelligence layer to create feedback datasets or retrieve evaluation results.
- abstract add_record(dataset_id: str, record: RecordData) None[source]
Adds a new record to the given dataset.
- Parameters:
dataset_id – id of the dataset the record is added to
record – the actual record data (i.e. content for the dataset’s fields)
- add_records(dataset_id: str, records: Sequence[RecordData]) None[source]
Adds new records to the given dataset.
- Parameters:
dataset_id – id of the dataset the record is added to
records – list containing the record data (i.e. content for the dataset’s fields)
- abstract create_dataset(workspace_id: str, dataset_name: str, fields: Sequence[Any], questions: Sequence[Any]) str[source]
Creates and publishes a new feedback dataset in Argilla.
Raises an error if the name exists already.
- Parameters:
workspace_id – the id of the workspace the feedback dataset should be created in. The user executing this request must have corresponding permissions for this workspace.
dataset_name – the name of the feedback-dataset to be created.
fields – all fields of this dataset.
questions – all questions for this dataset.
- Returns:
The id of the created dataset.
- abstract ensure_dataset_exists(workspace_id: str, dataset_name: str, fields: Sequence[Any], questions: Sequence[Any]) str[source]
Retrieves an existing dataset or creates and publishes a new feedback dataset in Argilla.
- Parameters:
workspace_id – the id of the workspace the feedback dataset should be created in. The user executing this request must have corresponding permissions for this workspace.
dataset_name – the name of the feedback-dataset to be created.
fields – all fields of this dataset.
questions – all questions for this dataset.
- Returns:
The id of the dataset to be retrieved .
- abstract evaluations(dataset_id: str) Iterable[ArgillaEvaluation][source]
Returns all human-evaluated evaluations for the given dataset.
- Parameters:
dataset_id – the id of the dataset.
- Returns:
An Iterable over all human-evaluated evaluations for the given dataset.
- class ArgillaEvaluation(*, example_id: str, record_id: str, responses: Mapping[str, Any], metadata: Mapping[str, Any])[source]
Bases:
BaseModelThe evaluation result for a single rating record in an Argilla feedback-dataset.
- example_id
the id of the example that was evaluated.
- Type:
str
- record_id
the id of the record that is evaluated.
- Type:
str
- responses
Maps question-names (
Question.name) to response values.- Type:
collections.abc.Mapping[str, Any]
- metadata
Metadata belonging to the evaluation, for example ids of completions.
- Type:
collections.abc.Mapping[str, Any]
- class BenchmarkLineage(*, trace_id: str, input: Input, expected_output: ExpectedOutput, output: Output, example_metadata: dict[str, Any] | None = None, evaluation: Any, run_latency: int, run_tokens: int)[source]
Bases:
BaseModel,Generic[Input,ExpectedOutput,Output,Evaluation]
- class EvaluationLogicIdentifier(*, logic: str, input_schema: dict[str, Any], output_schema: dict[str, Any], expected_output_schema: dict[str, Any], evaluation_schema: dict[str, Any])[source]
Bases:
BaseModel
- class GetBenchmarkLineageResponse(*, id: str, trace_id: str, benchmark_execution_id: str, input: JsonSerializable, expected_output: JsonSerializable, example_metadata: dict[str, JsonSerializable] | None = None, output: JsonSerializable, evaluation: JsonSerializable, run_latency: int, run_tokens: int)[source]
Bases:
BaseModel
- class GetBenchmarkResponse(*, id: str, project_id: str, dataset_id: str, name: str, description: str | None, benchmark_metadata: dict[str, Any] | None, evaluation_logic: EvaluationLogicIdentifier, aggregation_logic: AggregationLogicIdentifier, created_at: datetime, updated_at: datetime | None, last_executed_at: datetime | None, created_by: str | None, updated_by: str | None)[source]
Bases:
BaseModel
- class GetDatasetExamplesResponse(*, total: int, page: int, size: int, num_pages: int, items: Sequence[StudioExample])[source]
Bases:
BaseModel,Generic[Input,ExpectedOutput]
- class PostBenchmarkExecution(*, name: str, description: str | None, labels: set[str] | None, metadata: dict[str, Any] | None, start: datetime, end: datetime, run_start: datetime, run_end: datetime, run_successful_count: int, run_failed_count: int, run_success_avg_latency: float, run_success_avg_token_count: float, eval_start: datetime, eval_end: datetime, eval_successful_count: int, eval_failed_count: int, aggregation_start: datetime, aggregation_end: datetime, statistics: JsonSerializable)[source]
Bases:
BaseModel
- class PostBenchmarkLineagesRequest(root: RootModelRootType = PydanticUndefined)[source]
Bases:
RootModel[Sequence[BenchmarkLineage]]- classmethod model_construct(root: RootModelRootType, _fields_set: set[str] | None = None) Self
Create a new model using the provided root object and update fields set.
- Parameters:
root – The root object of the model.
_fields_set – The set of fields to be updated.
- Returns:
The new model.
- Raises:
NotImplemented – If the model is not a subclass of RootModel.
- class PostBenchmarkLineagesResponse(root: RootModelRootType = PydanticUndefined)[source]
Bases:
RootModel[Sequence[str]]- classmethod model_construct(root: RootModelRootType, _fields_set: set[str] | None = None) Self
Create a new model using the provided root object and update fields set.
- Parameters:
root – The root object of the model.
_fields_set – The set of fields to be updated.
- Returns:
The new model.
- Raises:
NotImplemented – If the model is not a subclass of RootModel.
- class PostBenchmarkRequest(*, dataset_id: str, name: str, description: str | None, benchmark_metadata: dict[str, Any] | None, evaluation_logic: EvaluationLogicIdentifier, aggregation_logic: AggregationLogicIdentifier)[source]
Bases:
BaseModel
- class Record(*, content: ~collections.abc.Mapping[str, str], example_id: str, metadata: ~collections.abc.Mapping[str, str | int] = <factory>, id: str)[source]
Bases:
RecordDataRepresents an Argilla record of an feedback-dataset.
Just adds the id to a
RecordData- id
the Argilla generated id of the record.
- Type:
str
- class RecordData(*, content: ~collections.abc.Mapping[str, str], example_id: str, metadata: ~collections.abc.Mapping[str, str | int] = <factory>)[source]
Bases:
BaseModelInput-data for a Argilla evaluation record.
This can be used to add a new record to an existing Argilla feedback-dataset. Once it is added it gets an Argilla provided id and can be retrieved as
Record- content
Maps field-names (
Field.name) to string values that can be displayed to the user.- Type:
collections.abc.Mapping[str, str]
- example_id
the id of the corresponding
Examplefrom aDataset.- Type:
str
- metadata
Arbitrary metadata in form of key/value strings that can be attached to a record.
- Type:
collections.abc.Mapping[str, str | int]
- class StudioClient(project: str, studio_url: str | None = None, auth_token: str | None = None, create_project: bool = False)[source]
Bases:
objectClient for communicating with Studio.
- project_id
The unique identifier of the project currently in use.
- url
The url of your current Studio instance.
- create_project(project: str, description: str | None = None, reuse_existing: bool = False) str[source]
Creates a project in Studio.
Projects are uniquely identified by the user provided name.
- Parameters:
project – User provided name of the project.
description – Description explaining the usage of the project. Defaults to None.
reuse_existing – Reuse project with specified name if already existing. Defaults to False.
- Returns:
The ID of the newly created project.
- get_benchmark(benchmark_id: str) GetBenchmarkResponse | None[source]
- get_benchmark_lineage(benchmark_id: str, execution_id: str, lineage_id: str) GetBenchmarkLineageResponse | None[source]
- get_dataset_examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) Iterable[StudioExample][source]
- submit_benchmark(dataset_id: str, eval_logic: EvaluationLogicIdentifier, aggregation_logic: AggregationLogicIdentifier, name: str, description: str | None = None, metadata: dict[str, Any] | None = None) str[source]
- submit_benchmark_execution(benchmark_id: str, data: PostBenchmarkExecution) str[source]
- submit_benchmark_lineages(benchmark_lineages: Sequence[BenchmarkLineage], benchmark_id: str, execution_id: str, max_payload_size: int = 52428800) PostBenchmarkLineagesResponse[source]
Submit benchmark lineages in batches to avoid exceeding the maximum payload size.
- Parameters:
benchmark_lineages – List of :class: BenchmarkLineages to submit.
benchmark_id – ID of the benchmark.
execution_id – ID of the execution.
max_payload_size – Maximum size of the payload in bytes. Defaults to 50MB.
- Returns:
Response containing the results of the submissions.
- submit_dataset(dataset: StudioDataset, examples: Iterable[StudioExample]) str[source]
Submits a dataset to Studio.
- Parameters:
dataset –
Datasetto be uploadedexamples –
Examplesof theDataset
- Returns:
ID of the created dataset
- submit_from_tracer(tracer: Tracer) list[str][source]
Sends all trace data from the Tracer to Studio.
- Parameters:
tracer –
Tracerto extract data from.- Returns:
List of created trace IDs.
- submit_trace(data: Sequence[ExportedSpan]) str[source]
Sends the provided spans to Studio as a singular trace.
The method fails if the span list is empty, has already been created or if spans belong to multiple traces.
- Parameters:
data –
Spansto create the trace from. Created by exporting from aTracer.- Returns:
The ID of the created trace.
- class StudioDataset(*, id: str = <factory>, name: str, labels: set[str] = {}, metadata: dict[str, JsonSerializable] = {})[source]
Bases:
BaseModelRepresents a
Datasetlinked to multiple examples as sent to Studio.- id
Dataset ID.
- Type:
str
- name
A short name of the dataset.
- Type:
str
- label
Labels for filtering datasets. Defaults to empty list.
- metadata
Additional information about the dataset. Defaults to empty dict.
- Type:
dict[str, JsonSerializable]
- class StudioExample(*, input: ~pharia_studio_sdk.connectors.studio.studio.Input, expected_output: ~pharia_studio_sdk.connectors.studio.studio.ExpectedOutput, id: str = <factory>, metadata: dict[str, JsonSerializable] | None = None)[source]
Bases:
BaseModel,Generic[Input,ExpectedOutput]Represents an instance of :class:`Example`as sent to Studio.
- input
Input for the
Task. Has to be same type as the input for the task used.- Type:
pharia_studio_sdk.connectors.studio.studio.Input
- expected_output
The expected output from a given example run. This will be used by the evaluator to compare the received output with.
- Type:
pharia_studio_sdk.connectors.studio.studio.ExpectedOutput
- id
Identifier for the example, defaults to uuid.
- Type:
str
- metadata
Optional dictionary of custom key-value pairs.
- Type:
dict[str, JsonSerializable] | None
- Generics:
Input: Interface to be passed to the
Taskthat shall be evaluated. ExpectedOutput: Output that is expected from the run with the supplied input.
Evaluation Module
- class AggregationLogic[source]
Bases:
ABC,Generic[Evaluation,AggregatedEvaluation]- abstract aggregate(evaluations: Iterable[Evaluation]) AggregatedEvaluation[source]
Evaluator-specific method for aggregating individual Evaluations into report-like Aggregated Evaluation.
This method is responsible for taking the results of an evaluation run and aggregating all the results. It should create an AggregatedEvaluation class and return it at the end.
- Parameters:
evaluations – The results from running eval_and_aggregate_runs with a
Task.- Returns:
The aggregated results of an evaluation run with a
Dataset.
- class AggregationOverview(*, evaluation_overviews: frozenset[EvaluationOverview], id: str, start: datetime, end: datetime, successful_evaluation_count: int, crashed_during_evaluation_count: int, description: str, statistics: Annotated[AggregatedEvaluation, SerializeAsAny()], labels: set[str] = {}, metadata: dict[str, JsonSerializable] = {})[source]
Bases:
BaseModel,Generic[AggregatedEvaluation]Complete overview of the results of evaluating a
Taskon a dataset.Created when running
Evaluator.eval_and_aggregate_runs(). Contains high-level information and statistics.- evaluation_overviews
:class:`EvaluationOverview`s used for aggregation.
- Type:
frozenset[pharia_studio_sdk.evaluation.evaluation.domain.EvaluationOverview]
- id
Aggregation overview ID.
- Type:
str
- start
Start timestamp of the aggregation.
- Type:
datetime.datetime
- end
End timestamp of the aggregation.
- Type:
datetime.datetime
- end
The time when the evaluation run ended
- Type:
datetime.datetime
- successful_evaluation_count
The number of examples that where successfully evaluated.
- Type:
int
- crashed_during_evaluation_count
The number of examples that crashed during evaluation.
- Type:
int
- failed_evaluation_count
The number of examples that crashed during evaluation plus the number of examples that failed to produce an output for evaluation.
- description
A short description.
- Type:
str
- statistics
Aggregated statistics of the run. Whatever is returned by
Evaluator.aggregate()- Type:
pharia_studio_sdk.evaluation.aggregation.domain.AggregatedEvaluation
- labels
Labels for filtering aggregation. Defaults to empty list.
- Type:
set[str]
- metadata
Additional information about the aggregation. Defaults to empty dict.
- Type:
dict[str, JsonSerializable]
- run_overviews() Iterable[RunOverview][source]
- class AggregationRepository[source]
Bases:
ABCBase aggregation repository interface.
Provides methods to store and load aggregated evaluation results:
AggregationOverview.- abstract aggregation_overview(aggregation_id: str, aggregation_type: type[AggregatedEvaluation]) AggregationOverview | None[source]
Returns an
AggregationOverviewfor the given ID.- Parameters:
aggregation_id – ID of the aggregation overview to retrieve.
aggregation_type – Type of the aggregation.
- Returns:
EvaluationOverviewif it was found, None otherwise.
- abstract aggregation_overview_ids() Sequence[str][source]
Returns sorted IDs of all stored :class:`AggregationOverview`s.
- Returns:
A
Sequenceof theAggregationOverviewIDs.
- aggregation_overviews(aggregation_type: type[AggregatedEvaluation]) Iterable[AggregationOverview][source]
Returns all :class:`AggregationOverview`s sorted by their ID.
- Parameters:
aggregation_type – Type of the aggregation.
- Yields:
:class:`AggregationOverview`s.
- abstract store_aggregation_overview(aggregation_overview: AggregationOverview) None[source]
Stores an
AggregationOverview.- Parameters:
aggregation_overview – The aggregated results to be persisted.
- class Aggregator(evaluation_repository: EvaluationRepository, aggregation_repository: AggregationRepository, description: str, aggregation_logic: AggregationLogic[Evaluation, AggregatedEvaluation])[source]
Bases:
Generic[Evaluation,AggregatedEvaluation]Aggregator that can handle automatic aggregation of evaluation scenarios.
This aggregator should be used for automatic eval. A user still has to implement :class: AggregationLogic.
- Parameters:
evaluation_repository – The repository that will be used to store evaluation results.
aggregation_repository – The repository that will be used to store aggregation results.
description – Human-readable description for the evaluator.
aggregation_logic – The logic to aggregate the evaluations.
- Generics:
Evaluation: Interface of the metrics that come from the evaluated
Task. AggregatedEvaluation: The aggregated results of an evaluation run with aDataset.
- final aggregate_evaluation(*eval_ids: str, description: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) AggregationOverview[source]
Aggregates all evaluations into an overview that includes high-level statistics.
Aggregates
Evaluation`s according to the implementation of :func:`AggregationLogic.aggregate.- Parameters:
*eval_ids – An overview of the evaluation to be aggregated. Does not include actual evaluations as these will be retrieved from the repository.
description – Optional description of the aggregation. Defaults to None.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the aggregation overview. Defaults to an empty dict.
- Returns:
An overview of the aggregated evaluation.
- aggregated_evaluation_type() type[AggregatedEvaluation][source]
Returns the type of the aggregated result of a run.
- Returns:
Returns the type of the aggreagtion result.
- evaluation_type() type[Evaluation][source]
Returns the type of the evaluation result of an example.
This can be used to retrieve properly typed evaluations of an evaluation run from a
EvaluationRepository- Returns:
Returns the type of the evaluation result of an example.
- class ArgillaEvaluationLogic(fields: Mapping[str, Any], questions: Sequence[Any])[source]
Bases:
EvaluationLogicBase[Input,Output,ExpectedOutput,Evaluation],ABC- abstract from_record(argilla_evaluation: ArgillaEvaluation) Evaluation[source]
This method takes the specific Argilla evaluation format and converts into a compatible
Evaluation.The format of argilla_evaluation.responses depends on the questions attribute. Each name of a question will be a key in the argilla_evaluation.responses mapping.
- Parameters:
argilla_evaluation – Argilla-specific data for a single evaluation.
- Returns:
An
Evaluationthat contains all evaluation specific data.
- abstract to_record(example: Example, *output: SuccessfulExampleOutput) RecordDataSequence[source]
This method is responsible for translating the Example and Output of the task to
RecordData.The specific format depends on the fields.
- Parameters:
example – The example to be translated.
*output – The output of the example that was run.
- Returns:
A
RecordDataSequencethat contains entries that should be evaluated in Argilla.
- class ArgillaEvaluator(dataset_repository: DatasetRepository, run_repository: RunRepository, evaluation_repository: AsyncEvaluationRepository, description: str, evaluation_logic: ArgillaEvaluationLogic[Input, Output, ExpectedOutput, Evaluation], argilla_client: ArgillaClient, workspace_id: str)[source]
Bases:
AsyncEvaluator[Input,Output,ExpectedOutput,Evaluation]Evaluator used to integrate with Argilla (https://github.com/argilla-io/argilla).
Use this evaluator if you would like to easily do human eval. This evaluator runs a dataset and sends the input and output to Argilla to be evaluated.
- Parameters:
dataset_repository – The repository with the examples that will be taken for the evaluation.
run_repository – The repository of the runs to evaluate.
evaluation_repository – The repository that will be used to store evaluation results.
description – Human-readable description for the evaluator.
evaluation_logic – The logic to use for evaluation.
argilla_client – The client to interface with argilla.
workspace_id – The argilla workspace id where datasets are created for evaluation.
See the
EvaluatorBasefor more information.- evaluation_lineage(evaluation_id: str, example_id: str) EvaluationLineage[Input, ExpectedOutput, Output, Evaluation] | None
Wrapper for RepositoryNagivator.evaluation_lineage.
- Parameters:
evaluation_id – The id of the evaluation
example_id – The id of the example of interest
- Returns:
The
EvaluationLineagefor the given evaluation id and example id. Returns None if the lineage is not complete because either an example, a run, or an evaluation does not exist.
- evaluation_lineages(evaluation_id: str) Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]
Wrapper for RepositoryNagivator.evaluation_lineages.
- Parameters:
evaluation_id – The id of the evaluation
- Returns:
An iterator over all :class:`EvaluationLineage`s for the given evaluation id.
- evaluation_type() type[Evaluation]
Returns the type of the evaluation result of an example.
This can be used to retrieve properly typed evaluations of an evaluation run from an
EvaluationRepository- Returns:
Returns the type of the evaluation result of an example.
- expected_output_type() type[ExpectedOutput]
Returns the type of the evaluated task’s expected output.
This can be used to retrieve properly typed
Example`s of a dataset from a :class:`DatasetRepository.- Returns:
The type of the evaluated task’s expected output.
- failed_evaluations(evaluation_id: str) Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]
Returns the EvaluationLineage objects for all failed example evaluations that belong to the given evaluation ID.
- Parameters:
evaluation_id – The ID of the evaluation overview
- Returns:
Iterableof :class:`EvaluationLineage`s.
- input_type() type[Input]
Returns the type of the evaluated task’s input.
This can be used to retrieve properly typed
Example`s of a dataset from a :class:`DatasetRepository.- Returns:
The type of the evaluated task’s input.
- output_type() type[Output]
Returns the type of the evaluated task’s output.
This can be used to retrieve properly typed outputs of an evaluation run from a
RunRepository.- Returns:
The type of the evaluated task’s output.
- retrieve(partial_evaluation_id: str) EvaluationOverview[source]
Retrieves external evaluations and saves them to an evaluation repository.
Failed or skipped submissions should be viewed as failed evaluations. Evaluations that are submitted but not yet evaluated also count as failed evaluations.
- Parameters:
partial_overview_id – The id of the corresponding
PartialEvaluationOverview.- Returns:
An
EvaluationOverviewthat describes the whole evaluation.
- submit(*run_ids: str, num_examples: int | None = None, dataset_name: str | None = None, abort_on_error: bool = False, skip_example_on_any_failure: bool = True, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) PartialEvaluationOverview[source]
Submits evaluations to external service to be evaluated.
Failed submissions are saved as FailedExampleEvaluations.
- Parameters:
*run_ids – The runs to be evaluated. Each run is expected to have the same dataset as input (which implies their tasks have the same input-type) and their tasks have the same output-type. For each example in the dataset referenced by the runs the outputs of all runs are collected and if all of them were successful they are passed on to the implementation specific evaluation. The method compares all run of the provided ids to each other.
num_examples – The number of examples which should be evaluated from the given runs. Always the first n runs stored in the evaluation repository. Defaults to None.
abort_on_error – Abort the whole submission process if a single submission fails. Defaults to False.
- Returns:
A
PartialEvaluationOverviewcontaining submission information.
- class AsyncEvaluationRepository[source]
Bases:
EvaluationRepository- abstract evaluation_overview(evaluation_id: str) EvaluationOverview | None
Returns an
EvaluationOverviewfor the given ID.- Parameters:
evaluation_id – ID of the evaluation overview to retrieve.
- Returns:
EvaluationOverviewif it was found, None otherwise.
- abstract evaluation_overview_ids() Sequence[str]
Returns sorted IDs of all stored :class:`EvaluationOverview`s.
- Returns:
A
Sequenceof theEvaluationOverviewIDs.
- evaluation_overviews() Iterable[EvaluationOverview]
Returns all :class:`EvaluationOverview`s sorted by their ID.
- Yields:
:class:`EvaluationOverview`s.
- abstract example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation] | None
Returns an
ExampleEvaluationfor the given evaluation overview ID and example ID.- Parameters:
evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in
Evaluator.do_evaluate()
- Returns:
ExampleEvaluationif it was found, None otherwise.
- abstract example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation]]
Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation[FailedExampleEvaluation]]
Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- initialize_evaluation() str
Initializes an
EvaluationOverviewand returns its ID.If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.
- Returns:
The created ID.
- abstract partial_evaluation_overview(partial_evaluation_id: str) PartialEvaluationOverview | None[source]
Returns an
PartialEvaluationOverviewfor the given ID.- Parameters:
partial_evaluation_id – ID of the partial evaluation overview to retrieve.
- Returns:
PartialEvaluationOverviewif it was found, None otherwise.
- abstract partial_evaluation_overview_ids() Sequence[str][source]
Returns sorted IDs of all stored :class:`PartialEvaluationOverview`s.
- Returns:
A
Sequenceof thePartialEvaluationOverviewIDs.
- partial_evaluation_overviews() Iterable[PartialEvaluationOverview][source]
Returns all :class:`PartialEvaluationOverview`s sorted by their ID.
- Yields:
:class:`PartialEvaluationOverview`s.
- abstract store_evaluation_overview(evaluation_overview: EvaluationOverview) None
Stores an
EvaluationOverview.- Parameters:
evaluation_overview – The overview to be persisted.
- abstract store_example_evaluation(example_evaluation: ExampleEvaluation) None
Stores an
ExampleEvaluation.- Parameters:
example_evaluation – The example evaluation to be persisted.
- abstract store_partial_evaluation_overview(partial_evaluation_overview: PartialEvaluationOverview) None[source]
Stores an
PartialEvaluationOverview.- Parameters:
partial_evaluation_overview – The partial overview to be persisted.
- successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- class AsyncFileEvaluationRepository(root_directory: Path)[source]
Bases:
FileEvaluationRepository,AsyncEvaluationRepository- evaluation_overview(evaluation_id: str) EvaluationOverview | None
Returns an
EvaluationOverviewfor the given ID.- Parameters:
evaluation_id – ID of the evaluation overview to retrieve.
- Returns:
EvaluationOverviewif it was found, None otherwise.
- evaluation_overview_ids() Sequence[str]
Returns sorted IDs of all stored :class:`EvaluationOverview`s.
- Returns:
A
Sequenceof theEvaluationOverviewIDs.
- evaluation_overviews() Iterable[EvaluationOverview]
Returns all :class:`EvaluationOverview`s sorted by their ID.
- Yields:
:class:`EvaluationOverview`s.
- example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation] | None
Returns an
ExampleEvaluationfor the given evaluation overview ID and example ID.- Parameters:
evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in
Evaluator.do_evaluate()
- Returns:
ExampleEvaluationif it was found, None otherwise.
- example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation]]
Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- exists(path: Path) bool
- failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation[FailedExampleEvaluation]]
Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- file_names(path: Path, file_type: str = 'json') Sequence[str]
- initialize_evaluation() str
Initializes an
EvaluationOverviewand returns its ID.If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.
- Returns:
The created ID.
- mkdir(path: Path) None
- partial_evaluation_overview(evaluation_id: str) PartialEvaluationOverview | None[source]
Returns an
PartialEvaluationOverviewfor the given ID.- Parameters:
partial_evaluation_id – ID of the partial evaluation overview to retrieve.
- Returns:
PartialEvaluationOverviewif it was found, None otherwise.
- partial_evaluation_overview_ids() Sequence[str][source]
Returns sorted IDs of all stored :class:`PartialEvaluationOverview`s.
- Returns:
A
Sequenceof thePartialEvaluationOverviewIDs.
- partial_evaluation_overviews() Iterable[PartialEvaluationOverview]
Returns all :class:`PartialEvaluationOverview`s sorted by their ID.
- Yields:
:class:`PartialEvaluationOverview`s.
- static path_to_str(path: Path) str
Returns a string for the given Path so that it’s readable for the respective file system.
- Parameters:
path – Given Path that should be converted.
- Returns:
String representation of the given Path.
- read_utf8(path: Path) str
- remove_file(path: Path) None
- store_evaluation_overview(overview: EvaluationOverview) None
Stores an
EvaluationOverview.- Parameters:
evaluation_overview – The overview to be persisted.
- store_example_evaluation(example_evaluation: ExampleEvaluation) None
Stores an
ExampleEvaluation.- Parameters:
example_evaluation – The example evaluation to be persisted.
- store_partial_evaluation_overview(overview: PartialEvaluationOverview) None[source]
Stores an
PartialEvaluationOverview.- Parameters:
partial_evaluation_overview – The partial overview to be persisted.
- successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- write_utf8(path: Path, content: str, create_parents: bool = False) None
- class AsyncInMemoryEvaluationRepository[source]
Bases:
AsyncEvaluationRepository,InMemoryEvaluationRepository- evaluation_overview(evaluation_id: str) EvaluationOverview | None
Returns an
EvaluationOverviewfor the given ID.- Parameters:
evaluation_id – ID of the evaluation overview to retrieve.
- Returns:
EvaluationOverviewif it was found, None otherwise.
- evaluation_overview_ids() Sequence[str]
Returns sorted IDs of all stored :class:`EvaluationOverview`s.
- Returns:
A
Sequenceof theEvaluationOverviewIDs.
- evaluation_overviews() Iterable[EvaluationOverview]
Returns all :class:`EvaluationOverview`s sorted by their ID.
- Yields:
:class:`EvaluationOverview`s.
- example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation] | None
Returns an
ExampleEvaluationfor the given evaluation overview ID and example ID.- Parameters:
evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in
Evaluator.do_evaluate()
- Returns:
ExampleEvaluationif it was found, None otherwise.
- example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation]]
Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation[FailedExampleEvaluation]]
Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- initialize_evaluation() str
Initializes an
EvaluationOverviewand returns its ID.If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.
- Returns:
The created ID.
- partial_evaluation_overview(evaluation_id: str) PartialEvaluationOverview | None[source]
Returns an
PartialEvaluationOverviewfor the given ID.- Parameters:
partial_evaluation_id – ID of the partial evaluation overview to retrieve.
- Returns:
PartialEvaluationOverviewif it was found, None otherwise.
- partial_evaluation_overview_ids() Sequence[str][source]
Returns sorted IDs of all stored :class:`PartialEvaluationOverview`s.
- Returns:
A
Sequenceof thePartialEvaluationOverviewIDs.
- partial_evaluation_overviews() Iterable[PartialEvaluationOverview]
Returns all :class:`PartialEvaluationOverview`s sorted by their ID.
- Yields:
:class:`PartialEvaluationOverview`s.
- store_evaluation_overview(overview: EvaluationOverview) None
Stores an
EvaluationOverview.- Parameters:
evaluation_overview – The overview to be persisted.
- store_example_evaluation(evaluation: ExampleEvaluation) None
Stores an
ExampleEvaluation.- Parameters:
example_evaluation – The example evaluation to be persisted.
- store_partial_evaluation_overview(overview: PartialEvaluationOverview) None[source]
Stores an
PartialEvaluationOverview.- Parameters:
partial_evaluation_overview – The partial overview to be persisted.
- successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- class BleuGrader[source]
Bases:
object- calculate_bleu(hypothesis: str, reference: str) float[source]
Calculates the BLEU-score for the given hypothesis and reference.
In the summarization use-case the BLEU-score roughly corresponds to the precision of the generated summary with regard to the expected summary.
- Parameters:
hypothesis – The generation to be evaluated.
reference – The baseline for the evaluation.
- Returns:
BLEU-score, float between 0 and 100. Where 100 means perfect match and 0 no overlap.
- class ComparisonEvaluation(*, first_player: str, second_player: str, outcome: MatchOutcome)[source]
Bases:
BaseModel
- class ComparisonEvaluationAggregationLogic[source]
Bases:
AggregationLogic[ComparisonEvaluation,AggregatedComparison]- aggregate(evaluations: Iterable[ComparisonEvaluation]) AggregatedComparison[source]
Evaluator-specific method for aggregating individual Evaluations into report-like Aggregated Evaluation.
This method is responsible for taking the results of an evaluation run and aggregating all the results. It should create an AggregatedEvaluation class and return it at the end.
- Parameters:
evaluations – The results from running eval_and_aggregate_runs with a
Task.- Returns:
The aggregated results of an evaluation run with a
Dataset.
- class Dataset(*, id: str = <factory>, name: str, labels: set[str] = {}, metadata: dict[str, JsonSerializable] = {})[source]
Bases:
BaseModelRepresents a dataset linked to multiple examples.
- id
Dataset ID.
- Type:
str
- name
A short name of the dataset.
- Type:
str
- label
Labels for filtering datasets. Defaults to empty list.
- metadata
Additional information about the dataset. Defaults to empty dict.
- Type:
dict[str, JsonSerializable]
- class DatasetRepository[source]
Bases:
ABCBase dataset repository interface.
Provides methods to store and load datasets and their linked examples (:class:`Example`s).
- abstract create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) Dataset[source]
Creates a dataset from given :class:`Example`s and returns the ID of that dataset.
- Parameters:
examples – An
Iterableof :class:`Example`s to be saved in the same dataset.dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.
- Returns:
The created
Dataset.
- abstract dataset(dataset_id: str) Dataset | None[source]
Returns a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- Returns:
Datasetif it was not, None otherwise.
- abstract dataset_ids() Iterable[str][source]
Returns all sorted dataset IDs.
- Returns:
Iterableof dataset IDs.
- datasets() Iterable[Dataset][source]
Returns all :class:`Dataset`s sorted by their ID.
- Yields:
:class:`Dataset`s.
- abstract delete_dataset(dataset_id: str) None[source]
Deletes a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- abstract example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) Example | None[source]
Returns an
Examplefor the given dataset ID and example ID.- Parameters:
dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
- Returns:
Exampleif it was found, None otherwise.
- abstract examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) Iterable[Example][source]
Returns all :class:`Example`s for the given dataset ID sorted by their ID.
- Parameters:
dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.
- Returns:
class`Example`s.
- Return type:
Iterableof
- class EloCalculator(players: Iterable[str], k_start: float = 20.0, k_floor: float = 10.0, decay_factor: float = 0.0005)[source]
Bases:
object- calculate(matches: Sequence[ComparisonEvaluation]) None[source]
- class EloEvaluationLogic[source]
Bases:
IncrementalEvaluationLogic[Input,Output,ExpectedOutput,Matches]- do_evaluate(example: Example, *output: SuccessfulExampleOutput) Evaluation
Executes the evaluation for this specific example.
Responsible for comparing the input & expected output of a task to the actually generated output. The difference to the standard
EvaluationLogic’s do_evaluate is that this method will separate already processed evaluation from new ones before handing them over to do_incremental_evaluate.- Parameters:
example – Input data of
Taskto produce the output.*output – Outputs of the
Task.
- Returns:
The metrics that come from the evaluated
Task.- Return type:
Evaluation
- do_incremental_evaluate(example: Example, outputs: list[SuccessfulExampleOutput], already_evaluated_outputs: list[list[SuccessfulExampleOutput]]) Matches[source]
- abstract grade(first: SuccessfulExampleOutput, second: SuccessfulExampleOutput, example: Example) MatchOutcome[source]
Returns a :class: MatchOutcome for the provided two contestants on the given example.
Defines the use case specific logic how to determine the winner of the two provided outputs.
- Parameters:
first – Instance of :class: SuccessfulExampleOutut[Output] of the first contestant in the comparison
second – Instance of :class: SuccessfulExampleOutut[Output] of the second contestant in the comparison
example – Datapoint of :class: Example on which the two outputs were generated
- Returns:
class: MatchOutcome
- Return type:
Instance of
- class EloGradingInput(*, instruction: str, first_completion: str, second_completion: str)[source]
Bases:
BaseModel
- exception EvaluationFailed(evaluation_id: str, failed_count: int)[source]
Bases:
Exception- add_note()
Exception.add_note(note) – add a note to the exception
- args
- description: str
- end: datetime
- failed_example_count: int
- successful_example_count: int
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- class EvaluationLogic[source]
Bases:
ABC,EvaluationLogicBase[Input,Output,ExpectedOutput,Evaluation]- abstract do_evaluate(example: Example, *output: SuccessfulExampleOutput) Evaluation[source]
Executes the evaluation for this specific example.
Responsible for comparing the input & expected output of a task to the actually generated output.
- Parameters:
example – Input data of
Taskto produce the output.*output – Output of the
Task.
- Returns:
The metrics that come from the evaluated
Task.
- class EvaluationOverview(*, run_overviews: frozenset[RunOverview], id: str, start_date: datetime, end_date: datetime, successful_evaluation_count: int, failed_evaluation_count: int, description: str, labels: set[str] = {}, metadata: dict[str, JsonSerializable] = {})[source]
Bases:
BaseModelOverview of the un-aggregated results of evaluating a
Taskon a dataset.- run_overviews
Overviews of the runs that were evaluated.
- Type:
frozenset[pharia_studio_sdk.evaluation.run.domain.RunOverview]
- id
The unique identifier of this evaluation.
- Type:
str
- start_date
The time when the evaluation run was started.
- Type:
datetime.datetime
- end_date
The time when the evaluation run was finished.
- Type:
datetime.datetime
- successful_evaluation_count
Number of successfully evaluated examples.
- Type:
int
- failed_evaluation_count
Number of examples that produced an error during evaluation. Note: failed runs are skipped in the evaluation and therefore not counted as failures
- Type:
int
- description
human-readable for the evaluator that created the evaluation.
- Type:
str
- labels
Labels for filtering evaluation. Defaults to empty list.
- Type:
set[str]
- metadata
Additional information about the evaluation. Defaults to empty dict.
- Type:
dict[str, JsonSerializable]
- class EvaluationRepository[source]
Bases:
ABCBase evaluation repository interface.
- Provides methods to store and load evaluation results:
EvaluationOverview`s and :class:`ExampleEvaluation.- An
EvaluationOverviewis created from and is linked (by its ID) to multiple :class:`ExampleEvaluation`s.
- abstract evaluation_overview(evaluation_id: str) EvaluationOverview | None[source]
Returns an
EvaluationOverviewfor the given ID.- Parameters:
evaluation_id – ID of the evaluation overview to retrieve.
- Returns:
EvaluationOverviewif it was found, None otherwise.
- abstract evaluation_overview_ids() Sequence[str][source]
Returns sorted IDs of all stored :class:`EvaluationOverview`s.
- Returns:
A
Sequenceof theEvaluationOverviewIDs.
- evaluation_overviews() Iterable[EvaluationOverview][source]
Returns all :class:`EvaluationOverview`s sorted by their ID.
- Yields:
:class:`EvaluationOverview`s.
- abstract example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation] | None[source]
Returns an
ExampleEvaluationfor the given evaluation overview ID and example ID.- Parameters:
evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in
Evaluator.do_evaluate()
- Returns:
ExampleEvaluationif it was found, None otherwise.
- abstract example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation]][source]
Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation[FailedExampleEvaluation]][source]
Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- initialize_evaluation() str[source]
Initializes an
EvaluationOverviewand returns its ID.If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.
- Returns:
The created ID.
- abstract store_evaluation_overview(evaluation_overview: EvaluationOverview) None[source]
Stores an
EvaluationOverview.- Parameters:
evaluation_overview – The overview to be persisted.
- abstract store_example_evaluation(example_evaluation: ExampleEvaluation) None[source]
Stores an
ExampleEvaluation.- Parameters:
example_evaluation – The example evaluation to be persisted.
- successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation][source]
Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- class Evaluator(dataset_repository: DatasetRepository, run_repository: RunRepository, evaluation_repository: EvaluationRepository, description: str, evaluation_logic: EvaluationLogic[Input, Output, ExpectedOutput, Evaluation])[source]
Bases:
EvaluatorBase[Input,Output,ExpectedOutput,Evaluation]Evaluator designed for most evaluation tasks. Only supports synchronous evaluation.
See the
EvaluatorBasefor more information.- final evaluate(example: Example, evaluation_id: str, abort_on_error: bool, *example_outputs: SuccessfulExampleOutput) Evaluation | FailedExampleEvaluation[source]
- evaluate_runs(*run_ids: str, num_examples: int | None = None, abort_on_error: bool = False, skip_example_on_any_failure: bool = True, description: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) EvaluationOverview[source]
Evaluates all generated outputs in the run.
For each set of successful outputs in the referenced runs,
EvaluationLogic.do_evaluate()is called and eval metrics are produced & stored in the providedEvaluationRepository.- Parameters:
*run_ids – The runs to be evaluated. Each run is expected to have the same dataset as input (which implies their tasks have the same input-type) and their tasks have the same output-type. For each example in the dataset referenced by the runs the outputs of all runs are collected and if all of them were successful they are passed on to the implementation specific evaluation. The method compares all run of the provided ids to each other.
num_examples – The number of examples which should be evaluated from the given runs. Always the first n runs stored in the evaluation repository. Defaults to None.
abort_on_error – Flag to abort all evaluations when an error occurs. Defaults to False.
skip_example_on_any_failure – Flag to skip evaluation on any example for which at least one run fails. Defaults to True.
description – Optional description of the evaluation. Defaults to None.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the evaluation overview. Defaults to an empty dict.
- Returns:
An overview of the evaluation. Individual
Evaluation`s will not be returned but instead stored in the :class:`EvaluationRepositoryprovided in the __init__.- Return type:
- evaluation_lineage(evaluation_id: str, example_id: str) EvaluationLineage[Input, ExpectedOutput, Output, Evaluation] | None
Wrapper for RepositoryNagivator.evaluation_lineage.
- Parameters:
evaluation_id – The id of the evaluation
example_id – The id of the example of interest
- Returns:
The
EvaluationLineagefor the given evaluation id and example id. Returns None if the lineage is not complete because either an example, a run, or an evaluation does not exist.
- evaluation_lineages(evaluation_id: str) Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]
Wrapper for RepositoryNagivator.evaluation_lineages.
- Parameters:
evaluation_id – The id of the evaluation
- Returns:
An iterator over all :class:`EvaluationLineage`s for the given evaluation id.
- evaluation_type() type[Evaluation]
Returns the type of the evaluation result of an example.
This can be used to retrieve properly typed evaluations of an evaluation run from an
EvaluationRepository- Returns:
Returns the type of the evaluation result of an example.
- expected_output_type() type[ExpectedOutput]
Returns the type of the evaluated task’s expected output.
This can be used to retrieve properly typed
Example`s of a dataset from a :class:`DatasetRepository.- Returns:
The type of the evaluated task’s expected output.
- failed_evaluations(evaluation_id: str) Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]
Returns the EvaluationLineage objects for all failed example evaluations that belong to the given evaluation ID.
- Parameters:
evaluation_id – The ID of the evaluation overview
- Returns:
Iterableof :class:`EvaluationLineage`s.
- input_type() type[Input]
Returns the type of the evaluated task’s input.
This can be used to retrieve properly typed
Example`s of a dataset from a :class:`DatasetRepository.- Returns:
The type of the evaluated task’s input.
- output_type() type[Output]
Returns the type of the evaluated task’s output.
This can be used to retrieve properly typed outputs of an evaluation run from a
RunRepository.- Returns:
The type of the evaluated task’s output.
- class Example(*, input: ~pharia_inference_sdk.core.task.Input, expected_output: ~pharia_studio_sdk.evaluation.dataset.domain.ExpectedOutput, id: str = <factory>, metadata: dict[str, JsonSerializable] | None = None)[source]
Bases:
BaseModel,Generic[Input,ExpectedOutput]Example case used for evaluations.
- input
Input for the
Task. Has to be same type as the input for the task used.- Type:
pharia_inference_sdk.core.task.Input
- expected_output
The expected output from a given example run. This will be used by the evaluator to compare the received output with.
- Type:
pharia_studio_sdk.evaluation.dataset.domain.ExpectedOutput
- id
Identifier for the example, defaults to uuid.
- Type:
str
- metadata
Optional dictionary of custom key-value pairs.
- Type:
dict[str, JsonSerializable] | None
- Generics:
Input: Interface to be passed to the
Taskthat shall be evaluated. ExpectedOutput: Output that is expected from the run with the supplied input.
- class ExampleEvaluation(*, evaluation_id: str, example_id: str, result: Annotated[Evaluation | FailedExampleEvaluation, SerializeAsAny()])[source]
Bases:
BaseModel,Generic[Evaluation]Evaluation of a single evaluated
Example.Created to persist the evaluation result in the repository.
- evaluation_id
Identifier of the run the evaluated example belongs to.
- Type:
str
- result
If the evaluation was successful, evaluation’s result, otherwise the exception raised during running or evaluating the
Task.- Type:
pharia_studio_sdk.evaluation.evaluation.domain.Evaluation | pharia_studio_sdk.evaluation.evaluation.domain.FailedExampleEvaluation
- Generics:
Evaluation: Interface of the metrics that come from the evaluated
Task.
- class ExampleOutput(*, run_id: str, example_id: str, output: Output | FailedExampleRun)[source]
Bases:
BaseModel,Generic[Output]Output of a single evaluated
Example.Created to persist the output (including failures) of an individual example in the repository.
- run_id
Identifier of the run that created the output.
- Type:
str
- output
Generated when running the
Task. When the running the task failed this is anFailedExampleRun.- Type:
pharia_inference_sdk.core.task.Output | pharia_studio_sdk.evaluation.run.domain.FailedExampleRun
- Generics:
Output: Interface of the output returned by the task.
- class FailedExampleEvaluation(*, error_message: str)[source]
Bases:
BaseModelCaptures an exception raised when evaluating an
ExampleOutput.- error_message
String-representation of the exception.
- Type:
str
- static from_exception(exception: Exception) FailedExampleEvaluation[source]
- class FileAggregationRepository(root_directory: Path)[source]
Bases:
FileSystemAggregationRepository- aggregation_overview(aggregation_id: str, aggregation_type: type[AggregatedEvaluation]) AggregationOverview | None
Returns an
AggregationOverviewfor the given ID.- Parameters:
aggregation_id – ID of the aggregation overview to retrieve.
aggregation_type – Type of the aggregation.
- Returns:
EvaluationOverviewif it was found, None otherwise.
- aggregation_overview_ids() Sequence[str]
Returns sorted IDs of all stored :class:`AggregationOverview`s.
- Returns:
A
Sequenceof theAggregationOverviewIDs.
- aggregation_overviews(aggregation_type: type[AggregatedEvaluation]) Iterable[AggregationOverview]
Returns all :class:`AggregationOverview`s sorted by their ID.
- Parameters:
aggregation_type – Type of the aggregation.
- Yields:
:class:`AggregationOverview`s.
- exists(path: Path) bool
- file_names(path: Path, file_type: str = 'json') Sequence[str]
- mkdir(path: Path) None
- static path_to_str(path: Path) str[source]
Returns a string for the given Path so that it’s readable for the respective file system.
- Parameters:
path – Given Path that should be converted.
- Returns:
String representation of the given Path.
- read_utf8(path: Path) str
- remove_file(path: Path) None
- store_aggregation_overview(aggregation_overview: AggregationOverview) None
Stores an
AggregationOverview.- Parameters:
aggregation_overview – The aggregated results to be persisted.
- write_utf8(path: Path, content: str, create_parents: bool = False) None
- class FileDatasetRepository(root_directory: Path)[source]
Bases:
FileSystemDatasetRepository- create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) Dataset
Creates a dataset from given :class:`Example`s and returns the ID of that dataset.
- Parameters:
examples – An
Iterableof :class:`Example`s to be saved in the same dataset.dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.
- Returns:
The created
Dataset.
- dataset(dataset_id: str) Dataset | None
Returns a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- Returns:
Datasetif it was not, None otherwise.
- dataset_ids() Iterable[str]
Returns all sorted dataset IDs.
- Returns:
Iterableof dataset IDs.
- datasets() Iterable[Dataset]
Returns all :class:`Dataset`s sorted by their ID.
- Yields:
:class:`Dataset`s.
- delete_dataset(dataset_id: str) None
Deletes a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) Example | None
Returns an
Examplefor the given dataset ID and example ID.- Parameters:
dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
- Returns:
Exampleif it was found, None otherwise.
- examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) Iterable[Example]
Returns all :class:`Example`s for the given dataset ID sorted by their ID.
- Parameters:
dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.
- Returns:
class`Example`s.
- Return type:
Iterableof
- exists(path: Path) bool
- file_names(path: Path, file_type: str = 'json') Sequence[str]
- mkdir(path: Path) None
- static path_to_str(path: Path) str
Returns a string for the given Path so that it’s readable for the respective file system.
- Parameters:
path – Given Path that should be converted.
- Returns:
String representation of the given Path.
- read_utf8(path: Path) str
- remove_file(path: Path) None
- write_utf8(path: Path, content: str, create_parents: bool = False) None
- class FileEvaluationRepository(root_directory: Path)[source]
Bases:
FileSystemEvaluationRepository- evaluation_overview(evaluation_id: str) EvaluationOverview | None
Returns an
EvaluationOverviewfor the given ID.- Parameters:
evaluation_id – ID of the evaluation overview to retrieve.
- Returns:
EvaluationOverviewif it was found, None otherwise.
- evaluation_overview_ids() Sequence[str]
Returns sorted IDs of all stored :class:`EvaluationOverview`s.
- Returns:
A
Sequenceof theEvaluationOverviewIDs.
- evaluation_overviews() Iterable[EvaluationOverview]
Returns all :class:`EvaluationOverview`s sorted by their ID.
- Yields:
:class:`EvaluationOverview`s.
- example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation] | None
Returns an
ExampleEvaluationfor the given evaluation overview ID and example ID.- Parameters:
evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in
Evaluator.do_evaluate()
- Returns:
ExampleEvaluationif it was found, None otherwise.
- example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation]]
Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- exists(path: Path) bool
- failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation[FailedExampleEvaluation]]
Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- file_names(path: Path, file_type: str = 'json') Sequence[str]
- initialize_evaluation() str
Initializes an
EvaluationOverviewand returns its ID.If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.
- Returns:
The created ID.
- mkdir(path: Path) None
- static path_to_str(path: Path) str[source]
Returns a string for the given Path so that it’s readable for the respective file system.
- Parameters:
path – Given Path that should be converted.
- Returns:
String representation of the given Path.
- read_utf8(path: Path) str
- remove_file(path: Path) None
- store_evaluation_overview(overview: EvaluationOverview) None
Stores an
EvaluationOverview.- Parameters:
evaluation_overview – The overview to be persisted.
- store_example_evaluation(example_evaluation: ExampleEvaluation) None
Stores an
ExampleEvaluation.- Parameters:
example_evaluation – The example evaluation to be persisted.
- successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- write_utf8(path: Path, content: str, create_parents: bool = False) None
- class FileRunRepository(root_directory: Path)[source]
Bases:
FileSystemRunRepository- create_temporary_run_data(tmp_hash: str, run_id: str) None
- create_tracer_for_example(run_id: str, example_id: str) Tracer
Creates and returns a
Tracerfor the given run ID and example ID.- Parameters:
run_id – The ID of the linked run overview.
example_id – ID of the example whose
Tracershould be retrieved.
- Returns:
.class:Tracer.
- Return type:
A
- delete_temporary_run_data(tmp_hash: str) None
- example_output(run_id: str, example_id: str, output_type: type[Output]) ExampleOutput | ExampleOutput[FailedExampleRun] | None
Returns
ExampleOutputfor the given run ID and example ID.- Parameters:
run_id – The ID of the linked run overview.
example_id – ID of the example to retrieve.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
ExampleOutput if it was found, None otherwise.
- Return type:
class
- example_output_ids(run_id: str) Sequence[str]
Returns the sorted IDs of all :class:`ExampleOutput`s for a given run ID.
- Parameters:
run_id – The ID of the run overview.
- Returns:
A
Sequenceof allExampleOutputIDs.
- example_outputs(run_id: str, output_type: type[Output]) Iterable[ExampleOutput | ExampleOutput[FailedExampleRun]]
Returns all
ExampleOutputfor a given run ID sorted by their example ID.- Parameters:
run_id – The ID of the run overview.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterableof :class:`ExampleOutput`s.
- example_tracer(run_id: str, example_id: str) Tracer | None
Returns an
Optional[Tracer]for the given run ID and example ID.- Parameters:
run_id – The ID of the linked run overview.
example_id – ID of the example whose
Tracershould be retrieved.
- Returns:
A
Tracerif it was found, None otherwise.
- exists(path: Path) bool
- failed_example_outputs(run_id: str, output_type: type[Output]) Iterable[ExampleOutput[FailedExampleRun]]
Returns all
ExampleOutputfor failed example runs with a given run-overview ID sorted by their example ID.- Parameters:
run_id – The ID of the run overview.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterableof :class:`ExampleOutput`s.
- file_names(path: Path, file_type: str = 'json') Sequence[str]
- finished_examples(tmp_hash: str) RecoveryData | None
- mkdir(path: Path) None
- static path_to_str(path: Path) str[source]
Returns a string for the given Path so that it’s readable for the respective file system.
- Parameters:
path – Given Path that should be converted.
- Returns:
String representation of the given Path.
- read_utf8(path: Path) str
- remove_file(path: Path) None
- run_overview(run_id: str) RunOverview | None
Returns a
RunOverviewfor the given ID.- Parameters:
run_id – ID of the run overview to retrieve.
- Returns:
RunOverviewif it was found, None otherwise.
- run_overview_ids() Sequence[str]
Returns sorted IDs of all stored :class:`RunOverview`s.
- Returns:
A
Sequenceof theRunOverviewIDs.
- run_overviews() Iterable[RunOverview]
Returns all :class:`RunOverview`s sorted by their ID.
- Yields:
Iterableof :class:`RunOverview`s.
- store_example_output(example_output: ExampleOutput) None
Stores an
ExampleOutput.- Parameters:
example_output – The example output to be persisted.
- store_example_output_parallel(tmp_hash: str, example_output: ExampleOutput) None
- store_run_overview(overview: RunOverview) None
Stores a
RunOverview.- Parameters:
overview – The overview to be persisted.
- successful_example_outputs(run_id: str, output_type: type[Output]) Iterable[ExampleOutput]
Returns all
ExampleOutputfor successful example runs with a given run-overview ID sorted by their example ID.- Parameters:
run_id – The ID of the run overview.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterableof :class:`ExampleOutput`s.
- temp_store_finished_example(tmp_hash: str, example_id: str) None
- write_utf8(path: Path, content: str, create_parents: bool = False) None
- class HighlightCoverageGrader(beta_factor: float = 1.0)[source]
Bases:
objectEvaluates how well the generated highlights match the expected highlights (via precision, recall and f1-score).
- Parameters:
beta_factor – factor to control weight of precision (0 <= beta < 1) vs. recall (beta > 1) when computing the f-score
- compute_fscores(generated_highlight_indices: Sequence[tuple[int, int]], expected_highlight_indices: Sequence[tuple[int, int]]) FScores[source]
Calculates how well the generated highlight ranges match the expected ones.
- Parameters:
generated_highlight_indices – list of tuples(start, end) of the generated highlights
expected_highlight_indices – list of tuples(start, end) of the generated highlights
- Returns:
FScores, which contains precision, recall and f-score metrics, all will be floats between 0 and 1, where 1 means perfect match and 0 no overlap
- class HuggingFaceAggregationRepository(repository_id: str, token: str, private: bool)[source]
Bases:
FileSystemAggregationRepository,HuggingFaceRepository- aggregation_overview(aggregation_id: str, aggregation_type: type[AggregatedEvaluation]) AggregationOverview | None
Returns an
AggregationOverviewfor the given ID.- Parameters:
aggregation_id – ID of the aggregation overview to retrieve.
aggregation_type – Type of the aggregation.
- Returns:
EvaluationOverviewif it was found, None otherwise.
- aggregation_overview_ids() Sequence[str]
Returns sorted IDs of all stored :class:`AggregationOverview`s.
- Returns:
A
Sequenceof theAggregationOverviewIDs.
- aggregation_overviews(aggregation_type: type[AggregatedEvaluation]) Iterable[AggregationOverview]
Returns all :class:`AggregationOverview`s sorted by their ID.
- Parameters:
aggregation_type – Type of the aggregation.
- Yields:
:class:`AggregationOverview`s.
- create_repository(repository_id: str, token: str, private: bool) None
- delete_repository() None
- exists(path: Path) bool
- file_names(path: Path, file_type: str = 'json') Sequence[str]
- mkdir(path: Path) None
- static path_to_str(path: Path) str
Returns a string for the given Path so that it’s readable for the respective file system.
- Parameters:
path – Given Path that should be converted.
- Returns:
String representation of the given Path.
- read_utf8(path: Path) str
- remove_file(path: Path) None
- store_aggregation_overview(aggregation_overview: AggregationOverview) None
Stores an
AggregationOverview.- Parameters:
aggregation_overview – The aggregated results to be persisted.
- write_utf8(path: Path, content: str, create_parents: bool = False) None
- class HuggingFaceDatasetRepository(repository_id: str, token: str, private: bool, caching: bool = True)[source]
Bases:
HuggingFaceRepository,FileSystemDatasetRepository- create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) Dataset
Creates a dataset from given :class:`Example`s and returns the ID of that dataset.
- Parameters:
examples – An
Iterableof :class:`Example`s to be saved in the same dataset.dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.
- Returns:
The created
Dataset.
- create_repository(repository_id: str, token: str, private: bool) None
- dataset(dataset_id: str) Dataset | None[source]
Returns a dataset identified by the given dataset ID.
This implementation should be backwards compatible to datasets created without a dataset object (i.e., there is no dataset file with dataset metadata).
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- Returns:
Datasetif it was not, None otherwise.
- dataset_ids() Iterable[str]
Returns all sorted dataset IDs.
- Returns:
Iterableof dataset IDs.
- datasets() Iterable[Dataset]
Returns all :class:`Dataset`s sorted by their ID.
- Yields:
:class:`Dataset`s.
- delete_dataset(dataset_id: str) None[source]
Deletes a dataset identified by the given dataset ID.
This implementation should be backwards compatible to datasets created without a dataset object (i.e., there is no dataset file with dataset metadata).
Note, that HuggingFace API does not seem to support deleting not-existing files.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- delete_repository() None
- example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) Example | None
Returns an
Examplefor the given dataset ID and example ID.- Parameters:
dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
- Returns:
Exampleif it was found, None otherwise.
- examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) Iterable[Example]
Returns all :class:`Example`s for the given dataset ID sorted by their ID.
- Parameters:
dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.
- Returns:
class`Example`s.
- Return type:
Iterableof
- exists(path: Path) bool
- file_names(path: Path, file_type: str = 'json') Sequence[str]
- mkdir(path: Path) None
- static path_to_str(path: Path) str
Returns a string for the given Path so that it’s readable for the respective file system.
- Parameters:
path – Given Path that should be converted.
- Returns:
String representation of the given Path.
- read_utf8(path: Path) str
- remove_file(path: Path) None
- write_utf8(path: Path, content: str, create_parents: bool = False) None
- class HuggingFaceRepository(repository_id: str, token: str, private: bool)[source]
Bases:
FileSystemBasedRepositoryHuggingFace base repository.
- exists(path: Path) bool
- file_names(path: Path, file_type: str = 'json') Sequence[str]
- mkdir(path: Path) None
- static path_to_str(path: Path) str[source]
Returns a string for the given Path so that it’s readable for the respective file system.
- Parameters:
path – Given Path that should be converted.
- Returns:
String representation of the given Path.
- read_utf8(path: Path) str
- remove_file(path: Path) None
- write_utf8(path: Path, content: str, create_parents: bool = False) None
- class InMemoryAggregationRepository[source]
Bases:
AggregationRepository- aggregation_overview(aggregation_id: str, aggregation_type: type[AggregatedEvaluation]) AggregationOverview | None[source]
Returns an
AggregationOverviewfor the given ID.- Parameters:
aggregation_id – ID of the aggregation overview to retrieve.
aggregation_type – Type of the aggregation.
- Returns:
EvaluationOverviewif it was found, None otherwise.
- aggregation_overview_ids() Sequence[str][source]
Returns sorted IDs of all stored :class:`AggregationOverview`s.
- Returns:
A
Sequenceof theAggregationOverviewIDs.
- aggregation_overviews(aggregation_type: type[AggregatedEvaluation]) Iterable[AggregationOverview]
Returns all :class:`AggregationOverview`s sorted by their ID.
- Parameters:
aggregation_type – Type of the aggregation.
- Yields:
:class:`AggregationOverview`s.
- store_aggregation_overview(aggregation_overview: AggregationOverview) None[source]
Stores an
AggregationOverview.- Parameters:
aggregation_overview – The aggregated results to be persisted.
- class InMemoryDatasetRepository[source]
Bases:
DatasetRepository- create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) Dataset[source]
Creates a dataset from given :class:`Example`s and returns the ID of that dataset.
- Parameters:
examples – An
Iterableof :class:`Example`s to be saved in the same dataset.dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.
- Returns:
The created
Dataset.
- dataset(dataset_id: str) Dataset | None[source]
Returns a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- Returns:
Datasetif it was not, None otherwise.
- dataset_ids() Iterable[str][source]
Returns all sorted dataset IDs.
- Returns:
Iterableof dataset IDs.
- datasets() Iterable[Dataset]
Returns all :class:`Dataset`s sorted by their ID.
- Yields:
:class:`Dataset`s.
- delete_dataset(dataset_id: str) None[source]
Deletes a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) Example | None[source]
Returns an
Examplefor the given dataset ID and example ID.- Parameters:
dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
- Returns:
Exampleif it was found, None otherwise.
- examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) Iterable[Example][source]
Returns all :class:`Example`s for the given dataset ID sorted by their ID.
- Parameters:
dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.
- Returns:
class`Example`s.
- Return type:
Iterableof
- class InMemoryEvaluationRepository[source]
Bases:
EvaluationRepositoryAn
EvaluationRepositorythat stores evaluation results in memory.Preferred for quick testing or to be used in Jupyter Notebooks.
- evaluation_overview(evaluation_id: str) EvaluationOverview | None[source]
Returns an
EvaluationOverviewfor the given ID.- Parameters:
evaluation_id – ID of the evaluation overview to retrieve.
- Returns:
EvaluationOverviewif it was found, None otherwise.
- evaluation_overview_ids() Sequence[str][source]
Returns sorted IDs of all stored :class:`EvaluationOverview`s.
- Returns:
A
Sequenceof theEvaluationOverviewIDs.
- evaluation_overviews() Iterable[EvaluationOverview]
Returns all :class:`EvaluationOverview`s sorted by their ID.
- Yields:
:class:`EvaluationOverview`s.
- example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation] | None[source]
Returns an
ExampleEvaluationfor the given evaluation overview ID and example ID.- Parameters:
evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in
Evaluator.do_evaluate()
- Returns:
ExampleEvaluationif it was found, None otherwise.
- example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation]][source]
Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation[FailedExampleEvaluation]]
Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- initialize_evaluation() str
Initializes an
EvaluationOverviewand returns its ID.If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.
- Returns:
The created ID.
- store_evaluation_overview(overview: EvaluationOverview) None[source]
Stores an
EvaluationOverview.- Parameters:
evaluation_overview – The overview to be persisted.
- store_example_evaluation(evaluation: ExampleEvaluation) None[source]
Stores an
ExampleEvaluation.- Parameters:
example_evaluation – The example evaluation to be persisted.
- successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- class InMemoryRunRepository[source]
Bases:
RunRepository- create_temporary_run_data(tmp_hash: str, run_id: str) None
- create_tracer_for_example(run_id: str, example_id: str) Tracer[source]
Creates and returns a
Tracerfor the given run ID and example ID.- Parameters:
run_id – The ID of the linked run overview.
example_id – ID of the example whose
Tracershould be retrieved.
- Returns:
.class:Tracer.
- Return type:
A
- delete_temporary_run_data(tmp_hash: str) None
- example_output(run_id: str, example_id: str, output_type: type[Output]) ExampleOutput | ExampleOutput[FailedExampleRun] | None[source]
Returns
ExampleOutputfor the given run ID and example ID.- Parameters:
run_id – The ID of the linked run overview.
example_id – ID of the example to retrieve.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
ExampleOutput if it was found, None otherwise.
- Return type:
class
- example_output_ids(run_id: str) Sequence[str][source]
Returns the sorted IDs of all :class:`ExampleOutput`s for a given run ID.
- Parameters:
run_id – The ID of the run overview.
- Returns:
A
Sequenceof allExampleOutputIDs.
- example_outputs(run_id: str, output_type: type[Output]) Iterable[ExampleOutput | ExampleOutput[FailedExampleRun]][source]
Returns all
ExampleOutputfor a given run ID sorted by their example ID.- Parameters:
run_id – The ID of the run overview.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterableof :class:`ExampleOutput`s.
- example_tracer(run_id: str, example_id: str) Tracer | None[source]
Returns an
Optional[Tracer]for the given run ID and example ID.- Parameters:
run_id – The ID of the linked run overview.
example_id – ID of the example whose
Tracershould be retrieved.
- Returns:
A
Tracerif it was found, None otherwise.
- failed_example_outputs(run_id: str, output_type: type[Output]) Iterable[ExampleOutput[FailedExampleRun]]
Returns all
ExampleOutputfor failed example runs with a given run-overview ID sorted by their example ID.- Parameters:
run_id – The ID of the run overview.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterableof :class:`ExampleOutput`s.
- run_overview(run_id: str) RunOverview | None[source]
Returns a
RunOverviewfor the given ID.- Parameters:
run_id – ID of the run overview to retrieve.
- Returns:
RunOverviewif it was found, None otherwise.
- run_overview_ids() Sequence[str][source]
Returns sorted IDs of all stored :class:`RunOverview`s.
- Returns:
A
Sequenceof theRunOverviewIDs.
- run_overviews() Iterable[RunOverview]
Returns all :class:`RunOverview`s sorted by their ID.
- Yields:
Iterableof :class:`RunOverview`s.
- store_example_output(example_output: ExampleOutput) None[source]
Stores an
ExampleOutput.- Parameters:
example_output – The example output to be persisted.
- store_example_output_parallel(tmp_hash: str, example_output: ExampleOutput) None
- store_run_overview(overview: RunOverview) None[source]
Stores a
RunOverview.- Parameters:
overview – The overview to be persisted.
- successful_example_outputs(run_id: str, output_type: type[Output]) Iterable[ExampleOutput]
Returns all
ExampleOutputfor successful example runs with a given run-overview ID sorted by their example ID.- Parameters:
run_id – The ID of the run overview.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterableof :class:`ExampleOutput`s.
- temp_store_finished_example(tmp_hash: str, example_id: str) None
- class IncrementalEvaluationLogic[source]
Bases:
EvaluationLogic[Input,Output,ExpectedOutput,Evaluation]- do_evaluate(example: Example, *output: SuccessfulExampleOutput) Evaluation[source]
Executes the evaluation for this specific example.
Responsible for comparing the input & expected output of a task to the actually generated output. The difference to the standard
EvaluationLogic’s do_evaluate is that this method will separate already processed evaluation from new ones before handing them over to do_incremental_evaluate.- Parameters:
example – Input data of
Taskto produce the output.*output – Outputs of the
Task.
- Returns:
The metrics that come from the evaluated
Task.- Return type:
Evaluation
- abstract do_incremental_evaluate(example: Example, outputs: list[SuccessfulExampleOutput], already_evaluated_outputs: list[list[SuccessfulExampleOutput]]) Evaluation[source]
- class IncrementalEvaluator(dataset_repository: DatasetRepository, run_repository: RunRepository, evaluation_repository: EvaluationRepository, description: str, incremental_evaluation_logic: IncrementalEvaluationLogic[Input, Output, ExpectedOutput, Evaluation])[source]
Bases:
Evaluator[Input,Output,ExpectedOutput,Evaluation]Evaluatorfor evaluating additional runs on top of previous evaluations. Intended for use withIncrementalEvaluationLogic.- Parameters:
dataset_repository – The repository with the examples that will be taken for the evaluation.
run_repository – The repository of the runs to evaluate.
evaluation_repository – The repository that will be used to store evaluation results.
description – Human-readable description for the evaluator.
incremental_evaluation_logic – The logic to use for evaluation.
- Generics:
Input: Interface to be passed to the
Taskthat shall be evaluated. Output: Type of the output of theTaskto be evaluated. ExpectedOutput: Output that is expected from the run with the supplied input. Evaluation: Interface of the metrics that come from the evaluatedTask.
- evaluate(example: Example, evaluation_id: str, abort_on_error: bool, *example_outputs: SuccessfulExampleOutput) Evaluation | FailedExampleEvaluation
- evaluate_additional_runs(*run_ids: str, previous_evaluation_ids: list[str] | None = None, num_examples: int | None = None, abort_on_error: bool = False, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) EvaluationOverview[source]
Evaluate all runs while considering which runs have already been evaluated according to previous_evaluation_id.
For each set of successful outputs in the referenced runs,
EvaluationLogic.do_evaluate()is called and eval metrics are produced & stored in the providedEvaluationRepository.- Parameters:
*run_ids – The runs to be evaluated. Each run is expected to have the same dataset as input (which implies their tasks have the same input-type) and their tasks have the same output-type. For each example in the dataset referenced by the runs the outputs of all runs are collected and if all of them were successful they are passed on to the implementation specific evaluation. The method compares all run of the provided ids to each other.
previous_evaluation_ids – IDs of previous evaluation to consider
num_examples – The number of examples which should be evaluated from the given runs. Always the first n runs stored in the evaluation repository. Defaults to None.
abort_on_error – Flag to abort all evaluations when an error occurs. Defaults to False.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the evaluation overview. Defaults to an empty dict.
- Returns:
An overview of the evaluation. Individual
Evaluation`s will not be returned but instead stored in the :class:`EvaluationRepositoryprovided in the __init__.- Return type:
- evaluate_runs(*run_ids: str, num_examples: int | None = None, abort_on_error: bool = False, skip_example_on_any_failure: bool = True, description: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) EvaluationOverview[source]
Evaluates all generated outputs in the run.
For each set of successful outputs in the referenced runs,
EvaluationLogic.do_evaluate()is called and eval metrics are produced & stored in the providedEvaluationRepository.- Parameters:
*run_ids – The runs to be evaluated. Each run is expected to have the same dataset as input (which implies their tasks have the same input-type) and their tasks have the same output-type. For each example in the dataset referenced by the runs the outputs of all runs are collected and if all of them were successful they are passed on to the implementation specific evaluation. The method compares all run of the provided ids to each other.
num_examples – The number of examples which should be evaluated from the given runs. Always the first n runs stored in the evaluation repository. Defaults to None.
abort_on_error – Flag to abort all evaluations when an error occurs. Defaults to False.
skip_example_on_any_failure – Flag to skip evaluation on any example for which at least one run fails. Defaults to True.
description – Optional description of the evaluation. Defaults to None.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the evaluation overview. Defaults to an empty dict.
- Returns:
An overview of the evaluation. Individual
Evaluation`s will not be returned but instead stored in the :class:`EvaluationRepositoryprovided in the __init__.- Return type:
- evaluation_lineage(evaluation_id: str, example_id: str) EvaluationLineage[Input, ExpectedOutput, Output, Evaluation] | None
Wrapper for RepositoryNagivator.evaluation_lineage.
- Parameters:
evaluation_id – The id of the evaluation
example_id – The id of the example of interest
- Returns:
The
EvaluationLineagefor the given evaluation id and example id. Returns None if the lineage is not complete because either an example, a run, or an evaluation does not exist.
- evaluation_lineages(evaluation_id: str) Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]
Wrapper for RepositoryNagivator.evaluation_lineages.
- Parameters:
evaluation_id – The id of the evaluation
- Returns:
An iterator over all :class:`EvaluationLineage`s for the given evaluation id.
- evaluation_type() type[Evaluation]
Returns the type of the evaluation result of an example.
This can be used to retrieve properly typed evaluations of an evaluation run from an
EvaluationRepository- Returns:
Returns the type of the evaluation result of an example.
- expected_output_type() type[ExpectedOutput]
Returns the type of the evaluated task’s expected output.
This can be used to retrieve properly typed
Example`s of a dataset from a :class:`DatasetRepository.- Returns:
The type of the evaluated task’s expected output.
- failed_evaluations(evaluation_id: str) Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]
Returns the EvaluationLineage objects for all failed example evaluations that belong to the given evaluation ID.
- Parameters:
evaluation_id – The ID of the evaluation overview
- Returns:
Iterableof :class:`EvaluationLineage`s.
- input_type() type[Input]
Returns the type of the evaluated task’s input.
This can be used to retrieve properly typed
Example`s of a dataset from a :class:`DatasetRepository.- Returns:
The type of the evaluated task’s input.
- output_type() type[Output]
Returns the type of the evaluated task’s output.
This can be used to retrieve properly typed outputs of an evaluation run from a
RunRepository.- Returns:
The type of the evaluated task’s output.
- class InstructComparisonArgillaEvaluationLogic(high_priority_runs: frozenset[str] | None = None)[source]
Bases:
ArgillaEvaluationLogic[InstructInput,CompleteOutput,None,ComparisonEvaluation]- from_record(argilla_evaluation: ArgillaEvaluation) ComparisonEvaluation[source]
This method takes the specific Argilla evaluation format and converts into a compatible
Evaluation.The format of argilla_evaluation.responses depends on the questions attribute. Each name of a question will be a key in the argilla_evaluation.responses mapping.
- Parameters:
argilla_evaluation – Argilla-specific data for a single evaluation.
- Returns:
An
Evaluationthat contains all evaluation specific data.
- to_record(example: Example[InstructInput, NoneType], *outputs: SuccessfulExampleOutput[CompleteOutput]) RecordDataSequence[source]
This method is responsible for translating the Example and Output of the task to
RecordData.The specific format depends on the fields.
- Parameters:
example – The example to be translated.
*output – The output of the example that was run.
- Returns:
A
RecordDataSequencethat contains entries that should be evaluated in Argilla.
- class LanguageMatchesGrader(acceptance_threshold: float = 0.1)[source]
Bases:
objectProvides a method to evaluate whether two texts are of the same language.
- Parameters:
acceptance_threshold – probability a language must surpass to be accepted
- languages_match(input: str, output: str) bool[source]
Calculates if the input and output text are of the same language.
The length of the texts and its sentences should be reasonably long in order for good performance.
- Parameters:
input – text for which languages is compared to
output – text
- Returns:
- whether input and output language match
returns true if clear input language is not determinable
- Return type:
bool
- class MatchOutcome(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
str,Enum- static from_rank_literal(rank: int) MatchOutcome[source]
- class Matches(*, comparison_evaluations: Sequence[ComparisonEvaluation])[source]
Bases:
BaseModel
- class MatchesAggregationLogic[source]
Bases:
AggregationLogic[Matches,AggregatedComparison]- aggregate(evaluations: Iterable[Matches]) AggregatedComparison[source]
Evaluator-specific method for aggregating individual Evaluations into report-like Aggregated Evaluation.
This method is responsible for taking the results of an evaluation run and aggregating all the results. It should create an AggregatedEvaluation class and return it at the end.
- Parameters:
evaluations – The results from running eval_and_aggregate_runs with a
Task.- Returns:
The aggregated results of an evaluation run with a
Dataset.
- class MeanAccumulator[source]
Bases:
Accumulator[float,float]- add(value: float) None[source]
Responsible for accumulating values.
- Parameters:
value – the value to add
- Returns:
nothing
- class RecordDataSequence(*, records: Sequence[RecordData])[source]
Bases:
BaseModel
Bases:
objectThe RepositoryNavigator is used to retrieve coupled data from multiple repositories.
Retrieves the
EvaluationLineagefor the evaluation with id evaluation_id and example with id example_id.- Parameters:
evaluation_id – The id of the evaluation
example_id – The id of the example of interest
input_type – The type of the input as defined by the
Exampleexpected_output_type – The type of the expected output as defined by the
Exampleoutput_type – The type of the run output as defined by the
Outputevaluation_type – The type of the evaluation as defined by the
Evaluation
- Returns:
The
EvaluationLineagefor the given evaluation id and example id. Returns None if the lineage is not complete because either an example, a run, or an evaluation does not exist.
Retrieves all
EvaluationLineage`s for the evaluation with id `evaluation_id.- Parameters:
evaluation_id – The id of the evaluation
input_type – The type of the input as defined by the
Exampleexpected_output_type – The type of the expected output as defined by the
Exampleoutput_type – The type of the run output as defined by the
Outputevaluation_type – The type of the evaluation as defined by the
Evaluation
- Yields:
All :class:`EvaluationLineage`s for the given evaluation id.
Retrieves the
RunLineagefor the run with id run_id and example with id example_id.- Parameters:
- Returns:
The
RunLineagefor the given run id and example id, None if the example or an output for the example does not exist.
Retrieves all
RunLineage`s for the run with id `run_id.- Parameters:
- Yields:
An iterator over all :class:`RunLineage`s for the given run id.
- class RougeGrader[source]
Bases:
object- calculate_rouge(hypothesis: str, reference: str) FScores[source]
Calculates the ROUGE-score for the hypothesis and reference.
In the summarization use-case the ROUGE-score roughly corresponds to the recall of the generated summary with regard to the expected summary.
- Parameters:
hypothesis – The generation to be evaluated.
reference – The baseline for the evaluation.
- Returns:
ROUGE-score, which contains precision, recall and f1 metrics, all will be floats between 0 and 1. Where 1 means perfect match and 0 no overlap.
- class RunOverview(*, dataset_id: str, id: str, start: datetime, end: datetime, failed_example_count: int, successful_example_count: int, description: str, labels: set[str] = {}, metadata: dict[str, JsonSerializable] = {})[source]
Bases:
BaseModelOverview of the run of a
Taskon a dataset.- dataset_id
Identifier of the dataset run.
- Type:
str
- id
The unique identifier of this run.
- Type:
str
- start
The time when the run was started
- Type:
datetime.datetime
- end
The time when the run ended
- Type:
datetime.datetime
- failed_example_count
The number of examples where an exception was raised when running the task.
- Type:
int
- successful_example_count
The number of examples that where successfully run.
- Type:
int
- description
Human-readable of the runner that run the task.
- Type:
str
- labels
Labels for filtering runs. Defaults to empty list.
- Type:
set[str]
- metadata
Additional information about the run. Defaults to empty dict.
- Type:
dict[str, JsonSerializable]
- class RunRepository[source]
Bases:
ABCBase run repository interface.
Provides methods to store and load run results:
RunOverviewandExampleOutput. ARunOverviewis created from and is linked (by its ID) to multiple :class:`ExampleOutput`s representing results of a dataset.- abstract create_tracer_for_example(run_id: str, example_id: str) Tracer[source]
Creates and returns a
Tracerfor the given run ID and example ID.- Parameters:
run_id – The ID of the linked run overview.
example_id – ID of the example whose
Tracershould be retrieved.
- Returns:
.class:Tracer.
- Return type:
A
- abstract example_output(run_id: str, example_id: str, output_type: type[Output]) ExampleOutput | ExampleOutput[FailedExampleRun] | None[source]
Returns
ExampleOutputfor the given run ID and example ID.- Parameters:
run_id – The ID of the linked run overview.
example_id – ID of the example to retrieve.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
ExampleOutput if it was found, None otherwise.
- Return type:
class
- abstract example_output_ids(run_id: str) Sequence[str][source]
Returns the sorted IDs of all :class:`ExampleOutput`s for a given run ID.
- Parameters:
run_id – The ID of the run overview.
- Returns:
A
Sequenceof allExampleOutputIDs.
- abstract example_outputs(run_id: str, output_type: type[Output]) Iterable[ExampleOutput | ExampleOutput[FailedExampleRun]][source]
Returns all
ExampleOutputfor a given run ID sorted by their example ID.- Parameters:
run_id – The ID of the run overview.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterableof :class:`ExampleOutput`s.
- abstract example_tracer(run_id: str, example_id: str) Tracer | None[source]
Returns an
Optional[Tracer]for the given run ID and example ID.- Parameters:
run_id – The ID of the linked run overview.
example_id – ID of the example whose
Tracershould be retrieved.
- Returns:
A
Tracerif it was found, None otherwise.
- failed_example_outputs(run_id: str, output_type: type[Output]) Iterable[ExampleOutput[FailedExampleRun]][source]
Returns all
ExampleOutputfor failed example runs with a given run-overview ID sorted by their example ID.- Parameters:
run_id – The ID of the run overview.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterableof :class:`ExampleOutput`s.
- abstract run_overview(run_id: str) RunOverview | None[source]
Returns a
RunOverviewfor the given ID.- Parameters:
run_id – ID of the run overview to retrieve.
- Returns:
RunOverviewif it was found, None otherwise.
- abstract run_overview_ids() Sequence[str][source]
Returns sorted IDs of all stored :class:`RunOverview`s.
- Returns:
A
Sequenceof theRunOverviewIDs.
- run_overviews() Iterable[RunOverview][source]
Returns all :class:`RunOverview`s sorted by their ID.
- Yields:
Iterableof :class:`RunOverview`s.
- abstract store_example_output(example_output: ExampleOutput) None[source]
Stores an
ExampleOutput.- Parameters:
example_output – The example output to be persisted.
- final store_example_output_parallel(tmp_hash: str, example_output: ExampleOutput) None[source]
- abstract store_run_overview(overview: RunOverview) None[source]
Stores a
RunOverview.- Parameters:
overview – The overview to be persisted.
- successful_example_outputs(run_id: str, output_type: type[Output]) Iterable[ExampleOutput][source]
Returns all
ExampleOutputfor successful example runs with a given run-overview ID sorted by their example ID.- Parameters:
run_id – The ID of the run overview.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterableof :class:`ExampleOutput`s.
- class Runner(task: Task[Input, Output], dataset_repository: DatasetRepository, run_repository: RunRepository, description: str)[source]
Bases:
Generic[Input,Output]- failed_runs(run_id: str, expected_output_type: type[ExpectedOutput]) Iterable[RunLineage[Input, ExpectedOutput, Output]][source]
Returns the RunLineage objects for all failed example runs that belong to the given run ID.
- Parameters:
run_id – The ID of the run overview
expected_output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterableof :class:`RunLineage`s.
- output_type() type[Output][source]
Returns the type of the evaluated task’s output.
This can be used to retrieve properly typed outputs of an evaluation run from a
RunRepository- Returns:
the type of the evaluated task’s output.
- run_dataset(dataset_id: str, tracer: Tracer | None = None, num_examples: int | None = None, abort_on_error: bool = False, max_workers: int = 10, description: str | None = None, trace_examples_individually: bool = True, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None, resume_from_recovery_data: bool = False) RunOverview[source]
Generates all outputs for the provided dataset.
Will run each
Exampleprovided in the dataset through theTask.- Parameters:
dataset_id – The id of the dataset to generate output for. Consists of examples, each with an
Inputand anExpectedOutput(can be None).tracer – An optional
Tracerto trace all the runs from each example. Use trace_examples_individually to trace each example with a dedicated tracer individually.num_examples – An optional int to specify how many examples from the dataset should be run. Always the first n examples will be taken.
abort_on_error – Flag to abort all run when an error occurs. Defaults to False.
max_workers – Number of examples that can be evaluated concurrently. Defaults to 10.
description – An optional description of the run. Defaults to None.
trace_examples_individually – Flag to create individual tracers for each example. Defaults to True.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the run overview. Defaults to an empty dict.
resume_from_recovery_data – Flag to resume if execution failed previously.
- Returns:
An overview of the run. Outputs will not be returned but instead stored in the
RunRepositoryprovided in the __init__.
- run_is_already_computed(metadata: dict[str, JsonSerializable]) bool[source]
Checks if a run with the given metadata has already been computed.
- Parameters:
metadata – The metadata dictionary to check.
- Returns:
True if a run with the same metadata has already been computed. False otherwise.
- run_lineage(run_id: str, example_id: str, expected_output_type: type[ExpectedOutput]) RunLineage[Input, ExpectedOutput, Output] | None[source]
Wrapper for RepositoryNavigator.run_lineage.
- Parameters:
run_id – The id of the run
example_id – The id of the example of interest
expected_output_type – The type of the expected output as defined by the
Example
- Returns:
The
RunLineagefor the given run id and example id, None if the example or an output for the example does not exist.
- class SingleHuggingfaceDatasetRepository(huggingface_dataset: DatasetDict | Dataset | IterableDatasetDict | IterableDataset)[source]
Bases:
DatasetRepository- create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) Dataset[source]
Creates a dataset from given :class:`Example`s and returns the ID of that dataset.
- Parameters:
examples – An
Iterableof :class:`Example`s to be saved in the same dataset.dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.
- Returns:
The created
Dataset.
- dataset(dataset_id: str) Dataset | None[source]
Returns a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- Returns:
Datasetif it was not, None otherwise.
- dataset_ids() Iterable[str][source]
Returns all sorted dataset IDs.
- Returns:
Iterableof dataset IDs.
- datasets() Iterable[Dataset]
Returns all :class:`Dataset`s sorted by their ID.
- Yields:
:class:`Dataset`s.
- delete_dataset(dataset_id: str) None[source]
Deletes a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) Example | None[source]
Returns an
Examplefor the given dataset ID and example ID.- Parameters:
dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
- Returns:
Exampleif it was found, None otherwise.
- examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) Iterable[Example][source]
Returns all :class:`Example`s for the given dataset ID sorted by their ID.
- Parameters:
dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.
- Returns:
class`Example`s.
- Return type:
Iterableof
- class SingleOutputEvaluationLogic[source]
Bases:
EvaluationLogic[Input,Output,ExpectedOutput,Evaluation]- final do_evaluate(example: Example, *output: SuccessfulExampleOutput) Evaluation[source]
Executes the evaluation for this specific example.
Responsible for comparing the input & expected output of a task to the actually generated output.
- Parameters:
example – Input data of
Taskto produce the output.*output – Output of the
Task.
- Returns:
The metrics that come from the evaluated
Task.
- class StudioBenchmark(benchmark_id: str, name: str, dataset_id: str, eval_logic: EvaluationLogic[Input, Output, ExpectedOutput, Evaluation], aggregation_logic: AggregationLogic[Evaluation, AggregatedEvaluation], studio_client: StudioClient, **kwargs: Any)[source]
Bases:
Benchmark- execute(task: Task[Input, Output], name: str, description: str | None = None, labels: set[str] | None = None, metadata: dict[str, Any] | None = None, max_workers: int = 10) str[source]
Executes the benchmark on a given task.
- Parameters:
task – The task to be evaluated in the benchmark.
name – Name of the benchmark execution.
description – Description of the task to be evaluated.
labels – Labels for filtering or categorizing the benchmark.
metadata – Additional information about the task for logging or configuration.
max_workers – Maximum number of concurrent workers to use for the benchmark execution.
- Returns:
Identifier of the benchmark run.
- class StudioBenchmarkRepository(studio_client: StudioClient)[source]
Bases:
BenchmarkRepository- create_benchmark(dataset_id: str, eval_logic: EvaluationLogic[Input, Output, ExpectedOutput, Evaluation], aggregation_logic: AggregationLogic[Evaluation, AggregatedEvaluation], name: str, metadata: dict[str, Any] | None = None, description: str | None = None) StudioBenchmark[source]
Creates a new benchmark and stores it in the repository.
- Parameters:
dataset_id – Identifier for the dataset associated with the benchmark.
eval_logic – Evaluation logic to be applied in the benchmark.
aggregation_logic – Aggregation logic for combining individual evaluations.
name – Name of the benchmark.
metadata – Additional information about the benchmark, defaults to None.
description – Description of the benchmark, defaults to None.
- Returns:
The created benchmark instance.
- get_benchmark(benchmark_id: str, eval_logic: EvaluationLogic[Input, Output, ExpectedOutput, Evaluation], aggregation_logic: AggregationLogic[Evaluation, AggregatedEvaluation], allow_diff: bool = False) StudioBenchmark | None[source]
Retrieves an existing benchmark from the repository.
- Parameters:
benchmark_id – Unique identifier for the benchmark to retrieve.
eval_logic – Evaluation logic to apply.
aggregation_logic – Aggregation logic to apply.
allow_diff – Retrieve the benchmark even though logics behaviour do not match.
- Returns:
The retrieved benchmark instance. Raises ValueError if no benchmark is found.
- class StudioDatasetRepository(studio_client: StudioClient)[source]
Bases:
DatasetRepositoryDataset repository interface with Data Platform.
- create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) Dataset[source]
Creates a dataset from given :class:`Example`s and returns the ID of that dataset.
- Parameters:
examples – An
Iterableof :class:`Example`s to be saved in the same dataset.dataset_name – A name for the dataset.
id – ID is not used in the StudioDatasetRepository as it is generated by the Studio.
labels – A list of labels for filtering. Defaults to an empty list. Defaults to None.
metadata – A dict for additional information about the dataset. Defaults to an empty dict. Defaults to None.
- Returns:
- dataset(dataset_id: str) Dataset | None[source]
Returns a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- Returns:
Datasetif it was not, None otherwise.
- dataset_ids() Iterable[str][source]
Returns all sorted dataset IDs.
- Returns:
Iterableof dataset IDs.
- datasets() Iterable[Dataset][source]
Returns all :class:`Dataset`s. Sorting is not guaranteed.
- Returns:
Sequenceof :class:`Dataset`s.
- delete_dataset(dataset_id: str) None[source]
Deletes a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) Example | None[source]
Returns an
Examplefor the given dataset ID and example ID.- Parameters:
dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
- Returns:
Exampleif it was found, None otherwise.
- examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) Iterable[Example][source]
Returns all :class:`Example`s for the given dataset ID sorted by their ID.
- Parameters:
dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output. Defaults to None.
- Returns:
class`Example`s.
- Return type:
Iterableof
- static map_to_example(example_to_map: StudioExample[Input, ExpectedOutput]) Example[source]
- static map_to_many_example(examples_to_map: Iterable[StudioExample[Input, ExpectedOutput]]) Iterable[Example][source]
- static map_to_many_studio_example(examples_to_map: Iterable[Example]) Iterable[StudioExample[Input, ExpectedOutput]][source]
- static map_to_studio_dataset(dataset_to_map: Dataset) StudioDataset[source]
- static map_to_studio_example(example_to_map: Example) StudioExample[Input, ExpectedOutput][source]
- class SuccessfulExampleOutput(*, run_id: str, example_id: str, output: Output)[source]
Bases:
BaseModel,Generic[Output]Successful output of a single evaluated
Example.- run_id
Identifier of the run that created the output.
- Type:
str
- output
Generated when running the
Task. This represent only the output of an successful run.- Type:
pharia_inference_sdk.core.task.Output
- Generics:
Output: Interface of the output returned by the task.
- class WinRateCalculator(players: Iterable[str])[source]
Bases:
object- calculate(matches: Sequence[ComparisonEvaluation]) Mapping[str, float][source]
- aggregation_overviews_to_pandas(aggregation_overviews: Sequence[AggregationOverview], unwrap_statistics: bool = True, strict: bool = True, unwrap_metadata: bool = True) DataFrame[source]
Converts aggregation overviews to a pandas table for easier comparison.
- Parameters:
aggregation_overviews – Overviews to convert.
unwrap_statistics – Unwrap the statistics field in the overviews into separate columns. Defaults to True.
strict – Allow only overviews with exactly equal statistics types. Defaults to True.
unwrap_metadata – Unwrap the metadata field in the overviews into separate columns. Defaults to True.
- Returns:
A pandas
DataFramecontaining an overview per row with fields as columns.
- evaluation_lineages_to_pandas(evaluation_lineages: Sequence[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]) DataFrame[source]
Converts a sequence of EvaluationLineage objects to a pandas DataFrame.
The EvaluationLineage objects are stored in the column “lineage”. The DataFrame is indexed by (example_id, evaluation_id, run_id). Each output of every lineage will contribute one row in the DataFrame.
- Parameters:
evaluation_lineages – The lineages to convert.
- Returns:
A pandas DataFrame with the data contained in the evaluation_lineages.
- run_lineages_to_pandas(run_lineages: Sequence[RunLineage[Input, ExpectedOutput, Output]]) DataFrame[source]
Converts a sequence of RunLineage objects to a pandas DataFrame.
The RunLineage objects are stored in the column “lineage”. The DataFrame is indexed by (example_id, run_id).
- Parameters:
run_lineages – The lineages to convert.
- Returns:
A pandas DataFrame with the data contained in the run_lineages.