gec_metrics.metrics.llm_kobayashi24 module

class gec_metrics.metrics.llm_kobayashi24.LLMKobayashi24(config: Config = None)[source]

Bases: MetricBaseForReferenceFree

class Config(model: str = None, cache: str = None, seed: int = 777, verbose: bool = False, criteria: str = None, instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\n\n# source\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n')[source]

Bases: Config

LLMKobayashi24** configuration.

Parameters:

model (str) – The name of the LLM. This is typically the ID of a Hugging Face model or an OpenAI model.
cache (str) – Filename for caching model inputs and outputs.
seed (int) – Seed value.
criteria (str) – Specifies the evaluation aspect. Can be one of None, ‘grammaticality’, ‘fluency’, or ‘meaning’.
instruction_template (str) – Template for the instruction. Must include [SOURCE] and [CORRECTION] placeholders.

cache: str = None

criteria: str = None

instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\n\n# source\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n'

model: str = None

seed: int = 777

verbose: bool = False

class LLMSentOutputFormat1(*, target_score1: int)[source]

Bases: BaseModel

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

target_score1: int

class LLMSentOutputFormat2(*, target_score1: int, target_score2: int)[source]

Bases: BaseModel

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

target_score1: int

target_score2: int

class LLMSentOutputFormat3(*, target_score1: int, target_score2: int, target_score3: int)[source]

Bases: BaseModel

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

target_score1: int

target_score2: int

target_score3: int

class LLMSentOutputFormat4(*, target_score1: int, target_score2: int, target_score3: int, target_score4: int)[source]

Bases: BaseModel

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

target_score1: int

target_score2: int

target_score3: int

target_score4: int

class LLMSentOutputFormat5(*, target_score1: int, target_score2: int, target_score3: int, target_score4: int, target_score5: int)[source]

Bases: BaseModel

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

target_score1: int

target_score2: int

target_score3: int

target_score4: int

target_score5: int

append_to_jsonl(file_name, data)[source]

abstractmethod call_client(instruction: str, output_format: BaseModel)[source]: Write forward scripts given instruction. You can refer output format.

create_hash(prompt)[source]

hyp_form(src: str, hyp: str) → str[source]

This is used for chaning format of the hypothsis, e.g., edit representation. :param src: Source sentence. :type src: str :param hyp: Hypothesis sentence. :type hyp: str

Return: str: Another representation of the hypothesis.

index_multiple(elems: list, target_e) → list[int][source]

Multiple version of list.index()

Parameters:

elems (list[any]) – The list containing any values.
target_e (any) – Value for which you want to know the index.

Returns:

Indices of target_e in elems.

Return type:

list[int]

abstractmethod load_client()[source]: This function loads LLM client, e.g. OpenAI() or .from_pretrained() forHuggingface model.

load_json(file_name)[source]

output_formats = [<class 'gec_metrics.metrics.llm_kobayashi24.LLMKobayashi24.LLMSentOutputFormat1'>, <class 'gec_metrics.metrics.llm_kobayashi24.LLMKobayashi24.LLMSentOutputFormat2'>, <class 'gec_metrics.metrics.llm_kobayashi24.LLMKobayashi24.LLMSentOutputFormat3'>, <class 'gec_metrics.metrics.llm_kobayashi24.LLMKobayashi24.LLMSentOutputFormat4'>, <class 'gec_metrics.metrics.llm_kobayashi24.LLMKobayashi24.LLMSentOutputFormat5'>]

sample_sentences(hypotheses: list[str], max_n: int = 5) → list[str][source]

Sample max_n sentences from hypotheses. LLMKobayashi24** metrics receives the hypotheses up to five. So if the number of distinct hypotheses is larger than five, we need to sample five sentences. In this implementation, we employ simple strategy: choose the five hypothesis from high frequency.

Parameters:

hypotheses (list[str]) – Hypotheses of all systems.
max_n (int) – The number of sentences to be sampled.

Returns:

The sampled sentences.

Return type:

list[str]

score_pairwise(sources: list[str], hypotheses: list[list[str]])[source]

Calculate pairwise scores for all of combinations of hypotheses. By default, it simply compares the sentence-level scores.

Parameters:

sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

Pairwise comparison resutls.: The shape is (num_sentences, num_systems, num_systems). Each element is -1, 0, or 1:

0 : tie 1 : sys_id1 wins sys_id2 -1: sys_id1 loses sys_id2

Return type:

list[list[list]]

score_sentence(sources: list[str], hypotheses: list[str])[source]

Calculate a sentence-level scores.

Parameters:

sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

Returns:

The sentence-level scores.

Return type:

list[float]

serialize(obj)[source]

class gec_metrics.metrics.llm_kobayashi24.LLMKobayashi24HFEdit(config: Config = None)[source]

Bases: LLMKobayashi24HFSent

LLM-E with huggingface models.

class Config(model: str = 'meta-llama/Llama-2-13b-chat-hf', cache: str = None, seed: int = 777, verbose: bool = False, criteria: str = None, instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\nFor targets without any edits, if the sentence is correct, they will be awarded 5 points; if there is an error, they will receive 1 point.\nThe edits in each target are indicated as follows:\nInsert "the": [→the]\nDelete "the": [the→]\nReplace "the" with "a": [the→a]\n\n# context\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the\nfollowing schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n', quantization: str = '8bit', dtype: str = 'bfloat16')[source]

Bases: Config

instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\nFor targets without any edits, if the sentence is correct, they will be awarded 5 points; if there is an error, they will receive 1 point.\nThe edits in each target are indicated as follows:\nInsert "the": [→the]\nDelete "the": [the→]\nReplace "the" with "a": [the→a]\n\n# context\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the\nfollowing schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n'

hyp_form(src: str, hyp: str) → str[source]

Convert hypothesis sentence into edit sequence. :param src: Source sentence. :type src: str :param hyp: Hypothesis sentence. :type hyp: str

Return: str: Another representation of the hypothesis.

class gec_metrics.metrics.llm_kobayashi24.LLMKobayashi24HFSent(config=None)[source]

Bases: LLMKobayashi24

LLM-S with huggingface models.

class Config(model: str = 'meta-llama/Llama-2-13b-chat-hf', cache: str = None, seed: int = 777, verbose: bool = False, criteria: str = None, instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\n\n# source\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n', quantization: str = '8bit', dtype: str = 'bfloat16')[source]

Bases: Config

Hugging Face configuration.

Parameters:

model (str) – Hugging Face Model name.
quantization (str) – Quantization setting. None, ‘4bit’ or ‘8bit’.
dtype (str)

dtype: str = 'bfloat16'

model: str = 'meta-llama/Llama-2-13b-chat-hf'

quantization: str = '8bit'

call_client(instruction, response_format)[source]: Write forward scripts given instruction. You can refer output format.

load_client()[source]: This function loads LLM client, e.g. OpenAI() or .from_pretrained() forHuggingface model.

class gec_metrics.metrics.llm_kobayashi24.LLMKobayashi24OpenAIEdit(config: Config = None)[source]

Bases: LLMKobayashi24OpenAISent

LLM-E with OpenAI models.

class Config(model: str = 'gpt-4o-mini-2024-07-18', cache: str = None, seed: int = 777, verbose: bool = False, criteria: str = None, instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\nFor targets without any edits, if the sentence is correct, they will be awarded 5 points; if there is an error, they will receive 1 point.\nThe edits in each target are indicated as follows:\nInsert "the": [→the]\nDelete "the": [the→]\nReplace "the" with "a": [the→a]\n\n# context\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the\nfollowing schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n', organization: str = None, api_key: str = None, base_url: str = None)[source]

Bases: Config

instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\nFor targets without any edits, if the sentence is correct, they will be awarded 5 points; if there is an error, they will receive 1 point.\nThe edits in each target are indicated as follows:\nInsert "the": [→the]\nDelete "the": [the→]\nReplace "the" with "a": [the→a]\n\n# context\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the\nfollowing schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n'

hyp_form(src: str, hyp: str) → str[source]

This is used for chaning format of the hypothsis, e.g., edit representation. :param src: Source sentence. :type src: str :param hyp: Hypothesis sentence. :type hyp: str

Return: str: Another representation of the hypothesis.

class gec_metrics.metrics.llm_kobayashi24.LLMKobayashi24OpenAISent(config: Config = None)[source]

Bases: LLMKobayashi24

LLM-S with OpenAI models.

class Config(model: str = 'gpt-4o-mini-2024-07-18', cache: str = None, seed: int = 777, verbose: bool = False, criteria: str = None, instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\n\n# source\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n', organization: str = None, api_key: str = None, base_url: str = None)[source]

Bases: Config

OpenAI configuration.

Parameters:

model (str) – Model name.
organization (str) – Your organization key.
api_key (str) – Your api key.
base_url (str) – When using Gemini models, specify an appropriate url.

api_key: str = None

base_url: str = None

model: str = 'gpt-4o-mini-2024-07-18'

organization: str = None

call_client(instruction, response_format)[source]: Write forward scripts given instruction. You can refer output format.

load_client()[source]: This function loads LLM client, e.g. OpenAI() or .from_pretrained() forHuggingface model.