gec_metrics.metrics package

Submodules

Module contents

class gec_metrics.metrics.BertScore(config: Config = None)[source]

Bases: MetricBaseForSourceFree

class Config(model_type: str = 'bert-base-uncased', num_layers: int = None, batch_size: int = 64, nthreads: int = 4, all_layers: bool = False, idf: bool = False, idf_sents: list[str] = None, lang: str = 'en', rescale_with_baseline: bool = True, baseline_path: str = None, use_fast_tokenizer: bool = False, score_type: str = 'f')[source]

Bases: Config

BERTScore configuration.

  • model_type (str): Embedding model.

  • num_layers (int): The layer of representation to use.

    If None, the pre-difined one is used. (See bert_score.utils.model2layers.)

  • nthreads (int): Number of threads.

  • idf (bool): Whether to use idf or not.

  • idf_sents (list[str]): Sentences to compute idf weights.

  • rescale_with_baselines (bool): Whether to rescale scores.

  • baseline_path (str): Path to .tsv file.

    If None, the pre-defined one is used. (See bert_score.rescale_baseline.*.tsv)

  • use_fast_tokenizer (bool): Whether to use fast tokenizer.

  • score_type (str): “p” (precision) or “r” (recall) or “f” (F1) score.

all_layers: bool = False
baseline_path: str = None
batch_size: int = 64
idf: bool = False
idf_sents: list[str] = None
lang: str = 'en'
model_type: str = 'bert-base-uncased'
nthreads: int = 4
num_layers: int = None
rescale_with_baseline: bool = True
score_type: str = 'f'
use_fast_tokenizer: bool = False
score_sentence(hypotheses: list[str], references: list[list[str]]) list[float][source]

Calculate sentence-level scores.

Parameters:
  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

The sentence-level scores.

Return type:

list[float]

class gec_metrics.metrics.ERRANT(config: Config = None)[source]

Bases: MetricBaseForReferenceBased

class Config(beta: float = 0.5, language: str = 'en')[source]

Bases: Config

ERRANT configuration. - beta (float): The beta for F-beta score. - language (str): The language for spacy.

beta: float = 0.5
language: str = 'en'
aggregate_to_overall(scores: dict[str, Score]) Score[source]

Convert error type-wise scores into an overall score.

Parameters:

scores (dict[str, "Score"]) – Error type-wise scores.

Returns:

The aggregated score.

Return type:

Score

cached_parse(sent: str) Doc[source]

Efficient parse() by caching.

Parameters:

sent (str) – The sentence to be parsed.

Returns:

The parse results.

Return type:

spacy.tokens.doc.Doc

edit_extraction(src: str, trg: str) list[Edit][source]

Extract edits given a source and a corrected sentence.

Parameters:
  • src (str) – The source sentence.

  • trg (str) – The corrected sentence.

Returns:

Extracted edits.

Return type:

list[errant.edit.Edit]

filter_edits(edits: list[Edit]) list[Edit][source]

Handle edits that will be ignored.

score_base(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[list[dict[str, Score]]][source]
Calculate scores while retaining sentence and reference boundaries.
The results can be aggregated according to the purpose,

e.g., at sentence-level or corpus-level.

Parameters:
  • sources (list[str]) – Source sentence.

  • hypothesis (list[str]) – Corrected sentences.

  • references (list[list[str]]) – Reference sentences. The shape is (the number of references, the number of sentences).

Returns:

The verbose scores.
  • The list shape is (num_sents, num_refs)

  • The dict contains error type-wise scores.

Return type:

list[list[dict[str, “Score”]]]

score_corpus(sources: list[str], hypotheses: list[str], references: list[list[str]]) float[source]

Calculate a corpus-level score. This accumulates edit count for TP, FP, FN

and calculates f-beta score.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

The corpus-level score.

Return type:

float

score_corpus_etype(sources: list[str], hypotheses: list[str], references: list[list[str]], cat: int = 2)[source]

Calculate error-type-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

  • cat (int) –

    Error type category. By following the original ERRANT,

    cat=1: Operation: e.g. M, R, U cat=2: Main types: e.g. NOUN, VERB cat=3: All types: e.g. M:NOUN, R:VERB

Returns:

The error-type-level score.

Each key is an error type, and value is the Score instance.

Return type:

dict[str, Score]

score_corpus_verbose(sources: list[str], hypotheses: list[str], references: list[list[str]]) Score[source]

Calculate a corpus level score by aggregating verbose scores.

Parameters:
  • sources (list[str]) – Source sentence.

  • hypothesis (list[str]) – Corrected sentences.

  • references (list[list[str]]) – Reference sentences. The shape is (the number of references, the number of sentences).

Returns:

It contains TP, FP, FN, Precision, Recall, and F-beta.

Return type:

Score

score_sentence(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[float][source]

Calculate sentence-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

The sentence-level scores.

Return type:

list[float]

score_sentence_verbose(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[Score][source]

Calculate sentence level scores by aggregating verbose scores. “verbose” means that TP, FP, FN, Precisoin, Recall, and F are available.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

The sentence-level scores.

Return type:

list[Score]

class gec_metrics.metrics.GLEU(config: Config = None)[source]

Bases: GREEN

GLEU implemented using GREEN reformulation (https://aclanthology.org/2024.inlg-main.25.pdf).

class Config(iter: int = 500, n: int = 4, unit: str = 'word')[source]

Bases: Config

GLEU configuration. :param - iter: The number of iterations. :type - iter: int :param - n: The maximum n of n-gram. :type - n: int :param - unit: Word-level or character-level. Can be ‘word’ or ‘char’. :type - unit: str

iter: int = 500
n: int = 4
unit: str = 'word'
aggregate_score(scores: list[Score], hyp_len: int, ref_len: int) float[source]

Aggregate n-gram scores to an overall score by the geometric mean.

Parameters:
  • scores (list[Score]) – The scores keeping n-gram boundary. The shape is (n, )

  • hyp_len (int) – The length of the hypothesis.

  • ref_len (int) – The length of the reference.

Returns:

The aggregated score.

Return type:

float

score_base(sources: list[str], hypotheses: list[str], references: list[list[str]]) float[source]
Compute True Positive and False Negative using GREEN’s reformulation.

(https://aclanthology.org/2024.inlg-main.25.pdf)

The actual equation is (TI + TK - UD) / (TI + TK + OI + UD),

thus we regard - True Positive (TP) as TI + TK - UD, - False Positive (FP) as OI + 2*UD.

Finally, precision = TP / (TP+FP) will be the GLEU score.

Parameters:
  • sources (list[str]) – Source sentence.

  • hypothesis (list[str]) – Corrected sentences.

  • references (list[list[str]]) – Reference sentences. The shape is (the number of references, the number of sentences).

Returns:

The verbose scores.

The shape is (num_iterations, num_sents, max_ngram).

list[list[int]]: The length for the hypotheses.

The shape is (num_iterations, num_sents)

list[list[int]]: The length for the references.

The shape is (num_iterations, num_sents)

Return type:

list[list[list[“Score”]]]

score_corpus(sources: list[str], hypotheses: list[str], references: list[list[str]]) float[source]

Calculate a corpus-level score.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

The corpus-level score.

Return type:

float

score_sentence(sources: list[str], hypotheses: list[str], references: list[list[str]]) float[source]

Calculate sentence-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

The sentence-level scores.

Return type:

list[float]

class gec_metrics.metrics.GLEUOfficial(config: Config = None)[source]

Bases: GLEU

score_base(sources: list[str], hypotheses: list[str], references: list[list[str]]) Tuple[list[list[list[Score]]], list[list[int]], list[list[int]]][source]
The official implementation contains an error

where the frequency of n-grams is ignored in the calculation of SR.

As a result, when an n-gram is classified into both TK and UD,

it is entirely counted as TK.

Parameters:
  • sources (list[str]) – Source sentence.

  • hypothesis (list[str]) – Corrected sentences.

  • references (list[list[str]]) – Reference sentences. The shape is (the number of references, the number of sentences).

Returns:

The verbose scores.

The shape is (num_iterations, num_sents, max_ngram).

list[list[int]]: The length for the hypotheses.

The shape is (num_iterations, num_sents)

list[list[int]]: The length for the references.

The shape is (num_iterations, num_sents)

Return type:

list[list[list[“Score”]]]

class gec_metrics.metrics.GREEN(config: Config = None)[source]

Bases: MetricBaseForReferenceBased

class Config(n: int = 4, beta: float = 2.0, unit: str = 'word')[source]

Bases: Config

GREEN configuration - n (int): Maxmimun n for n-gram. - beta (int): The beta for F-beta score. - unit (str): Word-level or character-level. Can be ‘word’ or ‘char’.

beta: float = 2.0
n: int = 4
unit: str = 'word'
aggregate_score(scores: list[Score]) float[source]

Aggregate n-gram scores to an overall score by the geometric mean.

Parameters:

scores (list[Score]) – The scores keeping n-gram boundary. The shape is (n, )

Returns:

The aggregated score.

Return type:

float

cached_get_all_ngrams(sentence: str) dict[str, int][source]

Get frequency of n-gram for all n (1 <= n <= config.n)

score_base(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[list[list[Score]]][source]
Calculate scores while retaining sentence and reference boundaries.
The results can be aggregated according to the purpose,

e.g., at sentence-level or corpus-level.

Parameters:
  • sources (list[str]) – Source sentence.

  • hypothesis (list[str]) – Corrected sentences.

  • references (list[list[str]]) – Reference sentences. The shape is (the number of references, the number of sentences).

Returns:

The verbose scores.

The shape is (num_iterations, num_sents, max_ngram).

Return type:

list[list[list[“Score”]]]

score_corpus(sources: list[str], hypotheses: list[str], references: list[list[str]]) float[source]

Calculate a corpus-level score. This accumulates n-gram count for TP, FP, FN

and calculates f-beta score.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

The corpus-level score.

Return type:

float

score_sentence(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[float][source]

Calculate sentence-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

The sentence-level scores.

Return type:

list[float]

class gec_metrics.metrics.GoToScorer(config: Config = None)[source]

Bases: ERRANT

class Chunk(o_start: int = 0, o_end: int = 0, c_str: str = '', type: str = '', weight: float = 1.0, is_edited: bool = False)[source]

Bases: object

Class to represent a chunk. - o_start (int): Start index of the span for the source words. - o_end (int): End index of the span for the source words. - c_str (str): Corrected version of the span. - type (str): Error type. - weight (float): Weight for the chunk. - is_edited (bool): Flag whether the GEC systems tried to edit this span.

This is used to distinguish FP from FN, and TP and TN.

c_str: str = ''
is_edited: bool = False
o_end: int = 0
o_start: int = 0
type: str = ''
weight: float = 1.0
class Config(beta: float = 0.5, language: str = 'en', weight_file: str = '', ref_id: int = 0, no_weight: bool = True)[source]

Bases: Config

GoToScorer configuration. - The JSON file containing the pre-computed weight.

This can be generated by the gecmetrics-gen-gotoscorer-weight script.

  • ref_id (int): The reference id.

  • no_weight (bool): If True, the weight of all chunks are 1.0.

no_weight: bool = True
ref_id: int = 0
weight_file: str = ''
annotate(r_chunk: Chunk, hyp_chunks: list[Chunk]) tuple[bool][source]
Annotate whether the reference chunk is correct

and whether the system attempted to edit it.

Parameters:
  • r_chunk (Chunk) – The chunk to be evaluated.

  • hyp_chunks (list[Chunk]) – The chunk sequence for one GEC systems.

Returns:

This contains two elements.

The first one represents correctness. The second one represents whether the system tried to edit or not.

Return type:

tuple[bool]

generate_chunks(edits: list[Edit], tokens: list[str]) list[Chunk][source]
Generate a chunk sequence given an edit sequence.
  • Tokens included in each edit become a single chunk.

  • Each token outside of the edits becomes a chunk respectively.

  • Dummy chunks will be inserted between all tokens

    to account for possible insertions.

Parameters:
  • edits (list[errant.edit.Edit]) – The edit sequence that can be obtained via errant.annotate()

  • tokens (list[str]) – The source tokens.

Returns:

The chunk sequence.

Return type:

list[Chunk]

score_base(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[list[dict[str, Score]]][source]
Calculate scores while retaining sentence and reference boundaries.
The results can be aggregated according to the purpose,

e.g., at sentence-level or corpus-level.

Parameters:
  • sources (list[str]) – Source sentence.

  • hypothesis (list[str]) – Corrected sentences.

  • references (list[list[str]]) – Reference sentences. The shape is (the number of references, the number of sentences).

Returns:

The verbose scores.
  • The list shape is (num_sents, num_refs)

  • The dict contains error type-wise scores.

Return type:

list[list[dict[str, “Score”]]]

visualize_chunk(chunks: list[Chunk], tokens: str) None[source]

The visualizer.

Parameters:
  • chunks (list[Chunk]) – The chunk sequence.

  • tokens (list[str]) – The source tokens.

Returns: None

Example: ```

from gec_metrics.metrics.gotoscorer import GoToScorer scorer = GoToScorer(GoToScorer.Config(no_weight=True)) src = ‘This sentences contain gramamtical error .’ trg = ‘This sentence contains a grammatical error .’ edits = scorer.edit_extraction(src, trg) chunks = scorer.generate_chunks(edits, src.split(’ ‘)) scorer.visualize_chunk(chunks, src.split(’ ‘))

# Output: # | |This| |sentences| |contain | |gramamtical| |error| | . | | # | |This| |sentence | |contains| a |grammatical| |error| | . | | # |1.0|1.0 |1.0| 1.0 |1.0| 1.0 |1.0| 1.0 |1.0| 1.0 |1.0|1.0|1.0|

```

class gec_metrics.metrics.IMPARA(config: Config = None)[source]

Bases: MetricBaseForReferenceFree

class Config(model_qe: str = 'gotutiyan/IMPARA-QE', model_se: str = 'google-bert/bert-base-cased', pooling: str = 'cls', max_length: int = 128, threshold: float = 0.9, no_cuda: bool = False, batch_size: int = 32)[source]

Bases: Config

IMPARA configuration.

Parameters:
  • model_qe (str) – Quality estimation model.

  • model_se (str) – Similarity estimation model.

  • pooling (str) – Pooling method. ‘cls’ or ‘mean’.

  • max_length (int) – Maximum length of inputs.

  • threshold (float) – Threshold for the similarity score.

  • no_cuda (bool) – If True, work on CPU.

  • batch_size (int) – Batch size for the inference.

batch_size: int = 32
max_length: int = 128
model_qe: str = 'gotutiyan/IMPARA-QE'
model_se: str = 'google-bert/bert-base-cased'
no_cuda: bool = False
pooling: str = 'cls'
threshold: float = 0.9
score_sentence(sources: list[str], hypotheses: list[str]) list[float][source]

Calculate sentence-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

Returns:

The sentence-level scores.

Return type:

list[float]

score_sentence_qe(hypotheses: list[str]) list[float][source]

Compute quality scores.

Parameters:

hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

Returns:

The quality scores.

Return type:

list[float]

score_sentence_se(sources: list[str], hypotheses: list[str]) list[float][source]

Compute similarity scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

Returns:

The similarity scores.

Return type:

list[float]

class gec_metrics.metrics.LLMKobayashi24HFEdit(config: Config = None)[source]

Bases: LLMKobayashi24HFSent

LLM-E with huggingface models.

class Config(model: str = 'meta-llama/Llama-2-13b-chat-hf', cache: str = None, seed: int = 777, verbose: bool = False, criteria: str = None, instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\nFor targets without any edits, if the sentence is correct, they will be awarded 5 points; if there is an error, they will receive 1 point.\nThe edits in each target are indicated as follows:\nInsert "the": [→the]\nDelete "the": [the→]\nReplace "the" with "a": [the→a]\n\n# context\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the\nfollowing schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n', quantization: str = '8bit', dtype: str = 'bfloat16')[source]

Bases: Config

instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\nFor targets without any edits, if the sentence is correct, they will be awarded 5 points; if there is an error, they will receive 1 point.\nThe edits in each target are indicated as follows:\nInsert "the": [→the]\nDelete "the": [the→]\nReplace "the" with "a": [the→a]\n\n# context\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the\nfollowing schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n'
hyp_form(src: str, hyp: str) str[source]

Convert hypothesis sentence into edit sequence. :param src: Source sentence. :type src: str :param hyp: Hypothesis sentence. :type hyp: str

Return

str: Another representation of the hypothesis.

class gec_metrics.metrics.LLMKobayashi24HFSent(config=None)[source]

Bases: LLMKobayashi24

LLM-S with huggingface models.

class Config(model: str = 'meta-llama/Llama-2-13b-chat-hf', cache: str = None, seed: int = 777, verbose: bool = False, criteria: str = None, instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\n\n# source\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n', quantization: str = '8bit', dtype: str = 'bfloat16')[source]

Bases: Config

Hugging Face configuration.

Parameters:
  • model (str) – Hugging Face Model name.

  • quantization (str) – Quantization setting. None, ‘4bit’ or ‘8bit’.

  • dtype (str)

dtype: str = 'bfloat16'
model: str = 'meta-llama/Llama-2-13b-chat-hf'
quantization: str = '8bit'
call_client(instruction, response_format)[source]

Write forward scripts given instruction. You can refer output format.

load_client()[source]

This function loads LLM client, e.g. OpenAI() or .from_pretrained() forHuggingface model.

class gec_metrics.metrics.LLMKobayashi24OpenAIEdit(config: Config = None)[source]

Bases: LLMKobayashi24OpenAISent

LLM-E with OpenAI models.

class Config(model: str = 'gpt-4o-mini-2024-07-18', cache: str = None, seed: int = 777, verbose: bool = False, criteria: str = None, instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\nFor targets without any edits, if the sentence is correct, they will be awarded 5 points; if there is an error, they will receive 1 point.\nThe edits in each target are indicated as follows:\nInsert "the": [→the]\nDelete "the": [the→]\nReplace "the" with "a": [the→a]\n\n# context\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the\nfollowing schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n', organization: str = None, api_key: str = None, base_url: str = None)[source]

Bases: Config

instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\nFor targets without any edits, if the sentence is correct, they will be awarded 5 points; if there is an error, they will receive 1 point.\nThe edits in each target are indicated as follows:\nInsert "the": [→the]\nDelete "the": [the→]\nReplace "the" with "a": [the→a]\n\n# context\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the\nfollowing schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n'
hyp_form(src: str, hyp: str) str[source]

This is used for chaning format of the hypothsis, e.g., edit representation. :param src: Source sentence. :type src: str :param hyp: Hypothesis sentence. :type hyp: str

Return

str: Another representation of the hypothesis.

class gec_metrics.metrics.LLMKobayashi24OpenAISent(config: Config = None)[source]

Bases: LLMKobayashi24

LLM-S with OpenAI models.

class Config(model: str = 'gpt-4o-mini-2024-07-18', cache: str = None, seed: int = 777, verbose: bool = False, criteria: str = None, instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\n\n# source\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n', organization: str = None, api_key: str = None, base_url: str = None)[source]

Bases: Config

OpenAI configuration.

Parameters:
  • model (str) – Model name.

  • organization (str) – Your organization key.

  • api_key (str) – Your api key.

  • base_url (str) – When using Gemini models, specify an appropriate url.

api_key: str = None
base_url: str = None
model: str = 'gpt-4o-mini-2024-07-18'
organization: str = None
call_client(instruction, response_format)[source]

Write forward scripts given instruction. You can refer output format.

load_client()[source]

This function loads LLM client, e.g. OpenAI() or .from_pretrained() forHuggingface model.

class gec_metrics.metrics.MetricBase(config: Config = None)[source]

Bases: ABC

class Config[source]

Bases: object

make_pairwise_scores(scores: list[list[float]]) list[list[list]][source]

Convert sentence-level scores into pairwise comparison results.

Parameters:

scores (list[list[float]]) – Sentence-level score. The shape is (num_systems, num_sentences).

Returns:

Pairwise comparison resutls

for all of combination of the systems. The shape is (num_sents, num_systems, num_systems). You can refer to the comparison result by [sent_id][sys_id1][sys_id2]. Each element is -1, 0, or 1:

0 : tie 1 : sys_id1 wins sys_id2 -1: sys_id1 loses sys_id2

Return type:

list[list[list]]

run_expected_wins(pairwise_scores: list[list[list[int]]]) list[float][source]

Apply Expected Wins given pairwise comparison scores. This is the [Bojar+ 11] version: https://aclanthology.org/W11-2101/

Score(i) = sum_j (wins(i, j) / (wins(i, j) + wins(j, i)))

Parameters:

pairwise_scores (list[list[list[int]]]) – Pairwise comparison results. The shape is (num_sents, num_systems, num_systems).

Returns:

System-level scores.

Return type:

list[float]

run_trueskill(pairwise_scores: list[list[list[int]]]) list[float][source]

Apply TrueSkill given pairwise comparison scores.

Parameters:

pairwise_scores (list[list[list[int]]]) – Pairwise comparison results. The shape is (num_sents, num_systems, num_systems).

Returns:

System-level scores.

Return type:

list[float]

class gec_metrics.metrics.MetricBaseForReferenceBased(config: Config = None)[source]

Bases: MetricBase

Abstract class for refernece-based metrics. All reference-based metrics must be implemented by inheriting from this class.

class Config[source]

Bases: Config

class Score(tp: float = 0.0, fp: float = 0.0, fn: float = 0.0, tn: float = 0.0, beta: float = 0.5)[source]

Bases: object

Handle edit or n-gram count. - tp: True Positive. - fp: False Positive. - fn: False Negative - tn: True Negative. - beta: The beta for F-beta score.

property accuracy: float

Calculate the accuracy.

beta: float
property f: float

Calculate the F-beta score.

fn: float
fp: float
property precision: float

Calculate the precision.

property recall: float

Calculate the recall

tn: float
tp: float
rank_systems(sources: list[str], hypotheses: list[list[str]], references: list[list[str]], aggregation='default') list[float][source]

Compute ranking score for multiple systems.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

  • aggregation

    (str): How to aggregate sentence-level scores. - “default” follows an original aggregation, e.g., average or accumulation. - “trueskill” convert sentence-level scores into pairwise comparison results,

    then apply TrueSkill. This is motivated by https://arxiv.org/abs/2502.09416.

    • ”expected_wins”: convert sentence-level scores into pairwise comparison results,

      then apply Expected Wins.

Retunrns:

list[float]: System-level scores.

score_corpus(sources: list[str], hypotheses: list[str], references: list[list[str]]) float[source]

Calculate a corpus-level score. By default, we use the average of the sentence-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

The corpus-level score.

Return type:

float

score_pairwise(sources: list[str], hypotheses: list[list[str]], references: list[list[str]]) list[list[list[int]]][source]

Calculate pairwise scores for all of combinations of hypotheses. By default, it simply compares the sentence-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

Pairwise comparison resutls.

The shape is (num_sentences, num_systems, num_systems).

Return type:

list[list[list]]

abstractmethod score_sentence(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[float][source]

Calculate sentence-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

The sentence-level scores.

Return type:

list[float]

class gec_metrics.metrics.MetricBaseForReferenceFree(config: Config = None)[source]

Bases: MetricBase

class Config[source]

Bases: Config

rank_systems(sources: list[str], hypotheses: list[list[str]], aggregation='default')[source]

Compute ranking score for multiple systems.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).

  • aggregation

    (str): How to aggregate sentence-level scores. - “default” follows an original aggregation, e.g., average or accumulation. - “trueskill” convert sentence-level scores into pairwise comparison results,

    then apply TrueSkill. This is motivated by https://arxiv.org/abs/2502.09416.

    • ”expected_wins”: convert sentence-level scores into pairwise comparison results,

      then apply Expected Wins.

Retunrns:

list[float]: System-level scores.

score_corpus(sources: list[str], hypotheses: list[str]) float[source]

Calculate a corpus-level score. By default, we use the average of the sentence-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

Returns:

The corpus-level score.

Return type:

float

score_pairwise(sources: list[str], hypotheses: list[list[str]]) list[list[list[int]]][source]

Calculate pairwise scores for all of combinations of hypotheses. By default, it simply compares the sentence-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).

Returns:

Pairwise comparison resutls.

The shape is (num_sentences, num_systems, num_systems).

Return type:

list[list[list]]

abstractmethod score_sentence(sources: list[str], hypotheses: list[str]) list[float][source]

Calculate a sentence-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

Returns:

The sentence-level scores.

Return type:

list[float]

class gec_metrics.metrics.MetricBaseForSourceFree(config: Config = None)[source]

Bases: MetricBase

Metric without source sentence. This is basically for BERTScore or BARTScore

(that will be a component of PT-{ERRANT, M2}.).

class Config[source]

Bases: Config

rank_systems(hypotheses: list[list[str]], references: list[list[str]], aggregation='default')[source]

Compute ranking score for multiple systems.

Parameters:
  • hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

  • aggregation

    (str): How to aggregate sentence-level scores. - “default” follows an original aggregation, e.g., average or accumulation. - “trueskill” convert sentence-level scores into pairwise comparison results,

    then apply TrueSkill. This is motivated by https://arxiv.org/abs/2502.09416.

    • ”expected_wins”: convert sentence-level scores into pairwise comparison results,

      then apply Expected Wins.

Retunrns:

list[float]: System-level scores.

score_corpus(hypotheses: list[str], references: list[list[str]]) float[source]

Calculate a corpus-level score. By default, we use the average of the sentence-level scores.

Parameters:
  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

The corpus-level score.

Return type:

float

score_pairwise(hypotheses: list[list[str]], references: list[list[str]]) list[list[list[int]]][source]

Calculate pairwise scores for all of combinations of hypotheses. By default, it simply compares the sentence-level scores.

Parameters:
  • hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

Pairwise comparison resutls.

The shape is (num_sentences, num_systems, num_systems).

Return type:

list[list[list]]

abstractmethod score_sentence(hypotheses: list[str], references: list[list[str]]) list[float][source]

Calculate a sentence-level scores.

Parameters:
  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

The sentence-level scores.

Return type:

list[float]

class gec_metrics.metrics.PTERRANT(config: Config = None)[source]

Bases: ERRANT

class Config(beta: float = 0.5, language: str = 'en', weight_model_name: str = 'bertscore', weight_model_config: Config = None)[source]

Bases: Config

Configuration of PTERRANT

  • weight_model_name (str): Model to compute edit-level weights.

    Currently only “bertscore” is available.

  • weight_model_config (MetricBaseForSourceFree.Config):

    The config instance of the weight model. If not specified, it uses the default one.

Also, you can use the same configurations as ERRANT.

weight_model_config: Config = None
weight_model_name: str = 'bertscore'
calc_edit_weights(src: str, ref: str, edits: list[Edit]) list[float][source]

Calculate a weight for each edit.

Parameters:
  • src (str) – Source sentence.

  • src – Reference sentence.

  • edits (list[errant.edit.Edit]) – Edits.

Returns:

The weight of each edit.

Return type:

list[float]

score_base(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[list[dict[str, Score]]][source]
Calculate scores while retaining sentence and reference boundaries.
The results can be aggregated according to the purpose,

e.g., at sentence-level or corpus-level.

Parameters:
  • sources (list[str]) – Source sentence.

  • hypothesis (list[str]) – Corrected sentences.

  • references (list[list[str]]) – Reference sentences. The shape is (the number of references, the number of sentences).

Returns:

The verbose scores.
  • The list shape is (num_sents, num_refs)

  • The dict contains error type-wise scores.

Return type:

list[list[dict[str, “Score”]]]

class gec_metrics.metrics.SOME(config: Config = None)[source]

Bases: MetricBaseForReferenceFree

class Config(model_g: str = 'gfm-models/grammer', model_f: str = 'gfm-models/fluency', model_m: str = 'gfm-models/meaning', weight_g: float = 0.55, weight_f: float = 0.43, weight_m: float = 0.02, no_cuda: bool = False, batch_size: int = 32, max_length: int = 128)[source]

Bases: Config

SOME configuration. - model_g (str): Model for grammaticality. - model_f (str): Model for fluency. - model_m (str): Model for meaning preservation. - weight_g (float): Weight for the grammaticality score. - weight_f (float): Weight for the fluency score. - weight_m (float): Weight for the meaning preservation score. - no_cuda (bool): If True, work on CPU. - batch_size (int): Batch size for inference. - max_length (int): Maximum length of inputs.

batch_size: int = 32
max_length: int = 128
model_f: str = 'gfm-models/fluency'
model_g: str = 'gfm-models/grammer'
model_m: str = 'gfm-models/meaning'
no_cuda: bool = False
weight_f: float = 0.43
weight_g: float = 0.55
weight_m: float = 0.02
min_max_normalize(x: int, x_min: int = 1, x_max: int = 4)[source]

Normalizes the input values in the range x_min to x_max. - x (int): Input value. - x_min (int): Lower bound of the range. - x_max (int): Upper bound of the range.

score_sentence(sources: list[str], hypotheses: list[str]) list[float][source]

Calculate sentence-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

Returns:

The sentence-level scores.

Return type:

list[float]

class gec_metrics.metrics.Scribendi(config: Config = None)[source]

Bases: MetricBaseForReferenceFree

class Config(model: str = 'gpt2', threshold: float = 0.8, no_cuda: bool = False, batch_size: int = 32)[source]

Bases: Config

Scribendi configuration. - model (str): Model id of a language model. - threshold (float): Threshold for the maximum values of

the token sort ratio and the levenshtein distance ratio.

  • no_cuda (bool): If True, work on CPU.

  • batch_size (int): Batch size for the inference.

batch_size: int = 32
model: str = 'gpt2'
no_cuda: bool = False
threshold: float = 0.8
levenshtein_distance_ratio(src: str, pred: str) float[source]

The word-level levenshtein distance ratio.

Parameters:
  • src (str) – The source sentence.

  • pred (str) – The corrected sentence.

Returns:

The levelshtein distance ratio.

Return type:

float

ppl(sents: list[str]) list[float][source]

Compute perplexity using a LM.

Parameters:

sents (list[str]) – The sentences to be computed the perplexity.

Returns:

The list of perplexity.

Return type:

list[float]

score_corpus(sources: list[str], hypotheses: list[str]) float[source]

Calculate a corpus-level score.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

Returns:

The corpus-level score.

Return type:

float

score_sentence(sources: list[str], hypotheses: list[str]) list[float][source]

Calculate sentence-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

Returns:

The sentence-level scores.

Return type:

list[float]

token_sort_ratio(src: str, pred: str) float[source]
Parameters:
  • src (str) – The source sentence.

  • pred (str) – The corrected sentence.

Returns:

The token sort ratio.

Return type:

float