gec_metrics.metrics package
Submodules
- gec_metrics.metrics.base module
- gec_metrics.metrics.bertscore module
- gec_metrics.metrics.errant module
- gec_metrics.metrics.gleu module
- gec_metrics.metrics.gotoscorer module
- gec_metrics.metrics.green module
- gec_metrics.metrics.impara module
- gec_metrics.metrics.llm_kobayashi24 module
- gec_metrics.metrics.pt_errant module
- gec_metrics.metrics.scribendi module
- gec_metrics.metrics.some module
- gec_metrics.metrics.utils module
Module contents
- class gec_metrics.metrics.BertScore(config: Config = None)[source]
Bases:
MetricBaseForSourceFree- class Config(model_type: str = 'bert-base-uncased', num_layers: int = None, batch_size: int = 64, nthreads: int = 4, all_layers: bool = False, idf: bool = False, idf_sents: list[str] = None, lang: str = 'en', rescale_with_baseline: bool = True, baseline_path: str = None, use_fast_tokenizer: bool = False, score_type: str = 'f')[source]
Bases:
ConfigBERTScore configuration.
model_type (str): Embedding model.
- num_layers (int): The layer of representation to use.
If None, the pre-difined one is used. (See bert_score.utils.model2layers.)
nthreads (int): Number of threads.
idf (bool): Whether to use idf or not.
idf_sents (list[str]): Sentences to compute idf weights.
rescale_with_baselines (bool): Whether to rescale scores.
- baseline_path (str): Path to .tsv file.
If None, the pre-defined one is used. (See bert_score.rescale_baseline.*.tsv)
use_fast_tokenizer (bool): Whether to use fast tokenizer.
score_type (str): “p” (precision) or “r” (recall) or “f” (F1) score.
- all_layers: bool = False
- baseline_path: str = None
- batch_size: int = 64
- idf: bool = False
- idf_sents: list[str] = None
- lang: str = 'en'
- model_type: str = 'bert-base-uncased'
- nthreads: int = 4
- num_layers: int = None
- rescale_with_baseline: bool = True
- score_type: str = 'f'
- use_fast_tokenizer: bool = False
- score_sentence(hypotheses: list[str], references: list[list[str]]) list[float][source]
Calculate sentence-level scores.
- Parameters:
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
The sentence-level scores.
- Return type:
list[float]
- class gec_metrics.metrics.ERRANT(config: Config = None)[source]
Bases:
MetricBaseForReferenceBased- class Config(beta: float = 0.5, language: str = 'en')[source]
Bases:
ConfigERRANT configuration. - beta (float): The beta for F-beta score. - language (str): The language for spacy.
- beta: float = 0.5
- language: str = 'en'
- aggregate_to_overall(scores: dict[str, Score]) Score[source]
Convert error type-wise scores into an overall score.
- Parameters:
scores (dict[str, "Score"]) – Error type-wise scores.
- Returns:
The aggregated score.
- Return type:
- cached_parse(sent: str) Doc[source]
Efficient parse() by caching.
- Parameters:
sent (str) – The sentence to be parsed.
- Returns:
The parse results.
- Return type:
spacy.tokens.doc.Doc
- edit_extraction(src: str, trg: str) list[Edit][source]
Extract edits given a source and a corrected sentence.
- Parameters:
src (str) – The source sentence.
trg (str) – The corrected sentence.
- Returns:
Extracted edits.
- Return type:
list[errant.edit.Edit]
- score_base(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[list[dict[str, Score]]][source]
- Calculate scores while retaining sentence and reference boundaries.
- The results can be aggregated according to the purpose,
e.g., at sentence-level or corpus-level.
- Parameters:
sources (list[str]) – Source sentence.
hypothesis (list[str]) – Corrected sentences.
references (list[list[str]]) – Reference sentences. The shape is (the number of references, the number of sentences).
- Returns:
- The verbose scores.
The list shape is (num_sents, num_refs)
The dict contains error type-wise scores.
- Return type:
list[list[dict[str, “Score”]]]
- score_corpus(sources: list[str], hypotheses: list[str], references: list[list[str]]) float[source]
Calculate a corpus-level score. This accumulates edit count for TP, FP, FN
and calculates f-beta score.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
The corpus-level score.
- Return type:
float
- score_corpus_etype(sources: list[str], hypotheses: list[str], references: list[list[str]], cat: int = 2)[source]
Calculate error-type-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
cat (int) –
Error type category. By following the original ERRANT,
cat=1: Operation: e.g. M, R, U cat=2: Main types: e.g. NOUN, VERB cat=3: All types: e.g. M:NOUN, R:VERB
- Returns:
- The error-type-level score.
Each key is an error type, and value is the Score instance.
- Return type:
dict[str, Score]
- score_corpus_verbose(sources: list[str], hypotheses: list[str], references: list[list[str]]) Score[source]
Calculate a corpus level score by aggregating verbose scores.
- Parameters:
sources (list[str]) – Source sentence.
hypothesis (list[str]) – Corrected sentences.
references (list[list[str]]) – Reference sentences. The shape is (the number of references, the number of sentences).
- Returns:
It contains TP, FP, FN, Precision, Recall, and F-beta.
- Return type:
- score_sentence(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[float][source]
Calculate sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
The sentence-level scores.
- Return type:
list[float]
- score_sentence_verbose(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[Score][source]
Calculate sentence level scores by aggregating verbose scores. “verbose” means that TP, FP, FN, Precisoin, Recall, and F are available.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
The sentence-level scores.
- Return type:
list[Score]
- class gec_metrics.metrics.GLEU(config: Config = None)[source]
Bases:
GREENGLEU implemented using GREEN reformulation (https://aclanthology.org/2024.inlg-main.25.pdf).
- class Config(iter: int = 500, n: int = 4, unit: str = 'word')[source]
Bases:
ConfigGLEU configuration. :param - iter: The number of iterations. :type - iter: int :param - n: The maximum n of n-gram. :type - n: int :param - unit: Word-level or character-level. Can be ‘word’ or ‘char’. :type - unit: str
- iter: int = 500
- n: int = 4
- unit: str = 'word'
- aggregate_score(scores: list[Score], hyp_len: int, ref_len: int) float[source]
Aggregate n-gram scores to an overall score by the geometric mean.
- Parameters:
scores (list[Score]) – The scores keeping n-gram boundary. The shape is (n, )
hyp_len (int) – The length of the hypothesis.
ref_len (int) – The length of the reference.
- Returns:
The aggregated score.
- Return type:
float
- score_base(sources: list[str], hypotheses: list[str], references: list[list[str]]) float[source]
- Compute True Positive and False Negative using GREEN’s reformulation.
(https://aclanthology.org/2024.inlg-main.25.pdf)
- The actual equation is (TI + TK - UD) / (TI + TK + OI + UD),
thus we regard - True Positive (TP) as TI + TK - UD, - False Positive (FP) as OI + 2*UD.
Finally, precision = TP / (TP+FP) will be the GLEU score.
- Parameters:
sources (list[str]) – Source sentence.
hypothesis (list[str]) – Corrected sentences.
references (list[list[str]]) – Reference sentences. The shape is (the number of references, the number of sentences).
- Returns:
- The verbose scores.
The shape is (num_iterations, num_sents, max_ngram).
- list[list[int]]: The length for the hypotheses.
The shape is (num_iterations, num_sents)
- list[list[int]]: The length for the references.
The shape is (num_iterations, num_sents)
- Return type:
list[list[list[“Score”]]]
- score_corpus(sources: list[str], hypotheses: list[str], references: list[list[str]]) float[source]
Calculate a corpus-level score.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
The corpus-level score.
- Return type:
float
- score_sentence(sources: list[str], hypotheses: list[str], references: list[list[str]]) float[source]
Calculate sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
The sentence-level scores.
- Return type:
list[float]
- class gec_metrics.metrics.GLEUOfficial(config: Config = None)[source]
Bases:
GLEU- score_base(sources: list[str], hypotheses: list[str], references: list[list[str]]) Tuple[list[list[list[Score]]], list[list[int]], list[list[int]]][source]
- The official implementation contains an error
where the frequency of n-grams is ignored in the calculation of SR.
- As a result, when an n-gram is classified into both TK and UD,
it is entirely counted as TK.
- Parameters:
sources (list[str]) – Source sentence.
hypothesis (list[str]) – Corrected sentences.
references (list[list[str]]) – Reference sentences. The shape is (the number of references, the number of sentences).
- Returns:
- The verbose scores.
The shape is (num_iterations, num_sents, max_ngram).
- list[list[int]]: The length for the hypotheses.
The shape is (num_iterations, num_sents)
- list[list[int]]: The length for the references.
The shape is (num_iterations, num_sents)
- Return type:
list[list[list[“Score”]]]
- class gec_metrics.metrics.GREEN(config: Config = None)[source]
Bases:
MetricBaseForReferenceBased- class Config(n: int = 4, beta: float = 2.0, unit: str = 'word')[source]
Bases:
ConfigGREEN configuration - n (int): Maxmimun n for n-gram. - beta (int): The beta for F-beta score. - unit (str): Word-level or character-level. Can be ‘word’ or ‘char’.
- beta: float = 2.0
- n: int = 4
- unit: str = 'word'
- aggregate_score(scores: list[Score]) float[source]
Aggregate n-gram scores to an overall score by the geometric mean.
- Parameters:
scores (list[Score]) – The scores keeping n-gram boundary. The shape is (n, )
- Returns:
The aggregated score.
- Return type:
float
- cached_get_all_ngrams(sentence: str) dict[str, int][source]
Get frequency of n-gram for all n (1 <= n <= config.n)
- score_base(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[list[list[Score]]][source]
- Calculate scores while retaining sentence and reference boundaries.
- The results can be aggregated according to the purpose,
e.g., at sentence-level or corpus-level.
- Parameters:
sources (list[str]) – Source sentence.
hypothesis (list[str]) – Corrected sentences.
references (list[list[str]]) – Reference sentences. The shape is (the number of references, the number of sentences).
- Returns:
- The verbose scores.
The shape is (num_iterations, num_sents, max_ngram).
- Return type:
list[list[list[“Score”]]]
- score_corpus(sources: list[str], hypotheses: list[str], references: list[list[str]]) float[source]
Calculate a corpus-level score. This accumulates n-gram count for TP, FP, FN
and calculates f-beta score.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
The corpus-level score.
- Return type:
float
- score_sentence(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[float][source]
Calculate sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
The sentence-level scores.
- Return type:
list[float]
- class gec_metrics.metrics.GoToScorer(config: Config = None)[source]
Bases:
ERRANT- class Chunk(o_start: int = 0, o_end: int = 0, c_str: str = '', type: str = '', weight: float = 1.0, is_edited: bool = False)[source]
Bases:
objectClass to represent a chunk. - o_start (int): Start index of the span for the source words. - o_end (int): End index of the span for the source words. - c_str (str): Corrected version of the span. - type (str): Error type. - weight (float): Weight for the chunk. - is_edited (bool): Flag whether the GEC systems tried to edit this span.
This is used to distinguish FP from FN, and TP and TN.
- c_str: str = ''
- is_edited: bool = False
- o_end: int = 0
- o_start: int = 0
- type: str = ''
- weight: float = 1.0
- class Config(beta: float = 0.5, language: str = 'en', weight_file: str = '', ref_id: int = 0, no_weight: bool = True)[source]
Bases:
ConfigGoToScorer configuration. - The JSON file containing the pre-computed weight.
This can be generated by the gecmetrics-gen-gotoscorer-weight script.
ref_id (int): The reference id.
no_weight (bool): If True, the weight of all chunks are 1.0.
- no_weight: bool = True
- ref_id: int = 0
- weight_file: str = ''
- annotate(r_chunk: Chunk, hyp_chunks: list[Chunk]) tuple[bool][source]
- Annotate whether the reference chunk is correct
and whether the system attempted to edit it.
- generate_chunks(edits: list[Edit], tokens: list[str]) list[Chunk][source]
- Generate a chunk sequence given an edit sequence.
Tokens included in each edit become a single chunk.
Each token outside of the edits becomes a chunk respectively.
- Dummy chunks will be inserted between all tokens
to account for possible insertions.
- Parameters:
edits (list[errant.edit.Edit]) – The edit sequence that can be obtained via errant.annotate()
tokens (list[str]) – The source tokens.
- Returns:
The chunk sequence.
- Return type:
list[Chunk]
- score_base(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[list[dict[str, Score]]][source]
- Calculate scores while retaining sentence and reference boundaries.
- The results can be aggregated according to the purpose,
e.g., at sentence-level or corpus-level.
- Parameters:
sources (list[str]) – Source sentence.
hypothesis (list[str]) – Corrected sentences.
references (list[list[str]]) – Reference sentences. The shape is (the number of references, the number of sentences).
- Returns:
- The verbose scores.
The list shape is (num_sents, num_refs)
The dict contains error type-wise scores.
- Return type:
list[list[dict[str, “Score”]]]
- visualize_chunk(chunks: list[Chunk], tokens: str) None[source]
The visualizer.
- Parameters:
chunks (list[Chunk]) – The chunk sequence.
tokens (list[str]) – The source tokens.
Returns: None
from gec_metrics.metrics.gotoscorer import GoToScorer scorer = GoToScorer(GoToScorer.Config(no_weight=True)) src = ‘This sentences contain gramamtical error .’ trg = ‘This sentence contains a grammatical error .’ edits = scorer.edit_extraction(src, trg) chunks = scorer.generate_chunks(edits, src.split(’ ‘)) scorer.visualize_chunk(chunks, src.split(’ ‘))
# Output: # | |This| |sentences| |contain | |gramamtical| |error| | . | | # | |This| |sentence | |contains| a |grammatical| |error| | . | | # |1.0|1.0 |1.0| 1.0 |1.0| 1.0 |1.0| 1.0 |1.0| 1.0 |1.0|1.0|1.0|
- class gec_metrics.metrics.IMPARA(config: Config = None)[source]
Bases:
MetricBaseForReferenceFree- class Config(model_qe: str = 'gotutiyan/IMPARA-QE', model_se: str = 'google-bert/bert-base-cased', pooling: str = 'cls', max_length: int = 128, threshold: float = 0.9, no_cuda: bool = False, batch_size: int = 32)[source]
Bases:
ConfigIMPARA configuration.
- Parameters:
model_qe (str) – Quality estimation model.
model_se (str) – Similarity estimation model.
pooling (str) – Pooling method. ‘cls’ or ‘mean’.
max_length (int) – Maximum length of inputs.
threshold (float) – Threshold for the similarity score.
no_cuda (bool) – If True, work on CPU.
batch_size (int) – Batch size for the inference.
- batch_size: int = 32
- max_length: int = 128
- model_qe: str = 'gotutiyan/IMPARA-QE'
- model_se: str = 'google-bert/bert-base-cased'
- no_cuda: bool = False
- pooling: str = 'cls'
- threshold: float = 0.9
- score_sentence(sources: list[str], hypotheses: list[str]) list[float][source]
Calculate sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
- Returns:
The sentence-level scores.
- Return type:
list[float]
- score_sentence_qe(hypotheses: list[str]) list[float][source]
Compute quality scores.
- Parameters:
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
- Returns:
The quality scores.
- Return type:
list[float]
- score_sentence_se(sources: list[str], hypotheses: list[str]) list[float][source]
Compute similarity scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
- Returns:
The similarity scores.
- Return type:
list[float]
- class gec_metrics.metrics.LLMKobayashi24HFEdit(config: Config = None)[source]
Bases:
LLMKobayashi24HFSentLLM-E with huggingface models.
- class Config(model: str = 'meta-llama/Llama-2-13b-chat-hf', cache: str = None, seed: int = 777, verbose: bool = False, criteria: str = None, instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\nFor targets without any edits, if the sentence is correct, they will be awarded 5 points; if there is an error, they will receive 1 point.\nThe edits in each target are indicated as follows:\nInsert "the": [→the]\nDelete "the": [the→]\nReplace "the" with "a": [the→a]\n\n# context\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the\nfollowing schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n', quantization: str = '8bit', dtype: str = 'bfloat16')[source]
Bases:
Config- instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\nFor targets without any edits, if the sentence is correct, they will be awarded 5 points; if there is an error, they will receive 1 point.\nThe edits in each target are indicated as follows:\nInsert "the": [→the]\nDelete "the": [the→]\nReplace "the" with "a": [the→a]\n\n# context\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the\nfollowing schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n'
- class gec_metrics.metrics.LLMKobayashi24HFSent(config=None)[source]
Bases:
LLMKobayashi24LLM-S with huggingface models.
- class Config(model: str = 'meta-llama/Llama-2-13b-chat-hf', cache: str = None, seed: int = 777, verbose: bool = False, criteria: str = None, instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\n\n# source\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n', quantization: str = '8bit', dtype: str = 'bfloat16')[source]
Bases:
ConfigHugging Face configuration.
- Parameters:
model (str) – Hugging Face Model name.
quantization (str) – Quantization setting. None, ‘4bit’ or ‘8bit’.
dtype (str)
- dtype: str = 'bfloat16'
- model: str = 'meta-llama/Llama-2-13b-chat-hf'
- quantization: str = '8bit'
- class gec_metrics.metrics.LLMKobayashi24OpenAIEdit(config: Config = None)[source]
Bases:
LLMKobayashi24OpenAISentLLM-E with OpenAI models.
- class Config(model: str = 'gpt-4o-mini-2024-07-18', cache: str = None, seed: int = 777, verbose: bool = False, criteria: str = None, instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\nFor targets without any edits, if the sentence is correct, they will be awarded 5 points; if there is an error, they will receive 1 point.\nThe edits in each target are indicated as follows:\nInsert "the": [→the]\nDelete "the": [the→]\nReplace "the" with "a": [the→a]\n\n# context\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the\nfollowing schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n', organization: str = None, api_key: str = None, base_url: str = None)[source]
Bases:
Config- instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\nFor targets without any edits, if the sentence is correct, they will be awarded 5 points; if there is an error, they will receive 1 point.\nThe edits in each target are indicated as follows:\nInsert "the": [→the]\nDelete "the": [the→]\nReplace "the" with "a": [the→a]\n\n# context\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the\nfollowing schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n'
- class gec_metrics.metrics.LLMKobayashi24OpenAISent(config: Config = None)[source]
Bases:
LLMKobayashi24LLM-S with OpenAI models.
- class Config(model: str = 'gpt-4o-mini-2024-07-18', cache: str = None, seed: int = 777, verbose: bool = False, criteria: str = None, instruction_template: str = 'The goal of this task is to rank the presented targets based on the quality of the sentences.\nAfter reading the source sentence and target sentences, please assign a score from a minimum of 1 point to a maximum of 5 points to each target based on the quality of the sentence (note that you can assign the same score multiple times).\n\n# source\n[SOURCE]\n\n# targets\n[CORRECTION]\n\n# output format\nThe output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":\n```json\n{\n"target1_score": int // assigned score for target 1\n...\n"targetN_score": int // assigned score for target N\n}\n```\n', organization: str = None, api_key: str = None, base_url: str = None)[source]
Bases:
ConfigOpenAI configuration.
- Parameters:
model (str) – Model name.
organization (str) – Your organization key.
api_key (str) – Your api key.
base_url (str) – When using Gemini models, specify an appropriate url.
- api_key: str = None
- base_url: str = None
- model: str = 'gpt-4o-mini-2024-07-18'
- organization: str = None
- class gec_metrics.metrics.MetricBase(config: Config = None)[source]
Bases:
ABC- make_pairwise_scores(scores: list[list[float]]) list[list[list]][source]
Convert sentence-level scores into pairwise comparison results.
- Parameters:
scores (list[list[float]]) – Sentence-level score. The shape is (num_systems, num_sentences).
- Returns:
- Pairwise comparison resutls
for all of combination of the systems. The shape is (num_sents, num_systems, num_systems). You can refer to the comparison result by [sent_id][sys_id1][sys_id2]. Each element is -1, 0, or 1:
0 : tie 1 : sys_id1 wins sys_id2 -1: sys_id1 loses sys_id2
- Return type:
list[list[list]]
- run_expected_wins(pairwise_scores: list[list[list[int]]]) list[float][source]
Apply Expected Wins given pairwise comparison scores. This is the [Bojar+ 11] version: https://aclanthology.org/W11-2101/
Score(i) = sum_j (wins(i, j) / (wins(i, j) + wins(j, i)))
- Parameters:
pairwise_scores (list[list[list[int]]]) – Pairwise comparison results. The shape is (num_sents, num_systems, num_systems).
- Returns:
System-level scores.
- Return type:
list[float]
- run_trueskill(pairwise_scores: list[list[list[int]]]) list[float][source]
Apply TrueSkill given pairwise comparison scores.
- Parameters:
pairwise_scores (list[list[list[int]]]) – Pairwise comparison results. The shape is (num_sents, num_systems, num_systems).
- Returns:
System-level scores.
- Return type:
list[float]
- class gec_metrics.metrics.MetricBaseForReferenceBased(config: Config = None)[source]
Bases:
MetricBaseAbstract class for refernece-based metrics. All reference-based metrics must be implemented by inheriting from this class.
- class Score(tp: float = 0.0, fp: float = 0.0, fn: float = 0.0, tn: float = 0.0, beta: float = 0.5)[source]
Bases:
objectHandle edit or n-gram count. - tp: True Positive. - fp: False Positive. - fn: False Negative - tn: True Negative. - beta: The beta for F-beta score.
- property accuracy: float
Calculate the accuracy.
- beta: float
- property f: float
Calculate the F-beta score.
- fn: float
- fp: float
- property precision: float
Calculate the precision.
- property recall: float
Calculate the recall
- tn: float
- tp: float
- rank_systems(sources: list[str], hypotheses: list[list[str]], references: list[list[str]], aggregation='default') list[float][source]
Compute ranking score for multiple systems.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
aggregation –
(str): How to aggregate sentence-level scores. - “default” follows an original aggregation, e.g., average or accumulation. - “trueskill” convert sentence-level scores into pairwise comparison results,
then apply TrueSkill. This is motivated by https://arxiv.org/abs/2502.09416.
- ”expected_wins”: convert sentence-level scores into pairwise comparison results,
then apply Expected Wins.
- Retunrns:
list[float]: System-level scores.
- score_corpus(sources: list[str], hypotheses: list[str], references: list[list[str]]) float[source]
Calculate a corpus-level score. By default, we use the average of the sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
The corpus-level score.
- Return type:
float
- score_pairwise(sources: list[str], hypotheses: list[list[str]], references: list[list[str]]) list[list[list[int]]][source]
Calculate pairwise scores for all of combinations of hypotheses. By default, it simply compares the sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
- Pairwise comparison resutls.
The shape is (num_sentences, num_systems, num_systems).
- Return type:
list[list[list]]
- abstractmethod score_sentence(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[float][source]
Calculate sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
The sentence-level scores.
- Return type:
list[float]
- class gec_metrics.metrics.MetricBaseForReferenceFree(config: Config = None)[source]
Bases:
MetricBase- rank_systems(sources: list[str], hypotheses: list[list[str]], aggregation='default')[source]
Compute ranking score for multiple systems.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).
aggregation –
(str): How to aggregate sentence-level scores. - “default” follows an original aggregation, e.g., average or accumulation. - “trueskill” convert sentence-level scores into pairwise comparison results,
then apply TrueSkill. This is motivated by https://arxiv.org/abs/2502.09416.
- ”expected_wins”: convert sentence-level scores into pairwise comparison results,
then apply Expected Wins.
- Retunrns:
list[float]: System-level scores.
- score_corpus(sources: list[str], hypotheses: list[str]) float[source]
Calculate a corpus-level score. By default, we use the average of the sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
- Returns:
The corpus-level score.
- Return type:
float
- score_pairwise(sources: list[str], hypotheses: list[list[str]]) list[list[list[int]]][source]
Calculate pairwise scores for all of combinations of hypotheses. By default, it simply compares the sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).
- Returns:
- Pairwise comparison resutls.
The shape is (num_sentences, num_systems, num_systems).
- Return type:
list[list[list]]
- abstractmethod score_sentence(sources: list[str], hypotheses: list[str]) list[float][source]
Calculate a sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
- Returns:
The sentence-level scores.
- Return type:
list[float]
- class gec_metrics.metrics.MetricBaseForSourceFree(config: Config = None)[source]
Bases:
MetricBaseMetric without source sentence. This is basically for BERTScore or BARTScore
(that will be a component of PT-{ERRANT, M2}.).
- rank_systems(hypotheses: list[list[str]], references: list[list[str]], aggregation='default')[source]
Compute ranking score for multiple systems.
- Parameters:
hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
aggregation –
(str): How to aggregate sentence-level scores. - “default” follows an original aggregation, e.g., average or accumulation. - “trueskill” convert sentence-level scores into pairwise comparison results,
then apply TrueSkill. This is motivated by https://arxiv.org/abs/2502.09416.
- ”expected_wins”: convert sentence-level scores into pairwise comparison results,
then apply Expected Wins.
- Retunrns:
list[float]: System-level scores.
- score_corpus(hypotheses: list[str], references: list[list[str]]) float[source]
Calculate a corpus-level score. By default, we use the average of the sentence-level scores.
- Parameters:
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
The corpus-level score.
- Return type:
float
- score_pairwise(hypotheses: list[list[str]], references: list[list[str]]) list[list[list[int]]][source]
Calculate pairwise scores for all of combinations of hypotheses. By default, it simply compares the sentence-level scores.
- Parameters:
hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
- Pairwise comparison resutls.
The shape is (num_sentences, num_systems, num_systems).
- Return type:
list[list[list]]
- abstractmethod score_sentence(hypotheses: list[str], references: list[list[str]]) list[float][source]
Calculate a sentence-level scores.
- Parameters:
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
The sentence-level scores.
- Return type:
list[float]
- class gec_metrics.metrics.PTERRANT(config: Config = None)[source]
Bases:
ERRANT- class Config(beta: float = 0.5, language: str = 'en', weight_model_name: str = 'bertscore', weight_model_config: Config = None)[source]
Bases:
ConfigConfiguration of PTERRANT
- weight_model_name (str): Model to compute edit-level weights.
Currently only “bertscore” is available.
- weight_model_config (MetricBaseForSourceFree.Config):
The config instance of the weight model. If not specified, it uses the default one.
Also, you can use the same configurations as ERRANT.
- weight_model_name: str = 'bertscore'
- calc_edit_weights(src: str, ref: str, edits: list[Edit]) list[float][source]
Calculate a weight for each edit.
- Parameters:
src (str) – Source sentence.
src – Reference sentence.
edits (list[errant.edit.Edit]) – Edits.
- Returns:
The weight of each edit.
- Return type:
list[float]
- score_base(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[list[dict[str, Score]]][source]
- Calculate scores while retaining sentence and reference boundaries.
- The results can be aggregated according to the purpose,
e.g., at sentence-level or corpus-level.
- Parameters:
sources (list[str]) – Source sentence.
hypothesis (list[str]) – Corrected sentences.
references (list[list[str]]) – Reference sentences. The shape is (the number of references, the number of sentences).
- Returns:
- The verbose scores.
The list shape is (num_sents, num_refs)
The dict contains error type-wise scores.
- Return type:
list[list[dict[str, “Score”]]]
- class gec_metrics.metrics.SOME(config: Config = None)[source]
Bases:
MetricBaseForReferenceFree- class Config(model_g: str = 'gfm-models/grammer', model_f: str = 'gfm-models/fluency', model_m: str = 'gfm-models/meaning', weight_g: float = 0.55, weight_f: float = 0.43, weight_m: float = 0.02, no_cuda: bool = False, batch_size: int = 32, max_length: int = 128)[source]
Bases:
ConfigSOME configuration. - model_g (str): Model for grammaticality. - model_f (str): Model for fluency. - model_m (str): Model for meaning preservation. - weight_g (float): Weight for the grammaticality score. - weight_f (float): Weight for the fluency score. - weight_m (float): Weight for the meaning preservation score. - no_cuda (bool): If True, work on CPU. - batch_size (int): Batch size for inference. - max_length (int): Maximum length of inputs.
- batch_size: int = 32
- max_length: int = 128
- model_f: str = 'gfm-models/fluency'
- model_g: str = 'gfm-models/grammer'
- model_m: str = 'gfm-models/meaning'
- no_cuda: bool = False
- weight_f: float = 0.43
- weight_g: float = 0.55
- weight_m: float = 0.02
- min_max_normalize(x: int, x_min: int = 1, x_max: int = 4)[source]
Normalizes the input values in the range x_min to x_max. - x (int): Input value. - x_min (int): Lower bound of the range. - x_max (int): Upper bound of the range.
- score_sentence(sources: list[str], hypotheses: list[str]) list[float][source]
Calculate sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
- Returns:
The sentence-level scores.
- Return type:
list[float]
- class gec_metrics.metrics.Scribendi(config: Config = None)[source]
Bases:
MetricBaseForReferenceFree- class Config(model: str = 'gpt2', threshold: float = 0.8, no_cuda: bool = False, batch_size: int = 32)[source]
Bases:
ConfigScribendi configuration. - model (str): Model id of a language model. - threshold (float): Threshold for the maximum values of
the token sort ratio and the levenshtein distance ratio.
no_cuda (bool): If True, work on CPU.
batch_size (int): Batch size for the inference.
- batch_size: int = 32
- model: str = 'gpt2'
- no_cuda: bool = False
- threshold: float = 0.8
- levenshtein_distance_ratio(src: str, pred: str) float[source]
The word-level levenshtein distance ratio.
- Parameters:
src (str) – The source sentence.
pred (str) – The corrected sentence.
- Returns:
The levelshtein distance ratio.
- Return type:
float
- ppl(sents: list[str]) list[float][source]
Compute perplexity using a LM.
- Parameters:
sents (list[str]) – The sentences to be computed the perplexity.
- Returns:
The list of perplexity.
- Return type:
list[float]
- score_corpus(sources: list[str], hypotheses: list[str]) float[source]
Calculate a corpus-level score.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
- Returns:
The corpus-level score.
- Return type:
float
- score_sentence(sources: list[str], hypotheses: list[str]) list[float][source]
Calculate sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
- Returns:
The sentence-level scores.
- Return type:
list[float]