gec_metrics.metrics.base module

class gec_metrics.metrics.base.MetricBase(config: Config = None)[source]

Bases: ABC

class Config[source]

Bases: object

make_pairwise_scores(scores: list[list[float]]) list[list[list]][source]

Convert sentence-level scores into pairwise comparison results.

Parameters:

scores (list[list[float]]) – Sentence-level score. The shape is (num_systems, num_sentences).

Returns:

Pairwise comparison resutls

for all of combination of the systems. The shape is (num_sents, num_systems, num_systems). You can refer to the comparison result by [sent_id][sys_id1][sys_id2]. Each element is -1, 0, or 1:

0 : tie 1 : sys_id1 wins sys_id2 -1: sys_id1 loses sys_id2

Return type:

list[list[list]]

run_expected_wins(pairwise_scores: list[list[list[int]]]) list[float][source]

Apply Expected Wins given pairwise comparison scores. This is the [Bojar+ 11] version: https://aclanthology.org/W11-2101/

Score(i) = sum_j (wins(i, j) / (wins(i, j) + wins(j, i)))

Parameters:

pairwise_scores (list[list[list[int]]]) – Pairwise comparison results. The shape is (num_sents, num_systems, num_systems).

Returns:

System-level scores.

Return type:

list[float]

run_trueskill(pairwise_scores: list[list[list[int]]]) list[float][source]

Apply TrueSkill given pairwise comparison scores.

Parameters:

pairwise_scores (list[list[list[int]]]) – Pairwise comparison results. The shape is (num_sents, num_systems, num_systems).

Returns:

System-level scores.

Return type:

list[float]

class gec_metrics.metrics.base.MetricBaseForReferenceBased(config: Config = None)[source]

Bases: MetricBase

Abstract class for refernece-based metrics. All reference-based metrics must be implemented by inheriting from this class.

class Config[source]

Bases: Config

class Score(tp: float = 0.0, fp: float = 0.0, fn: float = 0.0, tn: float = 0.0, beta: float = 0.5)[source]

Bases: object

Handle edit or n-gram count. - tp: True Positive. - fp: False Positive. - fn: False Negative - tn: True Negative. - beta: The beta for F-beta score.

property accuracy: float

Calculate the accuracy.

beta: float
property f: float

Calculate the F-beta score.

fn: float
fp: float
property precision: float

Calculate the precision.

property recall: float

Calculate the recall

tn: float
tp: float
rank_systems(sources: list[str], hypotheses: list[list[str]], references: list[list[str]], aggregation='default') list[float][source]

Compute ranking score for multiple systems.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

  • aggregation

    (str): How to aggregate sentence-level scores. - “default” follows an original aggregation, e.g., average or accumulation. - “trueskill” convert sentence-level scores into pairwise comparison results,

    then apply TrueSkill. This is motivated by https://arxiv.org/abs/2502.09416.

    • ”expected_wins”: convert sentence-level scores into pairwise comparison results,

      then apply Expected Wins.

Retunrns:

list[float]: System-level scores.

score_corpus(sources: list[str], hypotheses: list[str], references: list[list[str]]) float[source]

Calculate a corpus-level score. By default, we use the average of the sentence-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

The corpus-level score.

Return type:

float

score_pairwise(sources: list[str], hypotheses: list[list[str]], references: list[list[str]]) list[list[list[int]]][source]

Calculate pairwise scores for all of combinations of hypotheses. By default, it simply compares the sentence-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

Pairwise comparison resutls.

The shape is (num_sentences, num_systems, num_systems).

Return type:

list[list[list]]

abstractmethod score_sentence(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[float][source]

Calculate sentence-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

The sentence-level scores.

Return type:

list[float]

class gec_metrics.metrics.base.MetricBaseForReferenceFree(config: Config = None)[source]

Bases: MetricBase

class Config[source]

Bases: Config

rank_systems(sources: list[str], hypotheses: list[list[str]], aggregation='default')[source]

Compute ranking score for multiple systems.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).

  • aggregation

    (str): How to aggregate sentence-level scores. - “default” follows an original aggregation, e.g., average or accumulation. - “trueskill” convert sentence-level scores into pairwise comparison results,

    then apply TrueSkill. This is motivated by https://arxiv.org/abs/2502.09416.

    • ”expected_wins”: convert sentence-level scores into pairwise comparison results,

      then apply Expected Wins.

Retunrns:

list[float]: System-level scores.

score_corpus(sources: list[str], hypotheses: list[str]) float[source]

Calculate a corpus-level score. By default, we use the average of the sentence-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

Returns:

The corpus-level score.

Return type:

float

score_pairwise(sources: list[str], hypotheses: list[list[str]]) list[list[list[int]]][source]

Calculate pairwise scores for all of combinations of hypotheses. By default, it simply compares the sentence-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).

Returns:

Pairwise comparison resutls.

The shape is (num_sentences, num_systems, num_systems).

Return type:

list[list[list]]

abstractmethod score_sentence(sources: list[str], hypotheses: list[str]) list[float][source]

Calculate a sentence-level scores.

Parameters:
  • sources (list[str]) – Source sentence. The shape is (num_sentences, )

  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

Returns:

The sentence-level scores.

Return type:

list[float]

class gec_metrics.metrics.base.MetricBaseForSourceFree(config: Config = None)[source]

Bases: MetricBase

Metric without source sentence. This is basically for BERTScore or BARTScore

(that will be a component of PT-{ERRANT, M2}.).

class Config[source]

Bases: Config

rank_systems(hypotheses: list[list[str]], references: list[list[str]], aggregation='default')[source]

Compute ranking score for multiple systems.

Parameters:
  • hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

  • aggregation

    (str): How to aggregate sentence-level scores. - “default” follows an original aggregation, e.g., average or accumulation. - “trueskill” convert sentence-level scores into pairwise comparison results,

    then apply TrueSkill. This is motivated by https://arxiv.org/abs/2502.09416.

    • ”expected_wins”: convert sentence-level scores into pairwise comparison results,

      then apply Expected Wins.

Retunrns:

list[float]: System-level scores.

score_corpus(hypotheses: list[str], references: list[list[str]]) float[source]

Calculate a corpus-level score. By default, we use the average of the sentence-level scores.

Parameters:
  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

The corpus-level score.

Return type:

float

score_pairwise(hypotheses: list[list[str]], references: list[list[str]]) list[list[list[int]]][source]

Calculate pairwise scores for all of combinations of hypotheses. By default, it simply compares the sentence-level scores.

Parameters:
  • hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

Pairwise comparison resutls.

The shape is (num_sentences, num_systems, num_systems).

Return type:

list[list[list]]

abstractmethod score_sentence(hypotheses: list[str], references: list[list[str]]) list[float][source]

Calculate a sentence-level scores.

Parameters:
  • hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )

  • references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).

Returns:

The sentence-level scores.

Return type:

list[float]

gec_metrics.metrics.base.inputs_handler(metric: MetricBase, sources: list[str], hypotheses: list[str], references: list[list[str]]) dict[source]

This handles different input interface.

Given sources, hypotheses, references,

this function chooses the appropriate inputs according to the metric class.

Returns:

The dictionary contaning some of keys from “sources”, “hypotheses”, and “references”.

This can be input by **<variable> to metric.score_**() functions.

Return type:

dict

from gec_metrics.metrics import IMPARA, inputs_handler
metric = ERRANT()
inputs = inputs_handler(
    metric=metric,
    sources=[],
    hypothese=[],
    references=[[]]
)
metric.score_corpus(**inputs)