gec_metrics.metrics.base module
- class gec_metrics.metrics.base.MetricBase(config: Config = None)[source]
Bases:
ABC- make_pairwise_scores(scores: list[list[float]]) list[list[list]][source]
Convert sentence-level scores into pairwise comparison results.
- Parameters:
scores (list[list[float]]) – Sentence-level score. The shape is (num_systems, num_sentences).
- Returns:
- Pairwise comparison resutls
for all of combination of the systems. The shape is (num_sents, num_systems, num_systems). You can refer to the comparison result by [sent_id][sys_id1][sys_id2]. Each element is -1, 0, or 1:
0 : tie 1 : sys_id1 wins sys_id2 -1: sys_id1 loses sys_id2
- Return type:
list[list[list]]
- run_expected_wins(pairwise_scores: list[list[list[int]]]) list[float][source]
Apply Expected Wins given pairwise comparison scores. This is the [Bojar+ 11] version: https://aclanthology.org/W11-2101/
Score(i) = sum_j (wins(i, j) / (wins(i, j) + wins(j, i)))
- Parameters:
pairwise_scores (list[list[list[int]]]) – Pairwise comparison results. The shape is (num_sents, num_systems, num_systems).
- Returns:
System-level scores.
- Return type:
list[float]
- run_trueskill(pairwise_scores: list[list[list[int]]]) list[float][source]
Apply TrueSkill given pairwise comparison scores.
- Parameters:
pairwise_scores (list[list[list[int]]]) – Pairwise comparison results. The shape is (num_sents, num_systems, num_systems).
- Returns:
System-level scores.
- Return type:
list[float]
- class gec_metrics.metrics.base.MetricBaseForReferenceBased(config: Config = None)[source]
Bases:
MetricBaseAbstract class for refernece-based metrics. All reference-based metrics must be implemented by inheriting from this class.
- class Score(tp: float = 0.0, fp: float = 0.0, fn: float = 0.0, tn: float = 0.0, beta: float = 0.5)[source]
Bases:
objectHandle edit or n-gram count. - tp: True Positive. - fp: False Positive. - fn: False Negative - tn: True Negative. - beta: The beta for F-beta score.
- property accuracy: float
Calculate the accuracy.
- beta: float
- property f: float
Calculate the F-beta score.
- fn: float
- fp: float
- property precision: float
Calculate the precision.
- property recall: float
Calculate the recall
- tn: float
- tp: float
- rank_systems(sources: list[str], hypotheses: list[list[str]], references: list[list[str]], aggregation='default') list[float][source]
Compute ranking score for multiple systems.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
aggregation –
(str): How to aggregate sentence-level scores. - “default” follows an original aggregation, e.g., average or accumulation. - “trueskill” convert sentence-level scores into pairwise comparison results,
then apply TrueSkill. This is motivated by https://arxiv.org/abs/2502.09416.
- ”expected_wins”: convert sentence-level scores into pairwise comparison results,
then apply Expected Wins.
- Retunrns:
list[float]: System-level scores.
- score_corpus(sources: list[str], hypotheses: list[str], references: list[list[str]]) float[source]
Calculate a corpus-level score. By default, we use the average of the sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
The corpus-level score.
- Return type:
float
- score_pairwise(sources: list[str], hypotheses: list[list[str]], references: list[list[str]]) list[list[list[int]]][source]
Calculate pairwise scores for all of combinations of hypotheses. By default, it simply compares the sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
- Pairwise comparison resutls.
The shape is (num_sentences, num_systems, num_systems).
- Return type:
list[list[list]]
- abstractmethod score_sentence(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[float][source]
Calculate sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
The sentence-level scores.
- Return type:
list[float]
- class gec_metrics.metrics.base.MetricBaseForReferenceFree(config: Config = None)[source]
Bases:
MetricBase- rank_systems(sources: list[str], hypotheses: list[list[str]], aggregation='default')[source]
Compute ranking score for multiple systems.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).
aggregation –
(str): How to aggregate sentence-level scores. - “default” follows an original aggregation, e.g., average or accumulation. - “trueskill” convert sentence-level scores into pairwise comparison results,
then apply TrueSkill. This is motivated by https://arxiv.org/abs/2502.09416.
- ”expected_wins”: convert sentence-level scores into pairwise comparison results,
then apply Expected Wins.
- Retunrns:
list[float]: System-level scores.
- score_corpus(sources: list[str], hypotheses: list[str]) float[source]
Calculate a corpus-level score. By default, we use the average of the sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
- Returns:
The corpus-level score.
- Return type:
float
- score_pairwise(sources: list[str], hypotheses: list[list[str]]) list[list[list[int]]][source]
Calculate pairwise scores for all of combinations of hypotheses. By default, it simply compares the sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).
- Returns:
- Pairwise comparison resutls.
The shape is (num_sentences, num_systems, num_systems).
- Return type:
list[list[list]]
- abstractmethod score_sentence(sources: list[str], hypotheses: list[str]) list[float][source]
Calculate a sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
- Returns:
The sentence-level scores.
- Return type:
list[float]
- class gec_metrics.metrics.base.MetricBaseForSourceFree(config: Config = None)[source]
Bases:
MetricBaseMetric without source sentence. This is basically for BERTScore or BARTScore
(that will be a component of PT-{ERRANT, M2}.).
- rank_systems(hypotheses: list[list[str]], references: list[list[str]], aggregation='default')[source]
Compute ranking score for multiple systems.
- Parameters:
hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
aggregation –
(str): How to aggregate sentence-level scores. - “default” follows an original aggregation, e.g., average or accumulation. - “trueskill” convert sentence-level scores into pairwise comparison results,
then apply TrueSkill. This is motivated by https://arxiv.org/abs/2502.09416.
- ”expected_wins”: convert sentence-level scores into pairwise comparison results,
then apply Expected Wins.
- Retunrns:
list[float]: System-level scores.
- score_corpus(hypotheses: list[str], references: list[list[str]]) float[source]
Calculate a corpus-level score. By default, we use the average of the sentence-level scores.
- Parameters:
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
The corpus-level score.
- Return type:
float
- score_pairwise(hypotheses: list[list[str]], references: list[list[str]]) list[list[list[int]]][source]
Calculate pairwise scores for all of combinations of hypotheses. By default, it simply compares the sentence-level scores.
- Parameters:
hypotheses (list[list[str]]) – Corrected sentences. The shape is (num_systems, num_sentences).
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
- Pairwise comparison resutls.
The shape is (num_sentences, num_systems, num_systems).
- Return type:
list[list[list]]
- abstractmethod score_sentence(hypotheses: list[str], references: list[list[str]]) list[float][source]
Calculate a sentence-level scores.
- Parameters:
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
references (list[list[str]]) – Reference sentences. The shape is (num_references, num_sentences).
- Returns:
The sentence-level scores.
- Return type:
list[float]
- gec_metrics.metrics.base.inputs_handler(metric: MetricBase, sources: list[str], hypotheses: list[str], references: list[list[str]]) dict[source]
This handles different input interface.
- Given sources, hypotheses, references,
this function chooses the appropriate inputs according to the metric class.
- Returns:
- The dictionary contaning some of keys from “sources”, “hypotheses”, and “references”.
This can be input by **<variable> to metric.score_**() functions.
- Return type:
dict
from gec_metrics.metrics import IMPARA, inputs_handler metric = ERRANT() inputs = inputs_handler( metric=metric, sources=[], hypothese=[], references=[[]] ) metric.score_corpus(**inputs)