gec_metrics.meta_eval.base module
- class gec_metrics.meta_eval.base.MetaEvalBase(config: Config = None)[source]
Bases:
ABC- class Corr(pearson: float = None, spearman: float = None, accuracy: float = None, kendall: float = None, human_scores: list[float] = None, metric_scores: list[float] = None)[source]
Bases:
object- accuracy: float = None
- human_scores: list[float] = None
- kendall: float = None
- metric_scores: list[float] = None
- pearson: float = None
- spearman: float = None
- abstractmethod corr_sentence(metric: MetricBase)[source]
Compute sentence-level correlations.
- Parameters:
metric (MetricBase) – The metric to be evaluated.
- Returns:
The sentence-level correlations output.
- Return type:
**SentenceCorrOutput
- abstractmethod corr_system(metric: MetricBase, aggregation: str = 'default')[source]
Compute system-level correlations.
- Parameters:
metric (MetricBase) – The metric to be evaluated.
aggregation (str) – How to aggregate sentence-level scores into system rankings. - ‘default’: Default aggregation, e.g.,average or accumulation. - ‘trueskill’: TrueSkill aggregation.
- Returns:
The system-level correlations output.
- Return type:
**SystemCorrOutput
- pairwise_analysis(metric: MetricBase)[source]
Compute sentence-level correlations.
- Parameters:
metric (MetricBase) – The metric to be evaluated.
- Returns:
The sentence-level correlations output.
- Return type:
**SentenceCorrOutput
- rearange_sent_data(data)[source]
Rearange the format and content of sentence-level evaluation results. This is intednded to use for LLMKobayashi24** metric. The LLMKobayashi24** sentence-level meta evaluation requires the input
of the same set of corrected sentences as in the SEEDA manual evaluation.
- On the other hand, when the number of corrected sentences is larger than 5,
the same sentences as in SEEDA are not necessarily sampled and evaluated.
- To ensure that the same sentences as in SEEDA are used in the evaluation,
this function replaces the unused sentences with the used ones.
This ensures that only the same sentences as SEEDA are present in the hypothesis set.
- Also, different annotators evaluate different subsets of the same hypothesis set,
so we flatten the data to make this easier to handle.
- Specifically, human scores will be changed
from [num_sentences][num_annotatinos][num_systems] to [num_sentences * num_annotations][1][num_systems].
The hypotheses also be expanded: [num_systems][num_sentences] -> [num_systems][num_sentences * num_annotations].
- window_analysis_system(metric: MetricBase, window: int = 4, aggregation='default') SEEDAWindowAnalysisSystemCorrOutput[source]
System-level window analysis.
- Parameters:
metric (MetricBase) – The metric to be evaluated.
window (int) – The window size.
- Returns:
- The correlations.
Contains .ew_edit, .ew_sent, .ts_edit, .ts_sent.
Each is a dictinary: {(start_rank, end_rank): Corr}.
- Return type: