gec_metrics.meta_eval package
Submodules
Module contents
- class gec_metrics.meta_eval.MetaEvalBase(config: Config = None)[source]
Bases:
ABC- class Corr(pearson: float = None, spearman: float = None, accuracy: float = None, kendall: float = None, human_scores: list[float] = None, metric_scores: list[float] = None)[source]
Bases:
object- accuracy: float = None
- human_scores: list[float] = None
- kendall: float = None
- metric_scores: list[float] = None
- pearson: float = None
- spearman: float = None
- abstractmethod corr_sentence(metric: MetricBase)[source]
Compute sentence-level correlations.
- Parameters:
metric (MetricBase) – The metric to be evaluated.
- Returns:
The sentence-level correlations output.
- Return type:
**SentenceCorrOutput
- abstractmethod corr_system(metric: MetricBase, aggregation: str = 'default')[source]
Compute system-level correlations.
- Parameters:
metric (MetricBase) – The metric to be evaluated.
aggregation (str) – How to aggregate sentence-level scores into system rankings. - ‘default’: Default aggregation, e.g.,average or accumulation. - ‘trueskill’: TrueSkill aggregation.
- Returns:
The system-level correlations output.
- Return type:
**SystemCorrOutput
- pairwise_analysis(metric: MetricBase)[source]
Compute sentence-level correlations.
- Parameters:
metric (MetricBase) – The metric to be evaluated.
- Returns:
The sentence-level correlations output.
- Return type:
**SentenceCorrOutput
- rearange_sent_data(data)[source]
Rearange the format and content of sentence-level evaluation results. This is intednded to use for LLMKobayashi24** metric. The LLMKobayashi24** sentence-level meta evaluation requires the input
of the same set of corrected sentences as in the SEEDA manual evaluation.
- On the other hand, when the number of corrected sentences is larger than 5,
the same sentences as in SEEDA are not necessarily sampled and evaluated.
- To ensure that the same sentences as in SEEDA are used in the evaluation,
this function replaces the unused sentences with the used ones.
This ensures that only the same sentences as SEEDA are present in the hypothesis set.
- Also, different annotators evaluate different subsets of the same hypothesis set,
so we flatten the data to make this easier to handle.
- Specifically, human scores will be changed
from [num_sentences][num_annotatinos][num_systems] to [num_sentences * num_annotations][1][num_systems].
The hypotheses also be expanded: [num_systems][num_sentences] -> [num_systems][num_sentences * num_annotations].
- window_analysis_system(metric: MetricBase, window: int = 4, aggregation='default') SEEDAWindowAnalysisSystemCorrOutput[source]
System-level window analysis.
- Parameters:
metric (MetricBase) – The metric to be evaluated.
window (int) – The window size.
- Returns:
- The correlations.
Contains .ew_edit, .ew_sent, .ts_edit, .ts_sent.
Each is a dictinary: {(start_rank, end_rank): Corr}.
- Return type:
- class gec_metrics.meta_eval.MetaEvalGJG(config: Config = None)[source]
Bases:
MetaEvalBase- class GJGSentenceCorrOutput(corr: Corr = None)[source]
Bases:
OutputThe dataclass to store the meta-evaluation results.
- Parameters:
ts (MetaEvalBase.Corr) – The correlation using TrueSkill-based human evaluation.
ts – The correlation using Expected Wins-based human evaluation.
- class GJGSystemCorrOutput(ew: Corr = None, ts: Corr = None)[source]
Bases:
OutputThe dataclass to store the meta-evaluation results.
- Parameters:
ts (MetaEvalBase.Corr) – The correlation using TrueSkill-based human evaluation.
ts – The correlation using Expected Wins-based human evaluation.
- class GJGWindowAnalysisSystemCorrOutput(ew: dict = None, ts: dict = None)[source]
Bases:
OutputThe dataclass to store the meta-evaluation results.
- Parameters:
ts (MetaEvalBase.Corr) – The correlation using TrueSkill-based human evaluation.
ts – The correlation using Expected Wins-based human evaluation.
- ew: dict = None
- ts: dict = None
- MODELS = ['AMU', 'RAC', 'CAMB', 'CUUI', 'POST', 'UFC', 'PKU', 'UMC', 'IITB', 'SJTU', 'INPUT', 'NTHU', 'IPN']
- SCORE_ID = ['ew', 'ts']
- corr_sentence(metric: MetricBase) GJGSentenceCorrOutput[source]
Compute sentence-level correlations.
- Parameters:
metric (MetricBase) – The metric to be evaluated.
- Returns:
The correlations.
- Return type:
- corr_system(metric: MetricBase, aggregation='default') GJGSystemCorrOutput[source]
Compute system-level correlations.
- Parameters:
metric (MetricBase) – The metric to be evaluated.
- Returns:
The correlations.
- Return type:
- load_sentence_data() dict[str, list][source]
Loads sentence-level meta-evaluation data.
- Returns:
- The meta-evaluation data contianing the following keys:
”sources”: Source sentences. The shape is (num_sentences, ).
”hypotheses”: Hypotheses sentences. The shape is (num_systems, num_sentences).
”references”: Reference sentences. The shape is (num_references, num_sentences).
”models”: The model names. This index corresponds to the first dimension of “hypotheses”.
- ”human_scores”: Human scores for the systems.
- ”ew” is human Expected Wins scores.
The shape is (num_sentences, num_systems, num_systems).
- ”ts” is human TrueSkill scores.
The shape is (num_sentences, num_systems, num_systems).
- Return type:
dict[str, list]
- load_system_data() dict[str, list][source]
Load system-level meta-evaluation data.
- Returns:
- The meta-evaluation data contianing the following keys:
”sources”: Source sentences. The shape is (num_sentences, ).
”hypotheses”: Hypotheses sentences. The shape is (num_systems, num_sentences).
”references”: Reference sentences. The shape is (num_references, num_sentences).
”models”: The model names. This index corresponds to the first dimension of “hypotheses”.
- ”human_scores”: Human scores for the systems. The shape is (num_systems, )
”ew” is human Expected Wins scores.
”ts” is human TrueSkill scores.
- Return type:
dict[str, list]
- load_xml(xml_path: str, target_models: list[str]) dict[int, list[list[int]]][source]
Load a XML file.
- Parameters:
xml_path (str) – Path to a XML file.
target_models (list[str]) – Model names to be evaluated.
- Returns:
Dictionary containing sentence-level human evaluation rankings. The data is stored for each source and annotator. You can refer to the ranking by dict[src_id][annotator_id][system_id] = -rank. Note that each element is minus rank, so higher values are higher quality.
- Return type:
dict[int, list[list[int]]
- window_analysis_system(metric: MetricBase, window: int = 4, aggregation='default') GJGWindowAnalysisSystemCorrOutput[source]
System-level window analysis.
- Parameters:
metric (MetricBase) – The metric to be evaluated.
window (int) – The window size.
- Returns:
- The correlations.
Contains .ew_edit, .ew_sent, .ts_edit, .ts_sent.
Each is a dictinary: {(start_rank, end_rank): Corr}.
- Return type:
- class gec_metrics.meta_eval.MetaEvalSEEDA(config: Config = None)[source]
Bases:
MetaEvalBase- MODELS = ['BART', 'BERT-fuse', 'GECToR-BERT', 'GECToR-ens', 'GPT-3.5', 'INPUT', 'LM-Critic', 'PIE', 'REF-F', 'REF-M', 'Riken-Tohoku', 'T5', 'TemplateGEC', 'TransGEC', 'UEDIN-MS']
- SCORE_ID = ['EW_edit', 'EW_sent', 'TS_edit', 'TS_sent']
- class SEEDASentenceCorrOutput(sent: Corr = None, edit: Corr = None)[source]
Bases:
OutputThe dataclass to store sentence-level correlations.
- Parameters:
sent (MetaEvalBase.Corr) – SEEDA-S sentence-level correlation.
edit (MetaEvalBase.Corr) – SEEDA-E sentence-level correlation.
- class SEEDASystemCorrOutput(ew_edit: Corr = None, ew_sent: Corr = None, ts_edit: Corr = None, ts_sent: Corr = None)[source]
Bases:
OutputThe dataclass to store system-level correlations.
- Parameters:
ew_sent (MetaEvalBase.Corr) – SEEDA-S correlation based on Expected Wins-based human evaluation.
ew_edit (MetaEvalBase.Corr) – SEEDA-E correlation based on Expected Wins-based human evaluation.
ts_sent (MetaEvalBase.Corr) – SEEDA-S correlation based on TrueSkill-based human evaluation.
ts_edit (MetaEvalBase.Corr) – SEEDA-E correlation based on TrueSkill-based human evaluation.
- class SEEDAWindowAnalysisSystemCorrOutput(ew_edit: dict = None, ew_sent: dict = None, ts_edit: dict = None, ts_sent: dict = None)[source]
Bases:
OutputThe dataclass to store system-level correlations.
- Parameters:
ew_sent (MetaEvalBase.Corr) – SEEDA-S correlation based on Expected Wins-based human evaluation.
ew_edit (MetaEvalBase.Corr) – SEEDA-E correlation based on Expected Wins-based human evaluation.
ts_sent (MetaEvalBase.Corr) – SEEDA-S correlation based on TrueSkill-based human evaluation.
ts_edit (MetaEvalBase.Corr) – SEEDA-E correlation based on TrueSkill-based human evaluation.
- ew_edit: dict = None
- ew_sent: dict = None
- ts_edit: dict = None
- ts_sent: dict = None
- corr_sentence(metric: MetricBase) SEEDASentenceCorrOutput[source]
Compute sentence-level correlations.
- Parameters:
metric (MetricBase) – The metric to be evaluated.
- Returns:
The correlations.
- Return type:
- corr_system(metric: MetricBase, aggregation='default') SEEDASystemCorrOutput[source]
Compute system-level correlations.
- Parameters:
metric (MetricBase) – The metric to be evaluated.
- Returns:
The correlations.
- Return type:
- load_sentence_data() dict[str, int][source]
Load sentence-level meta-evaluation data.
- Returns:
- The meta-evaluation data contianing the following keys:
”sources”: Source sentences. The shape is (num_sentences, ).
”hypotheses”: Hypotheses sentences. The shape is (num_systems, num_sentences).
”references”: Reference sentences. The shape is (num_references, num_sentences).
”models”: The model names. This index corresponds to the first dimension of “hypotheses”.
- ”human_scores”: Dictionary of Human scores for the systems. The shape is (num_sentences, num_systems, num_systems).
”EW_edit”: Expected Wins scores using edit-based human evaluation.
”EW_sent”: Expected Wins scores using sentence-based human evaluation.
”TS_edit”: TrueSkill scores using edit-based human evaluation.
”TS_sent”: TrueSkill scores using sentence-based human evaluation.
- Return type:
dict[str, list]
- load_system_data() dict[str, list][source]
Load system-level meta-evaluation data.
- Returns:
- The meta-evaluation data contianing the following keys:
”sources”: Source sentences. The shape is (num_sentences, ).
”hypotheses”: Hypotheses sentences. The shape is (num_systems, num_sentences).
”references”: Reference sentences. The shape is (num_references, num_sentences).
”models”: The model names. This index corresponds to the first dimension of “hypotheses”.
- ”human_scores”: Dictionary of Human scores. The shape is (num_systems, ).
”EW_edit”: Expected Wins scores using edit-based human evaluation.
”EW_sent”: Expected Wins scores using sentence-based human evaluation.
”TS_edit”: TrueSkill scores using edit-based human evaluation.
”TS_sent”: TrueSkill scores using sentence-based human evaluation.
- Return type:
dict[str, list]
- load_xml(xml_path: str, target_models: list[str]) dict[str, list[list[int]]][source]
Load a XML file.
- Parameters:
xml_path (str) – Path to a XML file.
target_models (list[str]) – Model names to be evaluated.
- Returns:
Dictionary containing sentence-level human evaluation rankings. The data is stored for each source and annotator. You can refer to the ranking by dict[src_id][annotator_id][system_id] = -rank. Note that each element is minus rank, so higher values are higher quality.
- Return type:
dict[int, list[list[int]]]
- window_analysis_system(metric: MetricBase, window: int = 4, aggregation='default') SEEDAWindowAnalysisSystemCorrOutput[source]
System-level window analysis.
- Parameters:
metric (MetricBase) – The metric to be evaluated.
window (int) – The window size.
- Returns:
- The correlations.
Contains .ew_edit, .ew_sent, .ts_edit, .ts_sent.
Each is a dictinary: {(start_rank, end_rank): Corr}.
- Return type: