gec_metrics.metrics.gotoscorer module
- class gec_metrics.metrics.gotoscorer.GoToScorer(config: Config = None)[source]
Bases:
ERRANT- class Chunk(o_start: int = 0, o_end: int = 0, c_str: str = '', type: str = '', weight: float = 1.0, is_edited: bool = False)[source]
Bases:
objectClass to represent a chunk. - o_start (int): Start index of the span for the source words. - o_end (int): End index of the span for the source words. - c_str (str): Corrected version of the span. - type (str): Error type. - weight (float): Weight for the chunk. - is_edited (bool): Flag whether the GEC systems tried to edit this span.
This is used to distinguish FP from FN, and TP and TN.
- c_str: str = ''
- is_edited: bool = False
- o_end: int = 0
- o_start: int = 0
- type: str = ''
- weight: float = 1.0
- class Config(beta: float = 0.5, language: str = 'en', weight_file: str = '', ref_id: int = 0, no_weight: bool = True)[source]
Bases:
ConfigGoToScorer configuration. - The JSON file containing the pre-computed weight.
This can be generated by the gecmetrics-gen-gotoscorer-weight script.
ref_id (int): The reference id.
no_weight (bool): If True, the weight of all chunks are 1.0.
- no_weight: bool = True
- ref_id: int = 0
- weight_file: str = ''
- annotate(r_chunk: Chunk, hyp_chunks: list[Chunk]) tuple[bool][source]
- Annotate whether the reference chunk is correct
and whether the system attempted to edit it.
- generate_chunks(edits: list[Edit], tokens: list[str]) list[Chunk][source]
- Generate a chunk sequence given an edit sequence.
Tokens included in each edit become a single chunk.
Each token outside of the edits becomes a chunk respectively.
- Dummy chunks will be inserted between all tokens
to account for possible insertions.
- Parameters:
edits (list[errant.edit.Edit]) – The edit sequence that can be obtained via errant.annotate()
tokens (list[str]) – The source tokens.
- Returns:
The chunk sequence.
- Return type:
list[Chunk]
- score_base(sources: list[str], hypotheses: list[str], references: list[list[str]]) list[list[dict[str, Score]]][source]
- Calculate scores while retaining sentence and reference boundaries.
- The results can be aggregated according to the purpose,
e.g., at sentence-level or corpus-level.
- Parameters:
sources (list[str]) – Source sentence.
hypothesis (list[str]) – Corrected sentences.
references (list[list[str]]) – Reference sentences. The shape is (the number of references, the number of sentences).
- Returns:
- The verbose scores.
The list shape is (num_sents, num_refs)
The dict contains error type-wise scores.
- Return type:
list[list[dict[str, “Score”]]]
- visualize_chunk(chunks: list[Chunk], tokens: str) None[source]
The visualizer.
- Parameters:
chunks (list[Chunk]) – The chunk sequence.
tokens (list[str]) – The source tokens.
Returns: None
from gec_metrics.metrics.gotoscorer import GoToScorer scorer = GoToScorer(GoToScorer.Config(no_weight=True)) src = ‘This sentences contain gramamtical error .’ trg = ‘This sentence contains a grammatical error .’ edits = scorer.edit_extraction(src, trg) chunks = scorer.generate_chunks(edits, src.split(’ ‘)) scorer.visualize_chunk(chunks, src.split(’ ‘))
# Output: # | |This| |sentences| |contain | |gramamtical| |error| | . | | # | |This| |sentence | |contains| a |grammatical| |error| | . | | # |1.0|1.0 |1.0| 1.0 |1.0| 1.0 |1.0| 1.0 |1.0| 1.0 |1.0|1.0|1.0|