Evaluation

Participant systems will be evaluated on the official test dataset using automated metrics designed to measure both syntactic correctness and semantic fidelity of generated Structured Text programs.

Task A: Natural Language to Structured Text

Task A evaluates the ability of a system to generate IEC 61131-3 Structured Text (ST) programs from natural language descriptions.

Metrics

Metric Description
Cosine Similarity (CS) Semantic similarity between generated and reference code representations using embedding-based cosine similarity.
Structural Match Score (SMS) Measures the similarity of conditions, assignments, timers, loops, and control-flow structures between the generated and reference ST code.
Program Dependence Graph Similarity (PDG) Evaluates the similarity of data and control dependencies between generated and reference programs using Program Dependence Graphs.

Task B: Natural Language to Rust Program

Task B evaluates the ability of a system to generate Rust programs from natural language descriptions.

Metrics

Metric Description
Cosine Similarity (CS) Semantic similarity between generated and reference code representations using embedding-based cosine similarity.
Structural Match Score (SMS) Measures the similarity of conditions, assignments, timers, loops, and control-flow structures between the generated and reference Rust code.
Program Dependence Graph Similarity (PDG) Evaluates the similarity of data and control dependencies between generated and reference programs using Program Dependence Graphs.

Final Ranking

The final ranking of participating teams will be determined by evaluating both subtasks.