Spec-to-RTL and Code Completion metrics

Given a benchmark with $n$ problems, the TuRTLe framework generates $m=5$ candidate solutions for each. Then, each generation is processed sequentially through the evaluation pipeline: Syntax Correctness (STX), Functional Correctness (FNC), Synthesizability (SYN), and Post-Synthesis Quality (PSQ). Each stage only processes results that passed the previous one, and failures in previous stages are reported as automatic fails in the next ones, creating a cascade score where STX $\ge$ FNC $\ge$ SYN. It could happen that SYN $<$ PSQ because we contemplate the possibility of having PSQ better than the human reference (even though it is highly unlikely).

Tools for the Evaluation

STX is evaluated by compiling a design along with its testbench using Icarus Verilog and checking for errors. If no errors are issued at compile time, FNC is evaluated by running the simulation executable generated by the compiler and checking if the testbench passes. Functionally correct codes are tested for SYN, by elaborating the design with OpenLANE (Yosys).

For those designs that pass SYN, PSQ is measured with the PPA report, which is computed by using OpenLANE to synthesize the code into a netlist. Designs are synthesized using the SKY130A open-source PDK with a $10ns$ delay constraint. We extract area and power directly from the PPA report and evaluate performance based on maximum delay, defined as the difference between the clock period and the worst slack reported by static timing analysis. In this way, all three PPA metrics are represented as positive numbers, where 0 represents the minimum possible value, leaving the maximum unbounded.

Compute Metrics

Since STX, FNC and SYN are binary evaluations, pass or fail, their score is computed using Pass@1. Remember that failures in previous stages are reported as automatic fails in the next ones.

On the other hand, PPA is represented as real values that must be evaluated against a golden solution (crafted by humans in our case). Therefore, we need to introduce a new formulation that takes this into account.

PPA-Score

Let $p_{i,j}$ represent the PPA metric (power, performance, or area) from the LLM for candidate $j$ of problem $i$. We compute a score for each generation (represented as $\hat{p}_{i,j}$) computed according to the following steps:

Each $p_{i,j}$ is compared against the corresponding PPA value $g_i$ of the golden solution. For that, instead of aggregating $p_{i,j}$ we compute $p_{i,j}/g_i\in(0,+\infty)$.
For generations that do not pass STX, FNC and SYN evaluations, $p_{i,j}$ cannot be computed. As a result, we set a failure value of $2\cdot g_i$ (e.g. producing a design two times bigger than the human reference in the case of the area metric). This approach also requires us to clip results that pass the previous evaluations but perform worse than this threshold.
We flip the result so that the metric behaves as the rest of the goals STX, FNC, and SYN (higher is better).

The resulting score $\hat{p}_{i,j}$ of generation $j$ for problem $i$ is defined as $\hat{p}_{i,j} = \begin{cases} 2 - \min(p_{i,j}/g_i,2) & \text{if $p_{i,j}$ exists} \\ 0 & \text{otherwise} \end{cases}$ which has the following interpretation:

$\hat{p}_{i,j} = 0$ : means that the generation has not passed the previous stages, or that it requires twice or more the area, the power, or the delay (performance), when compared to the human reference.
$\hat{p}_{i,j} = 1$ : are designs with an area, power, or performance equal to that of the human reference.
$\hat{p}_{i,j} = 2$ : can only be obtained by chips which occupy no space, execute in no time, or consume no energy (perfect, but impossible to achieve).

The final formula considering all generations of an LLM for a given benchmark is just the average of these scores: $x\text{-score} = \left[\frac{1}{n\cdot m} \sum\limits_{i=1}^n \sum\limits_{j=1}^m\hat{p}_{i,j}\right]\cdot 100,$ where $x\in{\text{Power, Performance, Area}}$. Finally, we define the PPA-score as the average of these three metrics.

Note that under the assumption that all generations are synthetizable and produce the same PPA as the golden solution (i.e., the human baseline), the PPA-score would be 100%. The theoretical upper bound is 200, but it is impossible to achieve in practice.

Aggregated Scores

In the case of MC and S2R tasks, we report all numbers from the four evaluation stages (STX, FNC, SYN, and PSQ). However, to determine the best model for these tasks, it is necessary to aggregate the results.

For a single benchmark

Due to the cascade effect previously explained, the final stage of the evaluation (i.e., the PPA-Score) compresses the information from all previous stages. Therefore, we interpret it as the model's final score for the benchmark.

Across Multiple Benchmarks of the same task

When aggregating results across benchmarks for the same task, we cannot simply take a straightforward average because some benchmarks are much larger than others. Instead, we weight each benchmark’s contribution by its size, ensuring that each individual sample contributes equally to the overall result.