Simulation Performance





Simulation Performance

Higher Level of Abstraction

It's almost universally true that the higher level of abstraction of the system, the faster it will simulate. The creation of a system goes through multiple modeling levels, from specification all the way down to the transistor level. A model of a higher abstraction level emphasizes the system's operating principles rather than implementation details, thus offering more freedom for implementation variants. As the level gets lower, operating concepts gradually transform to concrete implementation details. The highest abstraction level is specification, where only the I/O behavior is specified and there is no information about the internal system. The next level of abstraction is the algorithmic or behavioral level, where the system's functions to meet the specifications are described. This description gives an architectural view of the system as opposed to an implementational view. In Verilog, a behavioral model usually contains constructs such as if-then-else, case, loops, functions, and tasks. Further down is the RTL, in which the states of the system have been identifiedthe registers are defined and the functionality affecting state transitions remains at the behavioral level. The next level, gate netlist, is a complete implementation of the system specification. A design process that proceeds from a high abstraction level to lower levels is called a topdown process; the other way around is called a bottom-up process.

Here we'll look at various levels of abstraction through a design example of a simplified traffic controller in which cars cannot make turns but can only go straight across the intersection. A segment of a behavioral model is as follows, where EW and NS denote the east-west and north-south-bound traffic lights respectively:

initial begin // initial state of the system
   EW = GREEN;
   NS = RED;
end

always @(posedge clock)
   if ((EW == GREEN) && (NS == RED)) EW <= YELLOW;
   else if ((EW == YELLOW) && (NS == RED)) EW <= RED;
   else if ((EW == RED) && (NS == YELLOW)) EW <= GREEN;

always @(posedge clock)
   if ((NS == GREEN) && (EW == RED)) NS <= YELLOW;
   else if ((NS == YELLOW) && (EW == RED)) NS <= RED;
   else if ((NS == RED) && (EW == YELLOW)) NS <= GREEN;

In this model, the order of light change is described using behavioral constructs. There are no details about how such changes are to be effected, which are to be further refined in the next level (RTL) shown in the following code. An RTL model identifies the states of the system and assigned registers or FFs to represent them. In the following code segment, FF array F1, of 2 bits, represents the state of the east-west light; F2, also of 2 bits, represents north-south. In Verilog, array instantiation instantiates an array of FFs instead of having the designer do one FF at a time. In this example, each array instantiation creates two FFs, as indicated by the range [1:0]:

initial begin // initial state of the system
   EW = GREEN;
   NS = RED;
end

FlipFlop F1[1:0] (.Q(EW),.D(next_EW),.clock(clock));
FlipFlop F2[1:0] (.Q(NS),.D(next_NW),.clock(clock));

always @(EW or NS) // next_state_function, combinational
begin
   if ((EW == GREEN) && (NS == RED)) next_EW <= YELLOW;
   else if ((EW == YELLOW) && (NS == RED)) next_EW <= RED;
   else if ((EW == RED) && (NW == YELLOW)) next_EW <= GREEN;
   else next_EW = EW;

   if ((NS == GREEN) && (EW == RED)) next_NS <= YELLOW;
   else if ((NS == YELLOW) && (EW == RED)) next_NS <= RED;
   else if ((NS == RED) && (EW == YELLOW)) next_NS <= GREEN;
   else next_NS = NS;
end

The always block is very much the same as that in the behavioral model, except that it is now completely combinational because the if-then-else statements have full cases and all variables read [EW and NS], appear on the sensitivity list. The next level is gate netlist, which implements the combinational always block with gates. The following shows the vast complexity of the gate-level model and provides the intuition behind using models of higher levels of abstraction. The combinational always block translates into the following AND-OR expressions:

next_NS[0]=[0]NS[1]+NS[1]NS[0]+NS[0][0]+[1][1]EW[0];
next_NS[1]=NS[1][0]+NS[1][1]+NS[1]EW[0]+[1][1]EW[0];
next_EW[1]=EW[1][0]+NS[1]EW[1]+[0]EW[1]+[1]NS[0][0];
next_EW[0]=EW[1]EW[0]+[0]EW[0]+[1]NS[0][0];

If only inverters2-input AND and OR gatesare used, the gate-level model consists of 34 lines of gate instantiations for the previous four lines of code, plus a couple more for FFs and the reset circuit. Hopefully, with this demonstration, you understand why gate-level models should be avoided for simulation performance.

Simulator Recognizable Components

Many simulators attempt to recognize some standard components, such as FFs, latches, and memory arrays, to make use of internal optimization for performance. The recognizable styles depend on the simulators; therefore, consult the simulator manual and make an effort to conform to the coding styles. In the event that you must deviate from the recommended styles, or no such recommendation is provided, you should code in a style as simple and as close to the "common" style as possible, which generally means avoiding using asynchronous signals, delays, and complex branching constructs. For FFs and latches, here are some sample "common" styles:

// positive-edge-triggered D-flip-flop
always @(posedge clock)
   q <= d;

// positive-edge-triggered DFF with synchronous active high
reset
always @(posedge clock)
   if(reset)
      q <= d;

// DFF with asynchronous active high reset
always @(posedge clock or posedge reset)
   if(reset)
      q <= 1'b0;
   else
      q <=d;

// active high latch
always @(clock or data)
   if (clock)
      q <= data;

// active high latch with asynchronous reset
always @(clock or data or reset)
   if (reset)
      q<=1'b0;
   else if (clock)
      q <= data;

Memory arrays are another tricky component: They don't have general recognizable forms. Furthermore, there are synchronous and asynchronous memory arraysthe former being faster simulationwise, and hence are preferred. Synchronous memory needs to be strobbed by a clock to give output after data, address, and read/write enable are present, whereas asynchronous memory produces the output immediately once data, address, and read/write enable are present. Figure presents a block diagram for the two types of memory.

9. Synchronous and asynchronous memory


To aid a simulator, a user directive such as a stylized comment such as

//my_simulator memory: clock is clk, data is in1, address is
addr1

can be attached to instruct the simulator to determine the clock, data, and address. The directive is usually simulator specific.

Coding finite-state machines needs special attention, and which particular style is preferred is dictated by the simulator. As a rule of thumb, separate as much combinational logic from the states as possible. For example, never encompass the next-state transition function and the sequential elements in one large sequential always block. Code the next-state transition function as a combinational always block using a case statement and mnemonics for the states. An example of a next-state function is

//combinational next-state function. Recommended
always @(presentState) begin
   case (presentState):
      IDLE: nextState = WAIT;
      REQUEST: nextState = GRANT;
      ...
      default: $display ("error");
end

Avoid using Verilog @ inside an always block, as in the following example. When @ is encountered, the simulation is blocked until the event of the @ occurs; in other words, variable values are held while waiting for the event. Therefore, an @ construct has embedded a state. The style using @ mimics traversing a state diagram in which a transition is made when a clock transition occurs. Most simulators do not recognize this style as a finite-state machine:

//bad coding style
always @(clock) begin
   nextState = ...;
   @(clock) // on a clock edge, go to another state
      if(presentState ==..) nextState = ...;
   @(clock) // on a clock edge, go to another state
      if(presentState ==..) nextState = ...;
end

This style can always be recoded into a tandem of FFs for the states and a combinational block for the next-state function. The idea is to create a state for every @, so that every time a clock clicks, the state machine transits to the state corresponding to the following @. Once the states are mapped with @s, the next-state function is written to produce the required outputs and next state.

Vector versus Scalar

Whenever possible, use the entire bus as an operand instead of just the bits, because the simulator takes more internal operations to access bits or ranges of bits than the entire bus. Moreover, when a statement involves bits of the same bus, the simulator may need to visit each bit during the execution of the statement, whereas for a statement on the whole bus, the simulator just needs to access the bus once, even if there are multiple operations on the whole bus. Finally, converting bit operations into bus operations often simplifies the code. To convert bit operations to bus operations, concatenation, reduction, and mask operators play an important role.

When bits are assigned to different values, the operation can be made into a vectored operation by concatenating the scalar operations:

scalar operation:
   assign bus[15:0] = a & b;
   assign bus[31:16] = x | y;

vectored operation:
   assign bus = {x | y, a & b}; // {} is the concatenation
   operator

scalar operation:
   assign output = (bus[1] & bus[2]) | (bus[3] ^ bus[0]);

vectored operation:
   assign output = (&(4'b1001 | bus)) | (^(4'b1001 & bus));

The first example simply concatenates the two scalar operations and assigns it to the bus. The second example uses masks to select the bits and then applies reduction operators on the selected bits. For instance, masking operation 4'b1001 | bus does bitwise OR and gives a vector (1,bus[2],bus[1],1). Then, the reduction & operator ANDs the vector bit by bit to give bus[2]bus[1]. Similarly, 4'b1001 & bus produces vector bus[3],0,0,bus[0]. Then the ^ reduction operator XORs the vector bit by bit to give bus[3]^ubs[0]. Finally, the intermediate results are ORed together.

The previous conversion technique applies to any expression. A general algorithm follows.

Algorithm for Converting Scalar Operations to Vectored Operations

Input: a set of assignments to bits or ranges of buses

Output: a set of assignments with whole-bus operands

1.
Repeat steps 2 and 3 for each of the assignments.

2.
Express the right-hand side as a sum of products.

3.
For each product term, if there are complemented variables, use a mask operation with bitwise XOR to invert the bits/ranges. Bit 1 corresponds to an inverted bit. Then select the bits/ranges of the bus in the term by bitwise ORing with a mask. Bit 0 corresponds to a selected bit. Next, reduction AND the result to produce a partial term. Finally, AND the result with the remaining literals in the product term.

4.
Use the concatenation operator to make the set of assignments into a bus assignment.


To illustrate the algorithm, consider the following set of assignments, assume bus A has 2 bits and bus B has 6 bits:

assign A[0] = [3] & B[5] & x + B[3]& y;
assign A[1] = B[4]  + B[0];

The right-hand sides are already in sum-of-products form. Step 2 is skipped (it is always possible to transform an expression to a sum of products). In step 3, for the first product term, the partial term made of only bus bits/ranges is [3]B[5]. Use mask (6'b001000 ^B) to invert B[3], giving (B[5],B[4],[3],B[2],B[1],B[0]), followed by masking with bitwise OR to select B[3] and B[5] (in other words, 6'010111 | (6'b001000 ^ B), giving (B[5],1,B[3],1,1,1)). Lastly, reduction AND to produce the partial term. Putting it all together, we have &(6'b010111 | (6'b001000^B)). Finally, AND it with x. Similarly transform the remaining terms. After step 3, we have

assign A[0] = (& (6'b010111 | (6'b001000 ^ B))) x + (&
(6'b110111 | B)) y;
assign A[1] = (&(6'b101111 | B) ) +(&(6'b111110 | B);

In step 4, concatenate the two assigns to make a bus assignment:

assign A = {(& (6'b010111 | (6'b001000 ^ B))) x +
(& (6'b110111 | B)) y,(&(6'b101111 | B) )+(&(6'b111110 | B)};

It's not necessary first to make all terms sum-of-products form, which is usually messy, if applications of mask, reduction, and bitwise operation can produce the required expressions.

This transformation is especially useful for error correction code (ECC) such as cyclic redundant code (CRC), for which operations on individual bits can be neatly transformed into bus operation. For instance, ECC code

C = A[0]^A[1]^A[2]^A[9]^A[10]^A[15]

can be recast into bus form:

C =  ^(A & 16'b1000011000000111).

A variation of the previous conversion can prove to be useful when applied to operations on bits inside a loop, as illustrated in the following example,

FOR (i=0; i<=31; i = i+1)
   assign A[i] = B[i] ^ C[i];

can be recoded as

assign A = B ^ C;

Related to this vectorization is instantiation of an array of gates, which often occurs in memory design. Instead of instantiating one gate at a time, use array instantiation so that a simulator recognizes the inherent bus structure. An example of array instantiation is

FlipFlop FFs [31:0] (.Q(output),.D(input),.clock(clock));

where the range [31:0] generates 32 FFs with inputs that are connected to bus D, outputs, bus Q, and clocks clock.

Minimization of the Interface to Other Simulation Systems

The other systems can be C code simulations or I/Os to the host. In Verilog, it is a common practice to cosimulate a Verilog model with another C model through a programming language interface (PLI). PLIs are communication channels for values to be passed from the Verilog side to the C side and vice versa. PLIs are a major bottleneck in performance. A strategy to reduce PLI impact is to communicate a minimum amount of information and accumulate the information before it is absolutely necessary to call the PLI.

Another common cause of slow performance is displaying or dumping too much data on the host during runtime, which can easily slow down a simulation by a factor of ten or more. Thus, all display and dump code should be configured to be able to turn on and off, and should be turned on only during debug mode.

Low-Level/Component-Level Optimization

In Verilog there are data types that most simulators do not accelerate and hence should be avoided. These are time, real, named event, trireg, integer array, hierarchical path reference, UDP, strength, transistor constructs (for example, nmos), force/release, assign/deassign, fork/join, wait, disable, and repeat. In general, anything related to time, low-level modeling constructs, and test bench constructs should be flagged.

Remove redundant code (for example, modules/functions/tasks) that is not instantiated/called, dangling nets, and unreferenced variables. Not all simulators remove redundant code; and for the ones that do, a longer compile time results.

Rules to Check

  1. Encourage the use of a high-level abstraction model. Issue warnings when transistor-level and strength constructs are detected. If a design has a simulation model, ensure that the model is invoked.

  2. Issue warnings if FFs or latches are not instantiated from the library.

  3. Sequential components in libraries are recognized as such by the target simulators.

  4. Synchronous RAM is preferred over asynchronous RAM.

  5. Operations on bus bits are warned if the number of bit operands exceeds a limit.

  6. Issue warnings if the number of system task or user task calls exceeds a limit.

  7. Low-level constructs and non-synthesizable constructs are discouraged, such as time, real, named event, trireg, UPD, strength, transistor, force/release, assign/deassign, fork/join, wait, disable, and repeat.


Code Profiling

A code profiler is a program that is attached to a simulation and is run to collect data about the distribution of the simulation time in the circuit. From the report of a profiler, the user can determine the bottlenecks in a simulation. The profiler calculates the accumulative time spent on module definitions and instances (meaning, the total time a particular module or a particular instance of the module is simulated). It also computes the time spent on blocks (such as an always block), functions/tasks, timing checks, and others.


     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows