Isolation of Underlying Causes





Isolation of Underlying Causes

In this section we will look at the basic principles and practices in debugging hardware designs. First we will discuss using expected results as a guide to trace errors, and then we will study how erroneous signals are traced forward and backward to locate their root cause. There are many forking points during tracing, and we will introduce a branching diagram to keep track of the paths. As sequential elements are traced, the time frame changes. We will consider the movement of the time frame for latch, FF, and memory. Next we will look at the four basic views of a design and their interaction in debugging, along with some common features of a typical debugger.

Reference Value, Propagation, and Bifurcation

Debugging starts with an observed error, which can be from a $display output, failed assertion, or a message from a monitor, and ends when the root cause of the error is determined. Before the user can debug, he must know what is correct (called the referenced value) and what is not. The definition of correctness is with respect to the specifications, not with respect to the implementation. For example, if a 2-input AND gate is mistakenly replaced by an OR gate and the inputs to the OR gate are 1 and 0, then the person debugging must know that the expected output should be 0 instead of 1. That is, the output value of 1 is incorrect even though the OR gate correctly produces 1 from the inputs. It should be stressed that this distinction between the correctness of the implementation and the correctness of the implementation's response be fresh in the mind of the person who is debugging. When one gets tired after hours of tracing, it is very easy to confuse these two concepts and get lost. The reference behavior is the guide for a designer to trace the problem. Without knowledge of the reference behavior, one will not be able to trace at all. The questions to ask in debugging are the following: What is the reference value for this node at this time? Does the node value deviate from the reference value? If the values are the same, tracing should stop at the node. If they are different, one would follow the drivers or the loads of the node to further the investigation. The value of a node that is the same as the reference value for the node is called expected.

The root cause is the function that maps expected inputs to unexpected outputs. Therefore, a module that accepts expected inputs but produces unexpected outputs must have an error in it. Using the previous example of a misplaced OR gate, the OR gate takes in the expected inputs, 0 and 1, and produces 1, which is unexpected. Therefore, the OR gate is the root cause. On the other hand, a module accepting unexpected inputs and producing unexpected outputs may or may not have an error. Furthermore, an error can have multiple root causes.

As debugging progresses, the reference behavior can take on different but equivalent forms. For example, if a reference value on a bus is 1010, then as we trace backward to the drivers of the bus, the reference behavior becomes the following: Exactly one bus driver is enabled and the input to the driver is 1010. Similarly, if we trace forward and see that this reference value is propagated to a decoder, then the reference value for the output bits of the decoder becomes the following: Only the tenth bit is active. Furthermore, a reference value can bifurcate and become uncertain, creating more possible routes to trace. A case in point is that the reference value of the output of a 2-input AND gate is 0, but the actual value is 1. Moving toward the inputs, the reference behavior bifurcates into three cases: either or both inputs are 0. To investigate further, you must assume one case of reference behavior to proceed. If you end up at gate or module with outputs that are all expected, the assumption is wrong. Then the next case of reference behavior is pursued. This phenomenon of uncertainty and bifurcation is the major cause of debugging complexity. Therefore, a key to effective debugging is to compute correctly the reference values during tracing.

Forward and Backward Debugging

There are two methods of debugging: forward tracing and backward tracing. Forward tracing starts at a time before the error is activated. Note that the effect of an activated error may not be seen immediately, but only after cycles of operation. Therefore, a critical step in forward tracing is finding the latest starting time before the error is activated, and there is no general algorithm to determine such times. Assuming that such a time is given, we must assume all node values are correct, or expected, and move along the flow of signals, forward in time, to get to the point when the error is activated. During this search, the first statement, gate, or block producing unexpected outputs contains a root cause. Besides finding a good starting time, another difficulty in forward tracing is knowing where in the circuit to start that will eventually lead to the error site. Figure shows forward tracing paths for node B. The shaded OR gate is the root of the problem. Of the two forward tracing paths, one leads to the error site and the other does not. When we come to a node with multiple fanouts, we must decide which paths to pursue, and there are exponentially many such paths. The ability to locate the starting point and making wise decisions at multiple-fanout forks can only be acquired through understanding the design and the nature of the bug.

Figure. Forward tracing paths for node B


Backward tracing proceeds against the flow of signals and backward in time to find the first statement, gate, or block that yields unexpected outputs from expected inputs. Unlike the uncertainties faced in forward tracing, the starting site and time are the location and time the error was observed, and one moves backward toward the fanins. With this method, the error is an unexpected behavior. The person debugging must know the reference behavior and be able to translate the reference behavior as he proceeds backward. The major difficulty in backward tracing, shared with forward tracing, is that when a gate, statement, or a block has multiple fanins, a decision must be made regarding which fanin to follow next, and there are exponentially many possible paths. When a multiple-fanin gate is encountered, the path or paths to pursue are the ones that show unexpected values. However, it is often possible that several paths show unexpected values. Figure shows three backward tracing paths from node X.

Figure. Backward tracing paths from node X


Tracing Diagram

With either tracing method, the fanin and fanout points are branching points that require making decisions. If a selection does not turn up root causes, we need to backtrack to the decision points to select other paths. To keep track of what has been visited and what has not, a tracing diagram comes in handy. The branching points in a tracing diagram systematically enumerate all possible selections and guide the selection decision in backtracking. Tracing diagrams are usually generated by a software program instead of being created by hand.

A node in a tracing diagram is either a primary input, port of a gate, module, or user-defined node. A user-defined node is a net that terminates tracing, e.g. a known good net. An arrow from node A to node B means that there is a path from A to B. The path is a forward path in forward tracing, and is a backward path in backward tracing. A reduced tracing diagram contains only nodes with more than one outgoing arrow, except for primary inputs and user-defined nodes.

Figure shows two reduced tracing diagrams: one for forward tracing from primary input B and the other for backward tracing from net X. The convention used here is that the input pins of a gate are numbered from top to bottom starting from 1. Outputs are similarly numbered. A node labeled as G. i represents the ith input of gate G in forward tracing, and the ith output in backward tracing. The rectangular nodes are user-defined nodes that, in this case, are the fault site. Fault sites are not known in advance in practice; they are shown here for illustration. The shaded nodes are primary inputs.

Figure. Tracing diagrams for forward tracing of primary input B (A) and for backward tracing of wire X (B)


When obtaining a reduced forward tracing diagram, gates with only one fanout are not represented in the tracing diagram because these gates have only one outgoing arrow. Similarly, nodes having only one fanin in a reduced backward tracing diagram are not shown. Forward tracing starts from primary input B. At the outset there are two fanouts: g4.2 and g5.2. Thus, node B in FigureA has two branches: one leading to node g4.2 and the other leading to node g5.2. The node inside the box, g5.2, is the root cause of the problem, and we assume that the debugging process ends when that node is reached.

If there are loops, the loop may be traversed several times. Each time a sequential element is crossed, the time frame may change. For instance, the loop in FigureA, consisting of g7.1 and g6.2 can be traversed multiple times, and each traversal advances the time by one cycle because the loop contains FF F3. Similarly, the loop in FigureB, consisting of g7.1, F2.1, g2.1, F3.1, and g3.1, contains two FFs, and therefore time retracts by two cycles whenever the loop is traversed once.

Time Framing

In tracing, when a combinational gate is traversed, either forward or backward, the current time of the simulation does not change. When a sequential element is traversed, the time of the simulation changes depending on whether it is forward or backward tracing. For example, when forward traversing an FF (such as from data to output), the time of the simulation advances by one clock cycle because the value at the output happens one cycle after the data input. On the contrary, in backward traversing (from output to data), the time of the simulation retracts by one cycle. Consider forward tracing from node n1 of Figure, and suppose the current time of the simulation is N. When we arrive at node n2, time advances to N+1, because FF1 has been traversed. When we continue to node n3, the simulation time stays at N+1 because the NOR gate is a combinational gate. In general, to compute the amount of time movement when traversing from node A to node B across a sequential circuit, we determine the time for data to propagate from A to B. Time moves forward in forward tracing and backward in backward tracing.

In a circuit with multiple clocks, time advance is with respect to the clock of the sequential element that has just been traversed. Consider the multiple-clock domain circuit in Figure. Suppose we are looking at node D and we want to determine the time at node A, which affected the current value at node D. Assume the current time at node D is the last rising edge of clock clk2 at time 19. Moving to the input, the time at which the value at node C might have changed can be anywhere between 9 and 14, during which the latch was transparent. To determine exactly when, we need to examine the drivers to the latch. Going over the AND gate does not change time. Node B could change at a falling transition of clock clk2. Therefore, node C might change at time 6. Moving backward further, node A might change only at a rising transition of clock clk1. Therefore, the value of node A at time 1 affects the current value at node D. If the current value of D is erroneous, the value of A at time 1 is a candidate to be examined.

8. Time frame determination in traversing a multiple-clock domain circuit


The same principle can be applied to circuits in RTL. Consider the following sequential element:

DFF g1 (.clk(clk1), .Q(A), .D(D), ...);

always @(clk2) begin
   data = A;
   @ (posedge clk2) begin
     state <= data << guard;
     out <= state ^ mask;
   end
end

We want to determine the time of variable D that affected the current value of out. The current time is 17, using the waveforms in Figure. To trace backward, we need to determine the last time clock clk2 had a positive transition, which, based on Figure, last changed at time 11.5. The assignment to data was executed when clk2 changed at time 6. Hence, the value assigned to data from A is the value of A at time 6. Because A is the output of the DFF g1, which is clocked by clk1, the time of D that affected A at time 6 is 1. Therefore, the time of D that affected out is 1. Any error in variable D at time 1 will be observed in variable out at time 17.

Load, Driver, and Cone Tracing

To understand the cause of a symptom at a node, the logic or circuitry potentially contributing to the node needs to be traced. Three common items are traced in practice: load, driver, and cone. Load tracing finds all fanouts to the node and is often used in forward debugging. Finding all fanouts of a node, which can be difficult in a large design in which the fanouts are spread over several files and different directories, is done with a tool that constructs connectivity of the design. Similarly, such a tool is used to find all drivers, or fanins, of a node. Tracing fanins or fanouts transitively (finding fanins of fanins) is called fanin or fanout cone tracing. A fanin cone to a node is the combinational subcircuit that ends at the node and starts at outputs of sequential elements or PIs. Similarly, a fanout cone is the combinational subcircuit that starts at the node and ends at inputs of sequential elements or POs.

Let's consider an example of debugging that requires driver and cone tracing. Consider the circuit in Figure in which a data bit at node a has an unexpectedly unknown or indeterminate value x at the current time 5. Assume that all FFs and latches are clocked by the same clock clk, with the waveform shown. For simplicity, let's assume that all clock pins are operational, free of bugs. To debug, we trace all drivers to node a. Because node a is the output of a transparent low latch, the time of the latch's input that affected node a at time 5 is between 4 and 5. Therefore, as we backtrack across the latch, the current time frame changes from 5 to 4. The driver to the latch is an XOR gate with an output value that is unknown. The XOR gate has two fanins, both of which have unknown values. Selecting the lower fanin, we arrive at an OR gate. One of its fanins, node f, has an unknown value. Node f is driven by an FF that had an unknown input value. Because this FF is positive-edge triggered, crossing it backward moves the current time frame from 4 to 3. The driver to the FF is a tristate buffer that is enabled at time 3; thus, the unknown value comes from the bus. The value on the bus indeed had an unknown value. Now we find all drivers to the bus and determine which ones are active at time 3. There are two active drivers to the bus and they are driving opposite values because their inputs are opposite. Further investigation into why both drivers are turned on at the same reveals the root cause: one of the buffers to the bus drivers should be an invertor.

9. Debugging an unknown data bit via driver and cone tracing


In a large circuit, instead of tracing drivers one gate level at a time, the entire cone of logic can be carved out for examination. Cone tracing is not limited just to combinational cones; it can be a cone of logic spanning a couple cycles. Three fanin cones for nodes x and y are shown in Figure. One-cycle cones are just combinational cones. Multiple-cycle cones are derived by unrolling the combinational logic multiple cycles and removing the sequential elements between cycles. For example, the two-cycle cone consists of all gates that can be reached from node x or y without crossing more than one FF. Similarly, the three-cycle cones include all gates reached without crossing more than two FFs. The primary inputs to a cone are the original primary inputs and outputs of the FFs. The cone's primary inputs are marked with the cycle numbers. For example, P3 means the value of P tHRee cycles backward from the current cycle. The current cycle is 1. Note the fast growth of cone size as the number of cycles increases; thus, in practice, only a small number of cycles are expanded for logic cones.

10. Unrolling to obtain a multiple-cycle logic cone. (A) Original circuit (B) Combinational cone (C) Two-cycle cone (D) Three-cycle cone


Memory and Array Tracing

Whenever an FF or a latch is crossed, time progresses or regresses. When memory or an array is crossed, the number of cycles that the time changes is a function of what data are being traced and can be deduced as follows. Suppose that we find out that the output of memory is wrong and we want to back trace to the root cause. Assuming the current time is T, we first determine whether the address and control signals to the memory (such as read, write, and CS) at time T are correct. If any of these signals is not correct, tracing continues from that line and time does not change. However, if the address and control signals are correct, then the wrong data were caused by either a bug in the memory model itself or by writing wrong data to that address. We search for the most recent time at which that address was written. Let this be time W. If the data at time W are not identical to the output at time T, the memory model has a problem. If the data are the same, the input data are wrong, tracing follows the input data, and the time frame becomes W. That is, the amount of time of the backward time lapse is T W. To illustrate this algorithm with the memory and waveforms in Figure, let's assume that the output of memory at address 8'h2c is expected to be 32'hc7f3 at time 1031, but is 32'ha71 instead. We back trace across memory. From the waveforms, the control signals, CS, W/R, and address are correct. That is, CS is active, W/R is READ, and the address is 8'h2c. So we search for the last time the memory was written at address 8'h2c, and the time was 976. Because the input data, value of in_data, are identical to the output data at time 1031, the memory model is fine and the error tracing continues from time 976 to determine why the input data had the wrong value of 32'h0a71. The algorithm for back tracing memory is shown here:

Memory Back Tracing Algorithm

  1. Assume the current time is T. Examine all inputs except data for correctness. If any of these is incorrect, back trace from that input and time remains T.

  2. If the inputs are correct at time T, search for the last time the address was written and mark this time W.

  3. If input data at time W are not the same as the output at time T, the memory model has a fault; otherwise, back trace from the data input and the time changes to W.


11. Illustration of back tracing across memory


In summary, forward tracing is just simulation. Backward tracing is searching the current or last input combination or condition that produced the current output. If found, time moves to that time and back tracing continues from there.

Zero Time Loop Constructs

Loop constructs occur most often in test benches, as used when iterating array elements, and usually do not contain delays. Hence they are executed in zero simulation time. That is, variables of the loop are computed at multiple times at the same, current simulation time as the loop is iterated. An example of such a loop is as follows:

always @(posedge clock) begin
   if (check_array = 1'b1)
      for (i=0; i<= 'ARRAY_SIZE; i=i+1) begin
        var = array[i];
        if( var == pattern ) found = 1;
       ...
      end // end of for loop

   if (found) ...
end // end of always block

The loop is computed with no simulation time advancement. Variable var is assigned to array[i] the number of times equal to 'ARRAY_SIZE at the current simulation time.

Multiple writes and reads to the same variable at the same simulation time cause difficulties in debugging, because when the simulation is paused, the variable value displayed is that of the last write. For example, the value of var displayed when the simulation is paused is array['ARRAY_SIZE]. If a bug is caused during the loop computation, seeing only the last value of the variable is not enough. To circumvent this problem, variables inside a zero time loop need to be saved for each loop iteration so that their entire history can be displayed at the end of the simulation time. For the previous example, the intraloop values of var can be pushed to a circular queue every time var is written:

always @(posedge clock) begin
   if (check_array = 1'b1)
      for (i=0; i<= 'ARRAY_SIZE; i=i+1) begin
        var = array[i];
        queue_push(var);
        if( var == pattern ) found = 1;
       ...
      end // end of for loop

   if (found) ...
end // end of always block

Some debuggers show all intraloop values of loop variables (such as var[1], ..., var['ARRAY_SIZE]) when the variables are displayed.

The Four Basic Views of Design

In RTL debugging, four views of a circuit are essentialRTL, schematic, finite-state machine, and waveformalthough other views exist, such as layout and DFT. A circuit and waveform viewer displays these four views and allows the user to switch among views. An RTL view shows the design code. A schematic view is a circuit diagram representation of the design code. The viewer creates the schematic by mapping simple code constructs in the design to a library of common gates. For example, a quest operation x ? y : z is mapped to a multiplexor. Other simple constructs are AND, OR, multiplexor, bus, and tristate buffers. Finite-state machines and memory, if they conform to a set of coding guidelines, will also be recognized. The mapper attempts to recognize as many common constructs as possible. If recognized, the constructs are represented with graphical symbols in the schematic view. The constructs not recognized are "black boxed." A schematic view preserves the module boundaries of the design so that a module instantiation is represented as a box labeled with the module instance name. To go inside the module, simply click on the box. The finite-state machine view shows state diagrams of finite-state machines. To recognize finite-state machines, many viewers assume certain finite-state machine coding styles. Finally, a waveform viewer displays waveforms of nodes. The waveforms are created from dumped data files in either standard format, such as VCD, or vendor-specific format, such as fsdb. Figure shows an example of the four views. In the schematic view, the reduction XOR ^ is not recognized as a common construct and hence is black boxed (the shaded box labeled ^ OP.) All other constructs are recognized and are represented by standard circuit symbols. The coding style of this example conforms to the finite-state machine's coding style: hence, it is recognized as a finite-state machine and its state diagram is shown in the state machine view. The waveform view displays signals or variables specified by the user.

12. RTL, schematic, finite-state machine, and waveform views of a design


For most viewers, the different views of a circuit are coordinated by drags and drops. For example, to switch from RTL to schematic view, click on a variable or signal and drag it to the schematic view. The schematic view will display the scope (such as a module) in which the variable or signal resides. To see the waveform of a signal, simply drag and drop the signal to the waveform viewer. The different views offer their unique benefits. The RTL view shows the exact functionality of the design unit, the schematic view best displays connectivity, the state diagram view offers a functional and graphical description of the RTL code, and the waveform view reveals the temporal behavior of signals.

Typical Debugger Functionality

Let's discuss some typical functionality in a debugger. The most basic functionality is tracing of drivers and loads of a node in the RTL and schematic views. With the schematic view, a command to trace a signal highlights all drivers or loads, depending on whether the driver or load option is set. Such a command can be a simple click on the signal. A continual command on a highlighted driver or load effects transitive tracing. With RTL view, a list of drivers or loads is shown when a net is traced. It is also possible to select a cone tracing option and have the debugger show a fanin or fanout cone of a variable or a net.

Tracing must be coupled with simulation values to be useful. When the user comes to a decision pointa multiple-fanin point for backward tracing or a multiple-fanout point for forward tracingshe needs to know the fanin or fanout that has the wrong value to continue. A convenient feature is annotation of simulation values in the RTL and schematic views (that is, signal values at the current time are appended to signals or variables). An example is shown in Figure, in which the annotated values are in bold. At the time shown, clock clk is in a falling transition. If the current time is changed, the values will change to reflect the simulation results. Based on the annotated values, the branches with expected values are pruned. Tracing follows the paths with unexpected values. To assist in keeping track of a tracing, branching points can be bookmarked and later revisited. An application of bookmarking is that after the current selection at a branching point turns out to be a deadend, the saved branching point is reverted so that another path is pursued.

13. Annotation of simulation values to RTL and schematic views


With waveform view, waveforms can be searched for values or transitions. For instance, a waveform on a bus can be searched to find the time the bus takes on a specific value. Furthermore, two sets of waveforms can be compared, and the differences are displayed at the times they differ.

Finally, a debug session can be saved to a file and can be restored later. This is useful when a debug session needs to be shown to another person at a remote location. Then the saved session file is sent to that person.

1.

The following example illustrates a typical debugging process using the backward tracing method. The first sign of a bug is an error message from a simulation run. Suppose error message "Missingpacket" appears at time 3000. The following sequence of actions will then take place:

  1. Determine from the nature of the error whether the error can be reproduced with only a port of the circuit and with a smaller test case.

  2. If the test case simulation is long, rerun the simulation from the last check point to determine whether the same error message occurs. If it does, debug from this check point; otherwise, try the previous check point until one is found that produces the error.

  3. Determine the neighborhood around the site of the error for the purpose of signal dumping. The site of the error is the statement that causes the printing of the error message, as shown in the following code:

       reg [3:1] index;
       ...
       always @(clock) begin
               0->1
         for (i=0; i<=128; i=i+1) begin
             item = in_queue[i];
          32'b0ffa3          128
             if (item == packet) packet_received = 1'b1;
             32'b0ffa3  32'h9ba0
         end
        if (packet_received!=1'b1) $display ("Missing packet"); // error site
                 1'b0
       end // end of always
       ...
       always @(clk) begin
               0->1
         if (ready_xmit) begin
               1'b0
            in_queue [index] = data_out;
                       5       32'hffff
         end
         ...
         index = base + inc ;
           5      2      3
       end
    

    The smaller the scope, the faster the simulation will be. However, too small a scope runs the risk of having to rerun the simulation for a larger scope if a traced signal goes out of bounds.

  4. Simulate the reduced circuit and test case and dump out the data from the selected scope.

  5. Load the dumped data into a debugger and annotate the signal values into the RTL or schematic views. The annotated values are shown in italics in the previous code.

  6. Search for the site of the error in the RTL or schematic views and back trace the drivers that trigger the error.

  7. Based on the statement at the error site, the error message was triggered by signal packet_received equal to 1'b0. TRacing drivers to packet_received, we find the for loop just above the error site to be a driver. Because of the zero simulation time of the loop, only the last values of the loop variables are shown. Signal packet_received is driven by the condition if(item == packet), where item is an item of in_queue and packet is the expected packet. If this condition fails, the array in_queue does not contain the data equal to packet. To find out why in_queue does not have the data, we trace the drivers of in_queue and locate the sole driver to be in_queue [j] = data_out. Then we search in the waveform of data_out for 32'h9ba0 and determine whether the expected packet was ever sent (in other words, assigned to in_queue). A search for data_out for 32'h9ba0 turns up the data at time 1200, as shown in Figure. The index value at 1200 was 6, meaning in_queue[6] had 32'h9ba0, but somehow it was lost. The error must have occurred between time 1200 and time 3000. So we need to search for any write to in_queue at index 6. We find that at time 2200, location 6 of in_queue was overwritten by another value. Suppose it is correct to have a write to in_queue at time 2200, but the location should not be 6. We then trace the drivers of index, which is index = base+inc, as shown in the previous code. Moving time to 2200, annotated values reveal that both variables base and inc take on a value of 7, but index takes on 6. Checking the declaration of index, we discover that index was declared to be reg[3:1], a 3-bit variable that should have been reg [3:0]. Thus the MSB was truncated, producing 6 from 14. This is the root cause of the error.

14. Array waveforms used in back tracing an error



     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows