Clock Rate Pipelining

This example shows how to apply clock rate pipelining to optimize slow paths in your design and thereby reduce latency, increase clock frequency and decrease area usage. For more information on how to use clock-rate pipelining, refer to the clock-rate pipelining documentation.

Introduction

Algorithmic design with Simulink may introduce many slow-rate datapaths in the generated HDL design. These slow paths correspond to slower Simulink sample time operations or even due to the algorithmic data-rate operating at a slower rate than the HDL clock rate.

Consider the Field-Oriented control example. It describes a motor-control design to be mapped to an FPGA. The input samples in this design are arriving every 20 $\mu s$ or 50 KHz. In a closed control loop, it is essential that the controller's latency is within the desired response time. In this model, there is a delay on the output port resulting in a latency of 20 $\mu s$ .

To meet design constraints like timing and area, we may want to apply several optimizations like input/output pipelining, distributed pipelining, streaming and/or sharing. Further, non-trivial math functions like sqrt or divide may have to be implemented as multi-cycle pipelined operations. Pipelines introduced by any of the above features and optimizations are applied at the same rate at which the signal path operates, which is 20 $\mu s$ . Thus, introducing any additional pipelining introduces undesirable latency overhead and may violate the closed loop latency budget.

However, the FPGA can implement this controller in the order of MHz, which means that the introduced pipelines can then operate at the MHz rate thereby minimizing the impact on latency. Clock-rate pipelining is a technique to leverage this rate differential, pipeline the controller and thereby improve its area and timing characteristics on the FPGA. This example walks through the steps for taking this design and incrementally applying timing and area optimizations using clock-rate pipelining.

Preparing the model

An important first step in applying clock-rate pipelining is to prepare the model so that it is amenable to clock-rate pipelining. Below are some of the main steps:

Defining the rate differential: Signal paths in Simulink end up on slow paths in HDL because of two primary reasons. First, the signal path is operating at a sample time that is slower than the base sample time of the model. Second, the Simulink base sample time may correspond to the data-rate instead of the clock-rate. For example, the base sample time in the hdlcoderFocCurrentFixptHdl.slx model is 20 $\mu$ secs. The final FPGA implementation of the controller may target 40 MHz (or 25 ns). The trouble with setting the model's sample time to 25 ns is that it drastically slows down Simulink simulation performance. To get around this, HDL Coder provides a setting, called Oversampling which specifies how much faster the FPGA clock rate runs with respect to the Simulink base sample time. Thus, in this case, we require a 800x oversampling.

Turn on flattening: Clock-rate pipelining works by finding the maximal sub-regions (called clock-rate regions) of the same slow rate that are delimited either by rate-change blocks, delay blocks or subsystem boundaries. If the output of a clock-rate region is a data-rate delay block, then HDL Coder absorbs that delay which allows a budget of several clock-rate pipelines corresponding to the ratio of data-rate to clock-rate. For all other sub-region outputs, a data-rate delay is introduced. These additional data-rate delays can be avoided by flattening internal subsystems. This can be done by turning on flattening globally by setting it on the DUT subsystem. Since the default value is set to 'inherit', all subsystems underneath will be flattened. See hierarchy flattening for more information on how to use the flattening feature.

Timing optimization

Now, we are ready to apply clock-rate pipelining. The feature option is on by default and will automatically find clock-rate regions. See clock-rate pipelining documentation to understand how the pipeline budget is determined and how clock-rate regions are formed. The goal of this section is to improve timing on slow paths using distributed pipelining but without introducing additional delays. We will create a local copy of the hdlcoderFocCurrentFixptHdl.slx model to demonstrate these concepts.

srcHdlModel = 'hdlcoderFocCurrentFixptHdl';
dstHdlModel = 'hdlcoderFocRetiming';
dstHdlDut   = [dstHdlModel '/FOC_Current_Control'];
gmHdlModel  = ['gm_' dstHdlModel];
gmHdlDut    = ['gm_' dstHdlDut];

open_system(srcHdlModel);
save_system(srcHdlModel,dstHdlModel);

The subsystem FOC_Current_Control contains the algorithm from which we will generate HDL code

open_system(dstHdlDut);

We can now configure the model to use clock-rate pipelining to break the critical path by introducing clock-rate pipeline delays throughout the slow datapath logic. To see the impact of clock-rate pipelining, generate HDL code and look inside the top-level subsystem of the generated model.

hdlset_param(dstHdlModel, 'ClockRatePipelining', 'on');
hdlset_param(dstHdlModel, 'Oversampling', 800);
hdlset_param(dstHdlDut, 'FlattenHierarchy', 'on');
hdlset_param(dstHdlDut, 'DistributedPipelining', 'on');
save_system(dstHdlModel);

makehdl(dstHdlDut);

### Generating HDL for 'hdlcoderFocRetiming/FOC_Current_Control'.
### Starting HDL check.
### To highlight blocks that obstruct distributed pipelining, click the following MATLAB script: <a href="matlab:run('hdlsrc/hdlcoderFocRetiming/highlightDistributedPipeliningBarriers')">hdlsrc/hdlcoderFocRetiming/highlightDistributedPipeliningBarriers.m</a>
### To clear highlighting, click the following MATLAB script: <a href="matlab:run('hdlsrc/hdlcoderFocRetiming/clearhighlighting.m')">hdlsrc/hdlcoderFocRetiming/clearhighlighting.m</a>
### Generating new validation model: <a href="matlab:open_system('gm_hdlcoderFocRetiming_vnl')">gm_hdlcoderFocRetiming_vnl</a>.
### Validation model generation complete.
### Begin VHDL Code Generation for 'hdlcoderFocRetiming'.
### MESSAGE: The design requires 800 times faster clock with respect to the base rate = 2e-05.
### Working on Max/Max as hdlsrc/hdlcoderFocRetiming/Max.vhd.
### Working on Min/Min as hdlsrc/hdlcoderFocRetiming/Min.vhd.
### Working on FOC_Current_Control_tc as hdlsrc/hdlcoderFocRetiming/FOC_Current_Control_tc.vhd.
### Working on hdlcoderFocRetiming/FOC_Current_Control as hdlsrc/hdlcoderFocRetiming/FOC_Current_Control.vhd.
### Generating package file hdlsrc/hdlcoderFocRetiming/FOC_Current_Control_pkg.vhd.
### Generating HTML files for code generation report at <a href="matlab:web('/tmp/BR2016bd_418663_21018/publish_examples3/tpad7ded80_4423_490b_af84_8dadc224750c/hdlsrc/hdlcoderFocRetiming/html/hdlcoderFocRetiming/hdlcoderFocRetiming_codegen_rpt.html');">hdlcoderFocRetiming_codegen_rpt.html</a>
### Creating HDL Code Generation Check Report file:///tmp/BR2016bd_418663_21018/publish_examples3/tpad7ded80_4423_490b_af84_8dadc224750c/hdlsrc/hdlcoderFocRetiming/FOC_Current_Control_report.html
### HDL check for 'hdlcoderFocRetiming' complete with 0 errors, 0 warnings, and 2 messages.
### HDL code generation complete.

We can review the generated model and observe that the entire design has been flattened in to one subsystem.

open_system(gmHdlDut);
set_param(gmHdlModel, 'SimulationCommand', 'update');
set_param(gmHdlDut, 'ZoomFactor', 'FitSystem');

Further, rate-transitions are introduced on the design inputs to bring them to the clock-rate, which is determined as the original base sample time divided by the Oversampling factor, which is 2e-5/800 = 2.5e-8 or 25 ns. All pipelines are introduced at this rate and are thus operating at the clock-rate. Finally, observe that the output-side delay has been replaced by a down-sampling rate transition bringing the signal back to the data-rate. The clock frequency of the design was improved by inserting pipelines at the clock-rate, without incurring any additional sample time delays.

As with all optimizations, it is recommended that the validation model and co-simulation model are generated and the user verifies that the functional behavior of the design is unchanged. The verification documentation pages describe these concepts in more depth.

Area optimization

The rate differential on slow path implies that computation along this path can take several clock cycles. Specifically, the allowed latency is defined by the clock-rate budget (see clock-rate pipelining documentation). Apart from adding pipelines to improve clock frequency, we could reuse hardware resources by leveraging the latency budget. Setting resource sharing options like StreamingFactor and SharingFactor in a slow-path region does exactly that. This section demonstrates how resource sharing is applied within clock-rate regions.

When resource sharing is applied to a clock-rate path, HDL Coder oversamples the shared resource architecture for time-multiplexing as illustrated in the Resource sharing example. However, if sharing or streaming is requested in a slow datapath, then HDL Coder implements resource sharing without oversampling. To trigger such sharing, set either sharing or streaming on the top-level subsystem. The sharing factor value chosen must be an upper bound. To determine a good value, the resource usage of the design must be analyzed.

srcHdlModel = 'hdlcoderFocRetiming';
dstHdlModel = 'hdlcoderFocSharing';
dstHdlDut   = [dstHdlModel '/FOC_Current_Control'];
gmHdlModel  = ['gm_' dstHdlModel];
gmHdlDut    = ['gm_' dstHdlDut];

open_system(srcHdlModel);
save_system(srcHdlModel,dstHdlModel);

open_system(dstHdlDut);
hilite_system([dstHdlDut '/Park_Transform']);
hilite_system([dstHdlDut '/Inverse_Park_Transform']);
hilite_system([dstHdlDut '/Clarke_Transform']);
hilite_system([dstHdlDut '/Inverse_Clarke_Transform']);

The Park_Transform subsystem and the Inverse_Park_Transform subsystem each use 4 multipliers within them that can be potentially shared. Additionally, the Clarke_Transform subsystem and the Inverse_Clarke_Transform subsystem each use 2 gains, which may be potentially shared, unless they are simply power-of-2 gains, which results in shifts instead of multiplications. Therefore, we can choose the upper-bound value of 4 for SharingFactor and generate code.

hdlset_param(dstHdlDut, 'SharingFactor', 4);
save_system(dstHdlModel);

makehdl(dstHdlDut);

### Generating HDL for 'hdlcoderFocSharing/FOC_Current_Control'.
### Starting HDL check.
### To highlight blocks that obstruct distributed pipelining, click the following MATLAB script: <a href="matlab:run('hdlsrc/hdlcoderFocSharing/highlightDistributedPipeliningBarriers')">hdlsrc/hdlcoderFocSharing/highlightDistributedPipeliningBarriers.m</a>
### To clear highlighting, click the following MATLAB script: <a href="matlab:run('hdlsrc/hdlcoderFocSharing/clearhighlighting.m')">hdlsrc/hdlcoderFocSharing/clearhighlighting.m</a>
### Generating new validation model: <a href="matlab:open_system('gm_hdlcoderFocSharing_vnl')">gm_hdlcoderFocSharing_vnl</a>.
### Validation model generation complete.
### Begin VHDL Code Generation for 'hdlcoderFocSharing'.
### MESSAGE: The design requires 800 times faster clock with respect to the base rate = 2e-05.
### Working on crp_temp_shared as hdlsrc/hdlcoderFocSharing/crp_temp_shared.vhd.
### Working on crp_temp_shared_block as hdlsrc/hdlcoderFocSharing/crp_temp_shared_block.vhd.
### Working on Max/Max as hdlsrc/hdlcoderFocSharing/Max.vhd.
### Working on Max_nw as hdlsrc/hdlcoderFocSharing/Max_nw.vhd.
### Working on Min/Min as hdlsrc/hdlcoderFocSharing/Min.vhd.
### Working on Min_nw as hdlsrc/hdlcoderFocSharing/Min_nw.vhd.
### Working on crp_temp_shared_block1 as hdlsrc/hdlcoderFocSharing/crp_temp_shared_block1.vhd.
### Working on crp_temp_shared_block2 as hdlsrc/hdlcoderFocSharing/crp_temp_shared_block2.vhd.
### Working on crp_temp_shared_block3 as hdlsrc/hdlcoderFocSharing/crp_temp_shared_block3.vhd.
### Working on crp_temp_shared_block4 as hdlsrc/hdlcoderFocSharing/crp_temp_shared_block4.vhd.
### Working on FOC_Current_Control_tc as hdlsrc/hdlcoderFocSharing/FOC_Current_Control_tc.vhd.
### Working on hdlcoderFocSharing/FOC_Current_Control as hdlsrc/hdlcoderFocSharing/FOC_Current_Control.vhd.
### Generating package file hdlsrc/hdlcoderFocSharing/FOC_Current_Control_pkg.vhd.
### Generating HTML files for code generation report at <a href="matlab:web('/tmp/BR2016bd_418663_21018/publish_examples3/tpad7ded80_4423_490b_af84_8dadc224750c/hdlsrc/hdlcoderFocSharing/html/hdlcoderFocSharing/hdlcoderFocSharing_codegen_rpt.html');">hdlcoderFocSharing_codegen_rpt.html</a>
### Creating HDL Code Generation Check Report file:///tmp/BR2016bd_418663_21018/publish_examples3/tpad7ded80_4423_490b_af84_8dadc224750c/hdlsrc/hdlcoderFocSharing/FOC_Current_Control_report.html
### HDL check for 'hdlcoderFocSharing' complete with 0 errors, 0 warnings, and 2 messages.
### HDL code generation complete.

We can review the generated model and observe that HDL Coder implements time-multiplexing in the clock-rate using knowledge of the available latency budget due to the slow datapath.

open_system(gmHdlDut);
set_param(gmHdlModel, 'SimulationCommand', 'update');
set_param(gmHdlDut, 'ZoomFactor', 'FitSystem');
hilite_system([gmHdlDut '/ctr_799']);
hilite_system([gmHdlDut '/crp_temp_shared']);
hilite_system([gmHdlDut '/crp_temp_shared1']);
hilite_system([gmHdlDut '/crp_temp_shared2']);
hilite_system([gmHdlDut '/crp_temp_shared3']);
hilite_system([gmHdlDut '/crp_temp_shared4']);
hilite_system([gmHdlDut '/crp_temp_shared5']);

The time-multiplexing architecture, also known as the single-rate sharing architecture is the same as the architecture described in the Resource sharing with oversampling constraints example. A global scheduler is created to enable and disable different regions of the design using enabled subsystems. The enable/disable control is implemented using a limited counter ctr_799 that counts to the latency budget (0 to 799). The shared regions are implemented as enabled subsystems that are enabled according to a automatically determined schedule order. In this design, we found 6 groups of multipliers that was shared by 4-ways or less, These 6 subsystems have crp_temp_shared as part of their names.

In summary, the multiplier count for the design has reduced from 20 to 10 without any latency penalties.

Minimizing latency

As an advanced maneuver, it is possible to reduce the ouput latency by removing the output Delay_Register and instead using the option to allow clock-rate pipelining of DUT output ports.

srcHdlModel = 'hdlcoderFocSharing';
dstHdlModel = 'hdlcoderFocMinLatency';
dstHdlDut   = [dstHdlModel '/FOC_Current_Control'];
gmHdlModel  = ['gm_' dstHdlModel];
gmHdlDut    = ['gm_' dstHdlDut];

open_system(srcHdlModel);
save_system(srcHdlModel,dstHdlModel);

delete_line(dstHdlDut,'Space_Vector_Modulation/1','Delay_Register/1');
delete_line(dstHdlDut,'Delay_Register/1','Phase_Voltage/1');
delete_block([dstHdlDut,'/Delay_Register'])
add_line(dstHdlDut,'Space_Vector_Modulation/1','Phase_Voltage/1');

open_system(dstHdlDut);

The clock-rate pipelining for output ports option is available in the configuration parameters dialog under the 'HDL Code Generation' -> 'Global Settings' -> 'Optimization' tab: check the 'Allow clock-rate pipelining of DUT output ports' option. This command-line property name for this option is 'ClockRatePipelineOutputPorts'. When the 'ClockRatePipelineOutputPorts' option is turned on and the output register removed, the generated HDL code does not wait for the full sample step to generate the output. Rather, it will generate the output within a few clock cycles as soon as the data is ready. The generated HDL code will generate the output at the clock-rate without waiting for the next sample step.

hdlset_param(dstHdlModel, 'ClockRatePipelineOutputPorts', 'on');
save_system(dstHdlModel);

makehdl(dstHdlDut);

### Generating HDL for 'hdlcoderFocMinLatency/FOC_Current_Control'.
### Starting HDL check.
### Clock-rate pipelining was applied on signals connected to the DUT's output ports. The DUT output port values are therefore updated at the clock-rate. The following ports are phase-offset by the stated number of clock cycles.
### Phase of output port 0: 8 clock cycles.
### To highlight blocks that obstruct distributed pipelining, click the following MATLAB script: <a href="matlab:run('hdlsrc/hdlcoderFocMinLatency/highlightDistributedPipeliningBarriers')">hdlsrc/hdlcoderFocMinLatency/highlightDistributedPipeliningBarriers.m</a>
### To clear highlighting, click the following MATLAB script: <a href="matlab:run('hdlsrc/hdlcoderFocMinLatency/clearhighlighting.m')">hdlsrc/hdlcoderFocMinLatency/clearhighlighting.m</a>
### Generating new validation model: <a href="matlab:open_system('gm_hdlcoderFocMinLatency_vnl')">gm_hdlcoderFocMinLatency_vnl</a>.
### Validation model generation complete.
### Begin VHDL Code Generation for 'hdlcoderFocMinLatency'.
### MESSAGE: The design requires 800 times faster clock with respect to the base rate = 2e-05.
### Working on Max/Max as hdlsrc/hdlcoderFocMinLatency/Max.vhd.
### Working on Min/Min as hdlsrc/hdlcoderFocMinLatency/Min.vhd.
### Working on crp_temp_shared as hdlsrc/hdlcoderFocMinLatency/crp_temp_shared.vhd.
### Working on crp_temp_shared_block as hdlsrc/hdlcoderFocMinLatency/crp_temp_shared_block.vhd.
### Working on crp_temp_shared_block1 as hdlsrc/hdlcoderFocMinLatency/crp_temp_shared_block1.vhd.
### Working on crp_temp_shared_block2 as hdlsrc/hdlcoderFocMinLatency/crp_temp_shared_block2.vhd.
### Working on crp_temp_shared_block3 as hdlsrc/hdlcoderFocMinLatency/crp_temp_shared_block3.vhd.
### Working on crp_temp_shared_block4 as hdlsrc/hdlcoderFocMinLatency/crp_temp_shared_block4.vhd.
### Working on Max_nw as hdlsrc/hdlcoderFocMinLatency/Max_nw.vhd.
### Working on Min_nw as hdlsrc/hdlcoderFocMinLatency/Min_nw.vhd.
### Working on FOC_Current_Control_tc as hdlsrc/hdlcoderFocMinLatency/FOC_Current_Control_tc.vhd.
### Working on FOC_Current_Control as hdlsrc/hdlcoderFocMinLatency/FOC_Current_Control.vhd.
### Generating package file hdlsrc/hdlcoderFocMinLatency/FOC_Current_Control_pkg.vhd.
### Generating HTML files for code generation report at <a href="matlab:web('/tmp/BR2016bd_418663_21018/publish_examples3/tpad7ded80_4423_490b_af84_8dadc224750c/hdlsrc/hdlcoderFocMinLatency/html/hdlcoderFocMinLatency/hdlcoderFocMinLatency_codegen_rpt.html');">hdlcoderFocMinLatency_codegen_rpt.html</a>
### Creating HDL Code Generation Check Report file:///tmp/BR2016bd_418663_21018/publish_examples3/tpad7ded80_4423_490b_af84_8dadc224750c/hdlsrc/hdlcoderFocMinLatency/FOC_Current_Control_report.html
### HDL check for 'hdlcoderFocMinLatency' complete with 0 errors, 0 warnings, and 2 messages.
### HDL code generation complete.

Notice that the 'makehdl' command has generated a message, '### Phase of output port 0:'. This message instructs the user on how to sample the DUT's outputs. The number of clock cycles specified here corresponds to how quickly the DUT's outputs can be sampled and, in essence, this is the latency of the design. Thus, the total latency of the design is down from a data-rate sample step of 20 $\mu s$ to a few nanoseconds.

We can review the generated model to observe that a new DUT subsystem is created whose output operates at the clock-rate, which is 25 ns.

open_system(gmHdlDut);
set_param(gmHdlModel, 'SimulationCommand', 'update');
set_param(gmHdlDut, 'ZoomFactor', 'FitSystem');

We must be careful when using this option since additional latency is introduced into the generated HDL code that was not in the original simulation model. In doing this, the sample-time of the output port has changed to the clock-rate. This introduces a possible discrepancy in results during the validation and verification flow since the test-harness expects the design to generate outputs at the data-rate. The validation model addresses this problem by inserting a down-sampling rate-transition to bring the output back to the data-rate. Thus, the validation model still compares outputs at the data-rate. The HDL testbench will, however, compare the new DUT's outputs at the clock-rate since the generated HDL outputs are emitted at the clock-rate.

Fine-tuning for performance

While this example illustrates the basic workflow to use clock-rate pipelining to minimize design latency, there are many other options available for fine-tuning HDL performance. The following are tips to leverage the feature's full potential. Note that these guidelines may not correspond to good modeling practices, but rather they are good practices for preparing your implementation model for HDL code generation and optimization.

Multi-rate designs: In this example, the source model is operating at a single rate, which is the data-rate. The Oversampling option specifies its relationship to the clock-rate. This setup works best for minimizing design latency. Clock-rate pipelining also works well in multi-rate designs by optimizing the slow-paths, but may introduce sample delays at the rate-transition boundaries. Thus, for minimizing latency, use a single-rate (the data-rate) for the whole design.
Clock frequency: You will notice in this design that distributed pipelining did not pipeline the whole datapath. This is because the optimization is cognizant of the consequences of retiming across certain blocks that may cause a numerical mismatch; see distributed pipelining documentation for more details. Often, these numerical integrity issues occur at boundary conditions. If the user is confident that the design does not hit these boundary conditions, the user can turn on the performance-mode of distributed pipelining. In this case, the user must do thorough validation to confirm that design is working properly and is robust to all operating conditions.
Flatten DUT hierarchy: For effective clock-rate pipelining, it is advisable to flatten all hierarchy. This is because clock-rate pipelining works best when all data-rate delays are at the same level of subsystem hierarchy. Setting the 'FlattenHierarchy' option on the top-level DUT will ensure this. However, to be effective, please check that all the requirements for flattening are satisfied for the lower level subsystems.
Provide sufficient budget: When the total number of clock-rate pipelines applied is equal to or more than the available oversampling budget, then understanding the timing impact can be hard. Therefore, provide sufficient budget, or Oversampling value, for clock-rate pipelining. The only drawback of too big of an oversampling value is that the counters used by the timing controller and scheduler may be larger. The area overhead is, therefore, quite small.

Summary

Clock-rate pipelining is a technique to optimize and pipeline slow paths in your design. Clock-rate pipelining ensures that pipelines are introduced at the clock-rate for the following HDL coder constructs and features:

Pipelined math operations: Several math blocks implement a multi-cycle, pipelined HDL implementation, e.g., Newton-Rhapson method for sqrt or recip, Cordic algorithm for trigonometric functions. These pipelines are introduced at clock-rate if the block operates on a slow path.
Floating point mapping: As described above, floating point library mapping utilizes clock-rate pipelines when implementing floating point math.
Pipelining optimizations: All pipelining optimizations including input/output pipelining and distributed pipelining use clock-rate registers on slow paths.
Resource sharing and streaming: Time-multiplexing of resource-shared architectures are implemented at the clock-rate.

Slow paths are identified as paths using a slower Simulink sample time or when Oversampling parameter is set in the HDL Coder settings. Using clock-rate pipelining, the design's speed and area properties can be improved without compromising the design's total latency.

% LocalWords:  ug Foc ug Foc Crp btonpii crp DUT's
% LocalWords:  distributedpipeliningpriority Rhapson recip

Was this topic helpful?

Examples