9.10.08

tool chain gripes... again...

Since I've been spending this past few weeks on FPGA programming, my thoughts have inevitably drifted to all the things that are wrong with the whole workflow...

The last component I've been working on is some FIR filter... Now, this _SHOULD_ be as easy as just writing the equivalent of

y := 0;
D <= din & D(0 to NTAPS-2);
for i in 0 to NTAPS-1 loop y := y + COEFF(i)*D(i); end loop;

or something similar...


BUT, because of the *inadequacies* of the toolchain, I can't do this because I will inevitably end up with timing closure problems (I'm running a Xilinx v2p at 200MHz after all).. So instead of the much simpler behavioural description, I have to do this sorta thing structurally, so it becomes a mass (or mess) of code... grrrr...

It doesn't even work if I put a few delays at the end to let the synthesis tool rebalance the delays (in the first place, why can't I just say, add as many delays as the tool requires?), because the placer will inevitably place the coefficient memory in a different location from the multiplier, and not realise that the only place to place the FFs feeding the multiplier can only be the CLBs near the multiplier! So the P&R (place-and-route) will take about an hour before giving up and saying it's not possible...



Because of this, I wrote a set of primitive components (like an adder which can also do loading, or a loadable counter) one year ago. However, this kind of hand-coded thing is never flexible enough... for e.g. take the adder-with-load example. it's declaration is something like..

entity addld is port (a, b, ld_val: in std_logic_vector(...); cin, ld_n: in std_logic; cout: out std_logic; s: out std_logic_vector(...)); end addld;

and it's behaviour is something like... (loose syntax again)

cout & s <= "0" & ld_val when ld_n = '0' else "0"&a + "0"&b;

incidentally, this component takes up the same amount of delay (and space) as a normal adder (without the bypass load functionality). AND, it cannot be inferred by the synthesis tool because it uses the MUL_AND part of a CLB (this idea was taken from one of Xilinx's appnotes) which can only be instantiated... even the LUT part cannot be inferred because Xilinx's synthesis tool is too stupid to realise that the A and B inputs must be equal to the MUL_AND inputs and will happily optimise them away, leaving the mapper to complain and bomb out. great eh?

BUT today, i wanted to load a particular pattern... (I want to load the extremal value of a 2's complement number)

ldval <= neg & repeat(not neg, WIDTH-1);

(repeat will form a vector out of [arg1] repeated [arg2] times)...

so because the addld is hard-coded with LUTs, the synthesis tool obviously doesn't figure out that to implement this particular ldval it is not necessary to incur an additional LUT delay computing , and so I might get silly timing closures related to this again! grrr... of course, I can build in this capability to have ld_val(i) optionally inverted etc etc but the point is... THE TOOLCHAIN SHOULD BE THE ONE DOING THIS NONSENSE!!

yeah, so basically... sian...

No comments: