# CS184a: Computer Architecture (Structure and Organization)

Day 19: February 26, 2003
Time Multiplexing



Caltech CS184 Winter2003 -- DeHon

#### Last Week

- Saw how to pipeline architectures
  - specifically interconnect
  - talked about general case
- Including how to map to them
- Saw how to reuse resources at maximum rate to do the same thing

#### Today

- Multicontext
  - Review why
  - Cost
  - Packing into contexts
  - Retiming implications
- [stuff saw in overview week 2-3, now can dig deeper into details]

Caltech CS184 Winter2003 -- DeHon

3

# How often reuse same operation applicable?

- Can we exploit higher frequency offered?
  - High throughput, feed-forward (acyclic)
  - Cycles in flowgraph
    - abundant data level parallelism [C-slow, last time]
    - no data level parallelism
  - Low throughput tasks
    - structured (e.g. datapaths) [serialize datapath]
    - unstructured
  - Data dependent operations
    - similar ops [local control -- next time]
- dis-similar ops Caltech CS184 Winter2003 -- DeHon

#### Structured Datapaths

- Datapaths: same pinst for all bits
- Can serialize and reuse the same data elements in succeeding cycles

• example: adder
Caltech CS184 Winter2003 – DeHon



Throughput Yield

Designw Black

Des



#### Remaining Cases

- Benefit from multicontext as well as high clock rate
  - cycles, no parallelism
  - data dependent, dissimilar operations
  - low throughput, irregular (can't afford swap?)

#### Single Context

- When have:
  - cycles and no data parallelism
  - low throughput, unstructured tasks
  - dis-similar data dependent tasks
- Active resources sit idle most of the time
  - Waste of resources
- Cannot reuse resources to perform different function, only same

Caltech CS184 Winter2003 -- DeHor

9

#### Resource Reuse

- To use resources in these cases
  - must direct to do different things.
- Must be able tell resources how to behave
- => separate instructions (pinsts) for each behavior









#### Example: DPGA Area

**Process** Chip AEs **Contexts AE** Area

 $A_{base}$  $A_{ctx}$ 

 $A_{base}$  :  $A_{ctx}$  (nominal delay)

 $1.0\mu$  CMOS 7.1mm×6.8mm 144 4 **640K** $\lambda^{2}$ **544K** $\lambda^2$  $24K\lambda^2$ 20+:1 9ns



15

Caltech CS184 Winter2003 -- DeHon

#### **Multicontext Tradeoff Curves**

• Assume Ideal packing:  $N_{active} = N_{total}/L$ 



Reminder: Robust point: c\*A<sub>ctxt</sub>=A<sub>6base</sub>

Caltech CS184 Winter2003 -- DeHon

#### In Practice

- Scheduling Limitations
- Retiming Limitations

Caltech CS184 Winter2003 -- DeHon

17

#### **Scheduling Limitations**

- N<sub>A</sub> (active)
  - size of largest stage
- Precedence:

can evaluate a LUT only after predecessors have been evaluated

→cannot always, completely equalize stage requirements

## Scheduling

- Precedence limits packing freedom
- · Freedom do have
  - shows up as slack in network





#### Scheduling

- · Computing Slack:
  - ASAP (As Soon As Possible) Schedule
    - · propagate depth forward from primary inputs
      - depth = 1 + max input depth
  - ALAP (As Late As Possible) Schedule
    - propagate distance from outputs back from outputs
      - -level = 1 + max output consumption level
  - Slack
    - slack = L+1-(depth+level) [PI depth=0, PO level=0]





#### Sequentialization

- Adding time slots
  - more sequential (more latency)
  - add slack
    - · allows better balance



L=4  $\rightarrow$ N<sub>A</sub>=2 (4 or 3 contexts)

Caltech CS184 Winter2003 -- DeHon

23

#### **Multicontext Scheduling**

- · "Retiming" for multicontext
  - goal: minimize peak resource requirements
    - resources: logic blocks, retiming inputs, interconnect
- NP-complete
- list schedule, anneal

#### Multicontext Data Retiming

- How do we accommodate intermediate data?
- Effects?

Caltech CS184 Winter2003 -- DeHon

# Signal Retiming

- Non-pipelined
  - hold value on LUT Output (wire)
    - from production through consumption
  - Wastes wire and switches by occupying
    - for entire critical path delay L
    - not just for 1/L'th of cycle takes to cross wire segment
  - How show up in multicontext?

26

25

Caltech CS184 Winter2003 -- DeHon



#### Alternate Retiming

- Recall from last time (Day 18)
  - Net buffer
    - smaller than LUT
  - Output retiming
    - · may have to route multiple times
  - Input buffer chain
    - · only need LUT every depth cycles





#### Input Buffer Retiming

Can only take K unique inputs per cycle

Configuration depth differ from context-

to-context



Caltech CS184 Winter2003 -- DeHon



**DES Latency Example** Area in  $[M \lambda^2]$ LUTs 1.30 Ideal LUTs 1.20 1800 1500 1.10 1200 1.00 Contexts Contexts Area in 1900 1300 1000 1000 Mapped Area Ideal Area Single Output case 1000 30 Contexts Caltech CS184 Winter2003 -- DeHon





# ASCII→Hex Example

- All retiming on wires (active outputs)
  - saturation based on inputs to largest stage



Ideal=Perfect scheduling spread + no retime overhead

Caltech CS184 Winter2003 -- DeHon

33

# ASCII $\rightarrow$ Hex Example (input retime) (input

#### General throughput mapping:

- If only want to achieve limited throughput
- Target produce new result every t cycles
- Spatially pipeline every t stages cycle = t
- 2. retime to minimize register requirements
- 3. multicontext evaluation w/in a spatial stage retime (list schedule) to minimize resource usage
- 4. Map for depth (i) and contexts (c)

Caltech CS184 Winter2003 -- DeHon

35

#### Benchmark Set

- 23 MCNC circuits
  - area mapped with SIS and Chortle

| Circuit | Mapped LUTs | Path Length | Circuit | Mapped LUTs | Path Length |
|---------|-------------|-------------|---------|-------------|-------------|
| 5xp1    | 46          | 10          | des     | 1267        | 13          |
| 9sym    | 123         | 7           | e64     | 230         | 9           |
| 9symml  | 108         | 8           | f51m    | 45          | 17          |
| C499    | 85          | 10          | misex1  | 20          | 6           |
| C880    | 176         | 21          | misex2  | 38          | 8           |
| alu2    | 169         | 19          | rd73    | 105         | 10          |
| арех6   | 248         | 9           | rd84    | 150         | 9           |
| арех7   | 77          | 7           | rot     | 293         | 16          |
| b9      | 46          | 7           | sao2    | 73          | 9           |
| clip    | 121         | 9           | vg2     | 60          | 9           |
| cordic  | 367         | 13          | z4ml    | 8           | 7           |
| count   | 46          | 16          |         |             |             |

Caltech CS184 Winter2003 -- DeHon







#### **Admin**

- No Lecture Friday
- No Lecture Wed., March 5<sup>th</sup>

# Big Ideas [MSB Ideas]

- Several cases cannot profitably reuse same logic at device cycle rate
  - cycles, no data parallelism
  - low throughput, unstructured
  - dis-similar data dependent computations
- These cases benefit from more than one instructions/operations per active element
- A<sub>ctxt</sub> << A<sub>active</sub> makes interesting
  - save area by sharing active among instructions

Caltech CS184 Winter2003 - Delinin

41

## Big Ideas [MSB-1 Ideas]

- Economical retiming becomes important here to achieve active LUT reduction
  - one output reg/LUT leads to early saturation
- c=4--8, I=4--6 automatically mapped designs
   1/2 to 1/3 single context size
- Most FPGAs typically run in realm where multicontext is smaller
  - How many for intrinsic reasons?
  - How many for lack of HSRA-like register/CAD support?