# CS184a: Computer Architecture (Structure and Organization)

Day 13: February 4, 2005 Interconnect 1: Requirements



Caltech CS184 Winter2005 -- DeHon

#### Last Time

- · Saw various compute blocks
- To exploit structure in typical designs we need programmable interconnect
- All reasonable, scalable structures:
  - small to moderate sized logic blocks
  - connected via programmable interconnect
- been saying delay across programmable interconnect is a big factor

Caltech CS184 Winter2005 -- DeHon

2

### Today

- Interconnect Design Space
- · Dominance of Interconnect
- · Interconnect Delay
- Simple things
  - and why they don't work

altech CS184 Winter2005 -- DeHor

3

# $A_{bit\_elm} = A_{fixed} + \underbrace{N_{SW}(N_p, w, p) \cdot A_{SW}}_{\text{Interconnect}} \\ + \underbrace{\left(\frac{c}{w}\right) \cdot n_{ibits} \cdot A_{mem\_xell}}_{\text{Instruction}} \\ + \underbrace{\left(\frac{c}{w}\right)$



| Dominant Time |                       |        |        |        |
|---------------|-----------------------|--------|--------|--------|
|               |                       | Total  |        | Inter. |
| Design        | Path                  | Delay  | Delay  |        |
| DPGA          | LUT-LUT (in subarray) | 3.5 ns | 1.5 ns | 60%    |
|               | LUT-xbar-LUT          | 7 ns   | 1.5 ns | 80%    |
| HSRA          | LUT-LUT               | 8 ns   | <2 ns  | 25%    |
|               | LUT-cascade           | 4 ns/4 | <2 ns  | 0%     |
|               | LUT-4tree-LUT         | 8 ns   | <2 ns  | 80%    |
|               | LUT-8tree-LUT         | 12 ns  | <2 ns  | 83%    |
|               | LUT-16tree-LUT        | 16 ns  | <2 ns  | 88%    |
|               | LUT-64tree-LUT        | 20 ns  | <2 ns  | 90%    |



### For Spatial Architectures

- · Interconnect dominant
  - area
  - power
  - time
- ...so need to understand in order to optimize architectures

Caltech CS184 Winter2005 -- DeHor

8

#### Interconnect

#### Problem

- Thousands of independent (bit) operators producing results
  - true of FPGAs today
  - ...true for \*LIW, multi-uP, etc. in future
- Each taking as inputs the results of other (bit) processing elements
- Interconnect is late bound
  - don't know until after fabrication

Caltech CS184 Winter2005 -- DeHon

9

11

#### Design Issues

- Flexibility -- route "anything"
  - (w/in reason?)
- Area -- wires, switches
- Delay -- switches in path, stubs, wire length
- Power -- switch, wire capacitance
- Routability -- computational difficulty finding routes

Caltech CS184 Winter2005 -- DeHon

10

## Delay

CS184 Winter2005 -- DeHon

# Wiring Delay

• Delay on wire of length L<sub>seq</sub>:

$$T_{seg} = T_{gate} + 0.4 RC$$

- $C = L_{seg} \times C_{sq}$
- $R = L_{seg} \times R_{sq}$

$$T_{\text{seq}} = T_{\text{gate}} + 0.4 C_{\text{sq}} \times R_{\text{sq}} \times L_{\text{seq}}^2$$

Caltech CS184 Winter2005 -- DeHon

#### Wire Numbers

- $R_{sq} = 0.17 \Omega/sq$ .
  - from ITRS:Interconnect
    - · Conductor effective resistance
    - A/R (aspect ratio)
- $C_{sq} = 7 \times 10^{-18} F/sq$ .
- $R_{sq} \times C_{sq} \approx 10^{-18} \, s$
- T<sub>gate</sub> = 30 ps
- Chip: 7mm side, 70nm sq. (45nm process)
  - 105 squares across chip

Caltach CS184 Winter2005 -- DeHon

13

15

#### Wiring Delay

· Wire Delay

$$T_{\text{seg}} = T_{\text{gate}} + 0.4 C_{\text{sg}} \times R_{\text{sg}} \times L_{\text{seg}}^2$$

$$T_{\text{seg}} = 30 \text{ps} + 0.4 \ 10^{-18} \, \text{s} \times 10^{10}$$

$$T_{seq} = 30ps + 4ns \approx 4ns$$

Caltech CS184 Winter2005 -- DeHon

14

#### **Buffer Wire**

- Buffer every L<sub>seq</sub>
- $T_{cross} = (L_{cross}/L_{seg}) T_{seg}$

$$\begin{split} T_{cross} &= \text{ } (L_{cross}/L_{seg}) \text{ } (T_{gate} + 0.4 \text{ } C_{sq} \times R_{sq} \times L_{seg}^2) \\ &= (L_{cross}) \text{ } (T_{gate}/L_{seg} + 0.4 \text{ } C_{sq} \times R_{sq} \times L_{seg}) \end{split}$$



Itech CS184 Winter2005 -- DeHor

Opt. Buffer Wire

- $T_{cross} = (L_{cross}) (T_{gate}/L_{seg} + 0.4 C_{sq} \times R_{sq} \times L_{seg})$
- Minimize:
  - Take  $d(T_{cross})/d(L_{seq}) = 0$
  - $\bullet 0 = (L_{cross}) (-T_{gate}/L_{seq}^2 + 0.4 C_{sq} \times R_{sq})$
  - $T_{\text{gate}} = 0.4 \text{ C}_{\text{sq}} \times \text{R}_{\text{sq}} \text{L}_{\text{seq}}^2$

Caltech CS184 Winter2005 -- DeHon

16

## **Optimization Point**

· Optimized:

$$\begin{split} T_{cross} &= (L_{cross}/L_{seg}) \; (T_{gate} + 0.4 \; C_{sq} \times R_{sq} \times L_{seg}^2) \\ T_{gate} &= 0.4 \; C_{sq} \times R_{sq} \; L_{seg}^2 \end{split}$$

Says: equalize gate and wire delay



# Optimal Segment Length

- $T_{\text{gate}} = 0.4 C_{\text{sg}} \times R_{\text{sg}} L_{\text{seg}}^2$
- $L_{seg} = Sqrt(T_{gate}/0.4 C_{sg} \times R_{sg})$
- $L_{seg} = Sqrt(30 \ 10^{-12} \ s/0.4 \ 10^{-18} \ s)$
- $L_{seg} \approx Sqrt(10^8) \approx 10^4 sq.$

Caltech CS184 Winter2005 -- DeHon















#### Shared Bus

- Consider operation:  $y=Ax^2 +Bx +C$ 
  - -3 mpys
  - -2 adds
  - ~5 values need to be routed from producer to consumer
- Performance lower bound if have design w/:
  - m multipliers
  - u madd units
  - a adders
- i simultaneous interconnection busses

26

# Resource Bounded Scheduling

- · Scheduling in general NP-hard
  - (find optimum)
  - can approximate in O(E) time

tech CS184 Winter2005 -- DeHon

27

#### Lower Bound: Critical Path

- ASAP schedule ignoring resource constraints
  - (look at length of remaining critical path)
- Certainly cannot finish any faster than that

Caltech CS184 Winter2005 -- DeHon

28

# Lower Bound: Resource Capacity

- Sum up all capacity required per resource
- Divide by total resource (for type)
- Lower bound on remaining schedule time
  - (best can do is pack all use densely)

altech CS184 Winter2005 -- DeHor





#### Shared Bus

- Consider operation: y=Ax<sup>2</sup> +Bx +C
  - 3 mpys
  - -2 adds
  - -~5 values need to be routed from producer to consumer
- Performance lower bound if have design w/:
  - m multipliers
  - u madd units
  - a adders
- i simultaneous interconnection busses

# Viewpoint

- Interconnect is a resource
- Bottleneck for design can be in availability of any resource
- Lower Bound on Delay: Logical Resource / Physical Resources
- May be worse
  - Dependencies (critical path bound)
  - ability to use resource

33

# **Shared Bus**

- Flexibility (+)
  - routes everything (given enough time)
  - can be trick to schedule use optimally
- Delay (Power) (--)
  - wire length O(kn)
  - parasitic stubs: kn+n
  - series switch: 1
  - O(kn)
- sequentialize I/B



#### Term: Bisection Bandwidth

- · Partition design into two equal size
- Minimize wires (nets) with ends in both halves
- Number of wires crossing is bisection



35

# (2) Crossbar · Avoid bottleneck Every output gets its own interconnect channel









# **Avoiding Crossbar Costs**

- Typical architecture trick:
  - exploit expected problem structure
- We have **freedom** in operator placement
- Designs have spatial locality
- →place connected components "close" together
  - don't need full interconnect?

Caltech CS184 Winter2005 -- DeHor

41

# Exploit Locality • Wires expensive • Local interconnect cheap • 1D versions • What does this do to - Switches? - Delay? • (quantify on hmwrk)









# Mesh

- Plausible
- ...but What's w
- ...and how does it grow?

Caltech CS184 Winter2005 -- DeHon

# Big Ideas [MSB Ideas]

- Interconnect Dominant
  - power, delay, area
- Can be bottleneck for designs
- · Can't afford full crossbar
- Need to exploit locality
- · Can't have everything close

Caltech CS184 Winter2005 -- DeHon

47