In this exercise we look at memory locality properties of matrix computation.
本题考察矩阵计算中存储器的局部特性。
The following code is written in C, where elements within the same row are stored contiguously.
下面的代码用 C 语言编写,同一行中的元素连续存放。
Assume each word is a 32-bit integer.
假定每个字是 32 位整数。
for (I = 0; I < 8; I++)
for (J = 0; J < 8000; J++)
A[I][J] = B[I][0] + A[J][I]
How many 32-bit integers can be stored in a 16-byte cache block?
16 字节的 cache 块中可以存放多少个 32 位的整数?
References to which variables exhibits temporal locality?
访问哪些变量会显示出时间局部性?
References to which variables exhibits spatial locality?
访问哪些变量会显示出空间局部性?
Locality is affected by both the reference order and data layout.
局部性同时受访问顺序和数据存放位置的影响。
The same computation can also be written below in Matlab, which differs from C by storing matrix elements within the same column contiguously in memory.
同样的计算也可以用下面的 MATLAB 语言编写,但与 C 语言代码不同的是,矩阵元素按列连续存放。
for I=1:8
for J=1:8000
A(I,J)=B(I,0)+A(J,I);
end
end
How many 16-byte cache blocks are needed to store all 32-bit matrix elements being referenced?
存放全部被访问的 32 位矩阵元素需要多少 16 字节的 cache 块?
References to which variables exhibits temporal locality?
访问哪些变量会显示出时间局部性?
References to which variables exhibits spatial locality?
访问哪些变量会显示出空间局部性?
For a direct-mapped cache design with a 32-bit address, the following bits of the address are used to access the cache.
对于一个 32 位地址的直接映射 cache,下面的地址位用来访问 cache。
Tag | Index | Offset |
---|---|---|
31~10 | 9~5 | 4~0 |
What is the cache block size (in words)?
cache 块大小是多少(单位为字)?
How many entries does the cache have?
cache 有多少项?
What is the ratio between total bits required for such a cache implementation over the data storage bits?
这样的 cache 实现时所需的总位数与数据存储位数之间的比率是多少?
下表记录了从上电开始的 cache 访问的字节地址。
(hexadecimal) 00 04 10 84 E8 A0 400 1E 8C C1C B4 884
(decimal) 0 4 16 132 232 160 1024 30 140 3100 180 2180
对于每个访问,请列出
- 标记、索引、偏移
- 是否命中或缺失
- 哪些字节被替换(如果有)
命中率是多少?
列出 cache 的最终状态,每个有效项表示为 < 索引,标记,数据 > 的形式。
In this exercise, we will look at the different ways capacity affects overall performance.
本题将研究不同容量对整体性能的影响。
In general, cache access time is proportional to capacity.
通常来说,cache 访问时间与 cache 容量成比例。
Assume that main memory accesses take 70 ns and that memory accesses are 36% of all instructions.
假设访问主存需要 70ns,并且在所有指令中有 36% 的指令需要访问数据存储器。
The following table shows data for L1 caches attached to each of two processors, P1 and P2.
下表是 P1 和 P2 两个处理器各自的 L1 cache 数据。
L1 Size | L1 Miss Rate | L1 Hit Time | |
---|---|---|---|
P1 | 2 KiB | 8.0% | 0.66 ns |
P2 | 4 KiB | 6.0% | 0.90 ns |
假定 L1 cache 的命中时间决定了 P1 和 P2 的周期时间,它们各自的时钟频率是多少?
What is the Average Memory Access Time for P1 and P2?
P1 和 P2 各自的 AMAT 分别是多少?
Assuming a base CPI of 1.0 without any memory stalls, what is the total CPI for P1 and P2?
假定在没有任何存储器阻塞时基本的 CPI 为 1.0,P1 和 P2 各自的总 CPI 分别是多少?
Which processor is faster?
哪个处理器更快?
Note
当我们说“基本 CPI 为 1.0”时,是指指令在一个时钟周期完成,除非访问指令或数据时发生 cache 缺失。
For the next three problems, we will consider the addition of an L2 cache to P1 to presumably make up for its limited L1 cache capacity.
对下面三个问题,我们考虑在 P1 中增加 L2 cache,以弥补 L1 cache 的容量限制。
Use the L1 cache capacities and hit times from the previous table when solving these problems.
在解决这些问题时,依然使用上表中 L1 cache 的容量和命中时间。
The L2 miss rate indicated is its local miss rate.
L2 cache 的缺失率是它的局部缺失率。
L2 Size | L2 Miss Rate | L2 Hit Time |
---|---|---|
1 MiB | 95% | 5.62 ns |
What is the AMAT for P1 with the addition of an L2 cache?
增加 L2 cache 后,P1 的 AMAT 是多少?
Is the AMAT better or worse with the L2 cache?
有了 L2 cache,AMAT 是更好还是更差了?
Assuming a base CPI of 1.0 without any memory stalls, what is the total CPI for P1 with the addition of an L2 cache?
假定在没有任何存储器阻塞时基本的 CPI 为 1.0,增加 L2 cache 后,P1 的总的 CPI 是多少?
要使有 L2 cache 的 P1 比没有 L2 cache 的 P1 更快,L2 cache 的缺失率应该是多少?
要使有 L2 cache 的 P1 比没有 L2 cache 的 P2 更快,L2 cache 的缺失率应该是多少?