-
Notifications
You must be signed in to change notification settings - Fork 133
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alder Lake] RAM geometry detected incorrectly #414
Comments
Hello, First CoreFreq report with Please post here the full output of |
Hi! Here is
|
Hello, Can you pull and try the new changes in the I will need a refresh of Thank you |
Here's Seems to detect all 32GB correctly now. Thanks!
|
It works but not as the way I was expecting it to do! Bus rate has decreased to 1900 MHz with DDR5 about 2400 MHz. Does it sound good to you ? Other addition for your platform is |
EDIT: Sorry, I meant to say IBT (indirect branch tracking), not BTI. The kernel option is Original post below. Yes, the geometry seems weird. As for the bus rate, I'll need to check what the BIOS reports, and post back later today. I have to admit I'm not that familiar with processor internals. What is TCO in this context? I didn't find anything useful on the internet due to the overloaded acronym - even Intel themselves only talk about total cost of ownership. I'll look in the BIOS for a TCO setting. Speaking of processor features, one detail I forgot to mention - I have Also, for what it's worth, this is a new system, and there are still some other random issues I need to debug:
These seem unlikely to be CPU-related, so just for information :) I'll get back to you later today after I've looked at the BIOS settings. |
This definition of TCO and datasheet I'm using to program Registers |
BIOS reports DRAM frequency as 4800 MHz. (I suppose this is including the "double" in the DDR.) And after booting Linux again, with nothing changed, the memory bus is now at almost 4 GHz?!
Definition and datasheet - thank you, very interesting! Summarizing, in this context TCO is a low-level system crash watchdog, and the acronym indeed means total cost of ownership. No wonder it was hard to find. :) The BIOS on this machine has no settings for TCO - or indeed that many settings at all. It is Here's a list of available BIOS settings, other than TPM setup:
I'm not seeing any kernel modules with "TCO" in the name. Here is the full output of
|
Ah, and I tested enabling TCO in CoreFreq, in Window ⊳ Technologies . Here's
|
Cause of DRAM frequency discrepancy possibly found. After booting the system, the memory bus is at 4000 MHz. But if I suspend and resume, it drops to 1900 MHz. Next, to figure out why this happens... By the way, thanks for developing CoreFreq, which makes this kind of detailed analysis possible! I originally installed CoreFreq to be able to monitor individual core clock rates and temperatures, for performance profile tuning. :) |
Thanks for your return. Line 4544 in 1cd8f35
|
Ok, good to know, thanks. The weird thing is, once I suspend/resume, the reported bus rate stays around 1900 MHz all the way until the next boot - it doesn't fluctuate. Also, the 4000 MHz after a fresh boot remains stable all the way until I suspend/resume the machine, it doesn't fluctuate either. It also doesn't matter whether the machine is stressed or not, so if the reading is accurate, whatever is happening, it doesn't seem a dynamic performance scaling issue. I read about Alder Lake and XMP 3.0, but this BIOS has no settings for that, either, so all I have to go on regarding the memory bus rate is what CoreFreq tells me. Or what other tools tell me - for comparison, I tried CPU-X, which says "Kingston KF548S38-16, 16384 MB @ 4800 MHz (SODIMM DDR5)" for each of the two RAM slots (this is after a suspend and resume). I've run some test loads (AI training on GPU using a custom code built on TensorFlow; and an MPI-distributed FEM code on CPU, specifically a custom Navier-Stokes solver built on FEniCS). Performance for both seems normal after suspend/resume, but to be sure, I'll have to re-check with a fresh boot. |
What is not trivial to read from source code is that CSR Registers like |
To read IMC data and up to third group of timings: HWiNFO and OCCT in the Windows world. Memtest86+ as bare-metal may provide the primary group of timings in Alder Lake but also the IMC frequency. |
Ah, thanks! Restarting CoreFreq ( Windows solutions are unfortunately not applicable, as this machine is Linux-only. I can download and try a recent Memtest86+, though, to see what it reports on the bare metal. Must be a decade since I've last run that utility :) |
Other settings, but same story in the past: |
Ok. This is getting weirder. After leaving the machine idle for half an hour:
That in itself fair enough for power saving, as 400 MHz is the lowest supported frequency, at least on the CPU - but when did the clock rate switch happen? I haven't seen the 4 GHz appear again. Other than that 400 MHz after a long idle break, the memory bus clock rate has been at 1900 MHz regardless of load. Performance of my test loads after a fresh boot was pretty much identical to my earlier results after a suspend/resume cycle, which might hint that the memory bus ran at the same speed. Either that, or the loads are not memory intensive enough for the performance to differ. :) In other news, upgraded kernel to Running the new kernel, I was able to run a session of Cyberpunk without crashing, but oddly enough, only from a fresh boot. If the machine has been suspended and resumed, the game still exhibits the same random crashing as before. But the crashes have been random enough that I can't be sure if the kernel change fixed them or not - need more testing. To narrow down the cause for the crashes, I tried also the older 515 version of the NVIDIA drivers. No change other than no software TDP limiter support in Haven't looked at the bare metal readings with MemTest86+ yet - I'll keep you posted. |
EDIT: I will come back with another monitoring of those frequencies in about two weeks. |
Hello, You can pull Reading this Intel post, I believe that the Gear mode should be taken account but I'm not sure which registers are involved. Can you please show the refresh of |
Hi, Nice. Sure, here's It's now reporting ~2.4 GHz with ~4.8 GT/s, as one would expect from this setup.
Also, in other news - for debugging the random crashes (of this machine, not CoreFreq :) ), I'm now running on the discrete NVIDIA GPU only, to take Optimus out of the equation. Disabling the integrated GPU seems to have solved a couple of the issues I was having with the machine:
Optimus has worked fine on other Intel/NVIDIA laptops I've used it on (4th and 10th gen i7), but new gen, new bugs, I suppose. I'd have preferred to use Optimus to squeeze out the last bit of discrete GPU memory for AI and games, as well as to save a few watts when idle, but if disabling Optimus means the machine runs without issues, I'll do so for now, and check again with new driver and kernel versions in a year or two. I still need to test with MemTest86+, to get us some numbers for the IMC from the bare metal. I'll get back to you on that as soon as I have the time. |
Regarding gear mode, yeah, should be taken into account. Also, doesn't seem to be documented well. Out of curiosity, I took a look at the guide, and gear is not mentioned anywhere on any of the 5060 pages of the document, despite being up to date up to 13th gen. :) I noticed that Tom's hardware mentioned that a tool called EDIT: Ah, right, you mentioned HWInfo already :) |
You rather read the datasheets of the 13th generation where gear is somehow specified. See my wiki for doc references. |
Ah, thank you! According to vol. 2 of the Raptor Lake datasheet, section Scheduler Configuration (SC_GS_CFG_0_0_0_MCHBAR) — If bit 31 is set, the MC is in GEAR2, and if bit 15 is set, the MC is in GEAR4:
On a side note, OS access to this register is R - only the BIOS has RW access. Which is probably a good thing. :P For comparison, I also checked vol. 2 of the Alder Lake datasheet, same section (pp. 137-139), and the offset and the bit numbers are the same. So we should be able to read the gear the same way in both gen 12 and gen 13. Do you want to have a go at implementing this (I can test it), or alternatively, care to point me at the relevant part of the source code so I can try? EDIT: add the note this register is used for Scheduler configuration, as per the docs. |
Hmm, there's also a 32-bit MMIO register at Memory Controller BIOS Request (MC_BIOS_REQ_0_0_0_MCHBAR_PCU) — Offset 5E00h (pp. 202-203 in Raptor Lake docs, pp. 184-185 in Alder Lake docs):
And then there's a 32-bit MMIO register at Memory Controller BIOS Data (MC_BIOS_DATA_0_0_0_MCHBAR_PCU) — Offset 5E04h (pp. 203-204; respectively pp. 186-187):
OS has R access to both of these registers, too. Right now, I don't know which of these is best - maybe try all of them and see what they report? |
@Technologicat You will peek the register void Query_ADL_IMC(void __iomem *mchmap, unsigned short mc)
{ /*Source: 12th Generation Intel® Core Processor Datasheet Vol 2 */
unsigned short cha;
unsigned int value = 0;
value = readl(mchmap+0x5E04);
printk("Register=%x\n", value); Next rebuild, reload driver and print kernel log to read the register output in hexadecimal. make clean all
rmmod corefreqk
insmod ./corefreqk.ko
dmesg Register=abcd1234 |
I'll also be away for a few days, so a short update for now:
Also, while strictly unrelated, but I've babbled so much about my setup in this thread that other users with a CLEVO (More related to RAM, in general, is the random fact that the VRAM on the GPU runs on a 7 GHz clock rate. I hadn't realized GDDR was that fast.) So I'll report that I tried underclocking the GPU (cores as well as VRAM) by 10%. This did it - it seems the crashes are gone. My hypothesis is that the crashing is likely caused by the infamous transient power spikes of the RTX 30xx GPU series, briefly overwhelming the power supply capabilities of the laptop. The power brick is 200W, but there is also the battery subsystem to consider. (I have no idea whether the power always passes through the battery subsystem on this model.) Note the GPU TDP is 125W, and for the i7-12700H CPU, 45W. The rest of the system also needs some power. So if the GPU draws much more power even for a short while, the system may brown-out and crash. Also note that the crashes occur even with the GPU sustained power draw limited to 80W ( So if you ended up here from a search engine, and are experiencing similar issues on a similar laptop, here's what I did:
As a concluding side note, another solution for troubleshooting GPU power issues, seen on the internet, is:
This disables performance scaling of the GPU, leading to a more predictable power draw. Note that in this mode, the GPU will consume more power when idle. You can also change the setting in the GUI, as well as see the current value of the setting, by running just Also note that changing the PowerMizer mode has mostly been suggested for "the GPU has fallen off the bus" errors, not for random system crashes. I tried on both |
Ok, I'm back. Here's the raw data from registers 0x5E00 and 0x5E04 (sampled both out of curiosity):
Or in binary,
As for In unrelated news, I spoke too soon - the crashes weren't gone yet, just occurred much less often. No crash in up to 2h of gaming, then boom. It's still very rare to get the system to crash with anything other than Cyberpunk, but the fact it has happened twice with other loads tells me it's not the game. But because the crash occurs the most often with it, this specific game is an ideal test load. Trying the Unigine Valley GPU benchmark, I noticed that when power-limited to 80W, the GPU clock rate jumps around a lot. Reading this comment on undervolting got me thinking, and I recalled Indeed, locking the maximum GPU clock rate at 1.785 GHz * (80W / 125W) ~ 1.1 GHz, where 125W is the default TDP and 1.785 GHz the default GPU clock rate, the GPU at full load stays just under the power limit, and does not need to throttle the clock rate, according to monitoring via nvtop. No need to touch the VRAM clock rate, it can run at the default 7 GHz. Tested yesterday. Two hours of idling in-game, and one hour of gaming (later, in another session). So far stable. (And as a note for other gamers who happen upon this, 80W / 1.1 GHz on the RTX 3070 Ti mobile gives about 30-40 FPS, which to me is perfectly playable in a role-playing game. The much lower fan noise level is well worth the FPS hit. Your mileage may vary.) |
Perhaps, if not null (aka register value not equal
If this can help, CoreFreq is monitoring some hardware event bits. |
One important thing I forgot to mention: upon loading the module, CoreFreq read the IMC registers 8 times. Only the first read gives any useful results. On second and further reads, the value in both registers is
Ah, thank you! It's very nice that CoreFreq runs in a terminal, so I can SSH in from another machine and run it over the SSH session when a fullscreen benchmark is running :) Also, for NVIDIA GPU tuning, I forgot to mention that From the (I don't know why the extra "1". My first thoughts were that either someone at NVIDIA likes 2001: A Space Odyssey, or has set things up well in advance for a reference to the old Over 9000! internet meme, once future VRAM gens hit 9 GHz. But more likely it's due to rounding.) |
Is this on the same channel, same controller ? Because driver loops over 8 possible controllers, 12 channels each, considering latest Zen architectures. Line 2000 in 34efe5d
Different controllers, different channels will certainly return the same timings but Registers may slightly changed. So it appears better to probe all of them when feasible. |
So forget my GFX events, those are for the Intel integrated iGPU. |
Thanks. Good catch. Printing the value of the
so once per controller only, as expected.
Ah, right, good point. I have that disabled for now. |
Original issue covered |
Yes, working correctly. Thank you! P.S. Noticed that |
P.P.S. Final note for other gamers: again, the crashes were not yet gone, just occurred less often. The real culprit turned out to be the CPU - most likely, a compatibility issue either with the Linux kernel or with some specific software (such as Cyberpunk). I disabled the e-cores, and haven't had a single crash since then. The way I thought of testing this possibility was reading some anecdotal reports on the internet about game instability with e-cores, and about PC crashes in video transcoding with e-cores enabled. Also, I got a semi-reproducible crash by merging Stable Diffusion checkpoints, making this easier to test. When the crash happened, the hw monitors ( Note that the BIOS setting for legacy game compatibility mode does nothing in Linux; instead, use the features provided by Linux to turn off individual cores. An easy GUI way is to create a profile in TUXEDO Control Center, and in that profile, set the number of logical cores to 12. Then the system will use the p-cores and their hyperthreads, as can be confirmed by |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Hi,
I have a CLEVO PD70PNN1 laptop with 32GB of DDR5 RAM (2 x 16GB), but CoreFreq reports only a single 16GB DIMM in the DIMM geometry. However, the
Total RAM
is reported correctly (32GB).Output of
corefreq-cli -s -n -m -n -c 1 -n -k -n -B -n -M
below.If it matters, I'm running Linux Mint 21.
Is there any other information I can provide to help?
The text was updated successfully, but these errors were encountered: