Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: SMC hwmon driver #299

Closed
wants to merge 2,545 commits into from
Closed

Conversation

chadmed
Copy link
Member

@chadmed chadmed commented Jun 5, 2024

As it says on the tin, just an RFC for an improved hwmon driver which pulls data from the SMC.

The only real major issue is that there seems to be no reliable or accurate source of information
for which keys are present on which machine. The data in this MR comes mostly from poking
around with m1n1 on various devices. Some additional sensors come from exelban/stats,
however granularity at the SoC level seems insufficient as many devices are missing sensors which
are supposed to be there (e.g. random CPU core temperatures).

My preferred approach for this would be to simply ditch the attempt to keep SMC sensors in .dtsi files
and just experimentally determine which are present on each device (or class of device) and populate
&smc at that level.

@chadmed chadmed force-pushed the asahi-hwmon-merge-v2 branch from c747afc to 50cce7b Compare June 9, 2024 00:29
if (!hwmon->fan)
return -ENOMEM;

hwmon_node = of_find_node_by_name(NULL, "macsmc-hwmon");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use of_get_child_by_name(pdev->dev.parent->of_node, "macsmc-hwmon"); the "macsmc-" isn't really needed in the name.
I would put the device_node into hwmon struct

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to keep a reference to it around? We don't use it after the sensors have been populated.

for (idx = 0; idx < num_keys; idx += 1)
configs[idx] = flags;

configs[idx + 1] = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function can be replaced by memset32, it's already 0-termination though the kzalloc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting rid of the 0 assignment makes sense, but I don't see how we do away with the loop since we need the config flag mask set for every sensor/channel. We can't just memset32 one location.

@chadmed chadmed force-pushed the asahi-hwmon-merge-v2 branch from b96b0ac to 31c9fad Compare June 15, 2024 13:05
@chadmed chadmed force-pushed the asahi-hwmon-merge-v2 branch 2 times, most recently from 070a639 to 9345bce Compare June 16, 2024 02:20
@chadmed
Copy link
Member Author

chadmed commented Jun 16, 2024

I am very unhappy with how we are handling the "shared" keys here. There doesn't seem to be much at all in common between logic boards, even keys you'd expect wouldn't change. I don't think having a series of cascaded includes is worth it given how wildly the boards seem to differ, and we should probably just specify the keys directly at the highest level we can empirically demonstrate that they are shared, e.g. t600x-j314-j416.dtsi

@jannau jannau force-pushed the asahi-wip branch 5 times, most recently from dc04487 to 138f32e Compare June 22, 2024 08:58
@chadmed chadmed force-pushed the asahi-hwmon-merge-v2 branch 2 times, most recently from db96da1 to 2b0ad20 Compare June 22, 2024 13:26
luizluca and others added 15 commits June 27, 2024 13:52
[ Upstream commit 4f580e9 ]

The necessity of asserting the reset on removal was previously
questioned, as DSA's own cleanup methods should suffice to prevent
traffic leakage[1].

When a driver has subdrivers controlled by devres, they will be
unregistered after the main driver's .remove is executed. If it asserts
a reset, the subdrivers will be unable to communicate with the hardware
during their cleanup. For LEDs, this means that they will fail to turn
off, resulting in a timeout error.

[1] https://lore.kernel.org/r/[email protected]/

Signed-off-by: Luiz Angelo Daros de Luca <[email protected]>
Reviewed-by: Linus Walleij <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 7c52c70 ]

Like various other Asus Vivobook and Expertbook laptops, the Asus
Vivobook Pro N6506MV has a DSDT table that describes IRQ 1 as ActiveLow
while the kernel is overriding it to Edge_High. This prevents the internal
keyboard from working. This patch prevents this issue by adding this laptop
to the override table that prevents the kernel from overriding this IRQ

Link: https://bugzilla.kernel.org/show_bug.cgi?id=218745
Tested-by: Gianni <[email protected]>
Signed-off-by: Tamim Khan <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit c2e6a87 ]

KCSAN detected a race condition in netpoll:

	BUG: KCSAN: data-race in net_rx_action / netpoll_send_skb
	write (marked) to 0xffff8881164168b0 of 4 bytes by interrupt on cpu 10:
	net_rx_action (./include/linux/netpoll.h:90 net/core/dev.c:6712 net/core/dev.c:6822)
<snip>
	read to 0xffff8881164168b0 of 4 bytes by task 1 on cpu 2:
	netpoll_send_skb (net/core/netpoll.c:319 net/core/netpoll.c:345 net/core/netpoll.c:393)
	netpoll_send_udp (net/core/netpoll.c:?)
<snip>
	value changed: 0x0000000a -> 0xffffffff

This happens because netpoll_owner_active() needs to check if the
current CPU is the owner of the lock, touching napi->poll_owner
non atomically. The ->poll_owner field contains the current CPU holding
the lock.

Use an atomic read to check if the poll owner is the current CPU.

Signed-off-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit bf76b14 ]

Currently, when using WCN7850 or QCN9274 as AP, ath12k always performs down
grade phy mode operation regardless of whether the firmware supports EHT
capability or not and then vdev will start in HE mode. When stations that
support EHT capability try to connect to the AP, the AP will set phy mode
to EHT after receiving the association request packet, and then send
WMI_PEER_ASSOC_CMDID command to firmware, AP’s firmware will crash.

This is because when the ath12k_mac_copy_sband_iftype_data() function
handles EHT capability, it does not copy the EHT capability into the
iftype[band][type] array according to the interface type. So, interface
type should not be used as an index to get eht_cap in
ath12k_mac_check_down_grade_phy_mode() function.

To address this issue, use types_mask to select the eht_cap in
ath12k_mac_check_down_grade_phy_mode() function.

This patch affects QCN9274 and WCN7850 because they have the same issue.

Hostapd log:
wlo1: STA 02:03:7f:37:12:34 IEEE 802.11: Could not set STA to kernel driver

Kernel log:
[270894.816076] ath12k_pci 0000:03:00.0: failed to send WMI_PEER_SET_PARAM cmd
[270894.816111] ath12k_pci 0000:03:00.0: failed to setup peer SMPS for vdev 0: -108
[270894.816122] ath12k_pci 0000:03:00.0: Failed to associate station: 02:03:7f:37:12:34
[270894.843389] ieee80211 phy5: Hardware restart was requested
[270894.843517] ath12k_pci 0000:03:00.0: failed to lookup peer 02:03:7f:37:12:34 on vdev 0
[270894.843616] ath12k_pci 0000:03:00.0: failed to send WMI_PEER_DELETE cmd
[270894.843650] ath12k_pci 0000:03:00.0: failed to delete peer vdev_id 0 addr 02:03:7f:37:12:34 ret -108
[270894.843663] ath12k_pci 0000:03:00.0: Failed to delete peer: 02:03:7f:37:12:34 for VDEV: 0

Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3
Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.0.1-00029-QCAHKSWPL_SILICONZ-1

Signed-off-by: Lingbo Kong <[email protected]>
Acked-by: Jeff Johnson <[email protected]>
Acked-by: Jeff Johnson <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://msgid.link/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit ecf0b2b ]

During chip recovery (e.g. chip reset), there is a possible situation that
kernel worker reset_work is holding the lock and waiting for kernel thread
stat_worker to be parked, while stat_worker is waiting for the release of
the same lock.
It causes a deadlock resulting in the dumping of hung tasks messages and
possible rebooting of the device.

This patch prevents the execution of stat_worker during the chip recovery.

Signed-off-by: Leon Yen <[email protected]>
Signed-off-by: Ming Yen Hsieh <[email protected]>
Signed-off-by: Felix Fietkau <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit dd2c345 ]

This device sometimes doesn't send touch release signals when moving
from >=4 fingers to <4 fingers. Using MT_QUIRK_NOT_SEEN_MEANS_UP instead
of MT_QUIRK_ALWAYS_VALID makes sure that no touches become stuck.

MT_QUIRK_FORCE_MULTI_INPUT is not necessary for this device, but does no
harm.

Signed-off-by: Sean O'Brien <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 59d2f5b ]

Adjusts the report descriptor for N-Key devices to
make the output count 0x01 which completely avoids
the need for a block of filtering.

Signed-off-by: Luke D. Jones <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit c901f63 ]

Lenovo Slim 7 16ARH7 is a machine with switchable graphics between AMD
and Nvidia, and the backlight can't be adjusted properly unless
acpi_backlight=native is passed.  Although nvidia-wmi-backlight is
present and loaded, this doesn't work as expected at all.

For making it working as default, add the corresponding quirk entry
with a DMI matching "LENOVO" "82UX".

Link: https://bugzilla.suse.com/show_bug.cgi?id=1217750
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit cd3fc3b ]

[Changes from V1:
- The warning to disable is -Wmaybe-uninitialized, not -Wuninitialized.
- This warning is only supported in GCC.]

The BPF selftest verifier_global_subprogs.c contains code that
purposedly performs out of bounds access to memory, to check whether
the kernel verifier is able to catch them.  For example:

  __noinline int global_unsupp(const int *mem)
  {
	if (!mem)
		return 0;
	return mem[100]; /* BOOM */
  }

With -O1 and higher and no inlining, GCC notices this fact and emits a
"maybe uninitialized" warning.  This is by design.  Note that the
emission of these warnings is highly dependent on the precise
optimizations that are performed.

This patch adds a compiler pragma to verifier_global_subprogs.c to
ignore these warnings.

Tested in bpf-next master.
No regressions.

Signed-off-by: Jose E. Marchesi <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Yonghong Song <[email protected]>
Cc: Eduard Zingerman <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 2d3b8df ]

On slow machines the SND timestamp sometimes doesn't arrive before
we quit. The test only waits as long as the packet delay, so it's
easy for a race condition to happen.

Double the wait but do a bit of polling, once the SND timestamp
arrives there's no point to wait any longer.

This fixes the "TXTIME abs" failures on debug kernels, like:

   Case ICMPv4  - TXTIME abs returned '', expected 'OK'

Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit b37b98a ]

Validate that `first` flag is set only for the first
descriptor in multi-buffer packets.
In case of an invalid descriptor, a reset will occur.
A new reset reason for RX data corruption has been added.

Signed-off-by: Shahar Itzko <[email protected]>
Signed-off-by: David Arinzon <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 68aa210 ]

This fixes some CHECKs reported by the checkpatch script.

Issues reported in ath3k.c:
-------
ath3k.c
-------
CHECK: Please don't use multiple blank lines
+
+

CHECK: Blank lines aren't necessary after an open brace '{'
+static const struct usb_device_id ath3k_blist_tbl[] = {
+

CHECK: Alignment should match open parenthesis
+static int ath3k_load_firmware(struct usb_device *udev,
+                               const struct firmware *firmware)

CHECK: Alignment should match open parenthesis
+               err = usb_bulk_msg(udev, pipe, send_buf, size,
+                                       &len, 3000);

CHECK: Unnecessary parentheses around 'len != size'
+               if (err || (len != size)) {

CHECK: Alignment should match open parenthesis
+static int ath3k_get_version(struct usb_device *udev,
+                       struct ath3k_version *version)

CHECK: Alignment should match open parenthesis
+static int ath3k_load_fwfile(struct usb_device *udev,
+               const struct firmware *firmware)

CHECK: Alignment should match open parenthesis
+               err = usb_bulk_msg(udev, pipe, send_buf, size,
+                                       &len, 3000);

CHECK: Unnecessary parentheses around 'len != size'
+               if (err || (len != size)) {

CHECK: Blank lines aren't necessary after an open brace '{'
+       switch (fw_version.ref_clock) {
+

CHECK: Alignment should match open parenthesis
+       snprintf(filename, ATH3K_NAME_LEN, "ar3k/ramps_0x%08x_%d%s",
+               le32_to_cpu(fw_version.rom_version), clk_value, ".dfu");

CHECK: Alignment should match open parenthesis
+static int ath3k_probe(struct usb_interface *intf,
+                       const struct usb_device_id *id)

CHECK: Alignment should match open parenthesis
+                       BT_ERR("Firmware file \"%s\" not found",
+                                                       ATH3K_FIRMWARE);

CHECK: Alignment should match open parenthesis
+               BT_ERR("Firmware file \"%s\" request failed (err=%d)",
+                                               ATH3K_FIRMWARE, ret);

total: 0 errors, 0 warnings, 14 checks, 540 lines checked

Signed-off-by: Uri Arev <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit f30a3be ]

[WHY]
PSP can access DCN registers during command submission and we need
to ensure that DCN is not in PG before doing so.

[HOW]
Add a callback to DM to lock and notify DC for idle optimization exit.
It can't be DC directly because of a potential race condition with the
link protection thread and the rest of DM operation.

Cc: Mario Limonciello <[email protected]>
Cc: Alex Deucher <[email protected]>
Reviewed-by: Charlene Liu <[email protected]>
Acked-by: Alex Hung <[email protected]>
Signed-off-by: Nicholas Kazlauskas <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit b5b6d62 ]

[Why]
Cursor update can be pre-empted by a request for setting target flip
submission.

This causes an issue where we're in the middle of the exit sequence
trying to log to DM, but the pre-emption starts another DMCUB
command submission that requires being out of idle.

The DC lock aqusition can fail, and depending on the DM/OS interface
it's possible that the function inserted into this thread must not fail.

This means that lock aqusition must be skipped and exit *must* occur.

[How]
Modify when we consider idle as active. Consider it exited only once
the exit has fully finished.

Consider it as entered prior to actual notification.

Since we're on the same core/thread the cached values are coherent
and we'll see that we still need to exit. Once the cursor update resumes
it'll continue doing the double exit but this won't cause a functional
issue, just a (potential) redundant operation.

Reviewed-by: Duncan Ma <[email protected]>
Acked-by: Wayne Lin <[email protected]>
Signed-off-by: Nicholas Kazlauskas <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 23f1d8b ]

The Z830 has some buttons that will only work properly as "quickstart"
buttons. To enable them in that mode, a value between 1 and 7 must be
used for HCI_HOTKEY_EVENT. Windows uses 0x5 on this laptop so use that for
maximum predictability and compatibility.

As there is not yet a known way of auto detection, this patch uses a DMI
quirk table. A module parameter is exposed to allow setting this on other
models for testing.

Signed-off-by: Arvid Norlander <[email protected]>
Tested-by: Hans de Goede <[email protected]>
Reviewed-by: Hans de Goede <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Hans de Goede <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
jannau and others added 4 commits July 1, 2024 16:06
"ioft" is a 48.16 fixed point type.

Signed-off-by: Janne Grunau <[email protected]>
Apple Silicon devices, particularly Macs, continue Apple's tradition of
integrating a ridiculous number of sensors on their products. Most of these
report to the System Management Controller, and their data is exposed by
the SMC firmware as simple key-value pairs.

This commit adds an hwmon drive for retrieving the data reported by the
temperature, voltage, current and power sensors, as well as fan data.

Devices each expose their own unique set of sensors and fans, even with
the SMC being baked into the SoC. This driver walks a devicetree node
in order to discover the sensors present on the current device at runtime.

Co-developed-by: Janne Grunau <[email protected]>
Signed-off-by: Janne Grunau <[email protected]>
Signed-off-by: James Calligeros <[email protected]>
@chadmed chadmed force-pushed the asahi-hwmon-merge-v2 branch from 10e36ca to 2806038 Compare July 1, 2024 06:06
chadmed added 5 commits July 1, 2024 16:07
Each SoC's SMC exposes a different set of hardware sensor keys,
however there are a number that are shared between all currently
supported SoCs. Describe these in a .dtsi so that we don't need to
duplicate them across every SoC.

Likewise, the fans on every machine are exposed as the same set of
keys on each. Add .dtsis for these too.

Co-developed-by: Janne Grunau <[email protected]>
Signed-off-by: Janne Grunau <[email protected]>
Signed-off-by: James Calligeros <[email protected]>
@chadmed chadmed force-pushed the asahi-hwmon-merge-v2 branch from 2806038 to 77b479a Compare July 1, 2024 06:08
@chadmed
Copy link
Member Author

chadmed commented Jul 1, 2024

I have pulled in @jannau's fixes and squashed the fan changes into a single commit adding all the "common" files. I have also split out each SoC into a separate commit and completely removed keys that were not readable on any of the machines I have access to.

EDIT: also rebased

@jannau
Copy link
Member

jannau commented Jul 8, 2024

I've pulled the code into bits/110-smc but haven't merged any of the hwmon devicetree changes. Mostly over the CPU core temp sensors. I still not convinced that there are temp sensors for the efficiency cores. I can't identify any temp sensor with a clear temp increase signal when running a CPU stressor on any of the efficiency cores. I would expect that ~0.6 W power use on the area of an efficiency core would be clearly identifiable.

misc issues spotted so far: hwmon-t8112.dtsi has only 7 CPU core sensors, we can't use t600x / t602x as includes since we need to handle M1 / M2 Ultra separately, we should think about how we can handle disabled CPU cores like on M1/M2 Pro.

@chadmed
Copy link
Member Author

chadmed commented Jul 8, 2024

I agree with not merging the DT changes until we figure out more stuff. FWIW, the reason t8112 lost a sensor is because I have no t8112 devices where the dropped CPU temp sensor was able to be read, and part of the final cleanup I did was getting rid of stuff that I could not verify as working on any hardware.

Same deal with the ecore sensors on t8112 - I could not get them to light up with an ecore load however they did materially change when burning even one of the pcores... The only solution here may be to brute force every T*** combination with a m1n1 experiment or something :/

@jannau jannau force-pushed the asahi-wip branch 4 times, most recently from 83a2784 to 15b881f Compare July 29, 2024 19:30
@jannau jannau force-pushed the asahi-wip branch 2 times, most recently from ccc12d2 to 24c1838 Compare August 21, 2024 20:21
@chadmed chadmed closed this Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.