Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an option to run DSA, IAA tests without device reset #38

Closed
ozhuraki opened this issue May 16, 2023 · 25 comments
Closed

Provide an option to run DSA, IAA tests without device reset #38

ozhuraki opened this issue May 16, 2023 · 25 comments

Comments

@ozhuraki
Copy link

No description provided.

@mythi
Copy link

mythi commented May 16, 2023

@ozhuraki same as #36?

@ozhuraki
Copy link
Author

@mythi
This is related but slighty different, i.e. when tests are run on an already configured wq, the device is reset which trigger EROFS in some cases (kernel version, etc):
https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/demo/accel-config-demo/idxd-reset.patch

@ramesh-thomas
Copy link
Contributor

@xinzhanz please see whether this can be changed.

@xinzhanz
Copy link
Contributor

Don't find the difference when apply this patch. When there is wq is enabled, run make check libaccfg, test is skip whatever the patch is added or not.

@mythi
Copy link

mythi commented Sep 15, 2023

Without the patch, this is what we see:

# ./dsa_test -w 1 -l 1048576 -o 0x10 -f 0x1 t200 -d dsa4/wq2.0
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 0: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 10: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 12: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 14: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 2: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 4: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 6: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 8: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 1: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 11: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 13: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 15: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 3: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 5: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 7: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 9: add_dev() failed
[error] No usable wq found

when adding the patch, everything works OK:

# ./dsa_test -w 1 -l 1048576 -o 0x10 -f 0x1 t200 -d dsa4/wq2.0
[ info] alloc wq 1 shared size 16 addr 0x7f2328f76000 batch sz 0x400 xfer sz 0x80000000
[ info] testmemory: opcode 16 len 0x100000 tflags 0x1 num_desc 1
[ info] preparing descriptor for crcgen
[ info] Submitted all crcgen jobs
[ info] verifying task result for 0x55c7f7b1ecf0
expected crc = 1419fe20

@mythi
Copy link

mythi commented Sep 15, 2023

accel-config version v4.1, linux 6.2

@mythi
Copy link

mythi commented Sep 26, 2023

@xinzhanz any suggestions we could try or what might be missing?

@xinzhanz
Copy link
Contributor

As I said, I didn't find the function difference. You can ask Ramesh to merge it.

@mythi
Copy link

mythi commented Sep 27, 2023

Do we know what's causing the error we see when running these tests in a container?

@mythi
Copy link

mythi commented Oct 23, 2023

Do we know what's causing the error we see when running these tests in a container?

@xinzhanz @ramesh-thomas any suggestions what to try?

@xinzhanz
Copy link
Contributor

Could you remove "-w 1" and try, ./dsa_test -l 1048576 -o 0x10 -f 0x1 t200 -d dsa4/wq2.0?
If it is ok, please send your patch out for merging.

@ramesh-thomas
Copy link
Contributor

@mythi you can either create a pull request or send the patch to [email protected] after checking the above.

@mythi
Copy link

mythi commented Oct 26, 2023

I'm not comfortable sending the patch because it just ignores the error and we don't know why the error happens. Is there some writing to sysfs during the tests?

@ramesh-thomas
Copy link
Contributor

I'm not comfortable sending the patch because it just ignores the error and we don't know why the error happens. Is there some writing to sysfs during the tests?

Yes, I agree. I did not know the patch was the one linked above adding a check for EROFS

Can you please check the cmd_status attribute? ls -l /sys/bus/dsa/devices/dsa0/cmd_status

This has never happened. Are you having a very old kernel and idxd driver?

Thanks

@mythi
Copy link

mythi commented Oct 26, 2023

This has never happened. Are you having a very old kernel and idxd driver?

my log was from 6.2. Is that old? :-) One reason I can think of is is that sysfs is often mounted read-only to containers. That's the reason I was asking if the tests need to write anything there.

@ramesh-thomas
Copy link
Contributor

Yes, it clears the command status attribute at the start. How can anything be configured if the attributes are made read-only?

@mythi
Copy link

mythi commented Oct 26, 2023

WQ setup is done separately by the node admin. The app container just uses the pre-configured queues

@mythi
Copy link

mythi commented Oct 27, 2023

WQ setup is done separately by the node admin. The app container just uses the pre-configured queues

@ramesh-thomas

What:		/sys/bus/dsa/devices/dsa<m>/cmd_status
Date:		Aug 28, 2020
KernelVersion:	5.10.0
Contact:	[email protected]
Description:	The last executed device administrative command's status/error.
		Also last configuration error overloaded.
		Writing to it will clear the status.

suggests that the apps are not supposed to write to it. Is my understanding correct?

@ramesh-thomas
Copy link
Contributor

Description: The last executed device administrative command's status/error.
Also last configuration error overloaded.
Writing to it will clear the status.

suggests that the apps are not supposed to write to it. Is my understanding correct?

I did not understand your question. Writing to it clears the status as the description states. We need to clear the status to avoid reading any stale value.

@mythi
Copy link

mythi commented Oct 30, 2023

We need to clear the status to avoid reading any stale value.

My question was if these 'device administrative commands' are related to WQ config which in our case are not done by the application container

@mythi
Copy link

mythi commented Nov 7, 2023

I've not looked deep into the code. Is the cmd_status writing done in the test code? Can it be done conditionally or when the reset is really needed (read returns an error).

@ramesh-thomas
Copy link
Contributor

Can you please try the latest in pending branch? I have added a fix.

@mythi
Copy link

mythi commented Dec 7, 2023

Can you please try the latest in pending branch? I have added a fix.

OK, checking.

@mythi
Copy link

mythi commented Dec 8, 2023

Can you please try the latest in pending branch? I have added a fix.

@ramesh-thomas our CI gives green with the patch so it seems to work well, thanks a lot!

@ramesh-thomas
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants