-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ci] Upgrade Azure VMSS to use Mariner Linux #6222
Conversation
Upgrade Azure VM Scale Set to use Mariner (Azure Linux) systems.
What is the benefit of this change? |
I've been requested by the service team within the company to upgrade the OS of Azure Virtual Machine Scale Set (VMSS) from open-source Ubuntu image to an Azure native Linux version called Mariner... Since upgrading is simply not feasible. I have to shut down the old VMSS and create a new one using the Mariner system. I'm almost done with only one CI test reporting segmentation fault. I'm trying to fix that. |
It is weird that cpp tests sometimes randomly segmentation faults. Note that even though we are using the Mariner image for VMSS, we still use docker to pull the Ubuntu image within the VMSS and run the tests on the same Ubuntu container as before. |
Got it, ok! I guess that's this? https://github.com/microsoft/CBL-Mariner No problem, like you said...we run all the Linux tests in containers, so hopefully it shouldn't be too disruptive.
Thanks very much for providing that link to the CI run you're talking about! Yeah, I'm not sure if that's related to this change or not. There's been a very low level of activity in the repo over the last few weeks...I haven't seen that specific job segfault in other commits outside this PR, but there also just have not been very many CI runs recently. I think it's ok to merge this and see if we observe more such problems. If we do that, can you leave the existing hosted runner VM running for a week or two so that we'd be able to switch back if the segfaults become more common? |
Yes. But I've debug this using a VM with the same setting as the VMSS and found that the segmentation fault happens very frequently. I'm afraid that wouldn't be acceptable. Hopefully I can find some more time to fix the issue. It is weird because I'm debugging using a docker container created from an Ubuntu 22.04 image within the Mariner system, which is the same image for containers as our previous VMSS in CI. I've manually triggered CI on master branch for several times yesterday and found no segmentation fault problems. |
Oh! Ok you're right, that's a serious problem and we should figure it out before merging. Thank you for being so thorough! Let me know if there's anything I can do to help. |
The root cause is the sanitizers in clang. Here is a very minimal reproducible case
Even this program compiled with In other words, it has nothing to do with the code in our repo. Do you have any suggestion on this? @jameslamb @guolinke @jmoralez @StrikerRUS |
People encountered the same issue here |
hmmmm I wasn't able to reproduce that with docker run \
--rm \
-it ubuntu:latest \
/bin/bash
apt-get update
apt-get install -y \
clang-14
cat << EOF >test.cpp
int main(int argc, char** argv) {
return 0;
}
EOF
clang++-14 -Wall -O3 -fsanitize=address -o test.o ./test.cpp
./test.o
echo $?
# 0 Maybe this hack to allow the use of Lines 138 to 140 in 074b3e8
Like maybe Mariner (or the I have another idea too... maybe it's because Mariner is using such an old version of the Docker Engine API? On a failed build on the new runner (build link), I see the following:
On the most recent build of
The first release of v1.41 of the Docker Engine API was in May 2019 (moby/moby#39208). That series ran until... February 2021, I think? (moby/moby#42063). v1.43 is the most recent version of the Docker Engine API (https://docs.docker.com/engine/api/version-history/). |
Just tried with clang++17 and the issue disappear. I'll upgrade clang to use clang-17. |
Ok I think that's alright for this. |
The version of Docker could also be a potential reason. But given that we are urgent to migrate to Mariner, I'll upgrade the clang for now to quickly solve this issue, and investigate with the docker version later on. |
Yeah I think that's ok! I understand it's something Microsoft considers time-sensitive. And we get so much more test coverage per commit by having that hosted runner than we'd be able to with only free-tier resources...I think it's worth it. |
…t/LightGBM into azure-pipelines-mariner
Failure of gpu_source task with clang 17. Retrigger to see if we can reproduce this error... If this occurs frequently, I will restrict the usage of clang 17 to only the cpp tests task. |
I found that without -DUSE_GPU=ON and with I guess that with |
I'll double the RAM of the VMSS... |
interesting! So then could we revert the use of |
Perhaps not. Because we are having two issues here
So we have to keep the |
Oh I see. Ok no problem. |
Note that the Mariner image does not natively contain docker and git, so we installed these using the cloud-init tool to install docker and git when every VM is created in the VMSS. However, this doesn't seems to be 100% successful. For example, this job fails with an undefined reference to git But according to the jobs we've run so far, I think this only occurs rarely (less than 1 in 10 times of the Azure Pipeline runs). So maybe we can ignore this for now and to merge the changes. And leave a better initialization of VMs in the future. WDYT? @jameslamb |
Previously we have 8GB RAM for each VM in VMSS (as well as in previous VMSS using Ubuntu images). And now we have 16GB RAM. I see that large RAM do bring stability of the CI jobs. Manually triggered 5 pipeline runs using the VMSS with 16GB RAM, and 4 out of succeed. Only one fails due to the initialization issue (did not successfully install git). However, the initialization issue rarely happens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok, let's merge this and see how often the jobs fail.
Even before this PR, I've observed Azure DevOps initialization errors and, timeouts, and other not-related-to-LightGBM failures pretty regularly (maybe 1 out of every 5 commits has at least 1 failed job).
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Upgrade Azure VM Scale Set to use Mariner (Azure Linux) systems.