Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GBK encoding caused CodeQL to detect code written in Java/Kotlin, but it was unable to process any of it #18527

Open
Weijin-wj opened this issue Jan 17, 2025 · 7 comments
Labels
question Further information is requested

Comments

@Weijin-wj
Copy link

Hello, I encountered the following issue while creating a database using codeql:

CodeQL detected code written in Java/Kotlin but could not process any of it. For more information, review our troubleshooting guide at https://gh.io/troubleshooting-code-scanning/no-source-code-seen-during-build.

Later I found out it was because the Maven project used GBK encoding, and the pom file is configured as follows:

<project.build.sourceEncoding>GBK</project.build.sourceEncoding>
<project.reporting.outputEncoding>GBK</project.reporting.outputEncoding>

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-compiler-plugin</artifactId>
    <configuration>
        <source>1.8</source>
        <target>1.8</target>
        <encoding>GBK</encoding>
    </configuration>
</plugin>

I then changed GBK to UTF-8 in the pom file, successfully created the database, and there was no prompt as above. Could you please explain why GBK encoding causes this issue? I feel that my solution is not elegant enough. Are there any other better methods? Thank you.

@Weijin-wj Weijin-wj added the question Further information is requested label Jan 17, 2025
@redsun82
Copy link
Contributor

👋 @Weijin-wj I'm glad you found a workaround! Let me circle back to the internal team to see if UTF8 encoding is a known requirement or if this is a bug we need to solve. Even if it were the former case, we would definitely need to improve how this is reported.

@smowton
Copy link
Contributor

smowton commented Jan 21, 2025

I have tried extracting a few repositories using GBK encoding, but am not able to reproduce what you describe. Could you please give an example of a repository that fails? Are you using CodeQL on the command-line? If so, what command did you run? Alternatively are you using it via Github Actions? If so, did you use CodeQL default setup or an advanced configuration with an explicit action YAML file?

@Weijin-wj
Copy link
Author

@redsun82 @smowton Sorry, I can't provide the original code, but I reproduced the previous issue with the following code. I created the database from the command line, and the command is as follows:

codeql database create ./test-db --language=java --command="mvn clean install --file pom.xml -Dmaven.test.skip=true"

I cannot determine whether the issue is due to my operation or other reasons. I haven't been working with CodeQL for very long, and there are still many areas I am not familiar with. I apologize if my improper operation has caused you any trouble.
Hello-Java-Sec-master.zip

@smowton
Copy link
Contributor

smowton commented Jan 21, 2025

If I download your Hello-Java-Sec-master.zip file and use that command I observe:

  • Doesn't compile at all under Java 21, probably because Lombok needs to be explicitly activated.
  • Extracts successfully under Java 17, although the GBK-encoded XML documents cannot be extracted.

Could you provide the full log from the terminal and the content of the test-db/log directory for your failure case?

@Weijin-wj
Copy link
Author

Of course, I'm happy to provide it. I've put the terminal log and the log for generating the database into a zip file.

test_db_log.zip

@smowton
Copy link
Contributor

smowton commented Jan 21, 2025

Thanks. I suspect your JAVA_HOME might be Java 8 or lower, which is causing us to use our minimal shipped JDK to run extraction, which in turn doesn't support many character encodings.

Workaround: set your JAVA_HOME to a Java >= 9 that supports GBK encoding (likely, any non-minified JDK).

As a proper fix I will revise the CodeQL JDK to support more charsets so this doesn't occur in future.

@Weijin-wj
Copy link
Author

@smowton Thank you for answering my confusion. If I didn't know the reason, I think I would be confused for a long time. CodeQL is really a great tool that has helped me a lot. Thank you again for your response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants