Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corpus.mallet returned non-zero exit status 1. #2876

Closed
T4rQu1N opened this issue Jul 7, 2020 · 14 comments
Closed

corpus.mallet returned non-zero exit status 1. #2876

T4rQu1N opened this issue Jul 7, 2020 · 14 comments
Labels
need info Not enough information for reproduce an issue, need more info from author

Comments

@T4rQu1N
Copy link

T4rQu1N commented Jul 7, 2020

Problem description

So many people have had this issue, and I have tried all the fixes suggested, to no avail. The path is correct, and I have changed it multiple times to remove spaces etc. I get the same error. Bearing in mind I have no idea how to provide all the information necessary, please respond with precise instructions as to how to debug this issue.

Error is
CalledProcessError: Command 'mallet-2.0.8/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input C:\Users\DraGoN\AppData\Local\Temp\b76d8f_corpus.txt --output C:\Users\DraGoN\AppData\Local\Temp\b76d8f_corpus.mallet' returned non-zero exit status 1.

I have looked and the temp files do exist in the temp directory. I have even tried editing the .bat file to hard code the mallet_home directory, and the java installation directory. Nothing works. I get the same error.

Steps/code/corpus to reproduce

import os
from gensim.models.wrappers import LdaMallet

os.environ.update({'MALLET_HOME':r'C:/Users/DraGoN/Documents/python/mallet-2.0.8'})
mallet_path = 'mallet-2.0.8/bin/mallet' # update this path

#Alternative LDA model, download here and put in directory - https://www.machinelearningplus.com/wp-content/uploads/2018/03/mallet-2.0.8.zip
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)`

Versions

Please provide the output of:

Windows-10-10.0.18362-SP0
Python 3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
NumPy 1.19.0
SciPy 1.4.1
gensim 3.8.3
FAST_VERSION 1
@piskvorky
Copy link
Owner

I have tried all the fixes suggested, to no avail

What fixes have you tried?

What output do you see when you run the mallet-2.0.8/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input C:\Users\DraGoN\AppData\Local\Temp\b76d8f_corpus.txt --output C:\Users\DraGoN\AppData\Local\Temp\b76d8f_corpus.mallet command manually (outside of Gensim, directly)?

I'd definitely recommend using an absolute path for the executable, not just mallet-2.0.8/bin/mallet.

@piskvorky piskvorky added the need info Not enough information for reproduce an issue, need more info from author label Jul 8, 2020
@T4rQu1N
Copy link
Author

T4rQu1N commented Jul 8, 2020

Thanks for your reply. Fixes I have tried as follows:

  • Set MALLET_HOME variable in the .bat file
  • Changed path directory in code to various iterations of /, , //, \ as suggested by many other posts
  • Changed path directory to absolute path for mallet_path
  • Removed \ or / from ends of path
  • Reinstalled Java both RE and DK
  • Set JAVA path manually in .bat file

All give the same error (non zero exit status).

How do I run the command manually outside of Gensim? When I navigate to the mallet bat directory, and copy and paste into CMD (anaconda):

mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input C:\Users\DraGoN\AppData\Local\Temp\b76d8f_corpus.txt --output C:\Users\DraGoN\AppData\Local\Temp\b76d8f_corpus.mallet 

I get the error "MALLET requires an environment variable MALLET_HOME". I should mention that I have reverted all the above fixes and am working with a default unzipped mallet-2.0.8.

Kind regards,

@piskvorky
Copy link
Owner

piskvorky commented Jul 8, 2020

Set your MALLET_HOME environment variable, as per the Mallet instructions.

Also, that's not the command you posted earlier (the path at the beginning is different).

@T4rQu1N
Copy link
Author

T4rQu1N commented Jul 8, 2020

Thanks for your speedy responses. The command I pasted was adjusted slightly because I navigated into the bin folder directly in CMD.

I have now added a MALLET_HOME to environment variables in windows, and run the command again in CMD. No error message here, so I assume it worked?

For future reference, if anyone else doesn't know how to do that stage, you go to system properties, and add a user environment variable (http://shiningmeadow.blogspot.com/2016/04/tutorial-for-installing-mallet-on.html). This was not clear for me when installing MALLET, as I assumed the os.environ.update command in python would take care of this on a temp basis.

Running again in Python, I still get the error, however the number has changed. It is now a non-zero exit status 2 (instead of 1).

@piskvorky
Copy link
Owner

piskvorky commented Jul 8, 2020

OK. Please post the exact command you're running from CLI (which you say works), and the exact command that Gensim outputs (when it fails with exist status 2). Exact, character-for-character. Cheers.

@T4rQu1N
Copy link
Author

T4rQu1N commented Jul 8, 2020

Sure, though I can't imagine it makes much difference, since multiple commands all give the same error.

CMD command:

 C:\Users\DraGoN\Documents\python\mallet-2.0.8\bin>mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input C:\Users\DraGoN\AppData\Local\Temp\b76d8f_corpus.txt --output C:\Users\DraGoN\AppData\Local\Temp\b76d8f_corpus.mallet

Works fine. No error in CMD.

In python:

os.environ.update({'MALLET_HOME':r'C:/Users/DraGoN/Documents/python/mallet-2.0.8/'})
mallet_path = r'C:/Users/DraGoN/Documents/python/mallet-2.0.8/bin/mallet' # update this path

ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)

I have tried hashing out the os.environ.update command now too. And removing/adding the "/" at the end of the dir, but it makes no difference. This is the error in full:

CalledProcessError                        Traceback (most recent call last)
<ipython-input-19-d0c4d0ee93c2> in <module>
      6 mallet_path = r'C:/Users/DraGoN/Documents/python/mallet-2.0.8/bin/mallet' # update this path
      7 
----> 8 ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)
      9 
     10 # Show Topics

D:\Apps\Anaconda\lib\site-packages\gensim\models\wrappers\ldamallet.py in __init__(self, mallet_path, corpus, num_topics, alpha, id2word, workers, prefix, optimize_interval, iterations, topic_threshold, random_seed)
    129         self.random_seed = random_seed
    130         if corpus is not None:
--> 131             self.train(corpus)
    132 
    133     def finferencer(self):

D:\Apps\Anaconda\lib\site-packages\gensim\models\wrappers\ldamallet.py in train(self, corpus)
    270 
    271         """
--> 272         self.convert_input(corpus, infer=False)
    273         cmd = self.mallet_path + ' train-topics --input %s --num-topics %s  --alpha %s --optimize-interval %s '\
    274             '--num-threads %s --output-state %s --output-doc-topics %s --output-topic-keys %s '\

D:\Apps\Anaconda\lib\site-packages\gensim\models\wrappers\ldamallet.py in convert_input(self, corpus, infer, serialize_corpus)
    259             cmd = cmd % (self.fcorpustxt(), self.fcorpusmallet())
    260         logger.info("converting temporary corpus to MALLET format with %s", cmd)
--> 261         check_output(args=cmd, shell=True)
    262 
    263     def train(self, corpus):

D:\Apps\Anaconda\lib\site-packages\gensim\utils.py in check_output(stdout, *popenargs, **kwargs)
   1930             error = subprocess.CalledProcessError(retcode, cmd)
   1931             error.output = output
-> 1932             raise error
   1933         return output
   1934     except KeyboardInterrupt:

CalledProcessError: Command 'C:/Users/DraGoN/Documents/python/mallet-2.0.8/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input C:\Users\DraGoN\AppData\Local\Temp\ea532c_corpus.txt --output C:\Users\DraGoN\AppData\Local\Temp\ea532c_corpus.mallet' returned non-zero exit status 2.

@piskvorky
Copy link
Owner

piskvorky commented Jul 8, 2020

That's weird. Can you try the exact same command, with C:/Users/DraGoN/Documents/python/mallet-2.0.8/bin/mallet?
Instead of changing to some specific dir and then running just mallet.

@T4rQu1N
Copy link
Author

T4rQu1N commented Jul 8, 2020

Yes, have tried it now in CMD. No error, everything seems fine.

Out of interest, what does the non zero exit status 2 mean? And how can I confirm that Mallet is running correctly in CMD. Is there a way to import the output file manually into gensim to see if it produces anything?

@piskvorky
Copy link
Owner

piskvorky commented Jul 8, 2020

Mallet is running correctly if its output indicates training without errors. It prints a lot of information. And at the end of training, it will have created the requested output files (new files appear on your disk).

Not sure what exit status 2 is.

@T4rQu1N
Copy link
Author

T4rQu1N commented Jul 8, 2020

Ok, so new files do appear in the temp directory. So I can assume manually, it works. Interestingly, if I restart the kernel and hash the os.environ.update so that it doesn't run, I then get the error:

FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\DraGoN\AppData\Local\Temp\d941f0_state.mallet.gz'

As soon as I unhash it, and re run the code, then it goes back to the normal non-zero exit status 2 error.

Does this help diagnose it at all?

@piskvorky
Copy link
Owner

What kernel, what "manually"? What are you actually doing?

@T4rQu1N
Copy link
Author

T4rQu1N commented Jul 8, 2020

Sorry, I always assume everyone does it the same way.

I am using jupyter notebook. Restarting the kernel, is basically clearing the cache and re-loading the code. I thought that since I had already set the MALLET_HOME environmental variable in windows, I wouldn't need to specify the code in python. However, not doing so gave the FileNotFoundError. So then I uncommented that particular line, and then it went back to the normal non-zero exit status error. Basically I am trying to figure out what exactly is causing the issue, and narrow down what parts are working from those that aren't.

When I refer to manually, I mean, putting the code into CMD. Don't forget I am a horrible noob, and don't really understand gensim at the best of times. I was hoping, since I can't get the wrapper to work in jupyter/python, that perhaps there is a way to import the output file from mallet directly, so that I can use the Lda analysis and continue working in jupyter after this step.

Any ideas as to what else could be wrong?

@T4rQu1N
Copy link
Author

T4rQu1N commented Jul 8, 2020

Ok, scrap that. For whatever reason, it now works. I restarted my computer and then, boom, it spat out the LDA. I can only assume that one of the solutions above, or perhaps the environment variable was not properly set until windows had been restarted? Fingers crossed it stays working.

Thanks for your speedy responses, and sticking in there with me today. It means a lot!

@piskvorky
Copy link
Owner

No problem :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need info Not enough information for reproduce an issue, need more info from author
Projects
None yet
Development

No branches or pull requests

2 participants