-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add corpora importer and first corpora generators #47
Conversation
After having the importer, we can discuss in #30 about which generators to add. |
} | ||
|
||
// Fix formatting | ||
value := strings.Replace(fmt.Sprintf("%#v\n", jsonData[d.Key]), "interface {}", "string", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anyone has a nice way of directly getting s string slice from the JSON, would be happy to change this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ill look into it!
We're off a great start! Thank you for taking care of this 🎉 I'll leave minor comments in the code directly, but I do have a more general question about the approach. As far as I can understand, we're downloading the entire repo as a zip, extracting it to a tmp location and then import the json files one by own. If we'd be relying on the url of the file we want to import, we could fetch the json and write to a file. It would be most likely slower right now but I think it would have interesting consequences:
For the speed part, I think we could make this version much faster spawning a go routine per file to import. But I admit I'd be doing it mostly because it's fun :) We could rely on github api and store some metadata in the comments of the imported file so that, in the long run, that information could act as a cache (like in fetch only the files that changed). I wouldn't do that right now. I realise I'm asking you for some sort of a rewrite (but I don't think it's that much work) so if you want I can take it from where it is and move it in that direction. This is, as it is, a very valuable contribution already! |
cmd/importcorpora/main.go
Outdated
// Content of a Go file | ||
const fileTemplate = `package data | ||
|
||
var %s = %s` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably a good idea to add a comment here because the variable is exported and linters would complain... not a big deal but something like // %s is an array of %s
should suffice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I couldn't come up with a good use of the comment, but we should add it to make the linters happy 😊
cmd/importcorpora/main.go
Outdated
// Usage: go run cmd/importcorpora/main.go | ||
// | ||
// Updates the at the bottom of this file specified data files with content from dariusk/corpora. | ||
package main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debated naming with myself here a lot :) But I like how explicit importcorpora
is 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can also at a task to the Makefile. Then it will be easier to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noted down the makefile comment somewhere and then forgot to add. Completely agree! 👍
Wow, I totally didn't see that approach :D |
@jorinvo yeah I agree using github api isn't necessary (simpler without!) Sure, of course it's ok :) |
Will have a look tonight! |
pkg/fakedata/generator.go
Outdated
return source[rand.Intn(len(source))] | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While working on something else I remembered I didn't comment here.
I think you can use withList
, it has the exact same behavior (I'm removing withEnum
for the very same reason)
instead of the whole repository.
I applied the discussed changes. Script is way simpler now without the zip mess. Was thinking about parallelizing the download but I can't justify having the extra complexity in the script. For a one off task this seems fast enough:
|
Add corpora importer and first corpora generators
- Using json.RawMessage makes the code a bit more robust - Getting rid of baseURL makes the task independent from corpora. Now the only assumption is that the URL returns a JSON with an array of strings at a given key - Namespacing cat under animal so that we follow the existing convention - Renamed emojis to emoji.go as they're both valid plurals but the latter is shorter At this point I'm not sure the naming `importcorpora` and `make corpora` is right anymore but, as we're going to work on this soon, it's safe to delay the decision
I created an import script to get or update information from https://github.com/dariusk/corpora.
We only need to specify at the bottom which corpora we like to import and every time the script is run they are updated from the repo.
Generators still have to be defined manually, but I think that's a good thing so we have control over help description and so on. Also, while the data might change, the generators only have have to be setup once.
I already added 4 new generators as examples. They can be adjusted after. And more ones can be added easily.
For now I put the script in the
cmd
directory.Of course, things can be renamed and moved around.
Please let me know, what you think :)