Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add corpora importer and first corpora generators #47

Merged
merged 5 commits into from
Jun 5, 2017
Merged

Add corpora importer and first corpora generators #47

merged 5 commits into from
Jun 5, 2017

Conversation

jorinvo
Copy link
Contributor

@jorinvo jorinvo commented Jun 3, 2017

I created an import script to get or update information from https://github.com/dariusk/corpora.
We only need to specify at the bottom which corpora we like to import and every time the script is run they are updated from the repo.

Generators still have to be defined manually, but I think that's a good thing so we have control over help description and so on. Also, while the data might change, the generators only have have to be setup once.

I already added 4 new generators as examples. They can be adjusted after. And more ones can be added easily.

For now I put the script in the cmd directory.

Of course, things can be renamed and moved around.

Please let me know, what you think :)

@jorinvo
Copy link
Contributor Author

jorinvo commented Jun 3, 2017

After having the importer, we can discuss in #30 about which generators to add.
Or just add some without discussions is also fine.

}

// Fix formatting
value := strings.Replace(fmt.Sprintf("%#v\n", jsonData[d.Key]), "interface {}", "string", 1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anyone has a nice way of directly getting s string slice from the JSON, would be happy to change this :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ill look into it!

@lucapette
Copy link
Owner

We're off a great start! Thank you for taking care of this 🎉

I'll leave minor comments in the code directly, but I do have a more general question about the approach.

As far as I can understand, we're downloading the entire repo as a zip, extracting it to a tmp location and then import the json files one by own.

If we'd be relying on the url of the file we want to import, we could fetch the json and write to a file. It would be most likely slower right now but I think it would have interesting consequences:

  • I believe it would be a lot less code to maintain (which is a big plus)
  • the growing size of the entire corpora repository wouldn't matter

For the speed part, I think we could make this version much faster spawning a go routine per file to import. But I admit I'd be doing it mostly because it's fun :)

We could rely on github api and store some metadata in the comments of the imported file so that, in the long run, that information could act as a cache (like in fetch only the files that changed). I wouldn't do that right now.

I realise I'm asking you for some sort of a rewrite (but I don't think it's that much work) so if you want I can take it from where it is and move it in that direction. This is, as it is, a very valuable contribution already!

// Content of a Go file
const fileTemplate = `package data

var %s = %s`
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably a good idea to add a comment here because the variable is exported and linters would complain... not a big deal but something like // %s is an array of %s should suffice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I couldn't come up with a good use of the comment, but we should add it to make the linters happy 😊

// Usage: go run cmd/importcorpora/main.go
//
// Updates the at the bottom of this file specified data files with content from dariusk/corpora.
package main
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated naming with myself here a lot :) But I like how explicit importcorpora is 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also at a task to the Makefile. Then it will be easier to use it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noted down the makefile comment somewhere and then forgot to add. Completely agree! 👍

@jorinvo
Copy link
Contributor Author

jorinvo commented Jun 4, 2017

Wow, I totally didn't see that approach :D
It's definitely way simpler!
The complexity of doing the zip thingy was also bugging me!
For now, can make it simple http calls. No need to use the Github API I'd say.
I can do the changes if you that's Ok for you.

@lucapette
Copy link
Owner

@jorinvo yeah I agree using github api isn't necessary (simpler without!) Sure, of course it's ok :)

@jorinvo
Copy link
Contributor Author

jorinvo commented Jun 4, 2017

Will have a look tonight!

return source[rand.Intn(len(source))]
}
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While working on something else I remembered I didn't comment here.

I think you can use withList, it has the exact same behavior (I'm removing withEnum for the very same reason)

@jorinvo
Copy link
Contributor Author

jorinvo commented Jun 4, 2017

I applied the discussed changes.

Script is way simpler now without the zip mess.

Was thinking about parallelizing the download but I can't justify having the extra complexity in the script. For a one off task this seems fast enough:

$ time make corpora
go run cmd/importcorpora/main.go

real    0m1.004s
user    0m0.706s
sys     0m0.177s

@lucapette lucapette merged commit eace86f into lucapette:master Jun 5, 2017
lucapette added a commit that referenced this pull request Jun 5, 2017
Add corpora importer and first corpora generators
lucapette added a commit that referenced this pull request Jun 5, 2017
- Using json.RawMessage makes the code a bit more robust
- Getting rid of baseURL makes the task independent from corpora. Now
the only assumption is that the URL returns a JSON with an array of
strings at a given key
- Namespacing cat under animal so that we follow the existing convention
- Renamed emojis to emoji.go as they're both valid plurals but the
latter is shorter

At this point I'm not sure the naming `importcorpora` and `make corpora`
is right anymore but, as we're going to work on this soon, it's safe to
delay the decision
@lucapette
Copy link
Owner

@jorinvo just merged it. I applied minor changes here. the only thing worth noticing is the usage of the type json.RawMessage (TIL).

I think it's good we discuss a bit what we want to import into #30

Thank you very much for taking care of this. It's awesome!

@jorinvo jorinvo deleted the corpora branch June 5, 2017 17:56
@lucapette lucapette mentioned this pull request Jun 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants