Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add German protestant and catholic churches #579

Open
baltpeter opened this issue Apr 27, 2020 · 5 comments
Open

Add German protestant and catholic churches #579

baltpeter opened this issue Apr 27, 2020 · 5 comments
Labels
discussion hacktoberfest help wanted template Issue related to the letter templates

Comments

@baltpeter
Copy link
Member

baltpeter commented Apr 27, 2020

After Lars' recent blog post and the templates helpfully provided by @thedrstrangelove (#565, #571), we can finally start adding support for the churches in Germany (well, the catholic and protestant churches, anyway).

Our SVA finder has supported their respective supervisory authorities for a while now, so that pretty much just leaves us with adding the churches to the database. That is a lot harder than it seems on first glance, though. A couple of observations and thoughts:

  1. The churches I have looked at, seem to act as the controller themselves, not through their diocese/national church (Landeskirche)/whatever. That is an issue as there a lot of churches in Germany.
  2. From what I have seen, the privacy policies within one diocese/national church/… seem to be fairly standardised. Within Evangelisch-lutherische Landeskirche in Braunschweig, for example, the privacy policy is identical for every church, simply linking to their legal notice (Impressum) for the contact details.
  3. I was pretty surprised to find out that almost all dioceses in northern Germany seem to have hired datenschutz nord GmbH as their external DPO. And not just that company but a single person within that company acts as the DPO for all those churches.
    That leaves us in quite a weird position: According to our current rules (and my previous points), all those churches should get their own entry in our company database. It's just that they would all have the exact same contact details. The alternative isn't better either: A single entry with literally thousands or even tens of thousands of 'run' entries. Our system definitely can't handle that.
  4. So, would we go about supporting churches? Even if we decided to add them all (or—hopefully—found a way to automate that), we would run into the similar issues as with schools (where we have ignored those problems for now). And I don't see much of a point to just adding the churches in Braunschweig. If however, we were to start adding churches large-scale, there are two main problems I see:
    • Typesense, our search engine, will explode
    • the search results will become more or less unusable, as for just about every possible query, there are going to be churches blocking the actual results

Maybe our switch to Xapiand might help somewhat with this last issue. @zner0L is working on that, ETA very much unclear.

@kishorenc
Copy link

Maybe our switch to Xapiand might help somewhat with this last issue.

👋 I work on Typesense -- if you are running into any issues, I'm all ears, especially on how a switch to Xapiand might help overcome some of these issues. Maybe I can learn something and improve Typesense.

@baltpeter
Copy link
Member Author

@kishorenc Thanks for stopping by! I definitely need to put my statement into context here.

We were really glad about Typesense as it has allowed us to deploy to deploy a search engine on our own infrastructure and without analytics (unlike Algolia) and without the ridiculous system requirements and complexity of Elastic Search et. al. So seriously, huge thanks for that! Typesense has served us very well so far.

The main issue we've been having is that I believe we are running a very suboptimal use-case for Typesense (and probably similar products). Our primary source of truth for our data is this repository of JSON files and they don't have any sense of a version or "last updated" date. So, whenever a new entry is added or an existing one is updated, our deploy script literally runs through the whole list of entries and individually deletes and then readds them. As you can imagine, the overhead of that is quite ridiculous and due to the fairly high number of entries (1300+ and growing), our deploys are almost taking half an hour by now (example).

The problems I have mentioned in this issue would mainly be linked to that. Adding all those churches would massively increase the number of entries, thus significantly magnifying the problem.
In addition, this brings forward a problem of ranking. I am worried that these new companies would displace the existing, manually added, ones which are more important.

To be clear, we are not sure if we actually want to switch to Xapiand. @zner0L has been doing the research on that and while he seemed quite sure to me for a while, this has somewhat changed recently. We also know that you have been doing quite a bit of work on Typesense in the meantime (we've had this discussion internally for almost a year now) that I need to catch up on. For example, I was very happy to read about the bulk import endpoint you added.

To keep this issue on-topic, I have created a separate issue for this (#584) with a lot more details on the problems we have experienced as well as the features we need. I would love to hear your thoughts there!

I had also considered opening issues for some of our problems with you but in the end, I decided that these were mostly problems on our end due to our unusual use case and that it would be unreasonable to expect you to implement solutions for them.
So let me stress that we definitely don't want to 'blame' Typesense for any of these issues. And also, once again: Thank you!

@zner0L
Copy link
Member

zner0L commented Aug 10, 2020

Ok, I feel like, with the new version of typesense ready to go, I can see a way to include churches in our database. My preferred way would be to support search on multiple indices, but from what I gather, typesense still doesn't support that. So IMO the best way would be to create a type field in the schema (which should be faceted and different for companies or churches, kinda like a category but in a many-to-one relationship) and then use group_by and group_limit to limit the amount of churches flooding the search results. And we need some kind of clever sorting to discourage churches from always popping up on top (e.g. via the sort-index).

And maybe we should also give power to the user and let them choose what type of entity they want if the search results are too bloated. Some kind of filter option to facet the search by.

@baltpeter
Copy link
Member Author

So IMO the best way would be to create a type field in the schema (which should be faceted and different for companies or churches, kinda like a category but in a many-to-one relationship) and then use group_by and group_limit to limit the amount of churches flooding the search results.

Those are good ideas. I can also see this working with the new Typesense version now.

We are still left with the problem of collecting those details in the first place, though…

@baltpeter
Copy link
Member Author

Just to update this issue: I have started the actual work on this a few weeks ago. I have opened datenanfragen/website#451 to prepare the site for the massive influx of unverified records and #773 as the first example of that. As we will never be able to add all churches manually, I have looked into ways of importing or scraping the necessary data. We will use the new datenanfragen/data-imports repo for that (it already contains the code for the first import).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion hacktoberfest help wanted template Issue related to the letter templates
Development

No branches or pull requests

3 participants