-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: add mkdocs Documentation, Examples and Quick Start
- Loading branch information
1 parent
6ebedf1
commit c7d3d47
Showing
15 changed files
with
384 additions
and
107 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
lbsntransform package is maintained by Alexander Dunkel and [vgiscience.org](https://vgiscience.org) | ||
|
||
Found any errors or bugs? Please email me alexander.dunkel[ät]tu-dresden.de | ||
|
||
lbsntransform docs built with [mkdocs](https://github.com/mkdocs/mkdocs) and [ReadTheDocs theme](https://github.com/mkdocs/mkdocs/tree/master/mkdocs/themes/readthedocs). | ||
|
||
lbsntransform is developed under open source GNU GPLv3 License. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
<header> | ||
<h2> | ||
<a class="homelink" rel="home" title="lbsntransform API Reference" href="https://lbsn.vgiscience.org/lbsntransform/docs/api/lbsntransform_.html"> | ||
lbsntransform API Reference | ||
</a> | ||
</h2> | ||
<h5 style='line-height: 5px;'> | ||
<a class="homelink" rel="home" title="lbsntransform Documentation Home" href="https://lbsn.vgiscience.org/lbsntransform/docs/"> | ||
lbsntransform Documentation (external) | ||
</a> | ||
</h5> | ||
<h5 style='line-height: 0px;'> | ||
<a href="https://github.com/Sieboldianus/lbsntransform">Edit on GitHub</a> | ||
</h5> | ||
|
||
</header> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
"""Script to allow argdown parse_args to markdown conversion""" | ||
|
||
import inspect | ||
from lbsntransform import BaseConfig | ||
from lbsntransform import __version__ | ||
|
||
|
||
def extract_argscode(): | ||
"""Extracts command line args code to separate file | ||
Preparation step for processing with argdown | ||
""" | ||
# open file to output source code | ||
source_file = open("parse_args.py", "w") | ||
# extract source code of parse_args | ||
parse_args_source = inspect.getsource(BaseConfig.parse_args) | ||
# remove first line | ||
parse_args_source = parse_args_source[parse_args_source.index('\n')+1:] | ||
# unindent all other lines | ||
parse_args_source = parse_args_source.lstrip().replace('\n ', '\n') | ||
# replace version string | ||
parse_args_source = parse_args_source.replace( | ||
'lbsntransform {__version__}', f'lbsntransform {__version__}') | ||
# replace package name | ||
parse_args_source = parse_args_source.replace( | ||
'usage: argdown', 'usage: lbsntransform') | ||
# write argdown and argparse imports first | ||
source_file.write('import argparse\n') | ||
source_file.write('import argdown\n') | ||
source_file.write('from pathlib import Path\n') | ||
# fix argparse name | ||
parse_args_source = parse_args_source.replace( | ||
'ArgumentParser()', 'ArgumentParser(prog="lbsntransform")') | ||
# write prepared source code | ||
source_file.write(parse_args_source) | ||
source_file.close() | ||
|
||
|
||
# run script | ||
extract_argscode() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
lbsntransform has a Command Line Interface (CLI) that can be used to convert many input formats to common lbsnstructure, including to its privacy-aware hll implementation. | ||
|
||
!!! Note | ||
Substitute bash linebreak character `\` in examples below with `^` if you're on Windows command line | ||
|
||
# Basic examples | ||
|
||
Key to mappings in lbsntransform is the origin id, which refers to the different mappings specified in modules `input/mappings/*.py`. For example, | ||
id `3` refers to Twitter (`field_mapping_twitter.py`). | ||
|
||
If you've retrieved Twitter jsons from the offocial API, place those files somewhere in a subfolder `.01_Input/` and run the following: | ||
|
||
```bash | ||
lbsntransform --origin 3 \ | ||
--file_input \ | ||
--file_type 'json' \ | ||
--transferlimit 1000 \ | ||
--csv_output | ||
|
||
``` | ||
|
||
lbsntransform will then create a subfolder `.02_Output/` and store converted data as CSV (specified with `--csv_output` flag). | ||
|
||
* `--transferlimit 1000` means: skip transfer after 1000 lbsn records | ||
* `--file_input`: read from local files (and not from database). Default local input is subfolder `.01_Input/`. This path can be modified with the flag `--input_path_url my-input-path` | ||
* `--file_type 'json'` refers to the file ending to look for in `.01_Input/` folder | ||
|
||
If your files are spread across subdirectories in (e.g.) `.01_Input/`, add `--recursive_load` flag. | ||
|
||
# Flickr YFCC100m | ||
|
||
A specific mapping is provided for the [YFCC100m Dataset](https://multimediacommons.wordpress.com/yfcc100m-core-dataset/). | ||
|
||
The YFCC100m Dataset consists of multiple files, with the core dataset of 100 Million Flickr photo metadata records (yfcc100m_dataset.csv) and several "expansion sets". | ||
|
||
The only expansion-set that is available for mapping is places-expansion (yfcc100m_places.csv). | ||
|
||
Both photo metadata and places metadata can be processed parrallel, by using `--zip_records`. | ||
|
||
Before executing the following, make sure you've started the [lbsn-raw database docker](https://gitlab.vgiscience.de/lbsn/databases/rawdb). This includes the postgres implementation of the common lbsn structure format. You can run the docker db container on any host, but we suggest testing your setup locally - in this case, `127.0.0.1` refers to _localhost_ and port `15432` (the default for lbsn-raw). | ||
|
||
|
||
```bash | ||
lbsntransform --origin 21 \ | ||
--file_input \ | ||
--input_path_url "https://myurltoflickrymcc.dataset.org/yfcc100m_dataset.csv;https://myurltoflickrymcc.dataset.org/flickr_yfcc100m/yfcc100m_places.csv" \ | ||
--dbpassword_output "your-db-password" \ | ||
--dbuser_output "postgres" \ | ||
--dbserveraddress_output "127.0.0.1:15432" \ | ||
--dbname_output "rawdb" \ | ||
--csv_delimiter $'\t' \ | ||
--file_type "csv" \ | ||
--zip_records \ | ||
--skip_until_record 7373485 \ | ||
--transferlimit 10000 | ||
``` | ||
|
||
In the example above, | ||
|
||
```bash | ||
--skip_until_record 7373485 | ||
``` | ||
.. is used to skip input records up to record `7373485`. This is an example on how to continue processing (e.g. if your previous transform job was aborted for any reason). | ||
|
||
|
||
Also, transfer is limited to first 10000 records: | ||
|
||
```bash | ||
--transferlimit 10000 | ||
``` | ||
|
||
If you have stored the Flickr-dataset locally, simply replace the urls with: | ||
|
||
```bash | ||
--input_path_url "/data/flickr_yfcc100m/" | ||
``` | ||
|
||
|
||
# Privacy-aware output (HyperLogLog) | ||
|
||
We've developed a privacy-aware implementation of lbsn-raw format, based based on the probabilistic datastructure HyperLogLog and the postgres implementation from [Citus](https://github.com/citusdata/postgresql-hll). | ||
|
||
Two preparations steps are necessary: | ||
|
||
* Prepare a postgres database with the HLL version of lbsnstructure. You can use the [lbsn-hll database docker](https://gitlab.vgiscience.de/lbsn/databases/hlldb) | ||
* Prepare a read-only (empty) database with Citus HyperLogLog extension installed. You can use the [hll importer docker](https://gitlab.vgiscience.de/lbsn/tools/importer) | ||
|
||
We've designed this rather complex setup to separate concerns: | ||
- the importer db (called `hllworkerdb` in the command below) will be used by lbsntransform to calculate hll `shards` from raw data - it will not store any data, nor will it get any additional (privacy-relevant) information. Shards are calculated in-memory and returned. The importer is prepared with global hll-settings that must not change during the whole lifetime of the final output. | ||
|
||
For example, as a means of additional security, before creating shards, distinct values can be one-way-hashed. This hashing can be improved using a `salt` that is only known to **importer**. | ||
|
||
Finally, as a result, output hll db will not retrieve any privacy-relevant data because this is removed before transmission. | ||
|
||
!!! Note | ||
Depending on chosen `bases` and the type of input data, data may still contain privacy sensitive references. Have a look at the [lbsn-docs](https://lbsn.vgiscience.org) for further information. | ||
|
||
To convert YFCC100m photo metadata and places and transfer to a local hll-db, use: | ||
|
||
```bash | ||
lbsntransform --origin 21 \ | ||
--file_input \ | ||
--input_path_url "/data/flickr_yfcc100m/" \ | ||
--dbpassword_output "your-db-password" \ | ||
--dbuser_output "postgres" \ | ||
--dbserveraddress_output "127.0.0.1:25432" \ | ||
--dbname_output "hlldb" \ | ||
--dbformat_output "hll" \ | ||
--dbpassword_hllworker "your-db-password" \ | ||
--dbuser_hllworker "postgres" \ | ||
--dbserveraddress_hllworker "127.0.0.1:20432" \ | ||
--dbname_hllworker "hllworkerdb" \ | ||
--csv_delimiter $'\t' \ | ||
--file_type "csv" \ | ||
--zip_records | ||
``` | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
**lbsntransform** A python package that uses the [common location based social network (LBSN) data structure concept](https://pypi.org/project/lbsnstructure/) (ProtoBuf) to import, transform and export Social Media data such as Twitter and Flickr. | ||
|
||
![](inputoutput.svg) | ||
|
||
Import, convert and export Location Based Social Media (LBSM) data, such as from Twitter and Flickr, to a common data structure format (lbsnstructure). lbsntransform can also anonymize data into a privacy-aware version of lbsnstructure using HyperLogLog. | ||
|
||
Input can be: | ||
- local CSV or Json (stacked/regular/line separated) | ||
- a web-url to CSV/Json | ||
- Postgres DB connection | ||
|
||
Output can be: | ||
- local CSV | ||
- local file with ProtoBuf encoded records | ||
- local SQL file ready for "Import from" in Postgres LBSN db | ||
- Postgres DB connection with existing [LBSN RAW Structure](https://gitlab.vgiscience.de/lbsn/databases/rawdb) | ||
- Postgres DB connection with existing [LBSN HLL Structure](https://gitlab.vgiscience.de/lbsn/databases/hlldb), which is a privacy-aware version of lbsnstructure |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
Title: lbsntransform | ||
Note left of Input: Twitter / \\nFlickr/\\nYFCC100m ... | ||
Input->Output: from Local CSV or JSON | ||
Input->Output: stacked/ regular/ line separated | ||
Input->Output: recursive subfolders | ||
Input->Output: zip inputs | ||
Input->Output: stream from web | ||
Note right of Output: LBSN Raw DB | ||
Note right of Output: LBSN Hll DB | ||
Input->Output: from live DB | ||
Output-->Input: to live DB | ||
Note left of Input: LBSN Raw DB |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# Windows | ||
|
||
There are many ways to install python tools: | ||
|
||
1. The recommended way to install the package is with `pip install lbsntransform` | ||
2. For Windows users, an alternative is to download the newest pre-compiled build from [releases](../../releases) and run `lbsntransform.exe` | ||
3. If you have problems with dependencies under windows, use [Gohlke wheels](<https://www.lfd.uci.edu/~gohlke/pythonlibs/>) or create an environment with with conda package manager, install all dependencies manually and then run `pip install lbsntransform --no-deps` | ||
|
||
# Linux | ||
|
||
* `pip install lbsntransform` is recommended to install lbsntransform in Linux. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
|
||
- see the [lbsn docs](https://lbsn.vgiscience.org) for further info regarding the underlying data structure concept |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,7 +17,6 @@ dependencies: | |
- bitarray | ||
- nltk | ||
- pip: | ||
- anybadge | ||
- pylint-exit | ||
- ppygis3 | ||
- lbsnstructure>=0.5.0 | ||
|
Oops, something went wrong.