-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
51af268
commit 5b32cb2
Showing
5 changed files
with
310 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
For any conversion, a mapping must exist. A mapping is defined in | ||
a python file (`.py`) and describes how any input data is converted | ||
to the [common lbsn structure](https://lbsn.vgiscience.org/), which | ||
is available from the Python version of the Proto Buf Spec. | ||
|
||
Mappings are loaded dynamically. You can provide a path to a folder | ||
containing mappings with the flag `--mappings_path ./subfolder`. | ||
|
||
If no path is provided, `lbsn raw` is assumed as input, for which | ||
the file mapping is available in [lbsntransform/input/field_mapping_lbsn.py](/api/input/mappings/field_mapping_lbsn.html), | ||
including lbsn db query syntax defined in [lbsntransform/input/db_query.py](/api/input/mappings/db_query.html). | ||
|
||
Predefined mappings exist for Flickr (CSV/JSON) and Twitter (JSON) | ||
in the [resources folder](https://gitlab.vgiscience.de/lbsn/lbsntransform/resources). | ||
If the git repository is cloned to a local folder, use | ||
`--mappings_path ./resources/mappings/` to load Flickr or Twitter mappings. | ||
|
||
Input mappings must have some specific attributes to be recognized. | ||
|
||
Primarily, a class constant "MAPPING_ID" is used to load mappings, | ||
e.g. the [field_mapping_lbsn.py](/api/input/mappings/field_mapping_lbsn.html) | ||
has the following module level constant: | ||
```py | ||
MAPPING_ID = 0 | ||
``` | ||
|
||
**Examples:** | ||
|
||
To load data with the default mapping, use `lbsntransform --origin 0`. | ||
|
||
To load data from Twitter json, use use | ||
```bash | ||
lbsntransform --origin 3 \ | ||
--mappings_path ./resources/mappings/ \ | ||
--file_input \ | ||
--file_type "json" | ||
``` | ||
|
||
To load data from Flickr YFCC100M, use use | ||
|
||
```bash | ||
lbsntransform --origin 21 \ | ||
--mappings_path ./resources/mappings/ \ | ||
--file_input \ | ||
--file_type "csv" \ | ||
--csv_delimiter $'\t' | ||
``` | ||
|
||
# Custom Input Mappings | ||
|
||
Start with any of the predefined mappings, either from [field_mapping_lbsn.py](/api/input/mappings/field_mapping_lbsn.html), | ||
or [field_mapping_twitter.py](https://gitlab.vgiscience.de/lbsn/lbsntransform/resources/field_mapping_twitter.py) (JSON) and | ||
[field_mapping_yfcc100m.py](https://gitlab.vgiscience.de/lbsn/lbsntransform/resources/field_mapping_yfcc100m.py) (CSV). | ||
|
||
A minimal template looks as follows: | ||
|
||
```py | ||
# -*- coding: utf-8 -*- | ||
|
||
""" | ||
Module for mapping example Posts dataset to common LBSN Structure. | ||
""" | ||
|
||
from typing import Optional | ||
from lbsnstructure import lbsnstructure_pb2 as lbsn | ||
from lbsntransform.tools.helper_functions import HelperFunctions as HF | ||
|
||
MAPPING_ID = 99 | ||
|
||
class importer(): | ||
""" Provides mapping function from Example Post Source to | ||
protobuf lbsnstructure | ||
""" | ||
ORIGIN_NAME = "Example Post Source" | ||
ORIGIN_ID = 2 | ||
|
||
def __init__(self, | ||
disable_reaction_post_referencing=False, | ||
geocodes=False, | ||
map_full_relations=False, | ||
map_reactions=True, | ||
ignore_non_geotagged=False, | ||
ignore_sources_set=None, | ||
min_geoaccuracy=None): | ||
origin = lbsn.Origin() | ||
origin.origin_id = 99 | ||
self.origin = origin | ||
self.null_island = 0 | ||
self.skipped_count = 0 | ||
self.skipped_low_geoaccuracy = 0 | ||
|
||
def parse_csv_record(self, record, record_type: Optional[str] = None): | ||
"""Entry point for processing CSV data: | ||
Attributes: | ||
record A single row from CSV, stored as list type. | ||
""" | ||
# extract/convert all lbsn records | ||
lbsn_records = self.extract_post(record) | ||
return lbsn_records | ||
|
||
def extract_post(self, record): | ||
post_record = HF.new_lbsn_record_with_id( | ||
lbsn.Post(), post_guid, self.origin) | ||
``` | ||
|
||
!!! Note | ||
For one lbsn origin, many mappings may exist. For example, | ||
for the above example origin with id "99", you may have | ||
mappings with ids 991, 992, 993 etc. This can be used to | ||
create separate mappings for json, csv etc. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# Input type: file, url, or database? | ||
|
||
lbsntransform can read data from different common types of data sources: | ||
|
||
The following cli arguments are available: | ||
|
||
* file input `--file_input` | ||
* json files `--file_type json` | ||
* stacked `--is_stacked_json` | ||
The typical form for json is `[{json1},{json2}]`. If `--is_stacked_json` is set, | ||
jsons in the form of `{json1}{json2}` (no comma) can be imported. | ||
* line separated `--is_line_separated_json` | ||
If this flag is used, lbsntransform expects one json per line (separated with a line break). | ||
* csv files `--file_type csv` | ||
* Set CSV delimiter with `--csv_delimiter`, common types are e.g.: | ||
* Comma: `','` (default) | ||
* Semi-colon: `';'` | ||
* Tab: `$'\t'` | ||
* Additional flags for file input: | ||
* `--input_path_url` the folder, path or url to read from, e.g.: | ||
* `--input_path_url 01_Input` Read from the relative subfolder "01_Input" (default). | ||
* `--input_path_url ~/data/` Read from the user's home folder "data". | ||
* `--input_path_url /c/tmp/data` Read from a WSL mounted subdir from Windows. | ||
* "/d/03_EvaVGI/01_Daten/02_FlickrCommons/Flickr_Commons_100Million_YFCC100M_dataset/" \ | ||
* `--recursive_load` to recursively process local sub directories (default depth: 2). | ||
* `--skip_until_file x` to process all files until a file name with name `x` is found | ||
* `--zip_records` Allows to zip records from multiple sources using semi-colon (`;`), e.g.: | ||
* `--input_path_url "https://mypage.org/dataset_col1.csv;https://mypage.org/dataset_col2.csv"` | ||
Will process records from both csv files parallel, by zipping files. | ||
* data base input (Postgres) | ||
* `--dbuser_input "postgres"` the name of the dbuser | ||
* `--dbserveraddress_input "127.0.0.1:5432"` the name and (optional) the port to use. The default postgres port is `5432`. | ||
* `--dbname_input "rawdb"` the name of the database. | ||
* `--dbpassword_input "mypw` the password to use when connecting. | ||
* `--dbformat_input "lbsn"` the format of the database. Currently, only "lbsn" and "json" are supported. | ||
* Additional flags for db input: | ||
- `--records_tofetch 1000` If retrieving from a db, limit the | ||
number of records to fetch per batch. Defaults to 10k. | ||
- `--startwith_db_rownumber xyz` To resume processing from an arbitrary ID. | ||
If input db type is "LBSN", provide the primary key to start from (e.g. post_guid, place_guid etc.). | ||
This flag will only work if processing a single lbsnObject (e.g. lbsnPost). | ||
- `--endwith_db_rownumber xyz` To stop processing at a particular row-id. | ||
- `--include_lbsn_objects` If processing from lbsn rawdb, provide a comma separated list of | ||
[lbsn objects](https://lbsn.vgiscience.org/structure/) to include. May contain: | ||
origin,country,city,place,user_groups,user,post,post_reaction,event | ||
Excluded objects will not be queried, but empty objects may be created due to referenced | ||
foreign key relationships. Defaults to origin,post. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
**lbsntransform** can output data to a database with the [common lbsn structure](), | ||
called [rawdb](https://gitlab.vgiscience.de/lbsn/databases/rawdb) | ||
or the privacy-aware version, called [hlldb](https://gitlab.vgiscience.de/lbsn/databases/hllb). | ||
|
||
**Examples:** | ||
|
||
To output data to rawdb: | ||
|
||
```bash | ||
lbsntransform --dbpassword_output "sample-key" \ | ||
--dbuser_output "postgres" \ | ||
--dbserveraddress_output "127.0.0.1:5432" \ | ||
--dbname_output "rawdb" \ | ||
--dbformat_output "lbsn" | ||
``` | ||
|
||
The syntax for conversion to hlldb is a little bit more complex, | ||
since the output structure may vary to a large degree, depending | ||
on each use case. | ||
|
||
!!! note | ||
The hlldb and structure are still in an early stage of development. | ||
We're beyond the initial proof of concept and are working on | ||
simplifying custom mappings. | ||
|
||
To output data to hlldb: | ||
```bash | ||
lbsntransform --dbpassword_output "sample-key" \ | ||
--dbuser_output "postgres" \ | ||
--dbserveraddress_output "127.0.0.1:25432" \ | ||
--dbname_output "hlldb" \ | ||
--dbformat_output "hll" \ | ||
--dbpassword_hllworker "sample-key" \ | ||
--dbuser_hllworker "postgres" \ | ||
--dbserveraddress_hllworker "127.0.0.1:15432" \ | ||
--dbname_hllworker "hllworkerdb" \ | ||
--include_lbsn_objects "origin,post" \ | ||
``` | ||
|
||
Above, a separate connection to a "hll_worker" database is provided. | ||
It is used to make hll calculations (union, hashing etc.). No items | ||
will be written to this database, a read_only user will suffice. A | ||
[Docker container with a predefined user](https://gitlab.vgiscience.de/lbsn/databases/pg-hll-empty) | ||
is available. | ||
|
||
Having two hll databases, one for calculations and one for storage means | ||
that concerns can be separated: There is no need for hlldb to receive any | ||
raw data. Likewise, the hll worker does not need to know contextual data, | ||
for union of specific hll sets. Such a setup improves rubustness and privacy. | ||
It further allows to separate processing into individual components. | ||
|
||
If no hll worker is available, hlldb may be used. | ||
|
||
Use `--include_lbsn_objects` to specify which input data you want to convert to | ||
the privacy aware version. For example, `--include_lbsn_objects "origin,post"` | ||
would process [lbsn objects](https://lbsn.vgiscience.org/structure/) | ||
of type origin and post (default). | ||
|
||
Use `--include_lbsn_bases` to specify which output data you want to convert to. | ||
|
||
We call this "bases", and they are defined in output mappings in | ||
[lbsntransform/input/field_mapping_lbsn.py](/api/output/hll/hll_bases.html), | ||
|
||
Bases can be separated by comma and may include: | ||
|
||
- Temporal Facet: | ||
- `monthofyear` | ||
- `month` | ||
- `dayofmonth` | ||
- `dayofweek` | ||
- `hourofday` | ||
- `year` | ||
- `month` | ||
- `date` | ||
- `timestamp` | ||
|
||
- Spatial Facet: | ||
- `country` | ||
- `region` | ||
- `city` | ||
- `place` | ||
- `latlng` | ||
|
||
- Social Facet: | ||
- `community` | ||
|
||
- Topical Facet: | ||
- `hashtag` | ||
- `emoji` | ||
- `term` | ||
|
||
- Composite Bases: | ||
- `_hashtag_latlng` | ||
- `_term_latlng` | ||
- `_emoji_latlng` | ||
|
||
|
||
For example: | ||
```bash | ||
lbsntransform --include_lbsn_bases hashtag,place,date,community | ||
``` | ||
|
||
would fill/update entries of the hlldb structures: | ||
- topical.hashtag | ||
- spatial.place | ||
- temporal.date | ||
- social.community | ||
|
||
This name refers to `schema.table`. | ||
|
||
It is possible to define own output hll db mappings. The best place | ||
to start is [lbsntransform/input/field_mapping_lbsn.py](/api/output/hll/hll_bases.html). | ||
Have a look at the pre-defined bases and add any additional needed. It is recommended | ||
to use inheritance. After adding your own mappings, the hlldb must be prepared with | ||
respective table structures. Have a look at the | ||
[predefined structures available](https://gitlab.vgiscience.de/lbsn/structure/hlldb). | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
For in-memory conversion, it is possible to import lbsntransform as a package: | ||
|
||
```py | ||
import lbsntransform as lt | ||
lt.add_processed_records( | ||
record) | ||
lt.store_lbsn_records() | ||
``` | ||
|
||
As a starting point, have a look at | ||
[lbsntransform/__main__.py](https://gitlab.vgiscience.de/lbsn/lbsntransform/-/blob/master/lbsntransform/__main__.py), | ||
which includes the code that is invoked on command line use. | ||
|
||
We plan to update this section with a Jupyter Lab example notebook. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
If you're using the command line interface, a common usage of lbsntransform is to | ||
import/convert arbitrary social media data, e.g. from Flickr or Twitter, to a Postgres Database | ||
with the [common lbsn structure](https://lbsn.vgiscience.org/) | ||
|
||
The following use cases exist: | ||
|
||
1. importing lbsntransform as a package | ||
|
||
Use this approach to convert data, such as individual posts | ||
retrieved from an API, on-the-fly (in-memory), in your own | ||
python package. | ||
|
||
2. using the command line interface (cli) to perform batch conversions | ||
|
||
Use this approach if you want to convert batches of data stored as | ||
arbitrary json/csv files, or if you want to convert from a database | ||
with the raw lbsn structure to a database with the privacy-aware hll | ||
format. | ||
|
||
For any conversion, | ||
- the input type must be provided, see [input-types](input-types) | ||
- a mapping must exist, see [input-mappings](input-mappings) |