Slight refactor, added docs, no other significant changes in functionality
- refactor node creation to be in its own module
- fix some backwards compatibility problems in 0.6.3
- fix bug when fixing overlapping bocks and one block is None
- fix bug that assigned all properties the same label
- improve algorithm for detecting 1D blocks
- fix bug where block suggestions returned overlapping blocks
- don't allow overlapping or identical annotations
- remove very old defunct wikifierservice
- performance improvements
- faster wikifier
- improved alignment algorithm
- treat multi-column date blocks as concatenated dates
- remove post-processing of pandas data for multi-factor speedup of loading large spreadsheets
- get rid of spreadsheet cache handling in api
- changes tags to dictionary by id rather than array
- speed up suggest_block algorithm for datasets with more than 300 rows by sampling top and bottom
- change iterator in statement mapper to accept start and end index, add option for "count" as well
- cache index_dict instead of recalculating each time.
- update country_wikifier_cache:
- add more countries
- standardize (lowercase, replace punctuation)
- add causx only countries for suggesting blocks
- bug fixes:
- annotation parsing: fix bug with range strings
- critical bug fix in cleaning id
- additional fixes to -$n behavior
- conditionally use -$n in blocks that are 50% or more blank spaces
- bug fix when adding entities
- strip start/end whitespace from auto-generated node labels
- when auto-creating nodes, check if changed from property to qnode or vice versa
- bug fix to exception raised when label resolves to empty string
- stop using -$n in auto-generated blocks
- when generating kgtk output, skip cells with problem (instead of erroring out)
- do not send custom properties to sparql query service
- return filepath when getting entities from kgtk file
- remove every scrap of a print statement
- complete overhaul of wikification, extremely not backwards compatible (those with scripts can use OldItemTable if they need)
- remove wikifier_files from project
- tweak: raise error if clean_id returns empty string
- improve suggestion heuristic to check multiple cells and average them
- fix bug where suggestion heuristic treated empty cells as dates
- some changes to suggestion format
- annotation can accept a node dictionary or a node id string
- changes to autonode creation
Small refactors including adding things to api.py
A variety of stability and performance fixes
- don't autosuggest P1114
- change format of suggestion
- cache custom nodes in memory
- strict_make_numeric added
- some additional entries in country wikifier
- in auto blocks, find longerst stretch and allow 10% total length be skips
- fix race condition for wikified labels
- much faster block suggestion code
- added some countries to country wikifier
- improved string_is_valid check
- delete links to nonexistent blocks in autosuggested links
- add field entity_file to project
- add links field to annotation
- fix some bugs with subject field in annotation
- normalize block edges
- use subject from annotation if provided
- fix bug in overlap checking
- add auto block detection for dates, countries, and quantities
- add support for partial statements
- canonical spreadsheet:
- fix bug in canonical spreadsheet that wasn't fetching labels
- add dataset id
- preliminary support for limiting number of statements fetched (wip)
- order of statements switched to row-major
- add method for loading a Sheet from an in-memory csv string
- add support for simple imperfect alignments
- if value resolves to empty string don't create statement
- fix bug in warn_for_empty_cells
- refine make_numeric to only replace in beginning and end and to do a better job stripping whitespace and commas
- smoother handling of irregular numbers in annotations:
- default to using make_numeric for anything of type quantity when creating yaml
- use make_numeric when checking suggestion for quantityness
- don't send "title" in yaml
- canonical spreadsheet returns labels not qnodes
- added a function to suggest possible annotation choices for annotation blocks
- bug fix: added preloaded_properties.tsv to manifest.in when building package
- bug fix to creation of additional kgtk edges
- support additional data file formats (primarily tsv, but anything pandas' parser can detect should work in the api)
- added another 520 preloaded properties to the default provider
- in annotations:
- only create custom properties for custom-node-id properties
- drop duplicates from auto-gen wikifier
- add file/sheet to autogen wikifier
- don't overwrite user changes to auto-gen properties/items
- make sure to update property types if linked annotation block type changes
- allow value to be a wikidata item
- major bug fix: non-existent properties were marked as existing and not generated properly
- changed default descriptions to be empty strings (instead of "a item", "property relation")
- datamart mapping:
- sanitize leading/trailing whitespace and trailing colon of names
- add flag to validate variables as quantity
- bug fixes in error reporting during annotation generation (caused by index errors)
- add support for specifying precedence (True=overwrites others, False=all others overwrite) when adding entity and wikifier files
- during auto-creation of entities/wikication in annotations, precedence is lowest, and don't create empty files
- if a field is filled with an empty string in annotation, treat it as empty when assigning blocks
- also generate entities and wikification for units
- add id, userlink, and link fields to annotations
- support specifying sheet name and data path in wikifier file
- fix bug when autogen dir doesn't exist in filesystem
- remove project.t2wml file compatibility for formats from version 0.0.16 and earlier (was deprecated from 0.0.17 onwards)
- add some additional validation to annotation
- overhaul of kgtk output to include custom edges for dataset, variable nodes, qualifier nodes, custom QNodes, and linking statement edges to dataset
- add support for tags in entity upload
- add all properties from wikidata to a kgtk file and change the default wikidataprovider for api script users to be the kgtkfileprovider
- change the format of errors to be a flat array of dictionaries instead of a layered dictionary
- in annotations, selections -> selection (backwards comaptibility will select first in list + print warning)
- all rectangle orientations are normalized to top left corner to bottom right
- bug fixes:
- handle $n cases in the annotations
- specify an escape char in to_csv
- remove imprecise alignment support in annotations until we have a better working version
- add preliminary support for no_wikification setting and saving to a csv file
- add functions for renaming and deleting files in a project
- add description and url fields to project
- HOTFIX: broken addition of wikifier files
- bug fix: not normalizing path when adding files
- various tweaks to annotation parsing:
- rudimentary support for annotating with imprecise alignments
- in annotationmapper, if there isn't a valid annotation (dependent variable + main subject)- don't try generating statements
- add basic annotation validation (must be a list containing dict entries with a key "role")
- munkres package added to requirements
- some module reorganization and cleanup
- added annotation_parsing module and AnnotationMapping class
- fix pandas xlrd bug
- until proper multi-wikifier file management is added, always use most recently applied wikifier file.
- added cleaning function fill_empty
- region field of yaml no longer expectes/requires a list (list support now prints deprecation warning)
- added munkres to requirements
- added support for qval, qcol, and region in qualifier
- item in yaml and statements has been renamed to "subject". for now there is backwards compatibility for older style yamls.
- statements no longer returns individual "cell" for item cell of statement and value of qualifier. instead, "cells" is returned, a dictionary of every field in the result (statement/qualifier) that was derived from a cell. (it will not work with concat, which does not support returning cell)
- added some more T2WML exception types, for project in particular
- fixed a small bug with references
- add support for leave/replace/add ethiopian calendar to gregorian
- change order of precedence when parsing regions -
cells
now supersedes all else - change to project file format- save selected yamls and sheets
- in make_numeric, return an empty string if fail to parse to number, and don;t convert floats to ints
- hot fix to how sparql queries are passed
- added sheet to knowledge graph (optional, backwards compatible) and to_json functionality to sheet
- bug fixes:
- cast to path before calling as_posix
- because of issues with numpy 1.19.4, make sure to list numpy version in requirements and setup
- slight output tweaks:
- return key for unexpected errors in statement as well
- slightly more information when wikify_region fails
- get rid of outdated error handling that included error codes for web
- update to valid property types: case insensitively: "globecoordinate", "quantity", "time", "string", "monolingualtext", "externalid", "wikibaseitem", "wikibaseproperty", "url" (externalid replaces externalidentifier, which was wrong)
- backwards incompatible: completely remove remaining ttl/rdf support
- backwards incompatible: the setting cache_data_files is now a property and cannot be directly set (instead, it is true when cache_data_files_folder is provided)
- new feature: as documented in grammar.md, a lot of cleaning functions added
- IMPORTANT: this includes an added requirement, text-unidecode
- backwards incompatible: some functions were renamed or replaced:
replace
is gone, there'sreplace_regex
nowclean
has been renamed toftfy
title
,upper
,lower
are gone, there's a functionchange_case
insteadstrip
is gone, usestrip_whitespace
instead
- included in the added cleaning functionality is the ability to add a
cleaningMapping
section to the yaml file to apply cleaning functions to specific sections of the calculated sheet - the default sparql endpoint is now the public wikidata endpoint
- default WikidataProvider is now a DictionaryProvider with preloaded properties (we may change how preloading works in future version)
- allow uploading entities with wikidata IDs if user specifies
allow_wikidata_ids=True
in add_entities_from_file
- add state to project file
- bug fix when adding absolute file paths to project
- change format of project file to include sheet names
- remove specific wikifier support
- removed the ProjectRunner class
- add cache_id and and handle_calendar properties to Project class
- pre-release: some of the cleaning functions (strip_whitespace, remove_numbers, truncate, normalize_whitespace, change_case)
- validate P/Q node definitions (must begin with P/Q, Pnum where num<10000 or Qnum where num<1 billion not allowed)
- when fetching property type, fetch description and label while you're at it
- stop saving directory to project file
- deprecation: rename add_nodes_from_file to add_entities_from_file
- backwards incompatible: rename .t2wmlproj to project.t2wml, so that it can be viewed on mac and linux
- change default project title from "Untitled" to folder name
- rename add_wikidata_file to more accurate add_entity_file
- All at least somewhat backwards incompatible:
- Changed project class filename to .t2wmlproj
- added sparql endpoint and warn for empty cells settings to project
- change add_properties_from_file to add_nodes_from_file, some changes to wikidata_provider interface (documented in api.md)
- change returned dict key (from adding properties to file) from "present" to "updated"
- Bug fix:
- when set to something like $sheet, which doesn't return ReturnClass, parsing would fail on attribute error
- added the Project class
- change settings to be class-based instead of a dictionary
- when caching, use an underscored version of the full path to the original file to create the cache name
- add setting
warn_for_empty_cells
, default False - change how date parsing is handled: if format is provided, must stricly match format, otherwise will fuzzy-guess
- change default behavior for regex to return first group
- support utf-8 encoding for yaml files
- bug fixes:
- bug fix for spacey import errors
- fixed bug reading value definitions in wikifier where value is numeric
- read pandas dataframes as strings (fixes bug: Object of type Timestamp is not JSON serializable)
- do not re-apply same skips (fixes bug: Multiple skip_rows conditions that apply to the same row causes error)
- do not generate qualifier edges for empty values
- empty data cells were not being skipped
- skip cells were being saved in columns
- "node2;kgtk:data_type" for location changed from "coordinate" to "location_coordinate"
- added globe field to kgtk output
- change settings to be class-based instead of a dictionary
- add setting
cache_data_files_folder
- when caching, use an underscored version of the full path to the original file to create the cache name
- complete overhaul of region definition, added support for 'columns'/'rows'/'cells'
- backwards incompatible:
- changed 'skip_column/row/cell' to 'skip_columns/rows/cells'
- renamed BaseStatementMapper to StatementMapper
- KnowledgeGraph.save_download renamed to KnowledgeGraph.save_file
- added case insensitivity to property typing (ie, url, Url, URL now all valid)
- stop printing template errors
- bug fixes:
- critical bug: did not recognise 0 column/0 row when wikifying
- x->y is now valid (previously only worked with spaces, x -> y)
- error when not sending any date formats
- item and cell falsiness now explicitly defined
- add support for list of date formats
- etk is now optional
- class SpreadsheetFile has been refined
- added class Statement
- statements are returned if they are valid (no errors in value, property, and item)
- qualifiers are included if they are valid (no errors in value, property)
- yet more docs, examples, tests
- A lot more classes:
- Wikifier class fully working, supports multiple wikifier definitions
- ItemTable completely revamped, totally different storage and lookup mechanism including preferential lookup
- Revamp of cell mapping, now with base class BaseStatementMapper and implementation YamlMapper
- WikifierService class
- DataFile is now SpreadsheetFile
- bug fix: bad access to sparql endpoint
- add support for adding label and description when uploading properties in tsv
- when wikify_region fails on specific cells, return error listing those cells, and wikify the rest
- create temporary csv with tempfile rather than manually
- add support for $filename to t2wml syntax
- add support for Url as property type
- do not include project name in kgtk id
- add metadata (sheet and filename) to results
- continued cleaning of server-specific code from the API
- separated from the server code into own package