This major release changes the approach of data. In version 1.X, the creation process output was different datasets, with redundant information accross the datasets. The final datasets were hot-one encoded which is not suitable for a lot of applications.
Therefore, it has been decided to change the nature of the output from a collection of datasets to one large database in different formats. People interested specifically in classification, can still generate the datasets from this version or simply extract a specific column to predict.
On top of this fundamental change, major changes include:
- Pandas compatible data: structured information can be directly read using
pd.read_json()
(at the exception of Bag-of-Words and TF-IDF in lbsvm format, readable usingsklearn.datasets.load_svmlight_file
first). Some information (citation of applications, decision body, representatives and Strasbourg case law citations) are stored as matrice that can also be directly loaded by Pandas. - Python 3.X compatibility: Python 2.X is replaced by Python 3.X
- The number of tokens in the dictionary has been increased to 10.000.
Additionally, there is no more minimal limit of cases per type of article which slightly increase the number of cases.
- #46 Corner cases in article parsing
- #44 Generate Pandas compatible data
- #42 Update raw information naming convention
- #37 Fetching documents fails on "Connection reset by peer"
- #34 Remove the minimal limit of cases per datasets
- #32 Retrieving case info fails on "Connection reset by peer"
- #30 Allow to select the version to build from build.py
- #26 Increase the default number of tokens to be generated
- #24 Typo in max_documents default value
- #13 Migration to Python 3.X
|- cases.json # list of cases. Each element contains the entire information about the case, including the judgment document. The format is nested.
|- bow # Bag-of-Words representation of judgment documents
| |- <itemid>_bow.txt # The format is X:Y where X is the word id and Y the word occurrences in the judgment
| |- ...
|
|- tfidf
| |- <itemid>_tfid.txt # TF-IDF representation of judgment documents
| |- ... # The format is X:Y where X is the word id and Y the word weight
|
|- cases_flat_domain_mapping.json # Set of possible values across the dataset for each variable
|
|- cases_schema.json # Schema of cases
|
|- cases.csv # All cases in CSV format (directly compatible with Pandas)
|
|- cases.json # List of cases in JSON format (directly compatible with Pandas)
|
|- features_text.json # Mapping of N-grams X:Y where X is the N-gram and Y its id used for bow and tfidf
|
|- matrice_appnos.json # Matrix of appnos citations (directly compatible with Pandas)
|
|- matrice_decision_body.json # Matrix of decision body (directly compatible with Pandas)
|
|- matrice_representatives.json # Matrix of representatives (directly compatible with Pandas)
|
|- matrice_scl.json # Matrix of Strasbourg Case Law (directly compatible with Pandas)
|
|- schema_hint.json # Schema hint (internal hints to know how to encode unstructured data to structured data)
|- judgments.zip # Archive with the raw MS Word documents
|
|- normalized.zip # Normalized documents before turning into Bag-of-Words