Version 2.0.0

Version notes

This major release changes the approach of data. In version 1.X, the creation process output was different datasets, with redundant information accross the datasets. The final datasets were hot-one encoded which is not suitable for a lot of applications.

Therefore, it has been decided to change the nature of the output from a collection of datasets to one large database in different formats. People interested specifically in classification, can still generate the datasets from this version or simply extract a specific column to predict.

On top of this fundamental change, major changes include:

Pandas compatible data: structured information can be directly read using pd.read_json() (at the exception of Bag-of-Words and TF-IDF in lbsvm format, readable using sklearn.datasets.load_svmlight_file first). Some information (citation of applications, decision body, representatives and Strasbourg case law citations) are stored as matrice that can also be directly loaded by Pandas.
Python 3.X compatibility: Python 2.X is replaced by Python 3.X
The number of tokens in the dictionary has been increased to 10.000.

Additionally, there is no more minimal limit of cases per type of article which slightly increase the number of cases.

Issues and enhancements

#46 Corner cases in article parsing
#44 Generate Pandas compatible data
#42 Update raw information naming convention
#37 Fetching documents fails on "Connection reset by peer"
#34 Remove the minimal limit of cases per datasets
#32 Retrieving case info fails on "Connection reset by peer"
#30 Allow to select the version to build from build.py
#26 Increase the default number of tokens to be generated
#24 Typo in max_documents default value
#13 Migration to Python 3.X

Data Structure

Unstructured

|- cases.json # list of cases. Each element contains the entire information about the case, including the judgment document. The format is nested.

Structured

|- bow                            # Bag-of-Words representation of judgment documents
|   |- <itemid>_bow.txt           # The format is X:Y where X is the word id and Y the word occurrences in the judgment
|   |- ...
|
|- tfidf
|   |- <itemid>_tfid.txt          # TF-IDF representation of judgment documents
|   |- ...                        # The format is X:Y where X is the word id and Y the word weight
|
|- cases_flat_domain_mapping.json # Set of possible values across the dataset for each variable
|
|- cases_schema.json              # Schema of cases
|
|- cases.csv                      # All cases in CSV format (directly compatible with Pandas)
|
|- cases.json                     # List of cases in JSON format (directly compatible with Pandas)
|
|- features_text.json             # Mapping of N-grams X:Y where X is the N-gram and Y its id used for bow and tfidf
|                                 
|- matrice_appnos.json            # Matrix of appnos citations (directly compatible with Pandas)
|                                 
|- matrice_decision_body.json     # Matrix of decision body (directly compatible with Pandas)
|                                 
|- matrice_representatives.json   # Matrix of representatives (directly compatible with Pandas)
|
|- matrice_scl.json               # Matrix of Strasbourg Case Law (directly compatible with Pandas)
|
|- schema_hint.json               # Schema hint (internal hints to know how to encode unstructured data to structured data)

Raw

|- judgments.zip                  # Archive with the raw MS Word documents
|
|- normalized.zip                 # Normalized documents before turning into Bag-of-Words

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.0.0.md

2.0.0.md

Version 2.0.0

Version notes

Issues and enhancements

Data Structure

Unstructured

Structured

Raw

Files

2.0.0.md

Latest commit

History

2.0.0.md

File metadata and controls

Version 2.0.0

Version notes

Issues and enhancements

Data Structure

Unstructured

Structured

Raw