Releases: scribe-org/Scribe-Data
Releases · scribe-org/Scribe-Data
Scribe-Data 4.1.0
✨ Features
- Queries for noun genders and other properties that require the Wikidata label service now return their English label rather than auto label that was returning just the Wikidata QID.
- SPARQL queries for English and Portuguese prepositions were added to allow the CLI to query these types of data.
- The convert functionality once again works for lists of languages all data types for them.
🐞 Bug Fixes
- SQLite conversion was fixed for all queries (#527).
- The data conversion process outputs were improved including capitalizing language names and repeat notices to the user were removed.
- The CLI's
get
command now returns all data types if none is passed. - The Portuguese verbs query was fixed as it wasn't formatted correctly.
- The emoji keyword functionality was fixed given the new lexeme ID based form of the data.
- Arguments were fixed that were breaking the functionality.
- Languages for the user were capitalized.
case
has been renamedgrammaticalCase
in preposition queries to assure that SQLite reserved keywords are not used.
Scribe-Data 4.0.0
✨ Features
- Queries for countless data types for countless languages were expanded and added ❤️
- Scribe-Data is now a fully functional CLI.
- Querying Wikidata lexicographical data can be done via the
get
command (#159). - The output type of queries can be in JSON, CSV, TSV and SQLite, with converting output types also being possible (#145, #146)
- Output paths can be set for query results (#144).
- The version of the CLI can be printed to the command line and the CLI can further be used to upgrade itself (#186, #157 ).
- Total Wikidata lexemes for languages and data types can be derived with the
total
command (#147). - Interactive and total commands can be used via an interactive mode with the
--interactive
argument (#158, #203). - Outputs were standardized to assure that the CLI experience is consistent
- Querying Wikidata lexicographical data can be done via the
- The machine translation process has been removed to make way for the Wiktionary based implementation (#292).
- Package metadata files were standardized for languages, data types and Wikidata lexeme forms.
- CLI commands have an argument check that can suggest correct languages and data types (#341).
🐞 Bug Fixes
- Wikidata query process stages no longer trigger the tqdm progress bar when they're unsuccessful (#155).
✅ Tests
- Tests have been written for the CLI to assure that it's functionality remains consistent.
- Workflows were created to assure that the Wikidata queries and project structure are consistent to assure package functionality (#339, #357)
- Project queries and its structure have been updated to match the rules developed for the checks.
📝 Documentation
- The CLI's functionality has been fully documented (#152, #208).
- Documentation was created to show how to write Scribe-Data queries (#395).
♻️ Code Refactoring
word_type
has been switched todata_type
throughout the codebase (#160).- Case, gender and annotation utility functions were removed as the formatting process that used them has changed.
- The SPARQLWrapper access method has been extracted to the Wikidata utils and is imported into the files that need it (#164).
- Export data paths have been converted to centrally saved variables to reduce hard coded string repetition.
- Many files were renamed including
update_data.py
being renamedquery_data.py
- Paths within the package have been updated to work for all operating systems via
pathlib
(#125). - The language formatting scripts have been dramatically simplified given changes to export paths all being the same.
- The
update_files
directory was removed in preparation of other means of showing data totals. - The
language_data_extraction
directory was moved under the Wikidata directory as it's only used for those processes now (#446). - The emoji keyword process was centralized to simplify project maintenance (#359).
- PyICU was removed as a dependency and a process was made to install it and its needed dependencies given the operating system of the user (#196).
- The data formatting step was centralized such that we only have one for all languages (#142).
- Sub-query processes are now no longer hard coded such that we'd need to maintain the total possible sub-queries within the
query_data.py
process.
Scribe-Data v3.3.0
✨ Features
- The translation process has been updated to allow for translations from non-English languages (#72, #73, #74, #75, #75, #76, #77, #78, #79).
📝 Documentation
- The documentation has been given a new layout with the logo in the top left (#90).
- The documentation now has links to the code at the top of each page (#91).
🐞 Bug Fixes
- Annotation bugs were removed like repeat or empty values.
- Perfect tenses of Portuguese verbs were fixed via finding the appropriate PID (#68).
- Note that the most common past perfect property is not the standard one, so this will need to be fixed.
♻️ Code Refactoring
- pre-commit have been added to the repo to improve the development experience (#137).
- Code formatting was shifted from black to Ruff.
- A Ruff based GitHub workflow was added to check the code formatting and lint the codebase on each pull request (#109).
- The
_update_files
directory was renamedupdate_files
as these files are used in non-internal manners now (#57). - A common function has been created to map Wikidata ids to noun genders (#69).
- The project now is installed locally for development and command line usage, so usages of
sys.path
have been removed from files (#122). - The directory structure has been dramatically streamlined and includes folders for future projects where language data could come from other sources like Wiktionary (#139).
- Translation files are moved to their own directory.
- The
extract_transform
directory has been removed and all files within it have been moved one level up. - The
languages
directory has been renamedlanguage_data_extraction
. - All files within
wikidata/_resources
have been moved to theresources
directory. - The gender and case annotations for data formatting have now been commonly defined.
- All language directory
formatted_data
files have been now moved to thescribe_data_json_export
directory to prepare for outputs being required to be directed to a directory outside of the package. - Path computing has been refactored throughout the codebase, and unneeded functions for data transfers have been removed.
Scribe-Data v3.2.2
- Minor fixes to documentation index and file docstrings to fix errors.
- Revert change to package path definition to hopefully register the resources directory.
Scribe-Data v3.2.1
♻️ Code Refactoring
- The docs and tests were grafted into the package using
MANIFEST.in
. - Minor fixes to file and function docstrings and documentation files.
include_package_data=True
is used insetup.py
to hopefully include all files in the package distribution.
Scribe-Data v3.2.0
✨ Features
- The data and process needed for an English keyboard has been added (#39).
- The Wikidata queries for English have been updated to get all nouns and verbs.
- Formatting scripts have been written to prepare the queried data and load it into an SQLite database.
- The data update process has been cleaned up in preparation for future changes to Scribe-Data and to implement better practices.
- Language data was extracted into a JSON file for more succinct referencing (#52).
- Language codes are now checked with the package langcodes for easier expansion.
- A process has been created to check and update words that can be translated for each Scribe language (#44).
- The baseline data returned from Wikidata queries is now removed once a formatted data file is created.
🐞 Bug Fixes
- Tensorflow was removed from the download wiki process to fix build problems on Macs.
✅ Tests
- A full testing suite has been added to run on GitHub Actions (#37).
- Unit tests have been added for Wikidata queries (#48) and utility functions (#50).
♻️ Code Refactoring
- The Anaconda based virtual environment was removed and documentation was updated to reflect this.
- Language data processes were moved into the
src/scribe_data/extract_transform/languages
directory to clean up the structure. - Code formatting processes were defined with common structures based on language and word type variables defined at the top of files.
Scribe-Data 3.1.0
✨ Features
- The word "Scribe" is now added to language database nouns files if it's not already present.
- German contracted prepositions have been added to the German prepositions formatting process.
- Words that are upper case are now better included in the autocomplete lexicon with their lower case equivalents being removed.
- Words with apostrophes have been removed from the autocomplete lexicon.
♻️ Code Refactoring
- Database output column names are now zero indexed to better align with Python and other language standards.
Scribe-Data 3.0.0
✨ Features
- Scribe-Data now has the ability to generate SQLite databases from formatted language data.
data_to_sqlite.py
is used to read available JSON files and input their information into the databases.
- These databases are now sent to Scribe apps via defined paths.
send_dbs_to_scribe.py
finds all available language databases and copies them.- Separating this step from the data update is in preparation for data import in the future where this will be an individual step.
- Scribe-Data now also creates autocomplete lexicons for each language within
data_to_sqlite.py
. - JSON data is no longer able to be uploaded to Scribe app directories directly, with the SQLite directories now being exported instead.
- Emojis of singular nouns are now also linked to their plural counterparts if the plural isn't present in the emoji keyword outputs.
- The emoji process also now updates a column to the
data_table.txt
file for sharing on readmes withupdate_data.py
maintaining it in the data update process.
♻️ Code Refactoring
- The Jupyter notebooks for autosuggestions and emojis as well as
update_data.py
were moved to theextract_transform
directory given that they're not used to load data anymore.- Their code was refactored to reflect their new locations.
- Massive amounts of refactoring happened to achieve the shift in the data export method:
format_WORD_TYPE.py
files export to aformatted_data
directory withinextract_transform
.- Copies of all data JSONs that were originally in Scribe apps are now in the
formatted_data
directories. - Functions in
update_utils.py
were switched given that data is no longer uploaded into aData
directory within the language keyboard directories within Scribe apps. - Lots of functions and variables were renamed to make them more understandable.
- Code to derive appropriate export locations within
format_WORD_TYPE.py
files was removed in favor of a languageformatted_data
directory. - regex was added as a dependency.
- pylint comments were removed.
- Verb SPARQL query scripts for Spanish and Italian were simplified to remove unneeded repeat conditions.
🐞 Bug Fixes
- The statements in translation files have been fixed as they were improperly defined after a file was moved.
Scribe-Data 2.1.0
✨ Features
- Scribe-Data can now split Wikidata queries into multiple stages to break up those that were too large to run.
Scribe-Data 2.0.0
✨ Features
- Scribe-Data now has the ability to download Wikipedia dumps of any language.
- Functions have been added to parse and clean the above dumps.
- Autosuggestions are generated from the cleaned texts by deriving most common words and those words that most commonly follow them.
- A query for profane words has been added and integrated into the autosuggest flow to make sure that inappropriate words are not included.
- The adjectives column has been removed from Scribe data tables until support is offered.
♻️ Code Refactoring
- The error messages for incorrect args in update_data.py have been updated.