Skip to content

Scribe-Data 4.0.0

Compare
Choose a tag to compare
@andrewtavis andrewtavis released this 28 Nov 18:27
· 46 commits to main since this release

✨ Features

  • Queries for countless data types for countless languages were expanded and added ❤️
  • Scribe-Data is now a fully functional CLI.
    • Querying Wikidata lexicographical data can be done via the get command (#159).
    • The output type of queries can be in JSON, CSV, TSV and SQLite, with converting output types also being possible (#145, #146)
    • Output paths can be set for query results (#144).
    • The version of the CLI can be printed to the command line and the CLI can further be used to upgrade itself (#186, #157 ).
    • Total Wikidata lexemes for languages and data types can be derived with the total command (#147).
    • Interactive and total commands can be used via an interactive mode with the --interactive argument (#158, #203).
    • Outputs were standardized to assure that the CLI experience is consistent
  • The machine translation process has been removed to make way for the Wiktionary based implementation (#292).
  • Package metadata files were standardized for languages, data types and Wikidata lexeme forms.
  • CLI commands have an argument check that can suggest correct languages and data types (#341).

🐞 Bug Fixes

  • Wikidata query process stages no longer trigger the tqdm progress bar when they're unsuccessful (#155).

✅ Tests

  • Tests have been written for the CLI to assure that it's functionality remains consistent.
  • Workflows were created to assure that the Wikidata queries and project structure are consistent to assure package functionality (#339, #357)
    • Project queries and its structure have been updated to match the rules developed for the checks.

📝 Documentation

  • The CLI's functionality has been fully documented (#152, #208).
  • Documentation was created to show how to write Scribe-Data queries (#395).

♻️ Code Refactoring

  • word_type has been switched to data_type throughout the codebase (#160).
  • Case, gender and annotation utility functions were removed as the formatting process that used them has changed.
  • The SPARQLWrapper access method has been extracted to the Wikidata utils and is imported into the files that need it (#164).
  • Export data paths have been converted to centrally saved variables to reduce hard coded string repetition.
  • Many files were renamed including update_data.py being renamed query_data.py
  • Paths within the package have been updated to work for all operating systems via pathlib (#125).
  • The language formatting scripts have been dramatically simplified given changes to export paths all being the same.
  • The update_files directory was removed in preparation of other means of showing data totals.
  • The language_data_extraction directory was moved under the Wikidata directory as it's only used for those processes now (#446).
  • The emoji keyword process was centralized to simplify project maintenance (#359).
  • PyICU was removed as a dependency and a process was made to install it and its needed dependencies given the operating system of the user (#196).
  • The data formatting step was centralized such that we only have one for all languages (#142).
  • Sub-query processes are now no longer hard coded such that we'd need to maintain the total possible sub-queries within the query_data.py process.