Yfirlestur.is is a web application where you can enter or submit Icelandic text and have it checked for spelling and grammar errors.
The tool also gives hints on words and structures that might not be appropriate, depending on the intended audience for the text.
Try Yfirlestur (in Icelandic) at https://yfirlestur.is!
Text with annotations, as displayed by Yfirlestur.is
The core spelling and grammar checking functionality of Yfirlestur.is is provided by the GreynirCorrect engine, by the same authors. User feedback is greatly appreciated, either through GitHub Issues or by e-mail to [email protected].
In addition to its graphical web front-end, Yfirlestur.is exposes a public HTTPS/JSON application programming interface (API) to perform spelling and grammar checking.
This API can for example by accessed by curl
from the Linux/MacOS command line
as follows (try it!):
curl https://yfirlestur.is/correct.api -d "text=Manninum á verkstæðinu vantar hamar"
...or, of course, via a HTTPS POST
from your own code; see below.
All text is assumed to be coded in UTF-8.
The example returns the following JSON (shown indented, for ease of reading):
{
"result": [
[
{
"annotations": [
{
"code":"P_WRONG_CASE_þgf_þf",
"detail":"Sögnin 'að vanta' er ópersónuleg. Frumlag hennar á að vera í þolfalli í stað þágufalls.",
"start":0,
"end":2,
"start_char":0,
"end_char":21,
"suggest":"Manninn á verkstæðinu",
"text":"Á líklega að vera 'Manninn á verkstæðinu'"
}
],
"corrected":"Manninum á verkstæðinu vantar hamar",
"tokens": [
{"k":6,"x":"Manninum"},
{"k":6,"x":"á"},
{"k":6,"x":"verkstæðinu"},
{"k":6,"x":"vantar"},
{"k":6,"x":"hamar"}
]
}
]
],
"stats":
{
"ambiguity":1.0,
"num_parsed":1,
"num_sentences":1,
"num_tokens":5
},
"text":"Manninum á verkstæðinu vantar hamar",
"valid":true
}
The result
field contains the result of the annotation, as a list of paragraphs,
each containing a list of sentences, each containing a list of annotations (under
the annotations
field). Of course, if a sentence is correct and has no annotations,
its annotation list will be empty. An overview of error codes used in annotations is available here.
Each sentence entry has a field containing a corrected
version of it, where
likely errors have been corrected. The corrected
string includes corrections
of most spelling errors but only a subset of suspected grammar errors;
the system is intentionally less aggressive about automatically applying those
(as can be seen in the example above).
Sentence entries also contain a list of tokens
. The tokens
originate in the Tokenizer package and contain the following fields:
i
: Character index of token start.
k
: Number identifying the token type (WORD, DATEREL, AMOUNT, etc.). The mapping from numbers to token types can be found in the documentation for the Tokenizer package.
o
: Original token text.
x
: Corrected text of token.
Other possible fields:
s
: Lemma of word. It can contain '-' if the lemma does not appear in BÍN and the word has been identified as a compound word.
c
: Part-of-speech (kk/kvk/hk, so, lo, ao, fs, st, etc.).
b
: Inflectional form given in BÍN. Can be '-' if the word cannot be inflected.
t
: Terminal that the token is connected to in the CFG.
v
: Token value (if applicable). Number, amount, date or name of currency.
f
: BÍN category (alm, ism, fyr, örn, etc.).
Each annotation applies to a span of sentence tokens, starting
at the token whose index is
given in start
and ending with the token whose index is
in end
. Both indices are 0-based
and inclusive. Also, a starting character index is found
in start_char
and an ending index in end_char
. Again,
both are 0-based and inclusive. Note that these are character
indices within the original source string, not byte indices.
An annotation has a code
which uniquely determines the type
of error or warning. If the code ends with /w
, it is a warning, otherwise
it is an error.
An annotation has a short, human-readable text
field which describes
the annotation succintly, as well as a detail
field which has further detail
on the annotation, possibly containing grammatical explanations.
Finally, some annotations contain a suggest
field with text that could
replace the text within the token span, if the user agrees with
the suggestion being made.
The result JSON further includes a stats
field with information about
the annotation job, such as the number of tokens and sentences processed,
and how many of those sentences could be parsed. The valid
field is
true
if the request was correctly formatted and could be processed
without error, or false
if there was a problem.
The /correct.api
endpoint supports several options that can be included
with the request data, either as additional form fields (for x-www-form-urlencoded
requests) or JSON properties (for application/json
requests).
Key | Type | Default | Explanation |
---|---|---|---|
annotate_unparsed_sentences | bool | true | Annotate sentence even when parsing fails |
suppress_suggestions | bool | false | Don't return suggestions |
ignore_wordlist | list | [] | Words to accept without comment |
ignore_rules | list | [] | Rules to ignore when annotating |
As an example, to suppress suggestions:
curl https://yfirlestur.is/correct.api -d "text=Manninum á verkstæðinu vantar hamar&suppress_suggestions=true"
As an example of accessing the Yfirlestur API from Python, here is a short demo program which submits two paragraphs of text to the spelling and grammar checker:
# $ pip install requests
import requests
import json
# The text to check, two paragraphs of two and one sentences, respectively
my_text = (
"Manninum á verkstæðinu vanntar hamar. Guðjón setti kókið í kælir.\n"
"Mér dreimdi stórann brauðhleyf."
)
# Make the POST request, submitting the text
# Include additional keys in the dict if you want to specify options,
# such as dict(text=mytext, suppress_suggestions=True)
rq = requests.post("https://yfirlestur.is/correct.api", data=dict(text=my_text))
# Retrieve the JSON response
resp = rq.json()
# Enumerate through the returned paragraphs, sentences and annotations
for ix, pg in enumerate(resp["result"]):
print(f"\n{ix+1}. efnisgrein")
for sent in pg:
print(f" {sent['corrected']}")
for ann in sent["annotations"]:
print(
f" {ann['start']:03} {ann['end']:03} "
f"{ann['code']:20} {ann['text']}"
)
This program prints the following output:
$ python test.py
1. efnisgrein
Manninum á verkstæðinu vantar hamar.
000 002 P_WRONG_CASE_þgf_þf Á líklega að vera 'Manninn á verkstæðinu'
003 003 S004 Orðið 'vanntar' var leiðrétt í 'vantar'
Guðjón setti kókið í kælir.
004 004 P_NT_EndingIR Á sennilega að vera 'kæli'
2. efnisgrein
Mér dreymdi stóran brauðhleif.
000 000 P_WRONG_CASE_þgf_þf Á líklega að vera 'Mig'
001 001 S004 Orðið 'dreimdi' var leiðrétt í 'dreymdi'
002 002 S001 Orðið 'stórann' var leiðrétt í 'stóran'
003 003 S004 Orðið 'brauðhleyf' var leiðrétt í 'brauðhleif'
The open source GreynirCorrect engine that powers Yfirlestur.is is further documented here.
The service can be packaged and started in development mode using Docker. Run the following commands to start the service and expose it via HTTP on port 5002:
# Set internal Gunicorn (WSGI web server) user and password
if [ ! -f "./gunicorn_user.txt" ]; then
echo 'root' > gunicorn_user.txt
echo 'root' >> gunicorn_user.txt
fi
docker build -t yfirlestur:latest .
docker run -it -p 5002:5002 yfirlestur
For production use, the Docker module should be packaged inside a robust server such as nginx, and the Gunicorn user should be configured appropriately.
Parts of this software were developed under the auspices of the Icelandic Government's 5-year Language Technology Programme for Icelandic, managed by Almannarómur. The LT Programme is described here (English version here).
Yfirlestur.is is Copyright © 2023 Miðeind ehf.
The original author of this software is Vilhjálmur Þorsteinsson.
This software is licensed under the MIT License:
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
If you would like to use this software in ways that are incompatible with the standard MIT license, contact Miðeind ehf. to negotiate custom arrangements.