Skip to content

honeybhardwaj/Language_Identification

Repository files navigation

Language_Identification


forthebadgeforthebadgeforthebadgeforthebadgeforthebadgeforthebadge

Description of Dataset

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. Each language in this dataset contains 1000 rows/paragraphs.

After data selection and preprocessing I used the 22 selective languages from the original dataset Which Includes following Languages

  • English
  • Arabic
  • French
  • Hindi
  • Urdu
  • Portuguese
  • Persian
  • Pushto
  • Spanish
  • Korean
  • Tamil
  • Turkish
  • Estonian
  • Russian
  • Romanian
  • Chinese
  • Swedish
  • Latin
  • German
  • Dutch
  • Japanese
  • Thai

Description of Repo!!

This is a Language Identification tool deployed on Flask . It is a simple Language prediction tool so don't mind if it gives you wrong results but it works real fine and have long way to go. The project evaluates the result for different model on 3 different algorithm.

The project have 2 sav file:

  • unigram_model.sav which has logistic regression as a classification algorithm. it is a unigram feature model.
  • unigram_model_rfc.sav which has random forest classifier as a classification algorithm.

The web folder contains the main code to run the server. you can run it by following command:

python3 main.py

The Requirements have been added in requirements.txt . LI is the virtual environment which you can use for setting uo the project. Data folder contains the dataset used in the project. You may see the project demo here


Setting Up the Project in your machine

  • Fork the github repo to create a copy in you account.
  • Clone the repo
git clone https://github.com/honeybhardwaj/Language_Identification.git
  • Activate the virtual environment
source LI/bin/activate
  • Install Dependencies
pip3 install -r requirements.txt
  • run server by going into the derectory
cd Web
python3 main.py

Contratulations!! everything is up for development. go ahead and contribute... contact me if you have any doubts. generate issues before contributing.


How it Looks

Screenshot from 2021-03-27 22-14-05 Screenshot from 2021-03-27 22-14-22


Prediction Images

Screenshot from 2021-03-27 22-44-59 Screenshot from 2021-03-27 22-44-11


Project Admin ❤️

Happy Coding 👨‍💻

please don't forget to give a star ⭐ if you liked it.