Skip to content

Latest commit

 

History

History
90 lines (69 loc) · 3.86 KB

README.md

File metadata and controls

90 lines (69 loc) · 3.86 KB

WebBEAT

  • backend program for checking live and extinct websites

/usr/local/bin/python3.9 --version

Prerequisities

  • Sys - could be run on Win, IOS or any other env, Linux server is recommended
  • Python 3.9+
  • pip + dependencies listed in
  • service user is recommended - eg. 'webbeat'

Installation of WebBEAT

  • prepare path for this project
  • clone this project and setup cron with desired intensity of checking - recommended eg. once for month
cd /opt/
git clone https://github.com/JanMeritus/WebBEAT.git
cd WebBEAT
mkdir logs
python3 -m pip check    # check main dependencies status
python3 -m pip install  # install dependencies 

Basic usage

  • script run could be specified by several options detailed in help section
$ python3 WebBEAT.py --help
usage: WebBEAT.py [-h] [-e ENDPOINT] [-s SEEDS] [-p PAUSE] [-t TIMEOUTMARGIN] [-r MAXREDIRECTS] [--whois_c] [--no-whois]

optional arguments:
  -h, --help            show this help message and exit
  -e ENDPOINT, --Endpoint ENDPOINT
                        set API DB endpoint; -e {endpoint adress}/api/v2
  -s SEEDS, --Seeds SEEDS
                        set API seeds list; -s 'https://webarchiv.cz https://nkp.cz' OR dont specify and get it from seeds endpoint
  -ss SEEDSSERVICE, --SeedsService SEEDSSERVICE
                        set full adress for seeds API service; -ss {endpoint adress}
  -bss BATCHSEEDSSERVICE, --BatchSeedsService BATCHSEEDSSERVICE
                        set batch size for seeds service; -bss {integer}
  -p PAUSE, --Pause PAUSE
                        set Pause between seeds, def. for Whois 61 s.; -p 10
  -t TIMEOUTMARGIN, --TimeoutMargin TIMEOUTMARGIN
                        set Timeout Margin call constraint in live requests, def. 0.02;
  -r MAXREDIRECTS, --MaxRedirects MAXREDIRECTS
                        set Max Redirects constraint in live requests, def. 12;
  --whois_c             Activate WHOIS checking procedure, def. activated. Use as parameter; --whois_c
  --no-whois            Desactivate WHOIS checking procedure, def. activated. Use as parameter; --no-whois

Operation

  • endpoint for data export -- endpoint DB for import of data is recommended, part of main repository https://github.com/WebarchivCZ/extinct-websites DB which supposes relational DB -- could be also easily sent to custom noSQL DB endpoint, eg. MongoDB
  • seed import -- could by done via FS (eg. for tests) -- recommended way is to import it from data endpoint as part of main project https://github.com/WebarchivCZ/extinct-websites, or any other json structured data provider '''{'data':[{'url':'seed'},{'url':'seed'}]}'''
  • time schedule decision -- it is recommended to run this script (for large amount of webs) once per month
  • single web vs batch decision -- for reason of specific implementation script send page data serially (specific implementation), however data model suppose batch approach
  • whois decision -- decide if you want use whois module -- here implemented specifically for czech CZ.NIC provider, for international just switch functionality of tweaked library - need to create bigger pauses - eg. 120 seconds

Crontab installation

#crontab -e
# no-whois example with urlFeeder service and batching
1 * * * *  python3 /opt/WebBeat/WebBEAT.py -p 2 --no-whois -ss http://121.0.0.1/api/urlFeeder/ -t 30 -bss 50 -e http://121.0.0.1/api/v2/ >> /opt/WebBeat/logs/WebBEAT_$(date +\%Y\%m\%d_\%H\%M).log
# whois option example without service
0 1 1 * *  python3 /opt/webbeat/WebBEAT.py -p 120 --whois_c  -e http://121.0.0.1/api/v2/  -s "seeds1 seed2 ...">> /opt/WebBeat/logs/WebBEAT_$(date +\%Y\%m\%d__\%H\%M).log

Dedication

For Webarchive of the National Library of the Czech Republic

Supported by

Realizováno v rámci institucionálního výzkumu Národní knihovny České republiky financovaného Ministerstvem kultury ČR v rámci Dlouhodobého koncepčního rozvoje výzkumné organizace.