Skip to content

A demonstration on pulling Binance API data and store them into Parquet files

License

Notifications You must be signed in to change notification settings

CloudyRowly/Binance-public-API-pull-to-Parquet-Demo

Repository files navigation

Data storage (Binance to Parquet)

Language: Python

Source files: realtime.py, restful.py, utils.py

Output folder: /data

Utilizes binance's api library

A program that demonstate the use of Binance API and the operations on Parquet files.

Resources:

Design choices made Reasons
Used "snappy" compression Fastest query time for "selected column query" of all types of compression, even faster than uncompressed. (130% faster than the next compression method: GZIP) (Source: Riz Ang: What is Apache Parquet file)
Used "PyArrow" engine for writing and reading Parquet files Is natively supported by Pandas, the library used for Data framing. Is faster than other engines at writing files.
Used Fast Parquet engine for appending to Parquet file PyArrow does not support Parquet file appending.

Restful API Test

Module used: Spot from binance.spot

reference: rest-api.md

implementation: restful.py

Uses Binance restful API, easy interface to pull historical data.

Test for independent read and write off Parquet files.

Test for column selective querying.


Websocket stream Test (Real-time data pulling)

Module used: SpotWebsocketStreamClient from binance.websocket.spot.websocket_stream

Reference document: web-socket-streams.md

implementation: realtime.py

Uses websocket, useful for listening to real time data.

Py Arrow does not support Parquet file appending => Use Fast Parquet for appending instead

Observation: first interval in the stream returns exception => Implemented exception handling to not write this interval into the data files.

CSV to Parquet

Utilizing binance public data, download monthly data using the command:

python download-kline.py -t spot -s BTCUSDT -i 15m -folder D:\Learning\y2s2\comp3500\Binance-public-API-pull-to-Parquet-Demo\public_kline

Binance will then return a folder of zip files:

CMD showing download process

Using unzip in csv_to_parquet.py, unzip all file in the given folder, result:

extracted csv files


Week 7 research progress: Focus on converting and storing tick data csv files

Focused on the tick data aka trades.

Goal:

Analysis

Sampled data from 6 days (17th - 22nd April 2024):

CSV:

Total Per day
12 012 572 entries ~2 002 095 entries
888 357 908 bytes ~148 059 651 bytes

Parquet control test (only convert data on the 19th - the peak day):

Properties value note
Number of entries 2828283
CSV file size 209 333 843 bytes
Parquet file size 54 456 020 bytes ~74% size compression
Time taken to convert and write to Parquet from csv ~1.44 seconds
Time taken to write each entry ~0.50 micro-seconds

Parquet:

Total Per file Per entry
220 309 998 bytes ~36 718 333 bytes = 36.7 MB 18.339 bytes
5.9533 seconds to convert csv to parquet ~0.99 seconds to convert csv to parquet ~0.5 micro-seconds
  • 75.2% size compression
  • 1GB of HDD can store ~27 days of tick data
  • Relatively affordable 256GB of space can store ~6971 days of tick data (19.1 years)
  • Takes ~362 seconds (~6 minutes) to convert and store a year worth of tick data
  • Can compress to a zip file => Further 55% size compressed => 85.5% in total compression ratio

Dependencies

The following libraries are required:

pip install pandas 
pip install binance-connector
pip install pyarrow
pip install fastparquet

Results

Rest api testing result

Note: Binance has a limit of 1000 intervals per request.

tables from reading the Parquet files

Real-time data pulling appending

tables showing data progression

Selective reading

Two tables showing column selected query and column + row_filtered query respectively.

Note: index column is pandas' format for data Framing, is not modifiable. Index of filtered row does not match with its original order from full table.

table showing filtered data

CSV to Parquet

a view of the written Parquet file

About

A demonstration on pulling Binance API data and store them into Parquet files

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages