mirror of
https://github.com/20kaushik02/real-time-traffic-analysis-clickhouse.git
synced 2025-12-06 11:34:07 +00:00
53 lines
2.4 KiB
Markdown
53 lines
2.4 KiB
Markdown
# Data filtering, preprocessing and selection for further use
|
|
|
|
## Traffic data
|
|
|
|
- IP packet traces are taken [from here](https://mawi.wide.ad.jp/mawi/samplepoint-F/2023/)
|
|
- Filtering
|
|
- L4 - Limit to TCP and UDP
|
|
- L3 - IPv6 is only around 10%, let's drop it
|
|
- Selection of fields:
|
|
- Timestamp
|
|
- capture window is from 0500-0515 UTC
|
|
- nanosecond precision, use `DateTime64` data type in ClickHouse
|
|
- IP
|
|
- addresses - src, dst
|
|
- L4 protocol - TCP, UDP. use `Enum` data type in ClickHouse
|
|
- TCP/UDP - ports - sport, dport
|
|
- Packet size - in bytes
|
|
- `sample_output.csv` contains a partial subset of `202310081400.pcap`, ~600K packets
|
|
|
|
## IP geolocation database
|
|
|
|
- This project uses the IP2Location LITE database for [IP geolocation](https://lite.ip2location.com)
|
|
- bit of preprocessing to leave out country code and convert IP address from decimal format to dotted string format
|
|
|
|
# Setting up Kafka
|
|
- Download and install kafka [from here](https://kafka.apache.org/downloads)
|
|
- Run all commands in separate terminals from installation location
|
|
- Zookeeper:
|
|
- Windows: `.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties`
|
|
- Mac: `bin/zookeeper-server-start.sh config/zookeeper.properties`
|
|
- Kafka Broker:
|
|
- Windows: `.\bin\windows\kafka-server-start.bat .\config\server.properties`
|
|
- Mac: `bin/kafka-server-start.sh config/server.properties`
|
|
- Creating a Kafka topic:
|
|
- Windows: `.\bin\windows\kafka-topics.bat --create --topic %topicname% --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1`
|
|
- Mac: `bin/kafka-topics.sh --create --topic %topicname% --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1`
|
|
|
|
|
|
# Streaming from pcap file using Kafka
|
|
- Start zookeeper and Kafka broker whenever python code is run after machine reboot
|
|
- Run pcap_processor.py file
|
|
- Arguments
|
|
- -f or --pcap_file: pcap file path, mandatory argument
|
|
- -o or --out_file: output csv file path
|
|
- -x or --sample: boolean value indicating if data has to be sampled
|
|
- -s or --stream: boolean value indicating if kafka streaming should happen
|
|
- --stream_size: integer indicating number of sampled packets
|
|
- -d or --debug: boolean value indicating if program is run in debug mode
|
|
|
|
|
|
python pcap_processor.py -f C:/Users/akash/storage/Asu/sem3/dds/project/202310081400.pcap -s --stream_size 1000
|
|
python pcap_procesor.py -c sample_output.csv -s --stream_size 1000
|