mirror of
https://github.com/20kaushik02/real-time-traffic-analysis-clickhouse.git
synced 2025-12-06 08:04:06 +00:00
Data filtering, preprocessing and selection for further use
Traffic data
- IP packet traces are taken from here
- Filtering
- L4 - Limit to TCP and UDP
- L3 - IPv6 is only around 10%, let's drop it
- Selection of fields:
- Timestamp
- capture window is from 0500-0515 UTC
- nanosecond precision, use
DateTime64data type in ClickHouse
- IP
- addresses - src, dst
- L4 protocol - TCP, UDP. use
Enumdata type in ClickHouse
- TCP/UDP - ports - sport, dport
- Packet size - in bytes
- Timestamp
sample_output.csvcontains a partial subset of202310081400.pcap, ~600K packets
IP geolocation database
- This project uses the IP2Location LITE database for IP geolocation
- bit of preprocessing to leave out country code and convert IP address from decimal format to dotted string format
Setting up Kafka
- Download and install kafka from here
- Run all commands in separate terminals from installation location
- Zookeeper:
- Windows:
.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties - Mac:
bin/zookeeper-server-start.sh config/zookeeper.properties
- Windows:
- Kafka Broker:
- Windows:
.\bin\windows\kafka-server-start.bat .\config\server.properties - Mac:
bin/kafka-server-start.sh config/server.properties
- Windows:
- Creating a Kafka topic:
- Windows:
.\bin\windows\kafka-topics.bat --create --topic %topicname% --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1 - Mac:
bin/kafka-topics.sh --create --topic %topicname% --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
- Windows:
Streaming from pcap file using Kafka
- Start zookeeper and Kafka broker whenever python code is run after machine reboot
- Run pcap_processor.py file
- Arguments
- -f or --pcap_file: pcap file path, mandatory argument
- -o or --out_file: output csv file path
- -x or --sample: boolean value indicating if data has to be sampled
- -s or --stream: boolean value indicating if kafka streaming should happen
- --stream_size: integer indicating number of sampled packets
- -d or --debug: boolean value indicating if program is run in debug mode
python pcap_processor.py -f C:/Users/akash/storage/Asu/sem3/dds/project/202310081400.pcap -s --stream_size 1000