update readme

This commit is contained in:
Kaushik Narayan R 2024-11-02 22:25:05 -07:00
parent 59ea030790
commit 3bf8565a16

View File

@ -1,17 +1,16 @@
# Data filtering, preprocessing and selection for further use # Data filtering, preprocessing and selection for further use
- IP packet traces are taken [from here](https://mawi.wide.ad.jp/mawi/samplepoint-F/2023/), specifically from 2023/10/01-2023/10/31 (yet to confirm) - IP packet traces are taken [from here](https://mawi.wide.ad.jp/mawi/samplepoint-F/2023/)
- Filtering - TODO - Filtering
- L4 - Limit to TCP and UDP - L4 - Limit to TCP and UDP
- maybe GRE for VPN usage?
- L3 - IPv6 is only around 10%, let's drop it - L3 - IPv6 is only around 10%, let's drop it
- Selection (of fields): - Selection of fields:
- Timestamp - Timestamp
- capture window is from 0500-0515 UTC - capture window is from 0500-0515 UTC
- nanosecond precision, use DateTime64 data type in ClickHouse - nanosecond precision, use `DateTime64` data type in ClickHouse
- IP - IP
- addresses - src, dst - addresses - src, dst
- protocol - TCP or UDP. cld go for boolean in ClickHouse to save space - L4 protocol - TCP, UDP. use `Enum` data type in ClickHouse
- TCP/UDP - TCP/UDP - ports - sport, dport
- ports - sport, dport
- Packet size - in bytes - Packet size - in bytes
- `sample_output.csv` contains a partial subset of `202310081400.pcap`, ~600K packets