diff --git a/preprocessing/README.md b/preprocessing/README.md index 0daa27c..c0eabd7 100644 --- a/preprocessing/README.md +++ b/preprocessing/README.md @@ -1,17 +1,16 @@ # Data filtering, preprocessing and selection for further use -- IP packet traces are taken [from here](https://mawi.wide.ad.jp/mawi/samplepoint-F/2023/), specifically from 2023/10/01-2023/10/31 (yet to confirm) -- Filtering - TODO +- IP packet traces are taken [from here](https://mawi.wide.ad.jp/mawi/samplepoint-F/2023/) +- Filtering - L4 - Limit to TCP and UDP - - maybe GRE for VPN usage? - L3 - IPv6 is only around 10%, let's drop it -- Selection (of fields): +- Selection of fields: - Timestamp - capture window is from 0500-0515 UTC - - nanosecond precision, use DateTime64 data type in ClickHouse + - nanosecond precision, use `DateTime64` data type in ClickHouse - IP - addresses - src, dst - - protocol - TCP or UDP. cld go for boolean in ClickHouse to save space - - TCP/UDP - - ports - sport, dport + - L4 protocol - TCP, UDP. use `Enum` data type in ClickHouse + - TCP/UDP - ports - sport, dport - Packet size - in bytes +- `sample_output.csv` contains a partial subset of `202310081400.pcap`, ~600K packets