[P] I trained a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this can go
![[P] I trained a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this can go](/_next/image?url=https%3A%2F%2Fpreview.redd.it%2F3hrr4prgbzsg1.png%3Fwidth%3D140%26height%3D120%26auto%3Dwebp%26s%3Dad74d593f251847f8d8acb4e0fc71c0f5679f4bf&w=3840&q=75)
| Experiment #324 ended well. ;) This time I built a small project around log anomaly detection. In about two days, I went from roughly 60% effectiveness in the first runs to a final F1 score of 0.9975 on the HDFS benchmark. Under my current preprocessing and evaluation setup, LogAI reaches F1=0.9975, which is slightly above the 0.996 HDFS result reported for LogRobust in a recent comparative study. What that means in practice:
What I find especially interesting is that this is probably the first log anomaly detection model built on top of Mamba-3 / SSM, which was only published a few weeks ago. The model is small:
For comparison, my previous approach took around 20 hours to train. The dataset here is the classic HDFS benchmark from LogHub / Zenodo, based on Amazon EC2 logs:
This benchmark has been used in a lot of papers since 2017, so it’s a useful place to test ideas. The part that surprised me most was not just the score, but what actually made the difference. I started with a fairly standard NLP-style approach:
That got me something like 0.61–0.74 F1, depending on the run. It looked reasonable at first, but I kept hitting a wall. Hyperparameter tuning helped a bit, but not enough. The breakthrough came when I stopped treating logs like natural language. Instead of splitting lines into subword tokens, I switched to template-based tokenization: one log template = one token representing an event type. So instead of feeding the model something like text, I feed it sequences like this: [5, 3, 7, 5, 5, 3, 12, 12, 5, ...] Where for example:
That one change did a lot at once:
The second important change was matching the classifier head to the architecture. Mamba is causal, so the last token carries a compressed summary of the sequence context. Once I respected that in the pooling/classification setup, the model started behaving the way I had hoped. The training pipeline was simple:
Data split was 70% train / 10% val / 20% test, so the reported F1 is on sessions the model did not see during training. Another useful thing is that the output is not just binary. The model gives a continuous anomaly score from 0 to 1. So in production this could be used with multiple thresholds, for example:
Or with an adaptive threshold that tracks the baseline noise level of a specific system. A broader lesson for me: skills and workflows I developed while playing with AI models for chess transfer surprisingly well to other domains. That’s not exactly new - a lot of AI labs started with games, and many still do - but it’s satisfying to see it work in practice. Also, I definitely did not get here alone. This is a combination of:
Very rough split:
Now I’ll probably build a dashboard and try this on my own Astrography / Astropolis production logs. Or I may push it further first on BGL, Thunderbird, or Spirit. Honestly, I still find it pretty wild how much can now be done on a gaming PC if you combine decent hardware, public research, and newer architectures quickly enough. Curious what people here think:
If there’s interest, I can also share more about the preprocessing, training loop, and the mistakes that got me stuck at 60-70% before it finally clicked. P.S. I also tested its effectiveness and reproducibility across different seeds. On most of them, it actually performed slightly better than before. [link] [comments] |
Want to read more?
Check out the full article on the original site