[GH-ISSUE #4] How to train Naive Bayes Classifier ? #5

Closed
opened 2026-03-04 00:58:17 +03:00 by kerem · 6 comments
Owner

Originally created by @JQuags on GitHub (Aug 22, 2020).
Original GitHub issue: https://github.com/spamscanner/spamscanner/issues/4

Is there more information on how to train the classifier?

I see in the source classifier.json is currently private, which explains the broken links on the site.

The source indicates removing classifier.json, should be all that is needed to train and set SPAM_CATEGORY and SCAN_DIRECTOR. Is that all then feed a directory of spam or ham in EML or ARF format?

Originally created by @JQuags on GitHub (Aug 22, 2020). Original GitHub issue: https://github.com/spamscanner/spamscanner/issues/4 Is there more information on how to train the classifier? I see in the source classifier.json is currently private, which explains the broken links on the site. The source indicates removing classifier.json, should be all that is needed to train and set SPAM_CATEGORY and SCAN_DIRECTOR. Is that all then feed a directory of spam or ham in EML or ARF format?
kerem closed this issue 2026-03-04 00:58:18 +03:00
Author
Owner

@wis commented on GitHub (Sep 13, 2020):

I thought you provided a well trained classifier.json, the link in the README 404s, why was it removed? @niftylettuce

<!-- gh-comment-id:691724288 --> @wis commented on GitHub (Sep 13, 2020): I thought you provided a well trained classifier.json, the link in the README 404s, why was it removed? @niftylettuce
Author
Owner

@JQuags commented on GitHub (Sep 14, 2020):

  • (spam dataset is private at the moment) - is in the comments

I suspect it never has been provided, and there may be privacy reason.

<!-- gh-comment-id:692190945 --> @JQuags commented on GitHub (Sep 14, 2020): * (spam dataset is private at the moment) - is in the comments I suspect it never has been provided, and there may be privacy reason.
Author
Owner

@niftylettuce commented on GitHub (Sep 14, 2020):

I should have this published in the near future. Currently I had to put my focus on something else. But this is not a privacy concern anymore as I have sha256 hashed all the tokens.

<!-- gh-comment-id:692210454 --> @niftylettuce commented on GitHub (Sep 14, 2020): I should have this published in the near future. Currently I had to put my focus on something else. But this is not a privacy concern anymore as I have sha256 hashed all the tokens.
Author
Owner

@wis commented on GitHub (Sep 16, 2020):

good! can we contribute to the training data by forwarding spam emails from our inbox to an email address you setup?

<!-- gh-comment-id:693168925 --> @wis commented on GitHub (Sep 16, 2020): good! can we contribute to the training data by forwarding spam emails from our inbox to an email address you setup?
Author
Owner

@niftylettuce commented on GitHub (Sep 16, 2020):

abuse@forwardemail.net works

On Tue, Sep 15, 2020 at 11:55 PM Wis notifications@github.com wrote:

good! can we contribute to the training data by forwarding spam emails
from our inbox to an email address you setup?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/spamscanner/spamscanner/issues/4#issuecomment-693168925,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAD7XBLTZNPBYGBGKE6YWFTSGBAMJANCNFSM4QH3AZLQ
.

<!-- gh-comment-id:693170854 --> @niftylettuce commented on GitHub (Sep 16, 2020): abuse@forwardemail.net works On Tue, Sep 15, 2020 at 11:55 PM Wis <notifications@github.com> wrote: > good! can we contribute to the training data by forwarding spam emails > from our inbox to an email address you setup? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/spamscanner/spamscanner/issues/4#issuecomment-693168925>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAD7XBLTZNPBYGBGKE6YWFTSGBAMJANCNFSM4QH3AZLQ> > . >
Author
Owner

@titanism commented on GitHub (Dec 22, 2025):

see https://github.com/spamscanner/spamscanner?tab=readme-ov-file#custom-classifier

you'd just write the JSON file you train to classifier.json and then load it basically

you can also make it do sha256 hashing (customizable)


v6 released, we will update classifier.json (there's one published now with sha256) after @fwdemail integration (we're on older v5). the current classifier.json is not that accurate, but we will improve after integration (since we process millions of emails daily, it'll be very accurate soon enough).

https://github.com/spamscanner/spamscanner

https://github.com/spamscanner/spamscanner/releases

X post/announcement @ https://x.com/fwdemail/status/2002872581402063281

we also support TypeScript now in the project (thx to AI, we despise TS internally tho)

<!-- gh-comment-id:3679768382 --> @titanism commented on GitHub (Dec 22, 2025): see <https://github.com/spamscanner/spamscanner?tab=readme-ov-file#custom-classifier> you'd just write the JSON file you train to classifier.json and then load it basically you can also make it do sha256 hashing (customizable) --- v6 released, we will update classifier.json (there's one published now with sha256) after @fwdemail integration (we're on older v5). the current classifier.json is not that accurate, but we will improve after integration (since we process millions of emails daily, it'll be very accurate soon enough). <https://github.com/spamscanner/spamscanner> <https://github.com/spamscanner/spamscanner/releases> X post/announcement @ <https://x.com/fwdemail/status/2002872581402063281> we also support TypeScript now in the project (thx to AI, we despise TS internally tho)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/spamscanner#5
No description provided.