Rekolha dataset email ho label spam (1) atau ham (0). Dataset populer: Enron Spam Dataset, SpamAssassin Public Corpus.
Ezemplu:
| Email | Label |
|-------------------------|-------|
| "Dapatkan diskon 50%" | 1 |
| "Berita baik untuk Anda"| 0 |
Etapa ba prosesu ne mak hanesan tuir mai nee:
Ezemplu: "Saya pergi berbelanja!" → ["saya", "pergi", "berbelanja"]
Ita sura utiliza metode Bag-of-Words (BoW).
| Kata | Frekuensi |
|------------|-----------|
| diskon | 2 |
| hadiah | 1 |
| beli | 1 |
Ita sei sura probabilidade klase spam no ham.
P(Spam) = Total Spam / Total Email
P(Ham) = Total Ham / Total Email
Ezemplu:
Jika ada 3 Spam dan 2 Ham:
P(Spam) = 3 / 5 = 0.6
P(Ham) = 2 / 5 = 0.4
Ita sei sura probabilidade kada liafuan ba klase spam no ham.
P(Luafuan|Spam) = Frekuensia Luafuan iha Spam / Total Luafuan iha Spam
P(Luafuan|Ham) = Frekuensia Luafuan iha Ham / Total Luafuan iha Ham
Ezemplu:
Karik "diskon" mosu dala 2 iha Fraze 5 spam:
P(diskon|Spam) = 2 / 5 = 0.4
Karik "diskon" la mosu iha ham no ita uza Laplace Smoothing:
P(diskon|Ham) = (0 + 1) / (Total liafuan iha ham + total Fitur)
= 1 / (3 + 1) = 0.25
Ita sei uza formula Naive Bayes hodi prediksaun ba klase email.
P(Spam|Email) ∝ P(Spam) * P(Liafuan1|Spam) * P(Liafuan1|Spam) * ...
P(Ham|Email) ∝ P(Ham) * P(Liafuan2|Ham) * P(Liafuan2|Ham) * ...
Ezemplu:
Karik email nee iha liafuan "diskon" no "hari":
P(Spam|Email) ∝ 0.6 * P(diskon|Spam) * P(hari|Spam)
P(Ham|Email) ∝ 0.4 * P(diskon|Ham) * P(hari|Ham)
Metrik nebee utiliza ba evaluasaun mak hanesan akurasi, presisi, no recall.
Integrasaun model ba sistema email hodi detekta spam ho realtime.
Ezemplu: Se modelu detekta spam, mak email refere sei fosai nudar spam.