Dynamic Bayesian driven Story listing

Forum: LXer Meta ForumTotal Replies: 6
Author Content

Jul 09, 2004
9:03 AM EDT
Dave, I think your system selecting News using Bayesian filter is very interesting and think it should get some coverage elsewhere. I submitted a story to Slashdot a while back but it was rejected. I was thinking that the news stories on your site and any other news site once entered just sits passively in the sequence and fall off once the length of the line has been reached. Why not try and use Bayesian filtering to determine the fall off rate. Highly rated stories or stories with a high score gets to stay longer. Some sites has a "most popular" section but I think what I am proposing would be better, maybe flagged with a "hours since posted" time stamp or something.

Jul 09, 2004
9:08 AM EDT
Hey, that's a very good idea. I hadn't considered using bayesian algorithms to determine the front page "drop off" rate.

The main problem is, my bayes code is not at all perfect, and I'm not satisfied with it. I'm getting a lot of "in the middle" type returns, like between 30% or 70% but very few ~0% and ~99%, which is what I want.

I wonder if I put the code up somewhere, if anyone else would be interested in taking it and actually making it work. What is really needed is a Bayes class (PHP) that really works well (developerWorks had one, but it's not generic enough). I emailed Eric Raymond (author of bogofilter) and suggested he make this tool and, while he expressed some interest, he hasn't acted on it.

The bayes stuff here needs to be much better before I'll put it in production use for the users.

bstadil: are you a mathematician?


Jul 09, 2004
10:30 AM EDT
Interesting. I found this: http://www.phpgeek.com/pragmacms/index.php?layout=main&cslot...

It's a GPL generic Bayesian filter, which is EXACTLY what I've been looking for. This wasn't available several months ago but it sure is now. I'm playing with it at the moment and we'll see how it goes.

If I can get this perfected, then I'll definitely be incorporating this all over the place, including in user preferences (your own personal newswire with the stuff that the system knows you do enjoy)


Jul 09, 2004
4:49 PM EDT
I have been using POPFile for a year or so and it is extremely accurate. I just checked my accuracy and it's at 99.54%, so the generic Bayesian filter is probably an excellent choice. .

You asked whether I am mathematician. No, not as such but have an advanced degree in Operations Research, You know model building and optimization stuff.

Maybe you want to make something dead simple for the variable "drop off" rate, in case it turns out to be confusing or not something people want.

Second I have mixed feelings about having your own ranking and selection of stories. I kind of like the idea that you benefit from others view and that the selection is a bit of a group effort.

Jul 09, 2004
8:04 PM EDT
The front page, I think, would always be available as it is right now, but I'm thinking of giving another sorting option, which would be a personal sorter.

Not a lot of success with the naivebayesian (sic) tool today. I'll play with it again tomorrow, if I find time between gardening sessions. :)

I'll bounce any ideas off you as I get 'em.


Jul 11, 2004
11:37 AM EDT
I think it would be good if there were more explicit guidelines as to just what "Worth Reading" means. I just read "The Five Top Objections to Open Source". It was an excellent article... for someone considering moving toward open source software. It would certainly be "worth reading" for that category of reader. That category of reader would probably be in the minority here and also less likely to vote than the regulars here who might not find it "worth reading". I consider those readers to be quite important. I would like to see more article like that on the site, since I see lxer as being an important portal for people just starting to investigate open source, even if their presence is transitory. (OSS is not a passion for those people, after all.) As discussed in a previous thread, some people may vote negatively simply because an article points out a problem with OSS as opposed to certain proprietary software, or positively to an article which is mainly a fluffy Linux "love fest" without much substance. Without clear guidelines, that's perfectly understandable.

Bayesian filtering is a statistical method, and as with any statistical method, it is subject to Garbage In->Garbage Out. The more consistent the voting criteria, the more effective the filter will be.

Another point that comes to mind regards voting frequency. If the Bayesian filter does its job well, then many people will be seeing articles which are "worth reading" and only a few which are not. So, do they vote positively for even the average articles (which are theoretically "worth reading" but not especially so) or do they vote only for the exceptionally good ones and against the exceptionally bad ones?

Remember, in spam filtering, it doesn't take that much "spam in the ham" or vice versa to throw off the filter. Or at least that's my understanding.

Jul 12, 2004
8:05 AM EDT
You make some interesting points but I do not think they are major. If the Bayesian filter determines the Fall-Off rate and the Cut-Off rate, then the lesser ranked stories will have a time span where they can be voted on. Albeit a lesser time span than the initial highly ranked stories. Once implemented a positive vote could result in a Kicker of some kind, the Kicker value to be determined by how the system in general performs.

Posting in this forum is limited to members of the group: [Editors, MEMBERS, SITEADMINS.]

Becoming a member of LXer is easy and free. Join Us!