Now that my environment is set up, I can finally start doing some coding.
My current project is creating a spam filter for okcupid messages. I have a dataset of about 700 messages as well as the profile of the woman that the messages were sent to.
Setup: The profile is of a straight cis-woman. The messages are from straight men writing first messages to her.
As I mentioned in my last post, I initially had a csv of these messages that I loaded into a MYSQL table.
I have also started thinking about what features I should use to analyze this data and what the end goal is.
Some of the initial features I thought of were:
- length of messages (number of characters or number of words)
- match percentage
- enemy percentage
- keywords
I also started thinking about what type of learning algorithm to use. I decided that I should do some form of supervised learning.
I was initially planning on just labeling the messages as spam or not. However, I quickly realized that I needed another category of messages. This new category I am calling "terrible".
Here are the definitions I am using so far.
Spam: A message that has no reference to the profile of the woman they are messaging. This type of message is often really short, focuses on the woman's looks , or asks lame arbitrary questions ("How are you", "how's your weekend")
Terrible: This message does reference the woman's profile in some way, but typically in a shallow way. This message may focus on the woman's looks, doesn't ask any questions, or is otherwise terrible in some way.
I realize this is all arbitrary labeling, but I have to start somewhere.
You can check out the code for the labeling on my github.
No comments:
Post a Comment