Wednesday, September 23, 2015

Adventures in Data Science: Labeling Data

Now that my environment is set up, I can finally start doing some coding.

My current project is creating a spam filter for okcupid messages. I have a dataset of about 700 messages as well as the profile of the woman that the messages were sent to.

Setup: The profile is of a straight cis-woman. The messages are from straight men writing first messages to her.

As I mentioned in my last post, I initially had a csv of these messages that I loaded into a MYSQL table.

I have also started thinking about what features I should use to analyze this data and what the end goal is.

Some of the initial features I thought of were:
- length of messages (number of characters or number of words)
- match percentage
- enemy percentage
- keywords

I also started thinking about what type of learning algorithm to use. I decided that I should do some form of supervised learning.

I was initially planning on just labeling the messages as spam or not. However, I quickly realized that I needed another category of messages. This new category I am calling "terrible".

Here are the definitions I am using so far.
Spam: A message that has no reference to the profile of the woman they are messaging. This type of message is often really short, focuses on the woman's looks , or asks lame arbitrary questions ("How are you", "how's your weekend")

Terrible: This message does reference the woman's profile in some way, but typically in a shallow way. This message may focus on the woman's looks, doesn't ask any questions, or is otherwise terrible in some way.

I realize this is all arbitrary labeling, but I have to start somewhere.

You can check out the code for the labeling on my github.

No comments:

Post a Comment