Logo

Wednesday, September 23, 2015

Adventures in Data Science: Labeling Data

Now that my environment is set up, I can finally start doing some coding.

My current project is creating a spam filter for okcupid messages. I have a dataset of about 700 messages as well as the profile of the woman that the messages were sent to.

Setup: The profile is of a straight cis-woman. The messages are from straight men writing first messages to her.

As I mentioned in my last post, I initially had a csv of these messages that I loaded into a MYSQL table.

I have also started thinking about what features I should use to analyze this data and what the end goal is.

Some of the initial features I thought of were:
- length of messages (number of characters or number of words)
- match percentage
- enemy percentage
- keywords

I also started thinking about what type of learning algorithm to use. I decided that I should do some form of supervised learning.

I was initially planning on just labeling the messages as spam or not. However, I quickly realized that I needed another category of messages. This new category I am calling "terrible".

Here are the definitions I am using so far.
Spam: A message that has no reference to the profile of the woman they are messaging. This type of message is often really short, focuses on the woman's looks , or asks lame arbitrary questions ("How are you", "how's your weekend")

Terrible: This message does reference the woman's profile in some way, but typically in a shallow way. This message may focus on the woman's looks, doesn't ask any questions, or is otherwise terrible in some way.

I realize this is all arbitrary labeling, but I have to start somewhere.

You can check out the code for the labeling on my github.

Wednesday, September 16, 2015

Adventures in Data Science: MYSQL on a Chromebook

After a thousand million years, I was finally able to put data into a MYSQL table.

I first tried to follow this tutorial to learn how to use MYSQL with python.

However, I quickly got stuck at the step of:
$ mysql -u root -p
This brought me the first error message:
- "ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)""

I followed this down a rabbit hole of other commands and error messages, some of which are listed below.

$ /etc/init.d/mysql
- "Rather than invoking init scripts through /etc/init.d, use the service(8)
utility, e.g. service mysql start
/etc/init.d/mysql: 54: /etc/init.d/mysql: initctl: not found

Since the script you are attempting to invoke has been converted to an
Upstart job, you may also use the start(8) utility, e.g. start mysql
/etc/init.d/mysql: 82: /etc/init.d/mysql: start: not found/usr/bin/service: 123: exec: start: not found"


$ service mysql start
- "start : Unknown job: mysql"

Many hours of research brought me to this page after I searched for "initctl not found crouton": https://github.com/dnschneid/crouton/wiki/Running-servers-in-croutonO

Thr problem apparently transpired because I was running Crouton on a chromebook. Turns out the MYSQL server doesn't start automatically when the chroot boots up. Never would have guessed that.

Once I applied the solution listed on the github site, the MYQL server started running when I start my chroot. Fingers crossed that it will stay like that!

I went back and finished the python tutorial mentioned above. I found it to be really useful for getting me started with mysqldb in python.

I finally got to read in my csv of data, and put it into a mysql table.

I'm very happy that I don't have to deal directly with the csv anymore. I'm hoping the MYSQL database will be easier to handle.

Next time: labeling data.

Wednesday, September 9, 2015

Adventures in Data Science: Setting up a Chromebook

After a year and a half of no posts, I'm back with a new blog series about doing data science on a chromebook.

Recently, I decided to up my coding skills by working on some programming side projects. Unfortunately, my personal computer is a Windows laptop, which is terrible for programming.

During Amazon Prime Days, I purchased a Chromebook with the intention of putting Linux on it.

The model I purchased was the Acer C720-3871 (Amazon).This one has 2GB of RAM, an Intel processor, and 32 Gb of solid state storage.

Stickers so you know I'm a real programmer.

The open chromebook.

For loading Linux, I followed this tutorial from LifeHacker and used Crouton. I chose to load xfce4 on top of ubuntu.

This version is pretty light weight so it might not work for everyone. I'm only planning on using the chromebook for programming so I didn't care about having a pretty interface or graphics.

Once I installed Linux, I set up my environment with:
- git
- ipython
- sublime text
- and more!

Since I'm a scientist, I'm sticking with Python 2.7 right now.

So far, the chromebook has been good enough for me. There are times when the laptop feels slow to respond, but I'm not sure yet whether that's a result of only having 2 GB of RAM or me not remembering how to use Linux.

More on installation issues next time.