ruby | Andrew J. Roback

I’m working on collecting data for my dissertation right now, and one major problem that I ran into was finding organizations on Twitter and Facebook. I have heard from more than one person who has a list of organizations (say the top non-profit organizations or Fortune 500 companies) and they want to make a collector to get Twitter posts, but they don’t have the usernames for those organizations. Twitter lists are great for finding lots of accounts, but there are two major problems: 1) The list you need may not exist, and 2) The accuracy and currentness of that list are wholly dependent on the curator. If you are concerned with getting the most accurate sampling of a group of organizations on social networking sites, chances are you have to make your own list.

I first encountered this problem when I was compiling Twitter lists of members of the U.S. House of Representatives and U.S. Senate in the 113th Congress. At the CaSM Lab, we use these lists to collect tweets authored by and directed at members of Congress (MOCs). To compile the lists, I had to do a Google search with the name of the MOC, plus the words “Congress” and “Twitter.” While adding these terms (usually) weeded out people who coincidentally had the same name as MOCs, it did not weed out MOC’s Congressional information pages and well-meaning websites like Tweet Congress. Even after a focused search, I still had to scan results, verify an account, and copy the URL or username.

For my dissertation, I am pulling from an initial list of 2,720 non-profit organizations that potentially have SNS accounts. Manually performing a search for each organization and extracting a potential URL for each organization would take far too long. Since there is some degree of human intelligence involved in such a task, paying someone to perform the searches and find URLs would seem to be the only option. Since this is a dissertation, however, I have approximately no funds allocated for this. Likewise, I wanted a method for finding URLs that works on a variety of projects so that I don’t have to pay someone every time I need to make a new list.

I had some previous experience with Ruby and the Watir gem so I chose that route for automating the search task. Watir is an application that allows you to automate a web browser to pass information to a website form and monitor the results. It also has some limited scraping abilities, which is perfect for scraping structured information such as search results or tables.

My initial script grabbed the first three URLs from a Google search indescriminately, but that caused a couple of problems. First, for organizations that return more than one page from their website in the top of Google results, you risk crowding out relevant social networking site URLs (a problem of recall). Second, the script returned lots of URLs from third-party non-profit information sites that had dummy entries for the organizations I searched for (similar to the TweetCongress problem). These non-relevant URLs lowered the instrument’s precision.

Unfortunately, since I wanted to start the Twitter collector immediately, that meant I still had a large amount of searching and scanning search results when collecting Twitter URLs for my study. For collecting Facebook URLs, I decided to return to the search script and fix these problems.

I recently finished a revised script (available on Github) that returns the first ten URLs for a given search term when the URL matches a predetermined string. In order to increase the instrument’s recall, I expanded the number of URLs it collects from three to ten (the number of URLs on the first page of a Google search). To increase the instrument’s precision, I changed the script to only collect URLs containing a given string (e.g. “facebook.com”). These changes greatly increased my confidence that when the script returns zero URLs for an organization, there are no social networking sites associated with that organization.

While this script doesn’t replace the need for human verification, it does eliminate the tedious process of performing initial searches and having to pick through the results to find a potential URL. There is certainly a chance that I’m missing a few accounts by using automation, but, as I learned when searching for MOCs, fatigue is equally as likely to result in a false negative as any automation.

Feel free to try the script out and if you do, please let me know how it works for your searches. It’s pretty versitile and can be adapted to most any search task where you need to find URLs for a list of people or organizations. Also, although I haven’t done so, I’m sure it could be modified to work with Ubuntu or as part of a Rails app. Its only limitation is that the available memory limitations slow it down after about 1,000 searches (a problem I don’t have time to investigate now).

Also, if you are looking for some introductory help on using Watir to automate a web browser, I have a tag on Diigo with links to some helpful resources.