rounded corners for top navigation bar

You’ve reached the Web home of Josh Braun > Surviving (in) Academia, Technology > Studying the Web With Free/Open Source Tools

header image taken from a photo by the author
partial transparency to lightens the background image

Wednesday, June 11th, 2008

Studying the Web With Free/Open Source Tools

by Josh Braun

feature image from a previous WP theme, imported en masse without description text

If you don’t follow this site (overwhelmingly likely), I’m a grad student doing social science research on the Web.  This sounds simple, and being employed at a university, I have access to a lot of tools for just these sorts of projects.  Unfortunately, on closer inspection many of them turned out to be far less useful than I would have hoped.  Many of the software packages (a) wouldn’t run on a Mac, (b) required a dedicated web server, (c) were attached to someone’s specific research initiative, (d) were housed on a university server or computer lab, meaning I couldn’t, say, work on a plane, or (e) were programs built for analyzing older text formats, and handled HTML pages very poorly.

Slave to convenience that I am, I wanted to use my own computer.  And I didn’t want to shell out several grand on proprietary software.  So I went looking for free and open source solutions to doing social science research online.  I had a surprising amount of trouble finding just what I was looking for, so I thought I’d share some of my hard-won knowledge, in the hope that it benefits the next person.    Here are a set of free software tools I can no longer live without, described warts and all.  As mentioned above, I use a Mac, but PC users should still get some mileage out of the Firefox plugins.

Scrapbook

I heart Scrapbook. This Firefox plugin is a great web caching tool and more. You can set it up to crawl and snapshot a single page, an entire website, or whole cascades of linked pages on the big bad Web. When you set it loose, you can even implement advanced filters to specify what sorts of pages you do or don’t want it to retrieve in the process. Zotero (see below) is another handy plugin that takes snapshots, but it unfortunately will grab only a page at a time. And it’s not as good at the next thing I’m going to describe.

Scrapbook is good for more than web caching—I was able to use it as a coding tool. Here’s how: I took a snapshot of the page I wanted to code, then used the plugin’s “capture selection” tool to make additional snapshots of interesting exerpts in the text. I then used the comment tool to add my code to each of these exerpts. So, for instance, if I was coding a blog post for whether it mentioned celebrity politicians, I could clip the paragraph about Barack Obama into its own snapshot. Say I was coding these posts on a five point scale, where Presidential candidates were an extreme example. Then, in the comment field for that snapshot, I would add the code “Politicians 5″.

Later, using Scrapbook’s “search comments” function, I can call up any excerpts with the Politicians code simply by typing in “Politicians”. Or, if I’m interested in only extreme examples, I can search “Politicians 5″.

Better yet, say there are three paragraphs about Barack Obama in a post, but only the first and last sentences are relevant. I could snapshot the whole three paragraphs, and then use the plugin’s eraser tool to remove all but the relevant material from the excerpt I was coding.

Of course, when you’re reviewing your data, it’s sometimes hard to remember the context in which you originally saw those coded excerpts. Fortunately, Scrapbook has a menu option that will instantly take you to the URL from which an excerpt was taken—in this case the original snapshot you took of the whole page.  I also write further notes to myself in the comment field of the original snapshot.  If I’m worried about those remarks turning up in a search for my coded excerpts, I replace the first letter of the code word with an “X” in my remarks.  Awkward, but there you have it.  If I want to see any remarks I might have made about politicians, I’ll search “Xoliticians” later on.

Of course, if you’re coding a hundred posts for twenty variables, you’ll end up with a lot of snapshots. I was able to keep them organized by grouping excerpts into folders with their parent snapshots. It won’t affect how the search tool brings up your data later on. You can also do statistical analysis on your coded data using TextWrangler (discussed below), along with your favorite stats program.

Of course, Scrapbook is not as powerful as expensive content analysis software tools like QDAMiner or Atlas.ti—but then again, it’s free. And the pay-to-play packages cost anywhere from $290 to $1800.

RealPlayer Downloader

It’s increasingly difficult to study the web without considering video.  But streaming and embedded video can present a challenge when it comes to capturing data.  Lots of videos are here today, gone tomorrow, so you’ll want to cache a copy.  But most websites make it difficult to do.

RealPlayer is a software program just about everyone has already.  What you may not know is that the newest version, RealPlayer 11, is packing a nifty new feature called “Downloader” that allows you to save Flash streaming video to your hard drive from anywhere on the Web.  If you want to convert videos from the .FLV format to something more usable, you may want to use Downloader in conjunction with a free conversion program like iSquint.  H/T Sol Hart

SearchStatus

The ability to do content analysis on webpages is great, but at some point you’ll probably want some metadata about the posts you’re coding, like how many people link to them, or to the site they’re residing on, or how many outbound links there are on a given page. You can find most of this information using advanced queries on search engines like Yahoo, Google, or MSN Live. But banging out all those search commands is likely to make your fingers tired before long.

The SearchStatus Firefox plugin automates some of this process, but unfortunately not the whole thing. The plugin will instantly deliver you Yahoo, Google, or MSN link reports for a given URL or domain, allowing you to see who’s linking to the post you’re interested in, and the total number of inbound links. SearchStatus also gives you Google and Alexa page rankings for every page you browse to, and it can generate a “link report” for any webpage, showing not only how many outbound links there are on a page, but also how many of them point to other domains, and how many to internal pages.

It’s a thing of beauty, except for the fact that you have to generate all your statistics one page at a time. It’d be much nicer if someone, with better coding skills than I, came along and wrote a program that would generate all these stats for a list of URLs. As it is, you’ll have to physically browse to each page you’re interested in, and cut-n-paste the number of inbound or outbound links from the result page delivered by SearchStatus into your metadata spreadsheet. If you’re doing this for a few hundred pages, it gets old quickly. It’s obviously better than typing the same Yahoo queries by hand, but as I said, it seems like someone with some meager scripting or programming skills could bang out a better software solution to this problem without too much trouble.

TextWrangler

TextWrangler is a free advanced text editing program, and relative of the commercial BBEdit software. The coolest thing about this software is that it supports GREP searches. GREP is an extremely powerful command-line search language for Unix. Of course, most of us don’t use Unix command lines, or don’t want to use them for everything. The beauty of TextWrangler is that it lets you use this nifty search tool in the relative comfort of the (Unix-based) OSX operating system. GREP is basically like Microsoft Word’s “find” feature on steroids. You can set it up to search, not just for a certain line of text, but for sophisticated variants of it. TextWrangler has equally advanced “text replace” functions, that you can use in conjunction with GREP to whip a full-sized document, or series of files, into a data set. Examples of some handy things I’ve done with TextWrangler and GREP (not all of these were for research projects):

  1. Generated a list of the screennames of everyone who commented on a particular blog post (handy when a comment thread tops 1000 responses).
  2. Located deleted comments in a two-week sample of blog posts, by comparing old and new snapshots of a site.
  3. Culled the timestamps of all activity on a website over a period of several days, along with the users involved.
  4. Generated a list of unique RSS subscribers from raw server log data.

TextWrangler can also search, compare, and modify entire lists of files. You can set it up to find all files on your computer or in a certain set of directories that contain a specified variable or text string, and perform the desired functions on those files only. It’s a pretty incredible tool. And when you’re done locating everything you want, you can even use the find and replace tools to transform the data into a comma-separated-values (.CSV) list that can be read by Excel or your favorite stats program.

In fact, if you’re coding with Scrapbook you can open up the .RDF file, where Scrapbook saves all your comments, and turn that into a spreadsheet with all your coding data. You’ll want to do this to a copy, of course, lest you destroy your original Scrapbook database.

The only issue with using TextWrangler to do all this is that you’ll need to know a bit about GREP. Fortunately, (a) it’s not that hard to learn, (b) for any specific task, you’ll only need to devise one or two search strings, so you don’t need to know everything GREP can do, and (c) there are a number of great tutorials around, including one packaged with the TextWrangler software. It only took me a few minutes to get a handle on the basics, about an hour to figure out some of the more advanced functions.  In any case, I frequently just cheat and check the various guides to figure out a specific problem.

Zotero

Zotero is a wonderful Firefox plugin that serves as an alternative to EndNote and more.  The plugin automatically grabs the citation metadata for journal articles, books, news articles, and even multimedia that you browse to on the Web.  Next time you visit the Amazon page for a favorite book, just click an icon in your URL bar and the bibliographic information will automatically be saved to your Zotero database.  The plugin syncs with Microsoft Word, OpenOffice, and WordPress, just like any other reference program, allowing you to “cite while you write” in any major bibliography format.  It stores PDFs, and can also take snapshots of webpages if you like, though Scrapbook is often somewhat better for this purpose.

Moreover, Zotero is being actively developed.  In recent months it’s acquired a tagging system, notetaking features resembling those in Scrapbook, and lots of other goodies that make it superior in many ways to proprietary packages like Endnote.  It imports references from other bibliography programs, so switching over is a breeze.  My only word of caution is that upgrades are sometimes a bit buggy, so be sure to back up your bibliography each time you install a new version of the plugin.

Other Resources

As a grad student, I have access to lots of nifty proprietary software, so I’m using a professional statistics program to parse the quantitative portions of my data. Fortunately, though, for people who don’t have such access, there are many, many statistical software packages out there that are entirely free and open source.  Wikipedia has a nice list to get you started.

I also ran across, NetVis recently, which is a cool open source tool for visualizing social networks. I may have to check it out as I move forward with my research.

There’s also a project at the University of Pittsburgh that apparently brings together tools for web analysis, including some that supposedly make blog data and the proprietary Atlas.ti package play more nicely together.

If you’re looking to graph and visualize your data, the free Vvidget software package is by all accounts the best thing since sliced bread.  Thus far, I’ve only tinkered with it, but the graphics it produces are drop-dead gorgeous.  They make Excel graphs look like ASCII charts.

Lastly, Google has a list of qualitative analysis programs and tools that may be of assistance.

Add Yours!

Got a comment about any of these? A PC equivalent to a Mac tool I listed?  Want to share your own favorites?  Did you write or mod something yourself? Please use the comment thread to add it here!

Tags: open source | research | software

6 Responses to “Studying the Web With Free/Open Source Tools”  rss icon

  1. Dima says:

    Thanks for sharing this useful information!
    You may want to contribute it to this wiki pr alternatively check it out for other tools. I haven’t explored it in depth, but it looks promising.

  2. Josh says:

    Great find, Dima! There’s a ton of stuff there that looks really useful.

  3. I’ve used a number of open-source – and expensive programs. Am now trying to find the best info visualization software. But google notebook – free and searchable – if you’re not disciplined, you get piles of notes – but you can always find things – especially in conjunction with Google Desktop.

    Notepad ++ is my current html editor – but I’m ready to ditch it for something better; and more and more I’m using Open Office instead of MS Word.

    And there are others – but I’m old and my memory, at the moment, is failing.
    Two things about Blue Highway – I’d like to talk to you about it – and the nav link leads to a slightly wonky page.

    Great blog

    Jon

  4. Josh says:

    Hi Jon,

    Thanks for stopping by and for the kind words!

    @Notepad++, I’ve heard great things about the program. If you’re looking to step up, a developer friend of mine recently introduced me to Aptana Studio. It’s a free and fully featured IDE, with syntax highlighting and assistance for html, javascript, and css. You can also get plugins for writing and running php, ruby, etc.

    @Google Notepad, I use it myself. The best thing about it, IMHO, is that it’s web-based, and therefore available from anywhere. I also use a (cheap, not free) program called NoteBook for Mac, which is similar, but has a lot of additional features. Unfortunately, it’s not an online tool, so it can make you a slave to your desktop. It’s also one of those programs that forces you to think linearly, which not every situation calls for. A professor friend of mine is a fan and proponent of a much more free-form organizational tool called FreeMind, which lets you map your thoughts out visually. The interface takes some getting used to (as you might expect, given its open-source origins), and the graphics are all very wire-frame in nature, but I’ve seen some of the notes he takes with the program and it certainly has some powerful features.

    @Blue Highways, The wonky page is the result of my upgrading to php5. A number of items on this site broke as a consequence, including my installations of bbPress, which powers Blue Highways, and reBlog, which powers the new-media-and-journalism feed. bbPress is easily fixed—I just need to upgrade the software to the latest version (though the community is largely dormant at the moment. The traffic and dicussion comes and goes in spurts.). reBlog is another story. It appears to have been orphaned, so I’ll have to repair it myself. I could always roll back to php4, which would get everything working again at once, but I’m taking some development courses and beginning to become attached to the newer version.

    In either case, send me an email and I’m happy to chat about BH. My whole site needs quite a bit of updating and repair, which is one reason I’m taking some elective courses on the subject. Keep stopping by, though, and I’ll be sure to keep you abreast of the updates. I also update this site more often when people are actually reading it. ;)

  5. gamepwner says:

    bbPress ftw, it’s so simple to use and customize! I recommend http://www.bbpresstemplates.net for cool bbPress themes.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

rounded corners for the page bottom