Mar 25 2007

LibraryFind code refactoring

I’ve been spending time this weekend refactoring a major piece of the LibraryFind code partly in an effort to make it easier to add protocol classes.  This change affects a lot of the current API code-base, but the biggest change comes in the meta_search.rb file where nearly all the business logic relating to searching, etc. will be removed in favor of a loose plug-in architecture that I’m hoping will make it easier to add additional search classes to the program.  However, with all refactoring, there’s a bit of debugging that happens and I tell you, this morning at 4 am, it just wasn’t happening.  The big change deals with about 200 lines of code in the meta_search.rb file (which in turn affects the current files that actually make up the protocols and searching).  These 200 lines of code have been replaced by the following block:


if is_in_cache == false
    _tmparray = Array.new()
    objSearch = nil
    eval("objSearch = " + _collect.conn_type.capitalize + "SearchClass.new")
    _tmparray = objSearch.SearchCollection(_collect, _qtype, _qstring, _start.to_i, _max.to_i, _last_id, _session_id, _action_type, _data, _bool_obj)
    if _tmparray != nil: record.concat(_tmparray) end
end 

Basically, this code snippet is called if the query isn’t located in the cache.  Originally, the code that this snippet replaced was a large case statement that performed different actions depending on what protocol was being utilized.  This snippet moves all this logic into models, where search classes are then plugable.  The protocol functions will have a naming convention, take the same values (though they’ll do different things with them) and in theory will make it easier to add support for new search types.  At least, this is what I’m seeing at this moment as I add the ability to query OpenSearch targets to LF. 

Anyway, I worked on this for about 2 hours this morning.  The new plug-ins were working fine — items were going into the cache and results were being returned.  However, they were being lost during the transition from the API to the UI.  Odd.  Couldn’t figure out what was going on and debugging is difficult because the application is threaded (another change — a dedicated global thread pump) so sometimes errors occur while other parts of the application are executing.  Anyway, at 4 am, I decided to knock off and come back to it in the morning with new eyes to see if I could see what I was doing.

Well, I’m glad I decided to sleep on it.  As I was in church this morning and I had an epiphany.   Its in lines 7 and 8.  The way Ruby’s threading works, variables within the threads are isolated and protected from the rest of the application.  To deal with that, ruby has a syntax that allows you to create thread variables that can be accessed outside the application.  So for example, if I have a variable that I want to access outside of the thread, I would use something like:

Thread.current["myrecord"] = Array.new()

This syntax is how plug-ins utilizing the global thread pump will return data to the application.  And there was the rub.  I’d forgotten that Ruby always returns from a function.  Simply for clarity, I always explicitly note what is being returned at the end of each function using the older return syntax:

return xxxxx

I’d conveniently forgotten that feature in the language, and this is what was gumming up the process.  The thread pump would finish evaluating the threads, capturing the thread data and then a string or common array would also be returned (outside of the thread pump) which, since not nil, would overwrite the current record variable.  Once I had these plug-ins start returning nil values and allow data processing to be handled by the thread pump, all was right in the world again.  Unfortunately, I lost 2 hours of sleep last night on this problem, and I’d like to have them back. :)

–TR


Jan 14 2007

Ruby and Date classes

I’d seen this somewhere before, but thought I’d benchmark it to see if it was true.  Ruby has 3 classes for doing dates.  There is Date, DateTime and Time classes.  In languages like C# and PHP, the equivalent DateTime class is the one that is generally utilized to do date/time manipulations.  However, in Ruby, DateTime and Time are nearly identical classes, the difference being that DateTime is a convenience class.  Benchmarking shows the DateTime class to be 20-25 times slower than the native Time class.  After doing a little research, the reason is clear.  The Time class is a native class developed in C, the DateTime is a wrapper class written in Ruby.  So while I rarely have occasion to do time manipulation in a loop — it would seem that the DateTime class should pretty much always be avoided.  In fact, I can’t see why the DateTime class exists.  I don’t see what it does that isn’t available in either the Date class or Time class. 

 

–TR


Nov 30 2006

Why I miss strong variable typing in scripting languages

I love strong typed languages.  Languages that require you to define all variables before use.  They save me from myself, or, from my attrocious spelling.  I was working on LibraryFind, OSU’s opensource metasearch service, three days ago making a few tweaks and changes.  One of the things that this tool has that is somewhat researchy — is this idea of caching all results and how that cache gets reutilized in different sessions.  Anyway, I was making a few changes to the following lines:

      #======================================================
      # Check to see if data was cached -- if it is load
      #======================================================
      if _search_id != nil
         _lxml = CachedSearch.retrieve_metadata(_search_id, _collect.id, _max.to_i)
         if _xml != nil
           if _lxml.status == LIBRARYFIND_CACHE_OK
             if _lxml.data != nil
               _lrecord =  _objRec.unpack_cache(_lxml.data, _max.to_i)
               _results_count = _results_count +  _lrecord.length
               record = record.concat(_lrecord)
               is_in_cache = true
             else
               is_in_cache = true
             end
           elsif _lxml.status == LIBRARYFIND_CACHE_EMPTY
             is_in_cache = true
           end
         end
      end

 

See what I did?  I’m sure you do.  However, I didn’t until last night when I thought that the caching service was running really slow.  So I looked in the code, and found my problem.  A typo.  It should have looked like:

      #======================================================
      # Check to see if data was cached -- if it is load
      #======================================================
      if _search_id != nil
         _lxml = CachedSearch.retrieve_metadata(_search_id, _collect.id, _max.to_i)
         if _lxml != nil
           if _lxml.status == LIBRARYFIND_CACHE_OK
             if _lxml.data != nil
               _lrecord =  _objRec.unpack_cache(_lxml.data, _max.to_i)
               _results_count = _results_count +  _lrecord.length
               record = record.concat(_lrecord)
               is_in_cache = true
             else
               is_in_cache = true
             end
           elsif _lxml.status == LIBRARYFIND_CACHE_EMPTY
             is_in_cache = true
           end
         end
      end

Since Ruby, like many scripting languages, happily created variables for you, I didn’t notice it.  And since the program kept running — abiet, more slowly — it didn’t dawn on me that I’d caused a boo-boo.  It wasn’t until last night, 2 days later, that I found it while doing a code audit.  A practice I have when dealing with scripting languages is to audit modified code weekly to inventory the life of each variable/process.  It’s something I do mostly as a way to eliminate variable useage — but in this case, it helped me find this problem.  Ack.

 

–TR


Nov 29 2006

Google Spell API — Ruby

Someone had asked if I could post the ruby code we use to interact with the google toolbar spell api.  Well, here it is.

require 'net/https'
require 'uri'
require 'rexml/document'

class GoogleSpell
   def GetWords(phrase)
     results = []
     x = 0
     i = 0

     phrase = phrase.downcase
     phrase = phrase.gsub("&", "&")
     phrase = phrase.gsub("<", "&lt;")
     phrase = phrase.gsub(">", "&gt;")
     word_frag = phrase.split(" ")
     word_frag.each do |lookup|
       words = "<spellrequest textalreadyclipped="0" ignoredups="1" ignoredigits="1" ignoreallcaps="0"><text>" + lookup + "</text></spellrequest>"
       gword = Hash.new()
       gword["original"] = lookup;
       gword["data"] = ""
       http = Net::HTTP.new('www.google.com', 443)
       http.use_ssl = true
       http.verify_mode = OpenSSL::SSL::VERIFY_NONE
       response =  http.start {|net|
         net.request_post("/tbproxy/spell?lang=en", words) {|response|
           doc = REXML::Document.new response.body
           nodelist = doc.elements.to_a("//c")
           nodelist.each do |item|
             if item.text.downcase != gword["original"]
               gword["data"] = item.text.downcase
             else
               gword["data"] = ""
             end
           end
         }
       }
       results << gword
     end
     return results
   end
end

–TR


Sep 20 2006

ruby-oai 0.0.3 — thanks ed

Ed posted revisions to the ruby-oai package.  It includes the ability to utilize libxml as the xml parser. 

 

For those interested, this can now be coupled with the following class to parse a normal dublin core unqualified record into an object version of the record. BTW, I realize that this code makes some use of the eval function — probably one that you want to general avoid — but is a very powerful function when one needs to execute dynamic code.

require 'xml/libxml'
require 'oai'

  class OaiDc
    attr_accessor :title, :creator, :subject, :description, :publisher,
                  :relation, :date, :type, :format, :contributor,
                  :identifier, :source, :language, :coverage, :rights

    def parse_metadata(element)

      labels = self.metadata_list()

      if element == nil: return nil end
      labels.each do |item|
        x = 0
        tmp_element = element.metadata.find("./oai_dc:dc/" + item, ["oai_dc:http://www.openarchives.org/OAI/2.0/oai_dc/","dc:http://purl.org/dc/elements/1.1/"])
        item = item.gsub('dc:','')
        eval("@" + item + " = []")
        tmp_element.each do |i|
          s = i.content
          if s != nil
            eval("@" + item + "[" + x.to_s + '] = ' +  s.dump )
            x += 1
          end
        end

        if x==0: eval('@' + item + '[' + x.to_s + '] = nil') end
      end
    end

    def metadata_list()
      labels = ['dc:title','dc:creator',
                'dc:subject','dc:description',
                'dc:publisher','dc:relation',
                'dc:date','dc:type','dc:format',
                'dc:contributor','dc:identifier',
                'dc:source','dc:language',
                'dc:coverage','dc:rights']
    end
  end

Sep 11 2006

ruby-oai and processing metadata elements

Since modifying the ruby-oai module to work with libxml — I’ve found extracting the data from “metadata” to be much easier.  And the nice part — using libxml, I was able to do an initial metadata harvest from 4 collections in under 10 seconds.  Here’s the code from the harvesting component — see below.

–TR

oai_dc harvester

require 'rubygems'
require 'xml/libxml'
require 'oai'

  class OaiDc
    attr_accessor :title, :creator, :subject, :description, :publisher,
                  :relation, :date, :type, :format, :contributor,
                  :identifier, :source, :language, :coverage, :rights

    def parse_metadata(element)

      labels = self.metadata_list()

      if element == nil: return nil end
      labels.each do |item|
        x = 0
        tmp_element = element.metadata.find("./oai_dc:dc/" + item, ["oai_dc:http://www.openarchives.org/OAI/2.0/oai_dc/","dc:http://purl.org/dc/elements/1.1/"])
        item = item.gsub('dc:','')
        eval("@" + item + " = []")
        tmp_element.each do |i|
          s = i.content
          if s != nil
            eval("@" + item + "[" + x.to_s + '] = ' +  s.dump )
            x += 1
          end
        end

        if x==0: eval('@' + item + '[' + x.to_s + '] = nil') end
      end
    end

    def metadata_list()
      labels = ['dc:title','dc:creator',
                'dc:subject','dc:description',
                'dc:publisher','dc:relation',
                'dc:date','dc:type','dc:format',
                'dc:contributor','dc:identifier',
                'dc:source','dc:language',
                'dc:coverage','dc:rights']
    end
  end


Sep 8 2006

More reasons to avoid REXML (like the plague)

I’ve been having a great time playing with ruby — but one of the things that I find myself constantly running up against is the “ruby way”.  The place where I notice this most is when dealing with XML.  One of the things I’ve noticed in working with XML in ruby is that the ruby crowd doesn’t seem to know what to do with it.  In most documentation that I’ve read — it appears that most ruby folks find little value in XML and prefer to work in YAML.  However, in my world — XML is king and working around it really isn’t an option.  So with XML — what are the ruby options?  There are a few — but by default — ruby pushes a component known as REXML.  In general, I’ve found this to be a nifty little library with lots of convenience functions.  However, it comes a very high cost.  First, I’ve found that there’s a limit to the size of file that can be loaded (~5 MB, then the component starts breaking) — and its slow.  Oh, how is it slow.  But how slow?

Well, I’ve been spending a lot of time reworking the API for LibraryFind, our soon to be open source hybrid Federated Search system.  The big project of course has been moving the data into ruby and out of PHP so we can place the project within a web framework (in our case, Rails).  Well, part of the project deals with harvesting data for local indexing — and in many cases, the data being harvested is in OAI.  So, I’ve been playing with Ed Summer’s ruby-oai (very cool) and its based around REXML.  After installing the module and testing the harvesting of a small collection, I was pretty dismayed with the speed of the application.  I suspected that it was REXML that was causing the slow down. 

So after chatting with Ed a little bit, he’d said that he’d be open to hacking up a very that supported libxml provided the changes didn’t:

  1. Turn the code into an unholy mess
  2. Didn’t require custom XPath statements (and they almost do)

Given that I was going to make these changes anyway for our own local instance — I thought this sounded like a good idea (and I’ve never had a chance to work with Ed, with I couldn’t pass up either :) )

So I spent sometime today modifying the ruby-oai module and finished integrating libxml into the module (I guess I’ll wait now to see if I violated number 1 though :) ) — and after running a small benchmarking application — all is well with the world again. 

So how did the benchmarking go?  Funny you should ask. :)   I made an OAI request using the REXML codebase which returned 394 records parsing just the identifier from the header.  Total time:

Time to run: 21.685583
Records returned: 394

Same codebase but just changing the parser — what’s the difference.  Let’s see:

Time to run: 0.75901
Records returned: 394

This kindof a difference really brings a smile to my face.  And considering how many oai sites I have to harvest from (~300), these extra seconds really start to add up.  Test code below…

–TR

Test Code:

require 'oai'

buffer = ""
start_time = Time.now()

client = OAI::Client.new 'http://digitalcollections.library.oregonstate.edu/cgi-bin/oai.exe', :parser =>'libxml'

last_check = Date.new(2006,1,1)
records = client.list_records :set => 'archives', :metadata_prefix => 'oai_dc', :from => last_check
x = 0
records.each do |record|
  #fields = record.serialize_metadata(record.metadata, "oai_dc", "Oai_Dc")
  #puts "Primary Title: " + fields.title[0] + "\n"
  buffer << record.header.identifier + "\n"
  x += 1
end

end_time = Time.now()

puts buffer
puts "Time to run: " + (end_time - start_time).to_s + "\n"
puts "Records returned: " + x.to_s