reeset

 

Its funny how things work out.  As some folks know, we are currently in the process at OSU of re-writing our hybrid metasearch tool from PHP to ruby, and now that much of the heavy lifting has been accomplished (a simple caching engine, wsdl api, oai harvesting, ferret integration, etc.), its been time to start looking at porting some of the niceties that we added to the PHP instance of LibraryFind to gauge our users reactions.  One of these features that remained to be ported was a spell checker. 

In the PHP instance of LibraryFind, the utility made use of a couple of built-in PHP components to provide a spell checker that wasn’t simply a dictionary lookup.  Utilizing Aspell as the dictionary behind the spell checker, the tool utilized metaphonics — the analysis of sounds — to determine which entry returned by the dictionary was closes to the actual typed text.  This of course meant that the text entered into the application had to be misspelled phonetically if the tool was to find the best match possible.

Well, ruby doesn’t have many of these tools available to it.  While I could have coupled it with Aspell, I couldn’t find a good metaphonics engine to attach to ruby, which I think is important since straight dictionary matching is only so useful.  So I started looking for alternatives…

Its actually interesting how  many alternatives one can find.  Google actually has two.  The first is its publicly available Search API which includes an option to returned spell checked items.  This looked promising, but the 1000 search limit was disappointing.  However, Google provides a secondary service that is for the most part undocumented that is currently utilized by it’s FireFox plugin.  Apparently, this undocumented API has been well-known for sometime, going back to 2005 when Google first released its beta Firefox toolbar.  What interesting as well (and I tool a look at the source to see it for myself) — is that within the toolbar, Google has a number of code comments that indicate if a specific piece of functionality is intended to be used by outside organizations.  A good example of this is the pagerank calls in the toolbar.  Comments surround the code that make it clear that this isn’t a publicly available function.  However, this isn’t the case with the toolbar spell checker API and after looking through the EULA and doing some web sleuthing, I couldn’t find anyone talking about restrictions to use.  Quite the contrary, there are a number of people building interfaces to this api.  You can find them here and here and here and here and here.  In fact, I could find examples working with this api in PHP, python, C# and PERL.  So I figured, why not a ruby.  So, last night I wrote a small ruby class to parse this data for LibraryFind.

I know, I know — what does this have to do with Dspace?  Well, after coding it into ruby I figured, heck, why not Java?  So I spent today putting together a new java class that implements the Google Spell checking API within Dspace.  Unfortunately, this isn’t the simplest of hacks (since it requires touching a number of the jsp files and adding a class file — but take a look at the results:

Currently, this exists on our development server, but will likely make its way into our production environment next week.  ***Side note:  it makes senses that today would be the day that I’d choose to make these changes.  We are currently freezing Dspace development so we can finish porting our current hacks into the 1.4 source so we can bring that into production.  The process was just about finished when I decided that we had time to sneak on more change into the migration. :)   You can see what Jeremy puts up — too much copious free time at home I guess. :)

So how can you too add this functionality to your Dspace instance?  Good question.  Well, first you need this java class.  I won’t guarantee that you won’t find any problems with it (and if you do, give me a holler) since I just finished it today and only did a little bit of lite testing, but I haven’t had trouble with it so far, so I feel pretty good about it.  You’ll likely need to change where the class is packaged.   We tend to package OSU specific classes within its own namespace.  Second, you need to make changes to the following files:

  1. search/results.jsp
  2. utils.js

First, results.jsp: 

Around line: 75 — add the reference to the class.


<%@ page import="edu.oregonstate.library.util.GoogleSpell" %>

Around line 104 — you need to add an id to the simple-search form tag.


Around line 161 — you need to add an id to the query text box


 
 
" />

Around line 170 — add the following snippet. You will notice that there is a snippet of inline javascript code. The reason its there is if Google offers no corrections for any word, I just don’t want to show them. So, by default, the results are hidden and the inline javascript actually displays the results if the can_see variable is set to true.


<%

if (query!=null) {
%>

<%
   boolean can_see = false;
   GoogleSpell objURL = new GoogleSpell();
   String[] words = query.split(" ");
   String[] t = objURL.GetWords(query);
   if (t!=null) {
%>

 

And next, the utils.js file:

In the utils.js file we are just adding two new functions. One is a convenance function and the other is the BuildSearch function. I put these at the end of the utils.js file — but it really doesn’t matter where it goes:


function BuildSearch(f,q, cf, n) {
  var s = "";
  for (i =0; i

And that's pretty much it.  Recompile the source and next time you do a search, if Google returns a suggestion, the tool will present it in the context of a Did you mean question.  Since sometimes misspelling occur within a phrase, or can have multiple suggestions, I've built the interface so that multiple selections show up in a listbox.  If a word is spelled correctly -- it is frozen, so only the misspelled words can be selected from.  For example:

In this example, degree was spelled correctly, but forestry was not.  Since Google returns suggestions for the misspelled word, those options are placed into a listbox, while the other is frozen since its spelled correctly and no other suggestions were offered.

So that's it in a nutshell.  Hopefully someone else will find these snippets useful.

 

--TR

 

Anyway, once MarcEdit starts allowing users to utilize the Saxon XSLT engine, a new COM property will be made available to allow users scripting to the MARCEngine the ability to modify which engine is in use.  Here’s how it would look:

Const MSXML = 1
Const SAXON = 2
lret = 0

Set obj_MARC=CreateObject(“MARCEngine5.MARC21″)
obj_MARC.Set_XSLT_Engine = SAXON
lret = obj_MARC.XML2MARC(“c:\test.xml”, “c:\test.mrc”, “c:\ead2marc21.xsl”, “c:\marc21slim2mnemonic.xsl”)

msgbox “finished”

 –TR

 

So I’ve just about finished making the changes to MarcEdit 5.0 to include Saxon’s .NET component into the application.  This will allow users the ability to support XSLT 1.0, 2.0, XPath, Xquery, etc.  By default, the application will continue to utilize the customized version of the System.XML classes provided by default in .NET since they are a bit faster than the Saxon implementation when dealing with 1.0 documents — but users will have the option to change this setting which might be the way to go depending on your XML.  The System.XML component is fairly strict and doesn’t support non-numeric character entities (i.e., &copy isn’t supported) while the Saxon component does. 

Anyway, the setting will be found in the properties box — so it will show up like:

properties.PNG

I’ve also been updating a number of the XSLT files provided with MarcEdit, with the biggest updates to the Dublin Core translations (both OAI harvester and straight DC translation).  And with XSLT 2.0 support added to MarcEdit, I’ll likely provide LC’s FRBR code with MarcEdit shortly as well.

Lastly, I’ll be changing the MarcEdit license shortly.  I’d like to make it easier for people to use and develop with the program so I’m looking at something like an MIT-like license.  Anyway, I’m thinking of using something like:

Redistribution and use in binary forms are permitted provided that the following conditions are met: 

  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 

  • Neither the name of TERRY REESE nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. 

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Anyway, I want to do a bit more testing to make sure that the COM changes, XSLT changes, etc. all work and then I’ll post.

–TR

 

In the openning keynote, Tim O’Reilly presented what he calls his 5 big ideas about open source.  Some I thought were fairly interesting. 

  1. Architectures of participation beyond software development (Web 2.0)
    The idea here is that in this new world of Web 2.0, successful companies are looking to architectures where people build systems simply by using them.  He noted companies like Amazon and Google, where basically through the act of using the tools, users contribute to the growth of the application.  He argues that open source needs to borrow from this model — (and to a large degree I think does) — in that people build tools out of need.  The next step of course is to share those tools once created.
  2. Open Source Licenses are Obsolete:
    I guess context needs to be added here.   O’Reilly wasn’t saying that the licenses themselves are obsolete within the current software distribution model, but in the Web 2.0 model, where software is no longer distributed but hosted, the open source license model is broken.  So how do we deal with this?  O’Reilly notes that we need to reinvent “open source” for a world in which software is performed rather than distributed and rely on vast databases and not just code.  He points to some new development models (mashups, developers apis like Google APIs and Flickr Services), and argues that the open source community needs to address this idea of building an open services api.
  3. Asymmetic Competition
    The idea that open source should look seriously at business frameworks and how they can be implemented to drive open source.
  4. Operations of advantage
    Notion that in the future, software will be hosted and virtualization will become more important in the future.
  5. Open Data
    Data needs to be made free as we at this point don’t know what data lockdown will look like in the future.  He points to some services working on this problem: http://www.movemydata.org/

–TR

 

I’m going to be away from the office for a few days while attending the OSCON 2006 conference here in Portland, Oregon.  I’ll be periodically blogging about the sessions throughout the conference — so stay tuned.  BTW, if anyone is here at OSCON, give me a hollar.
–TR

 

I had someone who uses UniMARC ask me about some problems that they were having with the conversion between Unimarc to Dublin Core.  Some of the characters where being skewed when the translation occurred.  The problem of course relates to the characterset that the Unimarc records are encoded in.  Because a number of MARC formats utilize the MARC8 characterset, I’ve code MarcEdit to expect either MARC8 or UTF8 when moving to XML.  The reason for this is because the MARC8 characterset overlaps with the ISO 8859-1 (cp 1252) charactermap.  Since the UniMARC data was in the ISO 8859-1 encoding, the overlapping elements were translated into Unicode as if they were in MARC8.  Ugh.  Fortunately, MarcEdit provides a facility that allows users to migrate their data between other formats and UTF8.  This layer allows users to move data from any supported characters (any available windows code page) to UTF8 and then back to MARC8 if necessary.  So for this example, I was able to recommend that the user use this character tool to convert the data into UTF8 and then process the file into XML.  It adds one step extra step, but it works for now.  I’m thinking in the near future, I’ll likely add an option in the Preferences to allow users to set a default characterset.  This will allow the MarcEngine to internally handle these problems easier when moving between MARC and XML.

One of the things I’ve been happy with in .NET has been the ease of moving between charactersets.  As some of you may know, the MARCEngine in MarcEdit has traditionally be written in assemblier.  This meant that I wrote my own character conversions for the most part — making the process fairly tedious.  In C# however, this is handled in a couple of lines of code.  So for example — If I was openning a file in windows codepage 1252 and needed to convert it to codepage 1250 or even UTF8:

string s = “”;
byte[] in;
byte[] out;
System.IO.StreamReader reader = new System.IO.StreamReader(@”c:\test1252.txt”, System.Text.Encoding.GetEncoding(1252);
System.IO.StreamWriter writer = new System.IO.StreamWriter(@”c:\testutf8.txt”, false, System.Text.Encoding.UTF8);

//Read the file in
s = reader.ReadToEnd();
in = System.Text.Encoding.GetEncoding(1252).GetBytes(s);
out = System.Text.Encoding.Convert(System.Text.Encoding.GetEncoding(1252), System.Text.Encoding.UTF8, in);
writer.Write(System.Text.Encoding.UTF8.GetString(out));

reader.close();
writer.close();

 

So I noticed David Bigwood mention on his blog (http://catalogablog.blogspot.com/2006/07/static-oai.html) that he wished that there existed an XSLT stylesheet that moved data from MARC21XML to Static OAI.  Well, ask and you shall receive.  While I doubt that David will be able to make use of this (it sounds like he’s finished with his project), maybe someone down the line will find it useful.  I’ll eventually fold this into the MarcEdit xslt sample repository.  Until then, you can download it from here: http://oregonstate.edu/~reeset/marcedit/xslt/MARC21slim2StaticOAIDC.xsl

–tr

 

So I’m wrapping this up as we speak.  I’ve been testing some of the instances and noticing that I’d left out a small block of code in the modified subfield section — so I’ll be finishing that tonight with the hope to post tonight.  But I thought I’d include a sample perl script generated by the wizard.  This adds a single field to a marc file.  As with the vbscript wizard, I think that the most useful function of the script wizard will be to provide a set of template code for processing MARC records in a number of different languages — but it also does fill a gap for folks with very little scripting knowledge, so hopefully, this will continue to fullfill this need.  Anyway, here’s the example: Generated PERL script

–tr

 

New updates to the MarcEdit 5.0 program.  The updates are as follows:

  1. Script Wizard:  The script wizard is back in MarcEdit.  Currently the script wizard only produces VBscripts (i.e., has the same functionality as the 4.6 version) — but the internal framework is such that new script language should be easy to add.  I’ll be working on adding a PERL output next — followed by a python output.  After that, I’ll see if there are additional languages that I should support.
  2. MarcValidator changes:  I’ve added a new field to the marcrules.txt file.  The new field is valid and is used to denote valid characters within a field.  Where might this be useful.  Well, the 020 for example — in testing valid data in the subfield.  If you want to use this new feature — and have downloaded MarcEdit 5.0 in the past, you will need to modify your marcrules.txt file after doing the update.  The installer will not modify this file because its a config file that the user may modify.  Here’s a sample of the new format:

020 R INTERNATIONAL STANDARD BOOK NUMBER
valid [^0-9x] Valid Characters
ind1 blank Undefined
ind2 blank Undefined
a NR International Standard Book Number
c NR Terms of availability
z R Canceled/invalid ISBN
6 NR Linkage
8 R Field link and sequence number

Notice the valid statement.  Here, Valid must be a regular expression that can evaluate as true or false.  Here, the expression is noting that any character that is not 0-9x is invalid.  So if a character is not one of these values — it evaluates as true and an error message is shown.    At some point in the future, I’ll likely create a special holder for ISBN so the tool could also validate digits (checkdigits, etc).  Right now, the tool simply doesn’t know that its dealing with an ISBN — its just looking at the rules file to see how the field should be constructed.

Anyway, those are the two big updates.  You can download the file from: MarcEdit50_Setup.exe

–TR

 

Shuttle launch days are exciting days around our house.  I love watching the shuttle launch.   And my boys love it as well.  At this point, there are 2 minutes till the shuttle launch and I find it very appropriate that we should be sending the shuttle back into space on the 4th.  I know that there are many people doing it — but I pray for a safe trip….

And they are off.  The entire house is watching the shuttle as its fly.  At 4 minutes — a guess of ~5000 mph and 60 miles of altitude.  Very cool. 

–TR