Jan 2 2008

Using MarcEdit to reuse (and maybe import) items in Dspace

I’ve been thinking a little bit about some of the things that I use MarcEdit for and have been pushing some of this work off my desk to some of the staff in our technical services department.  We actually use MarcEdit quite a bit when it comes to sharing metadata from our Dspace instance with other systems, like OCLC’s WorldCat and our online Catalog.  For example, we use MarcEdit to automatically generate MARC21 records for our theses submitted through Dspace.  The process seems to work fairly well, and has been very easy for our staff to learn.  Should write an article documenting this process and how its working at OSU at some point. 

To that end, I’m writing a plug-in for MarcEdit that may enable me to mainstream the processing of web page archiving in Dspace.  At this point, the process is a bit too manual for my tastes.  Along with spidering a site (using whatever the chosen depth may be), there is this pesky manual step of flattening the site and making the urls relative.  Not a big deal (unless there are file name collisions [which there always are] when reading depths), but it takes time.  So, I spent some time this afternoon and wrote a threaded web crawler.  Seems to work well.  At this point, I just need to add the logic to flatten all paths, and come up with a naming schema to re-write all urls to provide unique file names.  Once I get that down, building the batch import package for Dspace should be fairly trivial.  Not sure how much time I’ll have to work on this over the week/weekend, but would be a pretty cool project to finish I think.  It would certainly allow the library to provide site archiving as a dspace option (at this point, its only done under very special circumstances) and should simplify the process enough to the point that it could probably become a mainstream process. 

Anyway, if I do get a chance to get this finished, I’ll certainly make it available as a plug-in (with source).  Of course, if someone has already developed a simplified process that requires no manual processing after harvest, I would love to hear it.

–TR

Technorati Tags: ,,

Dec 26 2007

MarcEdit 5.x and OCLC Plug-in Update

I took a couple of minutes and made a few changes to MarcEdit and the OCLC plug-in to provide some additional functionality to the plug-in framework and fix an error in the OCLC plug-in. 

Changes:

MarcEdit:

  • One real change.  In the MacroInterfaces.dll (the library that allows the Scripting interface and the Plug-in interface access to the MarcEditor and its functions) I’ve added two new functions: AddButton and RemoveButton.  These two functions allow users to have plug-ins place buttons on the toolbar of the application (at least, on specific windows). 

OCLC Plug-in:

So what will you see with the changes.  Well, the big change you will see is when you initialize the plug-in in the MarcEditor.  Once you have downloaded the update to both MarcEdit and the Plug-in (you need both), you will see the following when the plug-in is executed:

image

The new button added has been highlighted.  This button now acts as the new Save button when you have made your changes to the OCLC data records.  This will move the data back into the OCLC Save File.  Remember — at this point, you will want to make a backup of your Save Files before you make your changes — just in case there are other fields in OCLC’s XML format that are different than I would have expected. 

If you have downloaded the OCLC Plug-in and would like to update it.  At this point, the process isn’t as streamlined as I’d like (I’ll fix that this week while I’m taking some time off to recharge).  Essentially, you need to Uninstall the plug-in (using the plug-in manager or, delete the oclc_helper.dll from the marcedit plug-ins directory (generally, c:\program files\marcedit 5.0\plugins\).  If you uninstall with the plug-in manager, you will need to close and restart MarcEdit.  Then open the plug-in manager and download the new plug-in.  If you delete the library directly from the plug-ins directory — just open MarcEdit, select the plug-in manager and download the plug-in.

I’ve updated a new version of MarcEdit: MarcEdit51_Setup.exe.  I’ve also uploaded a new version of the plug-in (download this through plug-in manager [see above]).  Source can be downloaded from: oclc_helper.zip.

If you have a strong desire to see how this type of interaction is accomplished in C#, please see the following post and sample project file: http://oregonstate.edu/~reeset/blog/archives/481.

 

–TR


Dec 24 2007

C# plug-ins continued — Interacting with one’s hosted application

Example Project Source: PluginProject.zip

Because I’ve been doing a lot of work with MarcEdit and plug-ins, I thought I’d post some sample code for anyone interested in how this might work.  Essentially, the sample project includes 3 parts — a host application, a set of Interfaces and a Shared library.  Making this work requires a couple of important parts. 

First, the host application (either the form or class), need to implement the set of interfaces.  So for example, if interaction with a form in the hosted application was need, you would configure the form to implement a set of interfaces.  This would look like:
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;

namespace HostApp
{
public partial class Form1 : Form, HostInterfaces.IHost
{
//...
}

This implements the IHost class (link to msdn) — a generic class that allows you to

pass objects between dynamically loaded libraries.  .NET includes a IScript interface that allows for scripting functionality as well. 

Anyway, the interfaces are simply like delegates — they define the visible functions/methods that will be accessible to a foreign assembly.  This is the simpliest file to create.  It looks something like this:

using System;
using System.Collections.Generic;
using System.Text;

namespace HostInterfaces
{
public interface IHost
{
System.Windows.Forms.Label label { get;}
System.Windows.Forms.ToolStripButton AddButton(string caption);
void RemoveButton(System.Windows.Forms.ToolStripButton t);

}
}

Finally, the Dynamic assembly has the ability to work with any function/object within the host application that has been made public through the interface.  For this sample project, I’ve shown how to modify a label (on the host application), add a button to a toolbar and respond to click events from that button. 

The project is a simple one — but should go a long way towards showing how this works.

Cheers,

–TR

Technorati Tags: ,,


Dec 21 2007

OCLC’s Connexion XML — why, oh why?

As I’d noted previously (http://oregonstate.edu/~reeset/blog/archives/479), some early testers had found that the Connexion plug-in that I’d written for MarcEdit stripped the 007.  I couldn’t originally figure out why — it’s just a control field and their syntax for control fields is pretty straightforward.  However, after looking at a few records with 007 records, I could see why.  In Connexion, OCLC lets folks code the 007 using delimiters like a normal variable MARC field (when its not) — and they save it as such — using delimiters.  For example:

<v007 i2=" " i1=" " im="0">
  <sa>
    <d>s</d>
  </sa>
  <sb>
    <d>d</d>
  </sb>
  <sd>
    <d>f</d>
  </sd>
  <se>
    <d>s</d>
  </se>
  <sf>
    <d>n</d>
  </sf>
  <sg>
    <d>g</d>
  </sg>
  <sh>
    <d>n</d>
  </sh>
  <si>
    <d>n</d>
  </si>
  <sj>
    <d>z</d>
  </sj>
  <sk>
    <d>u</d>
  </sk>
  <sl>
    <d>u</d>
  </sl>
  <sm>
    <d>u</d>
  </sm>
  <sn>
    <d>d</d>
  </sn>
</v007>

I’ll admit — I have no idea why they went with this format.  From my perspective, its clunky.  The 007, as a single control field, is fairly easy to parse as it can have up to 13 bytes, with number of bytes specified 0 byte of the data element.  In this format, you actually have to create 9 different templates for the different possibilities in order to account for different field lengths, byte combinations and delimiter settings.  Honestly, my first impression when looking at this was that its a perfect example of how something so simple can become much more difficult than need be.  Personally, I would have been happier had they broke from their MARCXML like syntax for this one field to create an special 007 element.  Again, this is something that could have been easily abstracted in the XSLT translation — but to be fair, I don’t think that they figured anyone but OCLC’s connexion team would ever be trying to work with this. 

So how I’m solving it?  Well, one of the cool things working with XSLT (and .NET in general) is the ability to use extensions to help fill in missing functionality in the XSLT language (in my case, the ms:script extension in the msxml library).  Since this transformation isn’t one that I’m really sharing (outside the plug-in), I’m not too worried about its portability.  So, what I’ve done is created a number of helper C# functions and embedded them within the xslt document to aid processing.  For example,

<xsl:stylesheet version="1.0"
xmlns:marc="http://www.loc.gov/MARC21/slim"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:ms="urn:schemas-microsoft-com:xslt"
 xmlns:osu="urn:oregonstate-edu:xslt"
 extension-element-prefixes="osu">
  <xsl:output method="xml" indent="yes" />
  <ms:script language="C#" implements-prefix="osu">
    <![CDATA[
        public int length(string s) {
          s = s.ToLower();
          if (s=="c") {
             return 14;
          } else if (s=="d") { return 6;}
          else if (s=="a") { return 8;}
          else if (s=="h") { return 13;}
          else if (s=="m") { return 10;}
          else if (s=="k") { return 6;}
          else if (s=="g") { return 9;}
          else if (s=="r") { return 11;}
          else if (s=="s") { return 14;}
          else if (s=="f") { return 10;}
          else if (s=="v") { return 9;}
          else { return 8;}
        }
      ]]>
  </ms:script>
 

This is a simple function that I’m using to track the number of elements needed for the processing template.  This is because I don’t want to create 9 different XSLT templates for each processing type, so I’m using some embedded C# to simplify the process.  On the plus side, using these embedded scripts make the translation process much faster on the .NET side (since .NET compiles xslt to byte code anyway before running any translation process), and this is a technique that I’ve never really had to use before so I was able to get a little practical experience.  Still don’t like it though.

–TR


Nov 19 2007

Dynamically loading and Unloading Assemblies in C#

While working on a plugin manager for a program written in C#, I found myself with a need to be able to load and unload assemblies dynamically be an application.  In C#, loading assemblies is a fairly easy prospect — one just needs to make use of the System.Reflection class.  Something like the following:

System.Reflection.Assembly assembly = System.Reflection.Assembly.LoadFile(@"c:\yourassembly.dll");

However, if you need to unload the assembly — good luck.  The .NET assembly class doesn’t include an unload method.  If you have a need to be able to dynamically load an unload assemblies, you need to work with the AppDomain class.  The .NET framework works on an Application Domain model, so for items like plugins (where you may need to load, unload or modify an assembly), you need to create an Application Domain manager to load assemblies onto.  This way, when you need to unload an assembly, you use the Unload method found within the AppDomain class. 

Of course, when dealing with plugins, you likely will need to create a new application domain for each plugin to be loaded.  This is because the you unload the appdomain, not the assemblies attached to the domains.  So for my project, I decided to create something much like the TempFileCollection.  In a global class, I decided to create a hash that stories a domain name and the domain object.  Using this method, I can do something like the following:

   1:  string path = cglobal.mglobal.AppPath() + "plugins" + System.IO.Path.DirectorySeparatorChar;
   2:              string[] files = System.IO.Directory.GetFiles(path);
   3:   
   4:              lstInstalled.Items.Clear();
   5:              foreach (string f in files)
   6:              {
   7:                  try
   8:                  {
   9:                      System.AppDomain domain = System.AppDomain.CreateDomain(System.IO.Path.GetFileName(f));
  10:                      System.IO.StreamReader reader = new System.IO.StreamReader(f, System.Text.Encoding.GetEncoding(1252), false);
  11:   
  12:                      byte[] b = new byte[reader.BaseStream.Length];
  13:                      reader.BaseStream.Read(b, 0, System.Convert.ToInt32(reader.BaseStream.Length));
  14:   
  15:                      domain.Load(b);
  16:                      System.Reflection.Assembly[] a = domain.GetAssemblies();
  17:                      int index = 0;
  18:   
  19:                      
  20:                      
  21:   
  22:                      for (int x = 0; x < a.Length; x++)
  23:                      {
  24:                          if (a[x].GetName().Name + ".dll" == System.IO.Path.GetFileName(f))
  25:                          {
  26:                              index = x;
  27:                              break;
  28:                          }
  29:                      }
  30:   
  31:                      System.Windows.Forms.ListViewItem item = new ListViewItem();
  32:   
  33:                      item.Text = a[index].GetName().Name + ".dll";
  34:                      item.SubItems.Add(a[index].GetName().Version.ToString());
  35:                      item.SubItems.Add(reader.BaseStream.Length.ToString());
  36:                      lstInstalled.Items.Add(item);
  37:                      reader.Close();
  38:                      cglobal.mglobal.domains.Add(System.IO.Path.GetFileName(f), domain);
  39:                      
  40:                  }
  41:                  catch { }
  42:              }
 

Then, if we need to unload the assembly, we can unload the domain that its attached to.  Something like:

   1:  for (int x = 0; x < lstInstalled.Items.Count; x++)
   2:              {
   3:                  if (lstInstalled.Items[x].Selected == true) {
   4:                      try {
   5:                          if (System.IO.File.Exists(cglobal.mglobal.AppPath() + "plugins" + System.IO.Path.DirectorySeparatorChar + lstInstalled.Items[x].Text)) {
   6:                              System.AppDomain.Unload((System.AppDomain)cglobal.mglobal.domains[lstInstalled.Items[x].Text]);
   7:                              cglobal.mglobal.domains.Remove(lstInstalled.Items[x].Text);
   8:                              System.IO.File.Delete(cglobal.mglobal.AppPath() + "plugins" + System.IO.Path.DirectorySeparatorChar + lstInstalled.Items[x].Text);
   9:                          }
  10:                      }
  11:                      catch {}
  12:                  }
  13:              }

Seems a little more involved that it has to be, but once you know how it works, its not that big of a deal.

–TR


Aug 1 2007

.NET 64-bit processor memory issues when using sendmessage to access a winform element

I’m posting this in hopes that it will save someone else a lot of time or someone that knows .NET a bit better than I can provide a better solution. 

Problem:

Last week, I had someone ping me regarding MarcEdit and a problem that they were running into with the Editor running it on a 64-bit version of Windows 2003 Server.  MarcEdit is compiled for any processor, so in theory, the framework should adjust the variable types to the current CPU type and go on it’s merry way.  And was it not that I have to work with some unmanaged code within my application, I’m sure that this would be the case.  However, when opening the MarcEditor, the user was getting the following error message:

This is odd because I test MarcEdit on every version of Windows from 98 to Vista.  The problem however, is I’ve never ran the program in a 64-bit version of Windows. 

Background:

I did a little bit of research, and found what I thought to be the problem.  The 64-bit version of windows shares many of the same signatures as its 32-bit counter-part, but one place where the signatures differ is in the Messaging Queue.  SendMessage, for example, which uses integers to pass values between processes had been updated to 64 bit integers and would crash if the wrong data type is sent into the function.  No problem, I fixed the signature issue, but the error message remained.  What I didn’t realize is that this wasn’t the actual problem (though it was a problem).  The real problem seemed to be related to simply accessing the RichTextbox Handle and passing it the callback.  Anytime the Handle was touched and passed, this error would be generated.

Solution:

So, Microsoft does make the Enterprise version of Windows 2003 Server available on a trial basis for developers wanting to test their software.  So, I dug up a box with an AMD-64 bit processor and set to installing the software.  Next, I installed SharpDevelop, an Open Source IDE for .NET.  I created a small sample program to isolate the code that was causing me problems.  In my case, the code that was causing the problem is necessary because of MARC being a UTF8 encoded data format.  Microsoft’s Richtext library supports the loading of plaintext (ASCII), Unicode text, text with OLE objects and text in just about any character format, including UTF8.  Unfortunately, the .NET framework only exposes plaintext and Unicode text as supported formats.  This means that in order to load UTF8 data and utilize the components streaming nature to minimize the memory footprint during loading, we need to essentially write our own EditStreamCallback function, create the delegates, the EDITSTREAM struct, etc.  And in that, there is the rub.  When compiling the code in SharpDevelop, I specified that the code should be targeted specifically for a 64-bit processor.  During compile, I got two warning messages that two core .NET components are compiled specifically for 32-bit processors.  Since the signatures on the 64 and 32 bit machines are identical, one can generally ignore these compilation warnings, as the framework does it’s magic.  However, the fact that I’m utilizing functionality from one of these two components within an unmanaged code block causes the problem.  Within the .NET (and 64-bit environment in general), an 64-bit process cannot load a library compiled for a 32-bit process.  A 32-bit process can run within a 64-bit environment, they just cannot share processes between themselves.  My best guess is that this is what was happening.  Since these two .NET components were compiled specifically for the 32-bit processors, my attempts to load them into a 64-bit process and utilize them within an unmanaged code block caused issues.  The solution is a simply one — for the GUI application of MarcEdit (which doesn’t do much anyway), the program simply needs to be complied to target 32-bit processors.  Now it runs just fine within a 64-bit environment, and will remain so until Microsoft cleans up these two core libraries.  With that said, if anyone has a better way of dealing with this problem (code is attached, so if you can make it work, I’d love to here from you), I’d love to hear about it.

RichText Code:

Finally, it’s pretty difficult to find example code dealing with the Richtext components in C#.  I think this is primarily because most folks that use high level languages like C# either don’t have a need for it or don’t have the background in C++ to understand what is actually happening at the Proc level.  Anyway, to that end, I’m posting the source to my small sample program (get it here) that I used to diagnosis this problem.  The trick to doing this type of interaction is to avoid the use of integer class variables.  In .NET, you have to remember that you are dealing with managed code, so when you make the call to a API like SendMessage, you should be Marshalling all your data, and passing it into the function via the IntPtr structure.  The only exception to that with the SendMessage API is the message argument, which microsoft defines and an unsigned 32-bit integer on all platforms, though for practical purposes, the message argument should be classed as a 32-bit integer.

API/Delegate Declarations

   1:  private const int SF_USECODEPAGE = 0x020;
   2:          private const int SF_TEXT = 0x001;
   3:          private const int SF_RTF = 0x002;
   4:          private const int CP_UTF8 = 65001;
   5:   
   6:          private const int WM_SETREDRAW      = 0x000B;
   7:   
   8:          private const int WM_USER = 0x400;
   9:          private const int EM_STREAMIN = WM_USER + 73;
  10:          private const int EM_GETEVENTMASK   = (WM_USER + 59);
  11:          private const int EM_SETEVENTMASK   = (WM_USER + 69);
  12:          private const int EM_STREAMOUT = WM_USER + 74;
  13:          private const int ENM_NONE =    0;
  14:          private const int EM_SETTEXTMODE        = WM_USER + 89;
  15:   
  16:          private const int TM_PLAINTEXT       = 1;
  17:   
  18:          private const int ECO_AUTOWORDSELECTION = 0x00000001;
  19:          private const int ECO_AUTOVSCROLL = 0x00000040;
  20:          private const int ECO_AUTOHSCROLL = 0x00000080;
  21:          private const int ECO_NOHIDESEL = 0x00000100;
  22:          private const int ECO_READONLY = 0x00000800;
  23:          private const int ECO_WANTRETURN = 0x00001000;
  24:          private const int ECO_SAVESEL = 0x00008000;
  25:          private const int ECO_SELECTIONBAR = 0x01000000;
  26:          private const int ECO_VERTICAL = 0x00400000;
  27:          private const int ECOOP_SET = 0x0001;
  28:          private const int ECOOP_OR = 0x0002;
  29:          private const int ECOOP_AND = 0x0003;
  30:          private const int ECOOP_XOR = 0x0004;
  31:  
  32:          private const int EM_SETOPTIONS = (WM_USER + 77);
  33:          private const int EM_GETOPTIONS = (WM_USER + 78);
  34:   
  35:   
  36:          delegate IntPtr EditStreamCallback(IntPtr dwCookie, IntPtr pbBuff, IntPtr
  37:              cb, out IntPtr pcb);
  38:   
  39:  
  40:          struct EDITSTREAM
  41:          {
  42:              public IntPtr dwCookie;
  43:              public IntPtr dwError;
  44:              public EditStreamCallback pfnCallback;
  45:          }
  46:   
  47:  
  48:   
  49:          [DllImport("user32.dll", CharSet = CharSet.Auto, SetLastError = false)]
  50:          static extern IntPtr SendMessage(HandleRef hWnd, Int32 Msg,
  51:                                          IntPtr wParam, IntPtr lParam);
  52:  
  53:          [DllImport("user32.dll", CharSet = CharSet.Auto, SetLastError = false)]
  54:          static extern IntPtr SendMessage(HandleRef hwnd, Int32 msg, IntPtr
  55:              wParam,    ref EDITSTREAM lParam);

In the declarations, you will see that two forms of SendMessage have been defined.  One where the lParam references the EDITSTREAM structure and on where it references an IntPtr structure.  The former is used when streaming data into the RichText window, the latter is used when sending regular messages between controls.  It should be noted, the later could be removed in .NET 2.0 by making use of the System.Windows.Forms.Message class, which essentially allows you to send messages to controls so long as all arguments can be sent as IntPtrs.

After the declarations, the remainder of the code is setting up the actual streaming, and creating the function that the delegate prototypes.  In this example, I’ve called the streaming function, ReadRichTextStream and the actual streaming function, StreamIn.  These functions would look like the following:

ReadRichTextStream: Accepts a RichTextBox Object and the filename of the file to load.

   1:          private void ReadRichTextStream(System.Windows.Forms.RichTextBox objRich,
   2:              string sfilename)
   3:          {
   4:  
   5:              string filename = sfilename.ToLower();
   6:              objRich.Text = "";
   7:              int eType = SF_TEXT;
   8:              if (filename.EndsWith(".mrk")|filename.EndsWith(".mrk8")|filename.EndsWith(".tmp")|filename.EndsWith(".xml"))
   9:              {
  10:                  eType = (((CP_UTF8)<<16)|SF_USECODEPAGE|SF_TEXT);
  11:              }
  12:              else if (filename.EndsWith(".bmrk"))
  13:              {
  14:                  eType = SF_TEXT;
  15:              }
  16:              else if (filename.EndsWith(".rtf"))
  17:              {
  18:                  eType = SF_RTF;
  19:              }
  20:              else if (filename.EndsWith(".txt"))
  21:              {
  22:                  eType = SF_TEXT;
  23:              }
  24:              else
  25:              {
  26:                  eType = (((CP_UTF8)<<16)|SF_USECODEPAGE|SF_TEXT);
  27:              }
  28:   
  29:              //this.Redraw = false;
  30:              long b_length = 0;
  31:              System.IO.FileStream fs = new System.IO.FileStream(sfilename, System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read);
  32:              b_length = fs.Length;
  33:              Application.DoEvents();
  34:              System.Runtime.InteropServices.GCHandle gch = System.Runtime.InteropServices.GCHandle.Alloc(fs, System.Runtime.InteropServices.GCHandleType.Normal);
  35:              EDITSTREAM es = new EDITSTREAM();
  36:              es.dwCookie = (IntPtr)gch;
  37:              EditStreamCallback callback = new EditStreamCallback(StreamIn);
  38:              es.pfnCallback = callback
  39:  
  40:              SendMessage(new HandleRef(objRich, objRich.Handle), (Int32)EM_STREAMIN, (IntPtr)eType, ref es);
  41:  
  42:              //Remember to free allocated memory to avoid leaks.
  43:              gch.Free();
  44:              fs.Close();
  45:  
  46:  
  47:          }

StreamIn: StreamIn is the function that actually reads the data from the file and pushs the data into the RichTextBox callback to print into the control.

   1:          public IntPtr StreamIn(IntPtr dwCookie, IntPtr pbBuff, IntPtr
   2:              cb, out IntPtr pcb)
   3:          {
   4:              byte[] buffer = new byte[cb.ToInt32()];
   5:              uint result = 0;
   6:   
   7:  
   8:  
   9:  
  10:              System.IO.FileStream fs = (System.IO.FileStream)((GCHandle)dwCookie).Target;
  11:              //pcb = cb;
  12:              try
  13:              {
  14:                  pcb = (IntPtr)fs.Read(buffer, 0, cb.ToInt32());
  15:  
  16:                  if (pcb.ToInt32()<=0)
  17:                  {
  18:                      pcb = IntPtr.Zero;
  19:                      result = 1;
  20:                      return (IntPtr)result;
  21:                  }
  22:                  else
  23:                  {
  24:  
  25:                      System.Runtime.InteropServices.Marshal.Copy(buffer, 0, pbBuff, pcb.ToInt32());
  26:                  }
  27:              }
  28:              catch
  29:              {
  30:                  pcb = IntPtr.Zero;
  31:                  result = 1;
  32:                  return (IntPtr)result;
  33:              }
  34:              fs.Close();
  35:              return (IntPtr)result;
  36:          }

Anyway, the gist of all this, is that by setting the compile option to target 32-bit processors in the MarcEdit gui, I’ve been able to solve this issue.  I’m having the user that found the problem verify that I’ve indeed hunted this bug down and squashed it — so as soon as that’s confirmed, I’ll be pushing this fix out with MarcEdit.

–TR