Jul 8 2009

Supporting Right to Left languages in MarcEdit

Life is a funny thing.  I’ve never been someone with a distinct desire to understand the intricacies of language or characters.  That’s stuff that my wife enjoys – she’s the linguist while I’ve barely mastered English. :)

Yet, here I am, thanks to MarcEdit, having to become much more familiar with character sets and language than I’d ever dreamed.  Some of these issues are easy.  Character sets for example.  There are actual solutions to the character set issues (both moving and identifying).  In MarcEdit, I found a long time ago that I could appease the math geek inside of me by forgoing the creation of large translation tables and coming up with mathematical translations for moving data between various character sets.  It’s not perfect (i.e., there are exceptions that require the use of a lookup table) – but it’s slick enough that in MarcEdit all language translations (save CJK) are handled through a relatively simple formula with support for exceptions.

Language on the other hand stymies me.  Lately I’ve been having conversations with Arabic and Hebrew users that would like to be able to support Right to Left processing of their language in MarcEdit.  In general, support for this is built right into the operating system through a localized implementation of the Unicode Bidirectional (BiDi) Algorithm.  So, in MarcEdit, you can mark the MarcEditor window and have it shift to the OS supported output.  The problem comes from MARC and the notepad like format MarcEdit uses for editing.  Because of how numbers, punctuation and mixing Latinate and non-Latinate characters are handled in the algorithm, the output presented to the user is close – but not perfect.  For example, in Arabic:

arabic

This is a window from a colleague in Dubai that has been helping me wrap my head around the changes needed to make MarcEdit easier to use for the creation of non-English records. 

The problem that we run into specifically has to do with numbers and words.  Many of the switched elements are elements that have numbers (or groups of numbers) attached to them – which is problematic because there is a lot of numerical data in MARC.

I’ve been pounding away at this, and I’m hoping that I might have come up with a workable solution.  It’s required writing a custom implementation of the Unicode Bidirectional (BiDi) Algorithm – though, I’ll admit, at this point, it’s actually a very simplified version so I’m going to need to send this to a few people to make sure that it doesn’t butcher some general assumptions.  But with a little bit of work, I’ve been able to get an output that looks something like this:

image

Of course, since I’ve change the output algorithm, I’ve had to create a method for reassembling the data so that it gets placed back into an order that MarcEdit can compile into MARC.  I’m certain that there is still more work to be done, but the upside is that very soon, MarcEdit should be able to fully (or at least, more closely) provide better language support for any user utilizing a Right to Left rendering language.  Will I understand the whole language issue any better, probably not – but I guess that’s as it should be.

–TR


May 26 2009

OLC MarcEdit Workshop

So, last week (May 20th), I had the opportunity to meet with about 56ish folks in Columbus, OH to talk about MarcEdit.  I don’t give MarcEdit workshops very often, though I have always enjoyed doing it when I have.  (The main reason I don’t is the time – I simply don’t have the time to organize them– so the times that I do provide these workshops, it’s really for much larger organizations that can handle all the logistics for me.)  This particular workshop was much more lecture style (so I spent more time talking than any one person really wants to hear), but I thought that it went really well.  There were some good comments and some enhancement requests (which I’m working on) – and, I had a good time in Columbus – even if I did forget to pack clothes (and I’m talking pretty much nothing – had to do some shopping the first night I was there). 

Anyway, this is the first of two public workshops that I’ll be teaching over the next month.  The second will be in the Washington DC, though I know that this session is already full.  The session in DC will be more hands on (less lecture) and I’m really interested to see what kind of feedback about the program comes out of it.

Lastly, slides.  You can get the slides from this workshop here:
http://oregonstate.edu/~reeset/marcedit/tutorials/olc.zip

–TR


Apr 16 2009

MarcEdit version number change coming

Just an FYI – I’m a big believer in incremental builds – hence the reason why MarcEdit’s build number changes so slowly.  For example, my current development build number is: 5.1.3393.37654 (actually built today).  I personally am not fond of software that make large leaps between build numbers, but I also realize that my incremental build numbers can sometimes confuse folks (I mean, how many people would expect over 3,000 minor builds (likely 150-200 actual releases).  So, I’m going to take the opportunity to just build numbers to 5.2 sometime in May.  The change will correspond to the following:

  1. Remove the cap on the Z39.50 query (at present, you can search 3 databases simultaneously – that will be removed so you can query to your heart’s content)
  2. Maybe the addition of a form-based record editor (for editing individual records – since I occasionally hear from smaller libraries that use MarcEdit for cataloging).  I say maybe because I’m still not convinced that this is something that should be in MarcEdit.  If you have strong feelings, let me know.
  3. New Edit function in the Editor: Copy Field – for those times when you need to just copy or change a field from one tag to another
  4. Updated Tutorials page (will include Youtube videos – as well as a few more videos being posted)
  5. A first crack at a MARC=>Excel export.  My guess is that this first crack will suck horribly, but it should at least be a starting place from which we can more forward.
  6. Finally, RobertCompare will make it back into MarcEdit (by popular request).  RoberCompare is a tool that uses heuristics to compare two records to identify differences in records.  The most challenging part is obviously dealing with Subjects/Main entry – but again, it’s a good starting point.  It will be enhanced from the previous version – though I’m still working out some of the code bottle necks. 
  7. For users that leave the RSS option on, I’ve enhanced the notifications so that it is easier for users to tell if an update is available.

In regards to the build changes – for users, the change will be fairly unnoticeable (save for some icons, program title changes and possibly some color changes in the UI).  The program will install over the previous 5.x incarnations – so updates will be seamless (i.e., it will work like any other install of MarcEdit).  So no worries there.

Finally, I’m going to be teaching two 6ish hour long workshops in the coming months.  One in Columbus, Ohio in May and one in the DC area in early June.  If you want information on either, I can send it out.

–TR


Mar 22 2009

MarcEdit 5.1 update

Couple of quick changes. 

  1. Delete Fields – you can delete multiple fields using “x” syntax, where the ‘x’ is no longer case sensitive.  For example:
    a) 900 – deletes the 900 field
    b) 90x – deletes fields 900, 901, 902, 903, etc.
    c) 9xX – deletes fields 900, 901, 902, 903…910, 911…920, 921, etc.
    Originally, x’s had be be lower case.  I’ve modified the code so that this is no longer case sensitive.
  2. Edit Subfield Function:  When working with Control fields (00x fields) – MarcEdit use to require data in the Position element.  If the position element is empty, the program will do a usual find/replace – including prepending and concating data.
  3. MarcEdit Preview Function (Bug Fix):  If a user loads a MARC file directly into the MarcEditor – the program would lose the temp file – so if you clicked on the link to load the entire file into the MarcEdit, it would get lost.  This only occurred when users opened a .mrc (MARC) file directly into the MarcEditor.  In all other contexts, the preview link works as documented.

If either of these affect your workflow, you can download the update here: MarcEdit_Setup.msi

–TR


Mar 16 2009

MarcEdit Update

I just uploaded a new version of MarcEdit 5.1.  Most of the updates are related to changes in the MarcEditor.  So here’s the list of changes:

  1. MarcEdit Find (specifically when the regular expression option is selected) – Previously, when searches were done, items were located but the window didn’t scroll to the located item.  That’s been corrected.
  2. MarcEdit Replace All (Regular Expressions): One of the changes made during the last update of MarcEdit was to change MarcEdit’s MarcEditor’s replace all function (when using regular expressions) from a single line evaluation to evaluating whole records.  This allows for the ability to perform replacement actions by evaluating multiple fields – but I had neglected to consider how this might break current workflows that relied on the previous functionality.  So, I’ve returned the functionality to evaluating single lines and added a switch to allow users that want to process data across multiple lines.  So for example:
    (=6.*)([^.]$) – this would evaluate field by field (line by line)
    (=6.*)([^.]$)/m – the /m tells marcedit to evaluate multiple lines – so this would run this expression against the entire record.
  3. MarcEdit Replace All (Regular Expressions):  The expression evaluator was too greedy – causing matches to blank records.  This should never happen any longer.
  4. MarcEditor – when openning a blank .mrk file using the Open button – the window would hang.  That’s been corrected.
  5. MarcEditor – when opening a .mrk file by double clicking on it, then opening a new window – closing either window would close both windows.  This has been corrected.
  6. XSLT file updates – I’ve added Creative Commons Zero license headers to the FGDC stylesheets distributed in MarcEdit.
  7. Help File has been updated to cover some of the noted changes.

Couple of notes – I’m currently writing up some new notes on using MarcEdit on Linux.  Mono 2.0+ essentially has added all the functionality necessary to run MarcEdit on Linux.  I’ll be creating a handful of Youtube videos for folks interested in giving this a try.  As for running on a Mac – well, I’ll look at that next. 

You can download the new version of MarcEdit from here: MarcEdit_Setup.msi

–TR


Mar 3 2009

EAD => MARC tutorial

Just an FYI, since some people ask how we go about generating MARC records from our EAD records using MarcEdit…I’ve posted a short video tutorial.  What I didn’t include was the EAD translation (it’s somewhat specific to OSU), but I’m happy to add a link to it if anyone is interested.

Anyway, you can find the video here:

http://www.youtube.com/watch?v=9LP4TTHx7e0

And just for reference, I think going forward, as I create tutorials or clarify information in the documentation – I’ll likely upload a video to youtube and will use MarcEdit as a common tag.

–TR


Mar 2 2009

MarcEdit 5.1 Update

Getting this last update out has taken a little more time than I would have liked, but I really wanted to think some of the issues that this update raised through so that the update process would be seamless.  Realistically, were I versioning MarcEdit in any realistic versioning process, this would likely be at the very least, a new point release.  However, I’ve already planned out my 5.5 release, and this request, while major, fell into this gray area – so I decided to keep this into the 5.1 branch.  Anyway, March 2nd is a good day to officially make this release.  March 2nd is my 32nd birthday, and this version of MarcEdit will be present to the MarcEdit user community.  Cheers.

Major Changes:

  1. In the continued work towards helping enterprise users, I’ve finally finished the installer update process that began with the last update.  In the previous update, MarcEdit’s installer was modified (as was MarcEdit itself) to make it more user aware.  What do I mean here – the program was changed so that users running MarcEdit no longer needed to be administrators when running the program, but rather, all configuration and mutable files were moved into the users Application Data directories.  This had a number of unintended benefits like supporting multiple MarcEdit users on the same machines (using custom user profiles) and making it easier to copy configuration settings from one computer to another. 

    This update takes this one step further.  As Libraries continue to move to more sophisticated application management, I’ve been running into more users that have their software managed through a central IT source.  The IT groups manage the software by automating a process to do distributed installation.  In the past, MarcEdit’s installer really didn’t do this well.  Well, about 2 months ago, I was contacted by a large IT group on the west coast wanting to know if this would be possible.  This required finishing the migration from MarcEdit’s custom installer to the Microsoft Windows Installer – while at the same time, making sure that the program cleans up the previous install while still keeping the users previous settings throughout the upgrade.  After a lot of testing (over the last month, spanning multiple institutions and users in different development environments) – I feel like this is ready to go. 

    When you install the new msi installer, what will happen. 
    1.  MarcEdit will evaluate your current installation – if you have never installed MarcEdit, it simply installs the application
    2.  If a previous version of MarcEdit is present, the program will copy, in order: config. data in the User Application directory, data in the MarcEdit Program directory – and then silently uninstall the previous version of MarcEdit.  Once the previous version has been removed, the installer when then install the new version of the Application. 
    3.  Part of the clean up process of the new install is to move the copied user data back into scope of the application.

    So what does this mean to you?  Well, if you are an individual user, who manages MarcEdit on your own machine, very little.  The one benefit that you will likely see from this migration is the eventual development of an automated updater.  The msi installer provides a number of very powerful and integrated functions that I should be able to leverage to potentially create an unmediated upgrade process for users.  For enterprise users however, this change will for the first time give your IT administrators the ability to install MarcEdit on multiple machines simply by using their enterprise software management system.  For colleges and universities that manage thousands of users, this should be a really big win.

  2. Z39.50 Changes:  This will be an incremental process, but for the first time, MarcEdit will allow users to query multiple user databases during a Z39.50 lookup.  This will allow users to query multiple Z39.50 targets to return data about a search.  This initial implementation allows multiple searching to be done in the Single Search mode.  In a planned future update, this will be extended to the batch search update, with rules regarding how to disambiguous duplicate records (for example, the ability to accept records from one target over another, etc.).  So how does this work.  Essentially, when you enter the single search, you select Select database and then select multiple items.  Up front, this is limited to 3 databases, but that limit will eventually be removed (especially as I get UI feedback).  When you select multiple database, the Single search screen changes to look like the below…

    Do you see the data in the red box?  This is how you can see what resources MarcEdit will be querying.  Also, see the green box.  You can see here that MarcEdit’s Z39.50 results list has changed slightly to let users see what institution each record is from.

  3. Help File is now local again:  Sadly, some topics are already out of date slightly (the Z39.50 info for example doesn’t represent the multiple querying functionality) – but this makes the help available both online and offline.  The online help will always be more up todate, but will be updated on each build.
  4. Youtube tutorials.  If you go to Youtube and look for marcedit, you will find a series of tutorials related to MarcEdit topics.  At present, you will find topics for:
       1.  Breaking your file
    http://www.youtube.com/watch?v=MOLJlQQ5RL8
  5.    2.  Making your file
    http://www.youtube.com/watch?v=8UU7gjv_eo8
  6.    3.  Editing a MARC file
    http://www.youtube.com/watch?v=kNMExHdki9k
  7.    4.  Converting a files characterset
    http://www.youtube.com/watch?v=gHxMALV-C4k
  8.    5.  Adding a new XML Function
    http://www.youtube.com/watch?v=3×5Ke81AoEU
  9.    6.  Updating a current XML Function
       7.  Using the Delimited Text Wizard
    http://www.youtube.com/watch?v=Kp_N3ncjS7Q
  10.    8.  Extracting a subset of records from a larger set
    http://www.youtube.com/watch?v=2F8DgEO4zSg
  11.    9.  Using the Z39.50 Client 
    http://www.youtube.com/watch?v=y0YibTP1dIs
  12. 10.  Harvesting OAI data into MARC
    http://www.youtube.com/watch?v=gvBrMVH6j7U
  13. 11.  Managing Plug-ins through the plug-in editor.  
    http://www.youtube.com/watch?v=ZTxgL1BAmew
  14. OCLC Connexion Client Plugin has been updated (was needed do to updates in a few other components).  If you use the Connexion Client plugin, you will need to update this plugin once you update. 
  15. MarcEditor Editor update to better support UTF-8 data loading.  Essentially, in layman’s terms, here’s what has changed.  In previous versions of MarcEdit, loading UTF8 data into the MarcEditor would sometimes cause the process to load slowly.  The why this occurred had to do with the way that the specific windows component that I was using handled text.  I’ve updated MarcEdit so that this process has been changed (as has the component) making it so that data loaded into the MarcEditor now uses a new editing component and one that natively handles UTF8 data.  The lag time that users previously experienced should no longer be applicable.  In addition to this fix, MarcEdit’s MarcEditor’s memory footprint has been reduced.  Not drastically, but a bit.  One thing to remember when loading data into the MarcEditor.  There is roughly a 4-1 memory usage when loading bytes into a visual interface in Windows.  So, for example, load a 20 MB file, and Windows will allocate ~80 MB of memory to view the file.  Open a 120 MB file, and Windows will need to allocate ~480 MBs to render the file.  The new Editor is able to reclaim some of this memory on the high end, but this is in part, how visual interfaced work. 

    Also, note the 1/2 GB limit of data loading into the Editor, but that data of any size can continue to be edited in the editor if one makes use of the Preview mode.

  16. Yaz Update – Previous version of MarcEdit used Yaz 1.+ because it was small and fast.  I had need for some of the enhanced functionality, so I’ve updated the version of used Yaz to 3.+.
  17. ‡biblios.net Proof of concept Plug-in:  As noted in a recent post, while attending code4lib this week, the folks at Liblime demonstrated their new  ‡biblios.net platform.    For those that haven’t heard, the  ‡biblios.net platform is an attempt to create a large, shared, Open Data repository of bibliographic metadata.  What I find most interesting about LibLime’s effort has been the development of an open API to provide push/pull functionality into the database.  In theory, this allows library developers the ability to develop tools around the  ‡biblios.net platform.  The plug-in demonstrates how this works, as well as providing folks that want to work with the  ‡biblios.net platform a way to integrate their workflow with MarcEdit.  You can see the Youtube video talking about how it works, here:
    http://www.youtube.com/watch?v=Y8tldqWhFjs

Minor Changes:

  1. UI changes to the Z39.50 (to accommodate the changes in functionality)
  2. Extended regular expression support in the Replace function so that regular expressions can be run over multiple lines.
  3. Updated workflow – when converting data from UTF-8 to MARC8 using the MarcBreaker, the 9th byte in the leader isn’t set correctly.  This is partly because the previous workflow assumed moving the other direction.  Since this cause some problems with some loaders, its been corrected.
  4. Updated the Marc21XML xslt function to accommodate the following:
         a.  Up to 9 indicators (per UniMARC)
         b.  Ability for indicators to be mixed.  The current version assumes indicator order, the update allows indicators to appear in whatever order.
  5. Other minor changes

You can pick up the update at: MarcEdit_Setup.msi.

If you run into any problems, please give me a holler.

–TR


Feb 25 2009

‡biblios.net MarcEdit Plug-in

So, one of the presentations at Code4Lib this year discussed one of the latest initiatives to come out of the LibLime company, ‡biblios.net.  ‡biblios.net is a repository of ~30 million MARC records released under an Open Data license. 

From my perspective, one of the things that I found most interesting about the ‡biblios.net platform is the support for developers.  ‡biblios.net provides a very simple to use push/pull data model over simple HTTP.  It’s something that I’ve been interested in taking a look at for a little while and after chatting with Joshua Ferraro after his session, decided to see how difficult it would be to actually work with this service.

So, I decided to create a plug-in.  Much in the same way MarcEdit has a Connexion batch editing plug-in, I created a ‡biblios.net helper plug-in.  I’ll make the plug-in available for download with the next version of MarcEdit through the Plugin Manager (so, likely when I get back from Code4Lib), but have posted the source code (in C#) and a short youtube video for folks wanting to see how it works.

While I’m not sure if ‡biblios.net will catch on as a service, I’m very impressed by LibLime’s effort to create a large, shared cataloging database that comes with a set of API to allow developers the ability to integrate directly with it.  Hopefully, OCLC will follow LibLime’s lead and eventually add push functionality integration for WorldCat – allowing membership and library developers an opportunity to develop their own cataloging interfaces to the resource.  Until then…

Youtube video (I think youtube has finished processing it – if note, check back):

C# Source code: biblios_net.zip

–TR


Feb 1 2009

MarcEdit update notes

Since a few folks have asked, the update mentioned last week is coming along nicely.  At this point, I have a number of MarcEdit users that are testing new features and the install program (both individually and in an enterprise environment).  Things are looking good, so it will be out the door as soon as all reports have come in.  Some notes again on the update:

  • Extending the Z39.50/SRU client to provide multi-target searching.
    This is a fairly often requested feature, but I’ve always held off adding it as a courtesy to OCLC (a number of years ago, I’d received a request when MarcEdit was set to add a Z39.50 client to keep it simple.)  It’s been a long time and times have changed — so I’ll be lifting this restriction slowly.  To start, the program will allow searching of 3 databases at once.  As I finish rethreading the requesting engine, I’ll likely remove even that restriction. 
  • Improved large UTF-8 file support in MarcEdit:
    MarcEdit has what I call a preview mode.  It essentially opens a snippet of a large file into an editor so users can preview global changes that they might run on a record.  This is how I like to work with large file sets.  However, some people like loading the entire file into a viewable space.  Now files under 30 MB — these are fairly harmless and easy to load into any traditional editor — but files over 30 MB are difficult to view in Windows.  The reason for this is memory.  For every byte loaded, it takes ~4 bytes of RAM to view.  So, a 30 MB file consumes 120 MB of system memory.  A 60 MB file would consume 240 MB of memory, and so on (I’ve oversimplifying this, but you get the general idea).  In MarcEdit, the program has always traditionally relied on a customized version of what is called the RichEdit component.  This is the same component Microsoft leverages in Wordpad.  It provides fairly good UTF8 support, as well as a good deal of formatting support.  The downside is speed.  When loading lots of data, the component tends to get bogged down, as part of the loading process is converting data not in richtext to richtext on the fly.  For smaller documents, the conversion isn’t noticed.  For larger documents, its a real time drain.  Likewise, charactersets like Hebrew, Arabic and Asian sets cause it all kinds of speed and resource problems.  So after sending emails back and forth to Microsoft’s support, I finally came up with a solution — write my own text rendering built upon the general textbox components in Windows.  It took a while, but I’ve got it working and early reports from testers is that it’s faster and consumes less memory.  Of course, since I spun my own, I also set a hard limit on the file total file size that can loaded into the editor at a half GB.  In reality, you’d run out of system memory before you ever were able to load a 1/2 GB, so this seemed more than reasonable.  Of course, if you have files larger than a 1/2 GB, you can load them just fine into MarcEdit for editing, so long as you utilize the preview mode.
  • Installation:
    You wouldn’t think that this would be a big deal.  I mean, installing a program is one of the basic parts of program design.  However, when you are talking about a large number of users or an enterprise system, it gets very complicated.  MarcEdit previously utilized an installer that relied on the Nullsoft installer with a number of plug-ins and shims that I’d written myself.  It was great for what it did, but poor for large enterprise level installations.  However, up until recently, I really didn’t care because most people that use MarcEdit simply download and install it.  However, times are changing as more libraries are moving from in-house IT management to utilizing centralized IT management.  The downside of this is that desktop management is all done through a central switch — with users losing the ability to install and manage their own software.  In the past month, I’ve had two system administrators (one from a very large university system) contact me inquiring about changing the install model.  They want to be able to manage the program centrally.  Made since, but when you have a user community measured in the tens of thousands, making these types of significant changes take a lot of planning and work because you need to bring everyone along at the same time.  There are a lot of gory details involved in making this work — but after many long nights, I finally was able to slay the beast, which is the windows installer, and come up with a package that would be usable.  Yeah me!  Right now, I’m having users from a wide variety of institutions, that have done various levels of customization, testing the installer within their environments.  So far, the only comments received have been cosmetic — which makes me think that this is close.

 

One additional note.  I’ll be participating with a number of folks in the coming year on examining RDA and how it will require tools to change to support the new models.  As I work on this and make changes, I’ll try to relay any pertinent information back to the user community.

 

–TR


Jan 18 2009

Quick MarcEdit note

Just a couple of quick MarcEdit notes.

  1. As part of my campaign to make MarcEdit work better within group managed systems, I’ve been working on migrating MarcEdit’s installer from a general .exe setup to one using Microsoft’s MSI installer.  This will allow system administrators the ability to use Microsoft’s package management tools to distribute MarcEdit and updates on those systems that don’t allow individual computer users to install software.
  2. Z39.50 Client:  Changes have been made — MarcEdit’s general search mode will now support the ability to query 3 databases simultaneously.  This is a first version of this functionality — so expect it to be revised as we move forward.
  3. Export Tab Delimited Data — This function has been modified to more of a Wizard format.  This should make the program a little easier to use.
  4. MarcEdit OAI Harvester — I’m made some changes to the Harvester to make it a little more fault tolerant when dealing with OAI servers that done function exactly as documented.
  5. Tutorials — I’m in the process of creating 25 video tutorials that will be loaded into You Tube.  These tutorials will be added to the tutorial area.
  6. OCLC Connexion Plug-in — This has been updated and will be posted.
  7. Fixes to the Export Selected MARC Records tool
  8. Having the help just on a Wiki doesn’t seem to be working well — so starting with the next install, I’ll be including a compiled version of the help with the program.  Hopefully, this will give people better access to the documentation.
  9. Will fix a problem from the previous install — under certain conditions, new users were not getting the xslt directory installed in the correct location.  The fix was easy for affected users (copy the xslt directory under program directory/shadow to program directory/xslt) — but this will correct the problem.

 

All of these changes have been completed.  At this point, I’m doing final Q/A for all changes.  With luck, this will be posted early next week.

 

–TR