MarcEdit – dealing with data in mixed character sets Posted: 2013-02-12 14:11:54
One of the hard and fast rules that MarcEdit has consistently enforced is that you are not allowed to mix character set streams. What this means – if your data is in UTF8 – MarcEdit will not process mnemonic data. There are some good reasons for this – but best being that mnemonics in MarcEdit map back to MARC8 representations of a character which is completely incompatible with the UTF8 character set. I’ve tried a few times to look at different ways to deal with this – but in most cases, I’ve been thwarted by the way C# handles streams. In C#, all data is typed as UTF8 streams, unless the data is specifically types as otherwise. In order to support MARC8 formatted data, MarcEdit reads all data as either UTF8 or as binary. This allows MarcEdit to move easily between MARC8 and UTF8. The problem occurs when someone wants to use mnemonics in a string that is already UTF8 encoded. For example: =246 13$aal-Mujāhid $bRees{aacute}, T{eacute}rry The above is problematic to process. Currently, MarcEdit ignores the mnemonics and treats them simply as strings. Because these mnemonics convert directly to MARC8 bytes – one of these three diacritics sets would be flattened when processed against the stream. If the stream was defaulted to UTF8, the {aacute} and the {eacute} encoded data will be flattened and the record generated by MarcEdit will have incorrect lengths. If the Stream is converted to binary, then reconstituted as a UTF8 stream, any UTF8 data present in the stream is flattened, but the mnemonic data is processed correctly. A bit of a pickle. To make this work – I ended up having to atomize the data that is to be processed, meaning that only the data in the mnemonic is processed – and then inserted back into a UTF8 data stream. So, it would look something like this: if (RecognizeUTF8(System.Text.Encoding.GetEncoding(1252).GetBytes(str_Source)) == RET_VAL_UTF_8)
{
if (objChar == null)
{
objChar = new marc82utf8.MARCDictionary();
objChar.UTFNormalize = UTFNormalization;
}
string tmp_diacritic = (string)lc_mnemonics_patch[tmp_string];
tmp_diacritic = objChar.MARC8UTF8(tmp_diacritic);
//need to convert bytes
byte[] bytes = System.Text.Encoding.UTF8.GetBytes(tmp_diacritic);
tmp_diacritic = System.Text.Encoding.GetEncoding(1252).GetString(bytes);
str_Source = str_Source.Replace(tmp_string, tmp_diacritic);
}
In this case, the atomized data is in tmp_diacritic and must be processed as MARC8 data to UTF8 utilizing the MarcEdit UTF8 Normalization library. At this point, the stream is switched to UTF8. This data must now be converted to bytes, then transcoded to the internal base encoding for staging all character data, so it can then be passed back into the library for proper character handling.
The upshot of this – MarcEdit will soon allow this type of mixed character editing. The downside is that we still can’t get away from this type of MARC8 legacy crap.
--tr
ALA Midwinter Presentation Posted: 2013-01-28 01:04:13
I want to thank everyone that took the time to attend my Mid-Winter Presentation. I really appreciated the feedback and the ideas. It’s the days like today, where I get to share some ideas and get to talk to some interesting people, that makes me glad I work in higher education. Anyway – my presentation, entitled: Dragging old data forward: finding yourself an RDA Helper can be found here: --TR
MarcEdit 5.9 Update Posted: 2013-01-26 01:16:57
I have been working hard over the past week to close a couple of outstanding issues with the application. The biggest of those issues is related to the Find/Find All function. While making some changes a few weeks ago, it appears that I introduced a bug. I’ve been working off-line with a few folks that have been helping me debug the issue and it appears that I’ve been able to isolate and correct the issue. The following changes have been made: - Bug Fix: Find/Find All – Regular Expressions were resulting in “Text Not Found” and Boundary errors. This has been corrected.
- Bug Fix: Find/Find All – The Find process has become incredibly slow due to some of the enhancements made to help the program jump directly to the found text when using Find All. The culprit was an inefficient loop, which has been corrected.
- Bug Fix: Find/Find All – When searching, and then moving the cursor and searching again, the program wouldn’t reset where the search would begin. This has been corrected.
- Enhancement: Find/Find All – I’ve added a directional component. You can now search up and down the record while using Find.
- Enhancement: RDA Helper – I’ve added support for the automatic generation of multiple 380 fields when the data to create those elements is available.
- Enhancement: Console Program – I’ve added an –xml switch to allow for the processing of data from MARCXML to other XML schemas.
You can pick up the new build from: --tr
MarcEdit 5.9 Update Posted: 2013-01-12 01:06:22
I know that there have been a lot of updates lately. Hopefully, folks haven’t minded. These updates have largely been due to folks really working out the RDA Helper, which has been nice. This update is specific to the RDA Helper. - Bug Fix: RDA Helper – When attempting to create the 380/1, the program looks for the presence of a 130. Under rare conditions, the 130 format can cause parsing problems. This fixes that issue.
- Bug Fix: RDA Helper – Incorrectly processes data when the copyright mark is set as {copy}. This has been fixed.
- Bug Fix: RDA Helper – after the last fix to correct an error message generated when processing records where the largest field is no longer than 300, the line separating records could be dropped. This has been corrected.
You can pick up the download from: --tr
MarcEdit 5.9 Update Posted: 2013-01-11 04:06:59
I’ve posted a new update that includes the following changes: 1) Bug Fix: RDA Helper – Corrects the exception that occurs when trying to insert 33x, 38x fields when the highest field in the record is a lower than 329. 2) Enhancement: RDA Helper – I added the ability to embed regular expressions into the Abbreviation expansion, and will include 2 as examples with the next update. This is necessary to deal with items like “v.” This could expand to volume or volumes depending on the data that follows. So, I’ve included a regular expression that will evaluate the data prior to “v.” to try and determine what the expansion should look like. This should give some added flexibility for anyone wanting to augment the substitution list. 3) Behavior Change: RDA Helper – because I’m modifying the substitutions list due to the item above, I need to provide a mechanism to update the codes. Essentially, I’ve added a routine that will automatically keep the substitution list synchronized with the master list. However, if you add your own items to the list – they will get overwritten – so I’ve added a notification and save the previous list as a backup so you can move any custom substitutions to the new list. I don’t anticipate changing the master list often, so this might be good enough. But I’ll look at other ways of making this process less intrusive in the future. 4) Enhancement: RDA Helper – when generating the 336, I’ve expanded the data elements consulted to help improve the selection of more granular values in the 336$a – specifically for text items. 5) Enhancement: RDA Helper – I spent a good deal of time last night optimizing how regular expressions are run in the RDA Helper and have seen a good deal of performance gained. 6) Bug Fix: Delimited Text Translator – When using autogenerate, indicators would occasionally be dropped. This has been corrected. 7) Enhancement: RDA Helper – MarcEdit will protect data within quotes from substitution. This makes the assumption that quoted data should be data that has been transcribed from the object. 8) Bug Fix: RDA Helper – When generating 264s from 260s, the first indicator wasn’t be retained correctly. This has been corrected. 9) Bug Fix: RDA Helper – When generating a 264, fields that ended with a hyphen were occasionally having periods added to the end. This has been corrected. 10) Bug Fix: Find/Find All – I made a change to the jump list that allowed the program to select the text searched for. Unfortunately, if you are looking for a regular expression, it can expand beyond the select. This has been corrected. You can download the program from: --TR
MarcEdit 5.9 Update Posted: 2013-01-09 02:15:37
Lite-update – really just to correct a couple data problems. One note, to enhance the Abbreviation substitution, I had to make the regular expressions a bit more complex. I’ll be keeping an eye on this part of the function to ensure that this doesn’t cause a performance bottleneck. - Bug Fix: RDA Helper – When doing substitution, sometimes it would match greedily. This has been corrected.
- Bug Fix: RDA Helper – 338 – under some circumstances, the $a was generated as a $u. This has been corrected.
- Bug Fix: Delimited Text Translator – When auto generating, defined control data (like the 006, 007) cause a validation issue. This has been corrected.
- Enhancement: Merge Records – Enhanced the function so that it can now handle records with multiple control numbers in a record, and will match those correctly.
- Enhancement: MARCEngine COM – new function – MarcEngine_Version to return the assembly build number.
Here’s an example of the Assembly number function Dim obj_MB
Set obj_MB=CreateObject("MARCEngine5.MARC21")
msgbox obj_MB.MarcEngine_Version
Download from:
--tr
MarcEdit 5.9 Update Posted: 2013-01-07 22:38:11
This is a light update that includes three enhancements and a couple of bug fixes: - Enhancement: Task Manager – I’ve included the ability to add a task to a task. This allows you to chain tasks together, essentially allowing for the creation of a library of tasks. You can read more information about this function here: http://people.oregonstate.edu/~reeset/blog/archives/1144
- Enhancement: Task Manager – Import/Export Tasks. While MarcEdit has a section specifically for sharing your configuration information in mass, I added this enhancement to make it easier to share specific tasks (since I think that this is likely the kind of data that would be shared most often). Hopefully, this makes it easier for folks within their institutions and on the MarcEdit Listserv to share tasks. Note, imported tasks that reference other task lists will break. The file names change on import to prevent overwriting existing tasks on the system.
- Enhancement: Merge Records tool: You can use the Merge records tool to merge duplicate data into a single record, but only using a control number. I haven’t implemented this option when using the MARC21 matching option. That is a much more complicated process.
- Bug Fix: RDA Helper – the 336 was creating a $c instead of $2 when it fell through to the default output when specific data was missing.
- Bug Fix: Delimited Text Translator – when Joining an item, if the last column joined is blank, it can incorrectly join the adjacent field data. This has been corrected.
- Behavior Change: When you add the copy field to a task – there was a message box generated showing completion. This would stop the task process till addressed. I’ve changed the behavior so that when run as a task, the message boxes are suppressed.
Download the new update from: --TR
MarcEdit 5.9 – Linking Task Lists Posted: 2013-01-07 20:54:03
One of the new enhancements to make it into MarcEdit 5.9 is the ability to link task lists. Task lists are essentially “recorded” macros that link together a set of global actions to be performed on a file. You can learn more about MarcEdit’s Task Automation tool here: One of the most recent requests has been the ability to chain task list together. This would allow users to create libraries of tasks and then link them together. Seemed like a great idea, so I created the following: In addition to this enhancement, I added two new options to the Task Manager which allows users to export and import selected tasks to make it easier to share them with other MarcEdit users. --TR
|