#duraspace IRC Log


IRC Log for 2012-05-23

Timestamps are in GMT/BST.

[6:34] -wolfe.freenode.net- *** Looking up your hostname...
[6:34] -wolfe.freenode.net- *** Checking Ident
[6:34] -wolfe.freenode.net- *** Found your hostname
[6:34] -wolfe.freenode.net- *** No Ident response
[6:34] * DuraLogBot (~PircBot@atlas.duraspace.org) has joined #duraspace
[6:34] * Topic is '[Welcome to DuraSpace - This channel is logged - http://irclogs.duraspace.org/]'
[6:34] * Set by cwilper!ad579d86@gateway/web/freenode/ip. on Fri Oct 22 01:19:41 UTC 2010
[12:02] * mhwood (mwood@mhw.ulib.iupui.edu) has joined #duraspace
[13:10] * tdonohue (~tdonohue@c-67-177-108-221.hsd1.il.comcast.net) has joined #duraspace
[14:14] * ClaudiaJuergen (~Miranda@pc5208.ub.uni-dortmund.de) has joined #duraspace
[15:04] * hpottinger (~hpottinge@mu-162198.dhcp.missouri.edu) has joined #duraspace
[15:14] * ClaudiaJuergen (~Miranda@pc5208.ub.uni-dortmund.de) Quit (Quit: Miranda IM! Smaller, Faster, Easier. http://miranda-im.org)
[15:15] * hpottinger (~hpottinge@mu-162198.dhcp.missouri.edu) Quit (Quit: Later, taterz!)
[17:00] <tdonohue> Hi all, my weekly "DSpace Office Hours" is starting now: https://wiki.duraspace.org/display/~tdonohue/DSpace+Office+Hours Please feel free to 'ping me' if you have anything you'd like to chat about.
[17:00] <kompewter> [ DSpace Office Hours - Tim Donohue - DuraSpace Wiki ] - https://wiki.duraspace.org/display/~tdonohue/DSpace+Office+Hours
[17:02] <tdonohue> until then, I'll be keeping close tabs on this channel while I do some other work behind the scenes
[19:53] <tdonohue> Reminder that the DSpace Developers Mtg is starting at the top of the hour: https://wiki.duraspace.org/display/DSPACE/DevMtg+2012-05-23
[19:53] <kompewter> [ DevMtg 2012-05-23 - DSpace - DuraSpace Wiki ] - https://wiki.duraspace.org/display/DSPACE/DevMtg+2012-05-23
[19:57] * hpottinger (~hpottinge@mu-162198.dhcp.missouri.edu) has joined #duraspace
[20:00] <tdonohue> Hi all, welcome. DSpace Developers Meeting is starting now: https://wiki.duraspace.org/display/DSPACE/DevMtg+2012-05-23
[20:00] <kompewter> [ DevMtg 2012-05-23 - DSpace - DuraSpace Wiki ] - https://wiki.duraspace.org/display/DSPACE/DevMtg+2012-05-23
[20:00] <tdonohue> we'll kick off, as usual, with JIRA reviews: https://jira.duraspace.org/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+DS+AND+resolution+%3D+Unresolved+AND+Key%3E%3DDS-1013+ORDER+BY+key+ASC
[20:00] <kompewter> [ https://jira.duraspace.org/browse/DS-1013 ] - [#DS-1013] Track new user registrations in statistics - DuraSpace JIRA
[20:00] <kompewter> [ Issue Navigator - DuraSpace JIRA ] - https://jira.duraspace.org/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+DS+AND+resolution+%3D+Unresolved+AND+Key%3E%3DDS-1013+ORDER+BY+key+ASC
[20:00] <tdonohue> today we're starting with DS-1013
[20:00] <kompewter> [ https://jira.duraspace.org/browse/DS-1013 ] - [#DS-1013] Track new user registrations in statistics - DuraSpace JIRA
[20:01] * richardrodgers (~richardro@ has joined #duraspace
[20:02] <tdonohue> any thoughts on DS-1013 idea? Seems like it may be hard to track exactly *when* a user registered (at least currently in DSpace)
[20:02] <kompewter> [ https://jira.duraspace.org/browse/DS-1013 ] - [#DS-1013] Track new user registrations in statistics - DuraSpace JIRA
[20:04] <hpottinger> would probably be a useful statistic to track, especially for self-submit repositories
[20:04] <mhwood> User registered when he followed the confirmation email's link. We'd need an event fired there.
[20:04] <tdonohue> I do agree that it seems like something useful to track. I think mhwood is right that we need a new event to track this
[20:05] <tdonohue> anyone interested in investigating this further / potentially looking into implementing?
[20:05] * mdiggory (~mdiggory@rrcs-74-87-47-114.west.biz.rr.com) has joined #duraspace
[20:06] <tdonohue> (note: I'm not implying any timeline on this -- so, if you are interested, you can set your own timeline)
[20:07] <tdonohue> hearing no volunteers -- sounds like the best we can do for now is: Summary for Ds-1013: seems to be a good idea, may need a new event to capture user registration. Needs a volunteer developer to investigate and/or implement
[20:07] <mdiggory> \me ponders the FBI coming knocking at his door asking for user registration and login records....
[20:08] <mdiggory> Means that we need more Usage Events
[20:09] <mdiggory> and Usage Event Types....
[20:09] <tdonohue> not sure it's much more of a stretch from what Dspace already captures -- we already have user info captured -- all they want is a new event that captures the *time* that registration occurred
[20:09] <tdonohue> yep, agreed mdiggory
[20:10] <mdiggory> We already need to add new UsageEvents for searches
[20:10] <tdonohue> ok. gonna move on in essence of time (we should post these notes into comments of Ds-1013 -- if someone else doesn't get to it, I'll do it after the meeting)
[20:10] <mdiggory> yep
[20:10] <tdonohue> next up, DS-1017
[20:10] <kompewter> [ https://jira.duraspace.org/browse/DS-1017 ] - [#DS-1017] mobile dspace theme - DuraSpace JIRA
[20:11] <tdonohue> looks like a placeholder. I'm +1 getting a mobile theme developed
[20:11] <tdonohue> we probably should followup with Jonathon to see if there are any updates. This could be a nice feature for 3.0 if he's got something ready by then
[20:12] * helix84 (a@ has joined #duraspace
[20:12] * mhwood shudders to think of reading _The Endochronic Properties of Resublimated Thiotimoline_ on a 4cm screen.
[20:13] <mdiggory> I wonder what view technologies there are that best support mobile theming...
[20:13] <tdonohue> mhwood -- what about on a Kindle Fire? or a 7cm screen :) In general though, I think we should be more mobile friendly
[20:16] <tdonohue> ok. looks like no other immediate comments: Ds-1017 Summary : Followup with Jonathon about status. (I can do this)
[20:16] <tdonohue> Last one for today, DS-1020
[20:16] <kompewter> [ https://jira.duraspace.org/browse/DS-1020 ] - [#DS-1020] Discovery filter dialog usability - DuraSpace JIRA
[20:16] <mhwood> The UI can be more mobile friendly. The content may be intractable, but that's not reason not to let people try.
[20:16] <mdiggory> Coccon and WebMVC let you make decisions based on client types... but I wonder if theres a general html presentation technology or ways to leverage html5 so that we don't create completely separate views for mobile devices vs common browsers
[20:17] <tdonohue> mdiggory: could be. Dunno, it's worth looking into
[20:17] <mdiggory> mhwood, I do read journal articles on my ipad...
[20:17] <tdonohue> it looks like Ds-1020 is assigned to KevinVdV already. I'm assuming this is in his lap then
[20:17] <mdiggory> but I "I have an app for that" ;-)
[20:18] <mhwood> An Ipad (or Fire) is technically a mobile platform, but its display capacity is more like a desktop. Wouldn't a mobile theme look weird on such devices?
[20:19] <tdonohue> mhwood -- not always. I have a Kindle Fire. By default it does use the regular Web UIs. But, some are so cruddy/busy, that it gives you the option to switch over to the mobile view if the regular website is too difficult to handle on the small screen
[20:19] <tdonohue> so, it's at least nice to have the option :)
[20:19] * ryscher (98033b45@gateway/web/freenode/ip. has joined #duraspace
[20:20] <mdiggory> If the ipad theme did away with significant noise in the rest of the webpage and just rendered navigation and content... then it'd be nice... not that DSpace sites are not already somewhat "Bare bones"
[20:20] <hpottinger> main thing about "mobile" or "mobile-friendly" is to be sure your links and buttons are big enough for a finger to tap
[20:20] <tdonohue> +1 hpottinger
[20:21] <tdonohue> (otherwise, it's extremely frustrating to do anything via a touch screen)
[20:21] <tdonohue> ok. We'll stop JIRA review there for today. I'm assuming KevinVdV will let us know if he needs help on Ds-1020 at a later date (since he's not here right now)
[20:21] <mhwood> OK, I bow to the more extensive experience of others.
[20:21] <richardrodgers> I think the responsive design stuff worries about all that (in CSS etc)
[20:22] <tdonohue> agreed richardrodgers. A mobile UI can just be a different CSS on existing HTML, if your HTML is "clean enough". It need not actually be a separately generated HTML page
[20:23] <mdiggory> TBH... with all the various reference manager apps that are emergent, etc... maybe we should be focused on integration with those services more than browsing in DSpace like it was a mobile app.
[20:24] <mdiggory> http://www.thirdstreetsoftware.com/site/SenteForMac.html
[20:24] <kompewter> [ Sente Academic Reference Manager for Mac OS X ] - http://www.thirdstreetsoftware.com/site/SenteForMac.html
[20:24] <mdiggory> http://itunes.apple.com/us/app/mendeley-reference-manager/id380669300?mt=8
[20:24] <kompewter> [ App Store - Mendeley - Reference Manager (Lite) ] - http://itunes.apple.com/us/app/mendeley-reference-manager/id380669300?mt=8
[20:25] <mdiggory> http://blogs.plos.org/mfenner/2010/10/06/reference-management-with-the-ipad/
[20:25] <kompewter> [ Reference Management with the iPad | Gobbledygook ] - http://blogs.plos.org/mfenner/2010/10/06/reference-management-with-the-ipad/
[20:25] <tdonohue> mdiggory -- I'd argue you'd want both actually. A "mobile UI" need not be that scary -- as I said, it may be possible to even do via just CSS
[20:25] <tdonohue> what are these tools using for the searching/browsing?
[20:26] <tdonohue> ugh...Z39.50, really? http://www.thirdstreetsoftware.com/site/Images/Searches.png
[20:26] <mdiggory> a miriad of aggregators... OCLC, Ebsco, google, MS academic search, so on
[20:26] <mdiggory> its one of the options...
[20:26] <mdiggory> actually I use google scholar primarily with sente
[20:27] <tdonohue> well -- I guess the question is "what else can we do to integrate with stuff like sente, if we are already optimizing Dspace for Google Scholar & similar"?
[20:29] <mdiggory> I think the same stuff we do for zotero etc, embedded rdfa, mircoformats, coins, etc
[20:30] <tdonohue> right, I'd agree with looking more at implementing more microformats -- start up a jira ticket! :)
[20:30] <mdiggory> Actually, I think getting the DSpace metadata into PDF metadata would go a really long way in this environement
[20:30] <tdonohue> DSpace metadata -> PDF metadata? How? Wouldn't that involve actually dynamically changing/modifying/creating PDFs?
[20:31] <richardrodgers> I think he meant RDF
[20:31] <mdiggory> no, I meant altering the PDF
[20:31] <tdonohue> oh, yea. RDF makes more sense :)
[20:31] <mdiggory> Yes.... literally altering the PDF (OMG... Noooo)
[20:32] <tdonohue> I think that leaves too many chances for corruption to be honest. If we did that, we'd need to keep the original in-tact
[20:32] <hpottinger> ooh, on the fly munging, cool
[20:32] <richardrodgers> Oh, I misheard mdiggory - we do enhance PDFs with metadata
[20:33] <richardrodgers> for certain content
[20:33] <mdiggory> richardrodgers: Who does?
[20:33] <richardrodgers> DSpace@MIT
[20:33] <mdiggory> inside DSpace or in external workflows?
[20:34] <tdonohue> out of curiousity, what software do you use to do that enhancement?
[20:34] <richardrodgers> just request a PDF in the Open Access collection - we programmatically insert a front page to the PDF
[20:34] <helix84> tdonohue: pdftk
[20:34] <hpottinger> richardrodgers: would you have a writeup on what you're doing? I'd love to read more about it
[20:35] <richardrodgers> Um, sure thing hpottinger - I believe we use IText
[20:36] <tdonohue> link (for others): http://itextpdf.com/
[20:36] <kompewter> [ iText ® - Free / Open Source PDF Library for Java and C# ] - http://itextpdf.com/
[20:36] <helix84> i think even exiftool can be used
[20:36] <richardrodgers> yea, I think there are a number of options
[20:37] <mdiggory> I agree there are a number of options, and the original can be preserved and augmented on either archive or download, as richard points out.
[20:37] <tdonohue> well -- it sounds like others are interested in this. Wondering out loud if this is something we should think about releasing a "plugin" to DSpace for?
[20:38] <helix84> tdonohue: so what would the use case be again?
[20:38] <tdonohue> mdiggory -- I agree. I'm +1 this idea, as long as you are still preserving the original. I only get nervous if you replace the original with something "programmatically" created
[20:39] <mdiggory> http://thinktibits.blogspot.com/2011/05/java-itext-add-pdf-metadata-tutorial.html
[20:39] <kompewter> [ Java iText Add PDF Metadata Tutorial Example | ThinkTibits! ] - http://thinktibits.blogspot.com/2011/05/java-itext-add-pdf-metadata-tutorial.html
[20:39] <mhwood> Yes, what? archived-at: foo; permalink: bar; ?
[20:40] <tdonohue> helix84: I didn't start this discussion. mdiggory did. he was pointing out that PDF metadata may "play better" with other external tools/readers (esp. some of the mobile apps)
[20:41] <tdonohue> in any case. do we wish to continue down this thread? Or is this something to create a JIRA ticket for? I just wanted to pause here & ask. The only other topic on today's agenda is 3.0 stuff (but I didn't have anything specific to bring up)
[20:42] <richardrodgers> If someone opens a JIRA ticket, we'd be glad to post info (and code) on what we are doing..
[20:42] <mdiggory> I'd say make it a JIRA ticket... and contributors can add details/comments...
[20:42] <tdonohue> Ok. mdiggory, would you be willing to create an initial ticket? (Since you started this discussion thread)
[20:43] <mdiggory> tdonohue: do you have a specific 3.0 agenda at this point?
[20:44] <helix84> i'd like to ask a question about metadata 4 all
[20:44] <tdonohue> what do you mean by "specific 3.0 agenda"? I didn't have anything specific to discuss today, if that's what you mean
[20:44] <mdiggory> yes, just today
[20:44] <tdonohue> the floor is open to topics. If you wanted to ask something, feel free helix84
[20:45] <helix84> in https://github.com/DSpace/DSpace/pull/12 mark created a separate EAV table for metadata on all other objects, while item metadata remains in the existing table, right?
[20:45] <kompewter> [ Pull Request #12: Support Metadata On All DSpaceObjects by mdiggory · DSpace/DSpace · GitHub ] - https://github.com/DSpace/DSpace/pull/12
[20:47] * ryscher (98033b45@gateway/web/freenode/ip. Quit (Ping timeout: 245 seconds)
[20:47] <hpottinger> I'm currently looking right at richardrodgers mds (modern DSpace) on another screen, and the readme mentions metadata for all objects
[20:47] <richardrodgers> helix84: I noted that also - in my experiments (mds) its all in the existing metadatavalue table
[20:47] <mdiggory> Yes, this is something we have used in some @mire projects and is used behind some of our addons that need to record metadata on bitstreams
[20:48] <helix84> mdiggory: what's the reason for a separate table?
[20:48] <mdiggory> And richardrodgers and I have chatted about that approach too, I think all options are on the table at this point, they all have various benefits
[20:49] <tdonohue> hpottinger -- yea, currently we have two implementations of Metadata 4 All (neither of which has received final committer approval, yet). There's Mark's version (in pull request #12) and Richard's version in MDS (https://github.com/richardrodgers/mds)
[20:49] <mdiggory> because its an addon for DSpace and a separate table reduces interference with the existing API and other apps.
[20:49] <mdiggory> We could have written it without overriding DSO's entirely
[20:50] <richardrodgers> whereas in my case, I'm breaking the API anyway, so non-interference isn't so important...
[20:50] <helix84> mdiggory: sorry, i don't unerstand what you mean by addon in this context
[20:50] <mdiggory> I do like a more common table approach.... but also, I've talked some with both tdonohue and richardrodgers about a need for being able to "group" metadata into sections.
[20:51] <mdiggory> helix84: this was originally an override of the dspace-api to support @mire addons.
[20:51] <tdonohue> personally, I'm not against going towards a more "common table" approach. As this is such a big feature, I think others won't mind if we break some APIs here or there in order to fully support it
[20:51] <mhwood> Hey, it's a .0 release -- things can be broken if they must be.
[20:51] <tdonohue> (plus, in the long term, it makes more sense to move towards a "common table" -- it seems odd to have to look in different areas for different types of metadata)
[20:52] <tdonohue> +1 mhwood -- as long as we provide a migration path for current data
[20:52] <helix84> i'm also in favor of a single table, but i also wanted to talk about another possible aproach
[20:53] <mdiggory> I will note that the Item Level Versioning proposal stores metadata for all Item Versions in the metadatavalue table (given that all Versions are true Items)
[20:53] <helix84> i needed to run some one time SQL queries agains metadatavalue and every time I do that these queries get very complicated
[20:53] <helix84> not impossible, just very complicated and ineffective
[20:53] <mdiggory> note that in this case Versioning is more like "Editioning"
[20:54] <helix84> i asked about it at some #sql channels and such and the database guys immediately recognized that it's EAV and insisted I shouldn't use EAV
[20:54] <mdiggory> helix84: can you give us the usecase?
[20:55] <helix84> yes, give me a minute to let me dig into my notes
[20:55] <mdiggory> do you mean... http://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model
[20:55] <kompewter> [ Entity–attribute–value model - Wikipedia, the free encyclopedia ] - http://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model
[20:56] <helix84> anyway, meanwhile you can take a look at one of the suggestions they gave me - hstore: http://www.postgresql.org/docs/9.1/static/hstore.html
[20:56] <kompewter> [ PostgreSQL: Documentation: 9.1: hstore ] - http://www.postgresql.org/docs/9.1/static/hstore.html
[20:56] <helix84> really nice, but no equivalent in oracle
[20:57] <mdiggory> is reminded of an offline dialog with tdonohue about getting metadata out of the db and into Bitstreams....
[20:57] <helix84> another option would be getting metadata into NoSQL
[20:57] <helix84> hstore is actually NoSQL
[20:58] <helix84> but then I was also thinking maybe that's why you also put metadata into SOLR
[20:58] <helix84> but I don't know how to write complex SQL-like queries in SOLR syntax
[20:59] <mdiggory> you try not to attempt that
[20:59] <mhwood> What problem are we trying to solve here?
[21:00] <mhwood> Efficiency is the query optimizer's job. Give it the indices it needs and let it work for you.
[21:00] <helix84> joins on EAV table
[21:00] <mdiggory> generally you flatten your data into solr to support using it for its search and factting capabilities. there is some experimental new work with joining solr queries, but you won't see the same capabilities as a db
[21:00] <tdonohue> huh. hadn't heard of PostgreSQL hstore -- it reminds me of stuff in the Ruby on Rails world (as I've seen similar DB structures there, where you can stick structured data in a single DB field).
[21:00] <helix84> i'm trying to find a nice example
[21:00] <mdiggory> http://www.oracle.com/technetwork/products/nosqldb/overview/index.html
[21:00] <kompewter> [ Oracle NoSQL Database Technical Overview ] - http://www.oracle.com/technetwork/products/nosqldb/overview/index.html
[21:01] <mdiggory> probibly overkill
[21:02] <tdonohue> In general though, I do see Solr as the way to index DB for quick browse/search (to avoid complex DB queries). But, I agree with mdiggory that custom Solr queries are sometimes hard -- depending on exactly what you are attempting.
[21:02] * mdiggory (~mdiggory@rrcs-74-87-47-114.west.biz.rr.com) has left #duraspace
[21:02] <helix84> ok, here's a nice one - find items with duplicate titles
[21:02] * mdiggory (~mdiggory@rrcs-74-87-47-114.west.biz.rr.com) has joined #duraspace
[21:04] <hpottinger> This thread appears to have branched from Metadata For All Objects discussion, at about 3:54
[21:05] <helix84> actually, i'm talking about the best way to store metadata
[21:05] <mhwood> I'm going to have to go. I do recall seeing discussion of something like this in _SQL Antipatterns_.
[21:05] <helix84> since there would be changes due to m4a, i think it's the right time to bring it up
[21:06] <mdiggory> thats a good point helix84
[21:06] <tdonohue> I'll admit, I'm not "connecting" all the dots here. But, it *sounds* interesting. I just need a more concrete example or two somewhere :)
[21:07] <mdiggory> my thoughts are that metadata comes in more forms than EAV
[21:08] <richardrodgers> interesting discussion, but I have to run. I'll read the IRC log
[21:08] <tdonohue> I will note though that I don't think we can just "cut off Oracle support". I do think these ideas seem very interesting to explore, but we need to avoid being PostgreSQL specific.
[21:08] <helix84> i'll paste my notes about duplicate titles
[21:08] * richardrodgers (~richardro@ Quit (Quit: richardrodgers)
[21:08] <mdiggory> I remember back a while ago Rob Tansley commenting that the database should be a cache of the preserved content
[21:08] <mdiggory> I also recall the Fedora folks having this postion as well, which lead to a more file based datastore
[21:09] * mhwood (mwood@mhw.ulib.iupui.edu) has left #duraspace
[21:09] <mdiggory> and applications on top for search, semantic queries, so on
[21:09] <helix84> http://pastebin.com/8U7uCegC
[21:09] <kompewter> [ [SQL] ---------------------------------------------- -- finding duplicate titles in D - Pastebin.com ] - http://pastebin.com/8U7uCegC
[21:09] <tdonohue> yea. currently, we are storing the preserved content (or metadata at least) in the DB, but our "cache" is Solr (or similar).
[21:10] <helix84> if you try to run this kind of query on real-world size data, it will take some time :)
[21:11] <tdonohue> right, I get that. Just not sure I get how hstore/NoSQL would make this sort of query *easier*
[21:12] <helix84> good point, i don't remember right now :)
[21:12] <tdonohue> and also whether hstore/NoSQL could also make other queries/metadata management easier -- I think if we can make a good case for why/how it simplifies things in the DB, then we'd have a better chance of convincing other Committers why it would be a worthwhile direction.
[21:13] <tdonohue> So, I'm glad you are thinking about this helix84 and glad you brought it up. It's just we probably need to all better understand why/how it could work with our current data model
[21:14] <mdiggory> Theres something similar to this in Solr called Field Collapsing... it would group duplicates on a specific field by value
[21:14] <helix84> i think if you try to solve the problem of finding duplicate values, you'll see the current disadvantages. the rest is just suggestions for now, no recommendations.
[21:14] <helix84> mdiggory: but i need to know the item ids of those duplicates
[21:15] <tdonohue> aren't item ids in Solr too? If they aren't they could also be indexed
[21:16] <helix84> what i meant is if you do "group by", i think you can lose ids (or any other columns) of the grouped items
[21:16] <helix84> don't know what the case is in SOLR, though
[21:17] <tdonohue> yea, I actually don't know either, to be honest.
[21:18] <tdonohue> In general though, I like the idea of doing *less SQL queries* in DSpace. I think there'd be two approaches -- one is to rely more heavily on Solr for indexing content (and only touch the DB when content needs to *change*), the other could be something like NoSQL
[21:19] <helix84> to sum up, the problem with storing metadata is that EAV tables are bad for complex SQL queries, and the reason we use EAV is that we need a way to be flexible with metadata schemas.
[21:20] <tdonohue> yep, good summary
[21:20] <helix84> i hope you don't mind that i hijacked the conversation with just the problem and no tested solution
[21:21] <hpottinger> metadata bitstreams looking kinda pretty good right now :-)
[21:21] <tdonohue> ok, well, meeting is obviously well into "overtime" here. Several folks already had to head out. So, there's no more "official discussion", but feel free to hang around or leave as you need to.
[21:21] <helix84> hpottinger: i don't see why. queries would be only slower...
[21:21] <tdonohue> helix84 -- not at all. I think it's a good brainstorm
[21:22] <tdonohue> I think if you went with metadata bitstreams, you'd have to still index that metadata in something like Solr --> so queries would end up being via Solr
[21:22] * helix84 really needs to study up on solr queries
[21:24] <tdonohue> i.e. I think we'll always likely want two main things when it comes to metadata : (1) the preserved metadata (could be in a more static place like a bitstream, or in a dynamic place like a DB), and (2) the cached / indexed / searchable metadata (can be indexed in such a way that it becomes more highly searchable, like in Solr or similar)
[21:24] <tdonohue> So, the *preserved* location need not be the same place as the *searchable* location
[21:25] <hpottinger> tdonohue: correct, that's where we're heading anyway, is my take on this discussion, so, we have a "cache" of the metadata in solr, or something else (NoSQL?), and the metatdata can be as crazy as it wants to be.
[21:25] <helix84> hpottinger: i remember now, you wanted bitstreams because of _structured_ metadata
[21:26] * hpottinger just wants happier users... :-)
[21:26] <tdonohue> yep, and users these days seem to want *structured* metadata ;)
[21:27] <helix84> we must have different users, mine only bitch about content, not DSpace
[21:27] <tdonohue> (or at least the repository managers seem to)
[21:27] <hpottinger> correct, my definition of "user" may differ from others, my job is to keep librarians happy
[21:28] <tdonohue> oh, yea. There are two levels of users : The repository managers/librarians (who often care about metadata & structured metadata), and the end users (who just want to submit content into the system & don't really care about metadata)
[21:30] <helix84> IMHO dspace metadata currently has just the right balance between structure and being able to query it. When you introduce more complex structure into metadata, your search has to _know_ about that structure.
[21:33] <tdonohue> yea, I agree with that helix84. But, at the same time, DSpace currently cannot easily support things like "author affiliations" without more structured metadata. Essentially, "dublin core" like structures can be a little too simplistic at times, which is why more structured, XML-based metadata schemas (like MODS) came about
[21:34] <hpottinger> helix84, I think I tend to agree, I like the current approach (I do like SQL, probably mostly because it's a skill already in my toolbox), but we're getting friction where the approach doesn't fit with what we're being asked to do.
[21:34] <helix84> i was thinking of just the same use case - authors + urls
[21:35] <hpottinger> alas, interesting discussion, but I've gotta go pick up kids
[21:35] <hpottinger> will catch up on the transcripts later, by all!
[21:35] * hpottinger (~hpottinge@mu-162198.dhcp.missouri.edu) has left #duraspace
[21:35] <tdonohue> yea, I unfortunately have to leave in a bit too. But the discussion is very interesting. In any case, the repository managers are starting to request things that they see in MODS (like author affiliations or similar), and DSpace just cannot support them easily
[21:36] <helix84> so, what would a search look like with simple metadata like this: <author><name>Doe, John</name><affiliation>ABC</affiliation></author>?
[21:37] <tdonohue> whereas, if we could support MODS / XML metadata, then we'd be in a much better place. But, as stated about, it will mean that DSpace would need to know how to index this type of metadata so that it could be searchable/browseable
[21:38] <tdonohue> helix84 - I think in that case, you could index all that in Solr, and then be able to search by author or by affiliation. Solr is a very powerful search engine that can create many different facets
[21:38] <helix84> exactly, how do you express the relations in Solr (or a relational DB)?
[21:39] <helix84> so in solr you can search by author name and get author name + affiliation?
[21:39] <tdonohue> In Solr, you express them as indexes. You can define new indexes in your Solr schema & then essentially index metadata coming from an external source into those Solr indexes. But, my Solr is a bit rusty.
[21:40] <tdonohue> yea, once you setup the Solr indexes & then index content. You can then construct queries that say stuff like: "give me everything that has an author who's affiliation is "ABC"" and Solr will return it
[21:41] <helix84> 1) it's nice it can do that 2) i think here's the problem i wanted to point out - you would neex to set up indexes in a way to reflect your structure - which you don't know in advance
[21:42] <tdonohue> for #2, you actually *do* know the metadata structure as long as you limit it to some standards, e.g. MODS and similar. You are correct though that we'd never be able to say "we support all XML structured metadata".
[21:42] <helix84> so along with each structured metadata format you would also need to define the structure of solr indexes (to reflect the groups and relations)
[21:43] <tdonohue> The solr indexes are actually independent of the structure of the external metadata. There's an Indexer class (in Java) that actually would need to do the translation from an external MODS metadata (or DC metadata) into the existing Solr Indexes
[21:44] <tdonohue> the hard part would be to determine the proper Solr indexes that "generally cover" the main metadata field types that we are wanting to deal with (e.g. title, author, author affiliation, subject, etc etc)
[21:44] <helix84> you're losing me a bit here
[21:45] <helix84> but it's because i don't understand solr well enough
[21:46] <tdonohue> yea, essentially, you'd need to dig into solr more. Essentially, a Solr index is more generic thing -- some samples could be "title", "author", "author affiliation", "subject".
[21:47] <mdiggory> http://code.google.com/p/solrmarc/
[21:47] <kompewter> [ solrmarc - Index your MaRC records with apache solr. - Google Project Hosting ] - http://code.google.com/p/solrmarc/
[21:47] <tdonohue> But, when you actually perform *indexing* of metadata, you'd need to write a Java class that says: The metadata field: <author><name>Doe, John</name></author> should be stored in the Solr index "author"
[21:47] <helix84> i would understand if you expressed it as data structures
[21:49] <tdonohue> Similarly, you could say the metadata field: <author><affiliation>ABC</affiliation></author> gets indexed in the Solr index named "author affiliation"
[21:49] <helix84> yeah, you currently set up these mappings in dspace.cfg. since in structured metadata it could be more compliceted, you would do the mappings with code.
[21:49] <mdiggory> there would always be a need to map, its generally easiest to map to the Solr XML document format, that might make it clearer...
[21:50] <helix84> what is Solr XML document format? link please?
[21:50] <tdonohue> correct, helix84. In fact, Discovery already does some of that mapping. It maps DSpace DB metadata fields into Solr indexes
[21:50] <mdiggory> SOlr does support mapping via xslt, but I've just used xslt outside of Solr to create Solr Documents and then post them
[21:51] <tdonohue> I'll let mdiggory take over -- he's done more Solr than I recently ;) Plus, I need to run here in a bit.
[21:51] <helix84> tdonohue: where can i see these discovery mappings?
[21:52] <mdiggory> I'm here but have to do another meeting...
[21:52] <helix84> no problem
[21:52] <tdonohue> helix84 -- they are in the discovery codebase. Not sure of the exact classes off the top of my head
[21:52] <mdiggory> https://wiki.duraspace.org/display/DSPACE/DSpace+Discovery
[21:52] <kompewter> [ DSpace Discovery - DSpace - DuraSpace Wiki ] - https://wiki.duraspace.org/display/DSPACE/DSpace+Discovery
[21:53] <tdonohue> It looks like the org.dspace.discovery.SolrServiceImpl class does the main Indexing for Discovery (see the indexContent() methods)
[21:54] <tdonohue> thats under dspace-discovery/dspace-discovery-solr/
[21:54] <tdonohue> but, admittedly, it's using the Solr Java API. So you may need to also look at that to understand how it is working
[21:55] <mdiggory> The configuration of which metadata fields are inserted into which facets / indexes is in the Spring configuration.
[21:56] <tdonohue> yea, that metadata field mapping (in Spring) looks to be in /dspace-discovery/dspace-discovery-provider/src/main/resources/spring/spring-dspace-addon-discovery-configuration-services.xml
[21:56] <mdiggory> Think of this as sort of an ORM strategy... Item / Solr / View mapping
[21:56] <tdonohue> (and mdiggory knows how this works much better than I -- I'm just digging in code right now)
[21:56] <helix84> it seems the main thing there is buildDocument(), but i'll have to look up what that does
[21:58] <tdonohue> buildDocument() is actually in that same SolrServiceImpl class & it looks like it loads the configs from the Spring config that I mentioned above
[21:58] <mdiggory> note, we use a great deal of wildcard fields in the actual solr schema... its best to dig into the solr admin UI to see what is actually there, it can be accessed inside the solr webapp
[21:58] <helix84> mdiggory: yes, i've looked at it before
[21:58] <mdiggory> correct, the spring config is used on both indexing and querying...
[21:59] <tdonohue> ok, gotta go. Been a good discussion though. Bye
[21:59] <helix84> do you mean spring-dspace-addon-discovery-configuration-services.xml? but that's just facets and such, not mapping metadata to indexes
[21:59] <helix84> bye tim
[21:59] <mdiggory> its a description of how to index the metadata fields so that the term completion, faceting and other features operate well on both string and date fields...
[22:01] * tdonohue (~tdonohue@c-67-177-108-221.hsd1.il.comcast.net) Quit (Read error: Connection reset by peer)
[22:01] <mdiggory> Note, that Discovery uses solr by default, but is designed to be able to be implemented on other search technologies, for instance, elasticsearch.
[22:04] <helix84> i honestly can't see where it says e.g. "map dc.contributor.author into the 'author' solr index". I only see the definition of such facet and filter.
[22:08] <helix84> anyway, lots of interesing stuff to study was mentioned here. i'm definitely going to read it again tomorow when my head stops spinning :) thanks everyone
[22:51] * bradmc (~bradmc@207-172-69-79.c3-0.smr-ubr3.sbo-smr.ma.static.cable.rcn.com) has joined #duraspace
[23:27] * bradmc (~bradmc@207-172-69-79.c3-0.smr-ubr3.sbo-smr.ma.static.cable.rcn.com) Quit (Quit: bradmc)
[23:31] * bradmc (~bradmc@207-172-69-79.c3-0.smr-ubr3.sbo-smr.ma.static.cable.rcn.com) has joined #duraspace
[23:32] * bradmc (~bradmc@207-172-69-79.c3-0.smr-ubr3.sbo-smr.ma.static.cable.rcn.com) Quit (Client Quit)

These logs were automatically created by DuraLogBot on irc.freenode.net using the Java IRC LogBot.