#duraspace IRC Log

Index

IRC Log for 2017-02-22

Timestamps are in GMT/BST.

[7:03] -orwell.freenode.net- *** Looking up your hostname...
[7:03] -orwell.freenode.net- *** Checking Ident
[7:03] -orwell.freenode.net- *** Found your hostname
[7:03] -orwell.freenode.net- *** No Ident response
[7:03] * DuraLogBot (~PircBot@webster.duraspace.org) has joined #duraspace
[7:03] * Topic is 'Welcome to DuraSpace IRC. This channel is used for formal meetings and is logged - http://irclogs.duraspace.org/'
[7:03] * Set by tdonohue on Thu Sep 15 17:49:38 UTC 2016
[13:20] * mhwood (mwood@mhw.ulib.iupui.edu) has joined #duraspace
[13:59] * tdonohue (~tdonohue@dspace/tdonohue) has joined #duraspace
[14:59] * tdonohue reminds everyone that our DSpace DevMtg starts shortly. Agenda https://wiki.duraspace.org/display/DSPACE/DevMtg+2017-02-22
[15:01] <tdonohue> Well, we are a tiny group today. Agenda is above, but it seems there's only 2 Committers here (pinging helix84 and mhwood)
[15:01] <mhwood> PONG
[15:02] * terry-b (~chrome@97-113-118-39.tukw.qwest.net) has joined #duraspace
[15:03] <tdonohue> And there's another Committer. Welcome terry-b
[15:03] <terry-b> hello
[15:03] <terry-b> I am back from a day off. Just starting my day here
[15:04] <tdonohue> It seems like in recent weeks, it's only been the three of us here (mhwood, terry-b & I). That's a bit disappointing, as it's harder for just three of us to accomplish much
[15:06] <tdonohue> Just as a comparison right now... currently in this IRC we have 5 real humans (only 3 verified "active"). While in Slack #dev it looks like 12 people are displaying as active
[15:06] <tdonohue> So, I'm really starting to wonder if we should move these meetings. If no one remembers about IRC meetings anymore, we should consider moving them to where there are more "eyes"
[15:07] <tdonohue> Even though, we don't have a kompewter or similar in Slack (speaking of, kompewter needs a kick)
[15:07] <terry-b> It would be easier for those not present to catch up on the threads in slack
[15:08] <tdonohue> Maybe I'll bring this up over in Slack #dev (post meeting), and see what folks think there as well.
[15:08] * kompewter (~kompewter@ec2-50-17-201-82.compute-1.amazonaws.com) has joined #duraspace
[15:08] <terry-b> It sounds like a good thing to try... and a good excuse to remind them of the meeting
[15:10] <tdonohue> For now, I put a reminder into #dev Slack about this mtg. Will ask over there after this meeting if we want to move it to Slack
[15:10] * hpottinger (~hpottinge@162.104.218.179) has joined #duraspace
[15:11] * ntorres (c1895861@gateway/web/freenode/ip.193.137.88.97) has joined #duraspace
[15:11] <tdonohue> That reminder woke up a few other people. Welcome hpottinger & ntorres
[15:11] <tdonohue> ;)
[15:11] * hpottinger shakes his head groggily, wut?
[15:12] <tdonohue> Ok, so back to our agenda: https://wiki.duraspace.org/display/DSPACE/DevMtg+2017-02-22
[15:12] <kompewter> [ DevMtg 2017-02-22 - DSpace - DuraSpace Wiki ] - https://wiki.duraspace.org/display/DSPACE/DevMtg+2017-02-22
[15:12] <tdonohue> We have the usual reminders at the top. Tomorrow is the next DSpace 7 UI meeting
[15:13] <tdonohue> The details were sent to dspace-devel mailing list though, and there will be a reminder on Slack
[15:13] <tdonohue> The main topics for today are DSpace 6.1 work (i.e. 6.x maintenance)
[15:15] <tdonohue> We have our usual list of high priority tickets in the Agenda itself. 10 of those are scheduled for 6.1
[15:15] <tdonohue> Here's that list https://jira.duraspace.org/issues/?jql=filter%20%3D%2013904%20AND%20fixVersion%20%3D%206.1%20ORDER%20BY%20fixVersion%20DESC%2C%20priority%20ASC%20
[15:15] <kompewter> [ Issue Navigator - DuraSpace JIRA ] - https://jira.duraspace.org/issues/?jql=filter%20%3D%2013904%20AND%20fixVersion%20%3D%206.1%20ORDER%20BY%20fixVersion%20DESC%2C%20priority%20ASC%20
[15:16] <mhwood> 3367 and 3378 are waiting for me to test. I'll be doing that today.
[15:16] * th5 (~th5@unaffiliated/th5) has joined #duraspace
[15:16] <tdonohue> So, I'd like to do some updates on these tickets.
[15:16] <tdonohue> thanks mhwood
[15:17] <tdonohue> Another one worth us looking at today is DS-2952 (lower on that list). It has several PRs now DSPR#1595 and DSPR#1647
[15:17] <kompewter> [ https://jira.duraspace.org/browse/DS-2952 ] - [DS-2952] SOLR: Full text indexing only includes the text on the last bitstream - DuraSpace JIRA
[15:17] <kompewter> [ https://github.com/DSpace/DSpace/pull/1595 ] - DS-2952 SOLR full text indexing multiple bitstreams by tomdesair
[15:17] <kompewter> [ https://github.com/DSpace/DSpace/pull/1647 ] - DS-2952: Alternative approach for PR 1595 by terrywbrady
[15:18] <tdonohue> Tom Desair recently commented on the PRs I see... terry-b and I also had some recent discussions on the behavior here
[15:18] <hpottinger> 1647 seems squishy
[15:19] <terry-b> I see tom desair's comment about the need to do a full text rebuild. We have many occasions in DSpace where we need to run a full rebuild.
[15:19] <tdonohue> I think the main question/concern here is that we've realized that 1595 "looks good" except it indexes *all* private bitstreams as publicly searchable. terry-b looked to "fix" that in 1647 by changing the behavior to just index public bitstreams as publicly searchable.
[15:20] <terry-b> Although I can see the argument that it is 2 separate issues, It forces us to find volunteers to test/merge twice... What is going to get this resolved faster?
[15:20] <mhwood> But that means that the cache can become inconsistent with the main store, if access rights are changed. Maybe what we need is to (1) index everything, but (2) check rights before presenting each search result.
[15:21] <hpottinger> mhwood++
[15:22] <tdonohue> mhwood: agreed that is ideal here. But, currently it seems Solr indexing don't store permissions. So, there's nothing to "check" against when returning results, I believe
[15:22] <terry-b> Some actions in DSpace are infrequent enough that it is probably correct to defer the index update until it is requested
[15:22] <mhwood> Don't ask Solr; ask the database, which is definitive.
[15:22] <hpottinger> full index rebuild requires downtime
[15:22] <terry-b> Solr stores item permissions that are used during search/browse
[15:22] <mhwood> Just filter the hits before presenting them.
[15:22] <terry-b> owningComm and owningColl
[15:23] <tdonohue> mhwood: so you are talking about getting *all* results from Solr, but then somehow filtering them via additional DB queries (per item/object)? That sounds expensive
[15:23] <terry-b> It would be prohibitively slow twithout that info
[15:24] <tdonohue> terry-b: do you know if the Solr item permissions stored could be useful here (as a filter for full text)?
[15:24] <tdonohue> Although, actually that's not the right level. We need *Bitstream* permissions
[15:24] <terry-b> They are item permissions not bitstream permissions
[15:24] <mhwood> How many hits do we show per page? Would that really be so very expensive?
[15:25] <tdonohue> mhwood: default is 10 items per page. But, you can switch that to up-to 100
[15:25] <terry-b> We have a collection of 150,000 items that are restricted to the university community. We rely on the permission filtering in Solr
[15:25] <hpottinger> if we change the results as we show them, we change the number of results
[15:25] <tdonohue> mhwood: the harder part is that you might be returned 10 items, but 5 may be restricted, so then you need to get another 5, but if one of this is restricted, then you need another 1.... all just to fill out 10 results
[15:26] <tdonohue> and hpottinger is right, we have no idea what the new "total" number of results is
[15:26] <mhwood> Good, fast, cheap...pick any two.
[15:27] <hpottinger> I don't know how to properly handle the "logged in user" use case, but for non-logged-in, we could just have an "is_OA" flag.
[15:27] <tdonohue> I only see two ways to make this "quick" : (1) Include permissions in the Solr index, so we can filter by them (and if permissions are updated, it should trigger an obj reindex anyhow)... or (2) Only index publicly available bitstreams (but as we pointed out that's not ideal)
[15:28] <tdonohue> By "quick" here I mean "speedy return of results"
[15:28] <mhwood> If querying a few hundred results out of an RDBMS is slow, you need to tune your indexes.
[15:28] <tdonohue> The #1 behavior is essentially what we already do for Items. We just would need the same behavior now for Bitstreams
[15:29] <terry-b> Also remember that we should consider approaches that are appropriate for a point release
[15:29] <th5> What's the downside for adding permissions to solr?
[15:29] <mhwood> Updating them every time they change.
[15:29] <tdonohue> The only downside is it may be a lot of work (undetermined though)
[15:29] <terry-b> We do not currently represent bitstreams as objects in solr... they are collapsed into an item object
[15:30] <tdonohue> mhwood: we already do that for Items though. When Item permissions change, that single item is reindexed in Solr
[15:30] <terry-b> Before PR1595 we only represent one bitstream in the record at all
[15:30] <tdonohue> terry-b: good point. I forgot that Solr isn't even "aware" of bitstreams
[15:30] <hpottinger> seems a fair trade-off: only full-text index OA bitstreams
[15:31] <mhwood> But then logged-in users may not find everything they are entitled to.
[15:32] <th5> I second that concern. Our collections are largely locked down.
[15:32] <mhwood> So, do we need to split the problem into (1) make it righter without major code disturbance, for 6.1, and (2) make it rightest later (perhaps 7.0)?
[15:32] * dyelar (~dyelar@biolinux.mrb.ku.edu) has joined #duraspace
[15:32] <tdonohue> Yes, only indexing OA bitstreams would hide restricted content from searches. That is a huge problem for sites that lock down or restrict content (e.g even sites that have local IP restrictions for access to theses & dissertations)
[15:33] <hpottinger> mhwood: good point
[15:33] <tdonohue> mhwood++ yes, likely
[15:33] <terry-b> mhwood, That makes sense to me although the solution may be tricky
[15:34] <hpottinger> it's literally what happens with embargos, at a smaller scale: your content isn't discoverable
[15:34] <terry-b> (solution for 7.0)
[15:34] <tdonohue> For embargoes, I'm not sure embargoed files SHOULD EVER be indexed (until the embargo expires). That's a different question to me than access restricted files (which should be indexed so they are searchable)
[15:35] <tdonohue> So, if we were to split this up for 6.1, I still think we need to exclude content under embargo from the Solr index
[15:35] <mhwood> It seems reasonable to treat embargo specially, since the people who should be able to see embargoed material are the people who should already know it's in there.
[15:35] <hpottinger> isn't embargo acheived by access policy these days?
[15:36] <tdonohue> hpottinger: yes, I'm now realizing that...which makes this harder :(
[15:36] <mhwood> Don't we still have "old embargo" AND "new embargo"? Fun....
[15:37] <terry-b> As we choose our approach for 6.1, we need to figure out if we (1)continue current practice which exposes embargo content or (2)make the full text more restrictive
[15:37] * tdonohue really wonders if we should have someone investigate what it would take to simply add a permission filter on the "full text" (not even adding the idea of bitstreams to Solr, just one filter on all the full text)
[15:38] <mhwood> So, for the short term: if it's restricted by old embargo, don't index it. If it's restricted by resource policy, don't index it. We still get more material indexed than before.
[15:38] <terry-b> Ultimately, the "old embargo" is effectively saved as an access policy so that distinction does not matter
[15:38] <tdonohue> mhwood: currently *everything* is indexed. So, we cannot get "more" than everything ;)
[15:39] <mhwood> Ah, I was thinking we only got one bitstream.
[15:39] <tdonohue> mhwood: well, yes, we do only get one bitstream. True.
[15:39] <terry-b> tdonohue, I do not know how we could add permissions for full text without introducing bitstream objects into solr
[15:42] <th5> Shouldn't things eventually move to having bitstreams (and permissions) in solr?
[15:42] <hpottinger> maybe we characterize the permissions? I'm not nearly smart enough to pull that off, but, some kind of code that we can use in a Solr query?
[15:42] <mhwood> If we index all bitstreams (as I believe the original PR does?) then filtering the hits would do that without further changes to Solr. If you want Solr to do the filtering then that's right -- we need to index each bitstream separately, and relate it to its item.
[15:42] <tdonohue> th5: yes, likely. but the point here is that's a significant effort. We're talking about what we can "fix" in 6.1 (nearterm) versus what is a much larger project (likely for 7.0)
[15:43] <th5> Thanks. Making sure I'm keeping up.
[15:43] <terry-b> It is useful to look at some of those records in solr admin, it helps to explain the functionality that is happening
[15:44] <tdonohue> So, I'm starting to come around to the idea of accepting DSPR#1595 "as-is" for 6.1. Let's see if I can explain why....
[15:44] <kompewter> [ https://github.com/DSpace/DSpace/pull/1595 ] - DS-2952 SOLR full text indexing multiple bitstreams by tomdesair
[15:44] <tdonohue> 1) This fixes the main bug (only one bitstream is currently indexed)...now all bitstreams are indexed
[15:45] <tdonohue> 2) It doesn't make things "worse"...currently it is possible to search within restricted/embargoed bitstreams (though you won't be able to *SEE* the bitstreams, though might see a snippet of where you search matched)
[15:45] <tdonohue> 3) In reality, if the Item itself is Restricted or Embargoed (at the Item Level), we already cover those scenarios...so you won't see those in your results.
[15:45] <hpottinger> if that's troublesome, you *can* turn off snippets
[15:45] <mhwood> So the result is that you find everything that fits your query, but you may not be able to read some of the results. That is mildly unpleasant, but it is correct.
[15:46] <tdonohue> I think that #2 and #3 are important... we need to realize we already handle Item restrictions properly. We're only talking about the unique scenario where the Item is *public* but one bitstream is *not*
[15:47] <tdonohue> And in that scenario you'd be able to search within the non-public bitstream, but you wouldn't be able to see it or download it
[15:47] <tdonohue> Thoughts?
[15:47] <hpottinger> tdonohue++ I think 1595 is good enough for now
[15:48] <terry-b> I think that is a clean decision on 1595. I think the issue about searching bitstreams under embargo is significant and we should provide some direction on that.
[15:48] * tdonohue notes we still should open up a separate ticket about adding Bitstream permissions to Solr (or some other way to filter out restricted bitstreams of public items)
[15:48] <mhwood> Yeah, it's progress. If we can commit to revisiting the issue, we can announce that "it's not perfect, but it's better, and we will make it better still."
[15:49] <terry-b> There is a security ticket for that.
[15:50] <tdonohue> I'm not sure it needs to be a security ticket. While it's not ideal, I don't think we need to keep this behavior secret (as you can turn off the snippets, in which case while you might get "hits" from embargoed bitstreams, you won't be able to see why)
[15:51] <terry-b> OK. When I first raised it you had suggested a security ticket.
[15:51] <th5> The biggest issues my users have brought up is respect for permissions / leaking data via discovery. In response we've turned off many features and put in very restrictive permissions.
[15:51] <mhwood> An infinite number of monkeys could eventually winkle out all of the restricted text using an infinite number of searches....
[15:51] <tdonohue> So, I'd vote we accept 1595 for 6.1, and open up the ticket terry-b is talking about, and clarify the workaround (turn off snippets), and move that forward for 7.0
[15:52] <tdonohue> terry-b: I've re-thought that now though. It's still a minor security issue, but there's an obvious workaround (turn off snippets), and the security implications here are extremely low (the most you can get back is a snippet)
[15:52] <tdonohue> terry-b: so, I suspect it'd be better to make this a public ticket, as then (hopefully) we get a volunteer from our community to *fix it*
[15:52] <terry-b> tdonohue, that sounds reasonable to me and your plan sounds do-able
[15:53] <terry-b> https://jira.duraspace.org/browse/DS-3498
[15:53] <kompewter> [ https://jira.duraspace.org/browse/DS-3498 ] - [DS-3498] Full Text Index Behavior on Items Under Embargo in DSpace 4x, 5x, 6x - DuraSpace JIRA
[15:53] <tdonohue> terry-b: I'll cleanup the ticket description based on this discussion and open it up
[15:53] <kompewter> [ [DS-3498] Full Text Index Behavior on Items Under Embargo in DSpace 4x, 5x, 6x - DuraSpace JIRA ] - https://jira.duraspace.org/browse/DS-3498
[15:54] <tdonohue> oh, you already opened it. Ok. well, I'll cleanup the ticket description then (post meeting), as I think we need to clarify the risk is "very low" here
[15:54] <terry-b> I can volunteer it the approach that I provided in the alternate PR seems reasonable for 6.1
[15:55] <terry-b> https://github.com/DSpace/DSpace/pull/1647 - please disregard the number of commits as I would plan to submit a fresh PR.
[15:55] <tdonohue> we are nearly out of time here. This has been a good discussion, but I do want to ask if there's other tickets we need to look at more closely (either now, or before next mtg)
[15:55] <kompewter> [ DS-2952: Alternative approach for PR 1595 by terrywbrady · Pull Request #1647 · DSpace/DSpace · GitHub ] - https://github.com/DSpace/DSpace/pull/1647
[15:56] <tdonohue> terry-b: I'm not sure the approach of only indexing public bitstreams would be best here. I think I now have convinced myself that #1595 (original PR) is "good enough" for 6.1
[15:57] <hpottinger> I volunteer to write the "how to hide snippets" docs, I've done that before and talked a few people through it, too.
[15:57] <terry-b> sounds good
[15:57] <hpottinger> gives me something to do during our "documentation working session"
[15:59] <tdonohue> ok, sounds like a plan. I'll update the DS-3498 ticket description now.
[15:59] <kompewter> [ https://jira.duraspace.org/browse/DS-3498 ] - [DS-3498] Full Text Index Behavior on Items Under Embargo in DSpace 4x, 5x, 6x - DuraSpace JIRA
[15:59] <tdonohue> We'll close up today's meeting then. I will be available for some JIRA Backlog reviews (if anyone is interested) in #dspace in ~5mins
[16:02] * mhwood (mwood@mhw.ulib.iupui.edu) has left #duraspace
[16:04] * tdonohue finished updating DS-3498 to better describe the issue and link back to this discussion
[16:04] <kompewter> [ https://jira.duraspace.org/browse/DS-3498 ] - [DS-3498] Full Text Index Behavior on Public Items with Embargoed Bitstreams - DuraSpace JIRA
[16:08] <tdonohue> I've also updated my comments on DSPR#1595 as approval
[16:08] <kompewter> [ https://github.com/DSpace/DSpace/pull/1595 ] - DS-2952 SOLR full text indexing multiple bitstreams by tomdesair
[16:09] * mhwood (mwood@mhw.ulib.iupui.edu) has joined #duraspace
[16:11] <terry-b> I merged https://github.com/DSpace/DSpace/pull/1595... I noticed this was for master. We will need a PR for 6.1 as well.
[16:11] <kompewter> [ Sign in to GitHub · GitHub ] - https://github.com/DSpace/DSpace/pull/1595...
[16:12] <tdonohue> terry-b: we can cherry-pick it over to 6.x. I can do so
[16:12] <terry-b> sounds good
[17:22] * terry-b (~chrome@97-113-118-39.tukw.qwest.net) Quit (Remote host closed the connection)
[17:30] * hpottinger (~hpottinge@162.104.218.179) Quit (Quit: Leaving 三三ᕕ( ᐛ )ᕗ LATER TATERS!)
[18:06] * hpottinger (~hpottinge@162.104.218.179) has joined #duraspace
[18:10] * ntorres (c1895861@gateway/web/freenode/ip.193.137.88.97) Quit (Ping timeout: 260 seconds)
[21:39] * hpottinger (~hpottinge@162.104.218.179) Quit (Quit: Leaving 三三ᕕ( ᐛ )ᕗ LATER TATERS!)
[22:26] * th5 (~th5@unaffiliated/th5) Quit ()
[22:27] * mhwood (mwood@mhw.ulib.iupui.edu) Quit (Remote host closed the connection)
[22:43] * tdonohue (~tdonohue@dspace/tdonohue) Quit (Read error: Connection reset by peer)
[22:56] * terry-b (~chrome@97-113-118-39.tukw.qwest.net) has joined #duraspace
[22:57] * terry-b (~chrome@97-113-118-39.tukw.qwest.net) Quit (Remote host closed the connection)
[23:35] * dyelar (~dyelar@biolinux.mrb.ku.edu) Quit (Quit: Leaving.)

These logs were automatically created by DuraLogBot on irc.freenode.net using the Java IRC LogBot.