SOLR-18130: new param "solrConnection" in solrj-streaming, support HTTP#4320
SOLR-18130: new param "solrConnection" in solrj-streaming, support HTTP#4320vyatkinv wants to merge 49 commits into
Conversation
27f4ae5 to
c877f2a
Compare
|
Shouldn't |
@dsmiley previously suggested naming this parameter |
Agreed... I believe the intent is to specify how you connect to your running Solr? In the first PR I believe it was |
dsmiley
left a comment
There was a problem hiding this comment.
Great work here! Clearly this was a bit more involved than a rename ;-)
"connectionString" is somewhat long and it's ambiguous as to what we're even connecting to. But it's not bad. How about "solrCloudString"? If you don't like that, I'll capitulate and be satisfied with "connectionString".
Incorporating auth is out-of-scope but I'll say now that I'm suspicious that it makes sense to embed secret credentials into this. A conversation for another time.
I imagine there should be some ref-guide updates on this in addition to major-changes-in-solr-10.adoc.
I can see the potential of a follow-on issue to update CLIUtils.getZkHost (and thus CLI commands that call it) to instead resolve a SolrCloud connection string that is not necessarily ZK.
Answers to the PR questions:
- It's a test issue -- good catch. Glad you made resolving this mandatory. I wouldn't call this "mandatory SolrCloud parameter" since the param can be blank when it's used inside Solr, defaulting to the server's connection.
- Normalize; do not round-trip with "zkHost".
- IMO SolrParams should be used for "parameters" like this, not Map. Perhaps too much scope creep here but I leave that to you.
- Let's not add support for that to ScoreNodes now, albeit a single comment where we initialize it would be nice. Like "for now limited to the current cluster but we could expand support if needed".
I suggest that you change StreamFactory to not mention "ZK" whatsoever, instead using the name we choose (e.g. getDefaultCloudString). For backwards-compatibility, however, the older method names should exist as deprecated.
|
Hi all... So I'm following up after @dsmiley pinged me (thanks!)... What I am excited about this is the opportunity to make it simpler for non Solr experts to access solr. I don't think that you should need to know if it's solrCloud or standalone or zk that you are using, it's just a "connection string". Yes, every connection string has rules of formatting... I would like a name like solrConnection or solrConnString but not something that is as specific as "solrCloudConnection" that ties us to a specific mode of Solr. I know that "Streaming Expressions" is a solr cloud specific feature. However, from a user perspective, they may not know that, or care. Someday I hpoe to see Streaming Expressions pushed more towards our Data Scientist / ML user base, and they don't care about Cloud versus non Cloud mode, and also don't care about zkHost... So, having a property that speaks to their concern "how do I connect" is perfect. Could we have a property like |
|
+1 to "solrConnection"
huh; you just agreed to "solrConnection"? |
|
I was just wondering if in the context of a stream expression if saying "connection" was enough. I like "solrConnection" too. We might look at other connections and see what they are named... |
|
Then lets move forward @vyatkinv with "solrConnection" throughout. If at the last minute Eric gets us to change our minds again, we shouldn't need to subsequently change the vast majority of this PR since "solrConnection" is a perfectly fine parameter & field & local-var name in all places we have such. |
…g> by SolrParams in streams, which used it
|
I’ve addressed most of the review comments. Summary of changes in the latest commits:
As follow-up work (either separate issue or PR), I suggest: PS: Documentation and new tests are not added yet. |
|
I've taken all the comments into account, and this evening I'll try to add all the necessary integration tests and check whether it's safe to remove the solrClientCache initialization for standalone mode. |
|
Integration tests via Launching via Is there a workaround to run integration tests now? |
dsmiley
left a comment
There was a problem hiding this comment.
I'll do a couple of the suggestions I included in this review, finishing today.
I also want to ensure no stream creates a SolrClientCache() (no args) -- should always rely on StreamContext.
| streamFactory.withCollectionSolrConnection(defaultCollection, solrConnection); | ||
| streamFactory.withDefaultSolrConnection(solrConnection); |
There was a problem hiding this comment.
this line seems redundant with the previous
There was a problem hiding this comment.
Why? The first line sets the default collection and connection to it, and the second line sets the default connection for all other collections if solrConnection is not specified in the expression itself.
If we remove any of them, the behavior will change.
There was a problem hiding this comment.
This will be relevant when explicitly enabling external clusters. By default, indeed, one connection is always established.
| streamFactory.withDefaultZkHost(defaultZkhost); | ||
| var solrConnection = CloudSolrClient.CloudSolrClientConnection.parse(defaultZkhost); | ||
| streamFactory.withCollectionSolrConnection(defaultCollection, solrConnection); | ||
| streamFactory.withDefaultSolrConnection(solrConnection); |
There was a problem hiding this comment.
this line appears redundant with the previous
| if (!solrConnection.isZookeeper()) { | ||
| throw new SQLException( | ||
| String.format( | ||
| Locale.ROOT, "Expected ZooKeeper connection string, but got: '%s'.", schemaName)); | ||
| } |
There was a problem hiding this comment.
not sure there's any point in ensuring this.
This is a great thing to drop a note to [email protected] about, and let the person who merged the code know about the issue! |
|
I already did yesterday: https://issues.apache.org/jira/browse/SOLR-17855 |
|
Added integration tests for CLI utilities with |
|
I'll pursue the SolrClientCache security matter in another JIRA issue TBD. Sorry to see this JIRA issue get so much scope creep. We can merge something here but just won't release it until the security matter is handled. |
|
Good |
|
I leave it to @epugh to review the CLI aspects of this PR. Ideally that'd be a separate PR; it's a large scope divergence. I don't have time/experience or frankly interest to review the CLI. |
I'm rolling back |
|
Totally happy to give the CLI stuff another review! |
| this.parameters = parameters; | ||
| } | ||
|
|
||
| public void addParameters(SolrParams parameters) { |
There was a problem hiding this comment.
The next method has this signature:
public StreamExpression withParameters(List<StreamExpressionParameter> parameters)
Shouldn't this method here, which calls that one, use the same style?
There was a problem hiding this comment.
This method should be called withParameters, but this name is taken by the method below, which overwrites existing parameters. That's why I chose the name addParameters to avoid confusion, because behaviour of my method different from withParameters
There was a problem hiding this comment.
if we are talking about returning this - I can add
There was a problem hiding this comment.
withMoreParameters or withAdditionalParameters?
yes, return this on any method that starts with "with".
There was a problem hiding this comment.
done
I also switched completely to chanined builder in several classes where possible.
| return shards; | ||
| } | ||
|
|
||
| public static CloudSolrClient.CloudSolrClientConnection buildSolrConnection( |
There was a problem hiding this comment.
An argument could be made for this being an instance method of streamFactory. It's a required param and calls 2 of its methods, one of which is similar to this one. That one is named differently, getConnectionForCollection vs buildSolrConnection but this is really the same concept. WDYT?
There was a problem hiding this comment.
good idea.
We can generally make getConnectionForCollection private - it should no longer be used directly in streams
There was a problem hiding this comment.
Love that -- making methods private or less scope exposure generally
There was a problem hiding this comment.
done
We can also make getDefaultSolrConnection private, if we decide not to use it in ScoreNodesStream, but at the beginning we said that we were not going to change its behavior and add the solrConnection parameter to its arguments.
dsmiley
left a comment
There was a problem hiding this comment.
NIce.
I'm itching to merge this. As soon as you're good with the CLI changes Eric, I'll get this in.
|
Okay, I am doing some testing. I actually think we can remove the |
…in-solr-10.adoc Co-authored-by: Eric Pugh <[email protected]>
https://issues.apache.org/jira/browse/SOLR-18130
Description
This PR updates the solrj-streaming module by replacing all usages of zkHost with solrCloud to enable support for HTTP-based quorum configurations.
Solution
Parameters, fields and variables in solrj-streaming, named as
zkHostrenamed tosolrCloudFor backward compatibility, specifying zkHost is still supported.
A shared method has been introduced in an abstract class to resolve the effective solrCloud value using a priority-based approach (e.g., explicit parameter → legacy zkHost → default Zookeeper host).
Tests
Introduced method, that randomly provides either an HTTP or Zookeeper record in different scenarios. Additionally, all zkHost parameters have been replaced with connectionString, using
getSolrConnection().toString()for substitution.Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.