Friday, May 9, 2014

Elasticsearch–State of the .Net Clients

Among the thousands of other developers, I have jumped onto the Elasticsearch fan base.

Elasticsearch is a flexible and powerful open source, distributed, real-time search and analytics engine. Architected from the ground up for use in distributed environments where reliability and scalability are must haves, Elasticsearch gives you the ability to move easily beyond simple full-text search.

I have been working with Elasticsearch for just under a year now, and I have a pretty good understanding of building queries, templates, facets, and aggregations. About three months ago I switched off of Couchbase+Elasticsearch to just Elasticsearch to “skip the middle man”. Elasticsearch is a powerful document store, it is fast, supports versioning as well as partial and scripted updates.

So let’s talk about the .Net clients that are available for Elasticsearch. There are a few out there, and they have their benefits.


UPDATE – I have started contributing to Elasticsearch.Net, read about it on the updated blog post.


Elasticsearch.Net / NEST

As with most developers, this was the first one that I snagged and started using. However, this is a “hybrid” client with two different builds available. One with just the Elasticsearch.Net components for connectivity and API support that is generated from the Elasticsearch API definitions. And the NEST build that contains Elasticsearch.Net and adds a Fluent interface to the Query DSL.

NEST is used by a few notable people, specifically stackoverflow and Kiln, and has some pretty good documentation.

So why did I stop using NEST?

For one, the API is horribly complex. I had to live in the documentation. And once I built out what I wanted, reading the code just hurt my head.

This is a simple Lucene query:

client.Search(body => body.Query(q => q.QueryString(qs => qs.Query(query))).Skip(skip).Size(take))

That is the equivalent of this JSON:

{
"query": {
"query_string": {
"query": "lucene query"

}
},
"size": 10,
"from": 0
}

I didn’t even want to think about what the code would look like when doing multiple nested aggregations against multiple filters and queries. Most likely around 300+ Fluent lambda expressions is what I would have ended up with.


So my next step was to break it down, and then try just using the Elasticsearch.Net connectivity portions, and leave out the Fluent wrapper.


Let’s break out the anonymous objects!

            var search = new
{
size = size,
from = page * size,
query = new { query_string = new { query = query } },

};

r = await client.SearchAsync("myindex", "mytype", search);

Same results, slightly better code, however, anonymous objects smell like moldy cheese to me, as well as having to deal with the results of this query as well.


ElasticLinq


Announced a few months ago, ElasticLinq is a project from CenturyLink, spear headed by Brad Wilson who has a few great tools under his belt, and published a few books, and James Newkirk. They also pulled in Damien Guard (who worked on the LinqToSql/EF expression builders).


This project seemed like a who’s who on building a Linq interface into Elasticsearch, which gave me a bit of hope.


I really don’t recommend this project, even though the people involved in the project are notable and definitely qualified, it has taken a deep dive into the open source release death. Zero commits in 2 months on a brand new project that is heavily lacking in features, puts this in my “might use it in a demo app” category for adoption. If the people who made it lost interest that fast, the community isn’t going to pick it up and finish the heavy lifting.


Let’s face it though, implementing Linq is hard, and rather tedious, so as a side project, this is something I expected to die off pretty quickly. I hope however, they prove me wrong.


PlainElastic.Net


This one is a bit interesting, it is essentially a lot like Elasticsearch.Net, however it takes the approach of just being a wire format helper.


However, I could rinse/repeat the entire section on NEST for this one. It has a bit cleaner Fluent interface, but again, tons of lambda expressions.


What’s Broken?


Steve Jobs once said:


That’s been one of my mantras -- focus and simplicity. Simple can be harder than complex: You have to work hard to get your thinking clean to make it simple. But it’s worth it in the end because once you get there, you can move mountains.

Both Elasticsearch.Net and PlainElastic.Net suffer from the same problem, a poor Http implementation. I am not saying it is bad code, it is just not where the thought behind the library went, instead it was heavily focused on other features. Looking at the two code bases, it is obvious that one is based on the other as well, I won’t claim to care which is based on which.


They both suffer heavily from another problem, they don’t maintain an Http connection, which according to the Elasticsearch documentation for Http connections:


When possible, consider using HTTP keep alive when connecting for better performance and try to get your favorite client not to do HTTP chunking.

In most of the Elasticsearch talks I have seen, as well as in the Google groups, they heavily push on keeping the connection alive. That is because you are draining sockets on both sides of the connection when open/close is used. Creating TCP connections, even with Http, isn’t cheap, it is a multi-packet hop to negotiate the connection. While this might seem like a micro-optimization, at scale, this is a problem.


One could argue that the WebRequest.Create() will share connections internally, but it is never guaranteed.


Finally, the WebRequest has async, but it is the APM model, rather than the TAP model, which is how most modern Web API clients are going to communicate with remote services. There are also some performance differences between the two, I won’t get into them, but it really comes down to when it is appropriate to use either.


HttpClient does use the TAP model, and is probably more appropriate for Elasticsearch.




    1. An HttpClient instance is the place to configure extensions, set default headers, cancel outstanding requests and more.
    2. You can issue as many requests as you like through a single HttpClient instance.
    3. HttpClients are not tied to particular HTTP server or host; you can submit any HTTP request using the same HttpClient instance.
    4. You can derive from HttpClient to create specialized clients for particular sites or patterns
    5. HttpClient uses the new Task-oriented pattern for handling asynchronous requests making it dramatically easier to manage and coordinate multiple outstanding requests.

HttpClient is only async, if you are utilizing IO, async should be the only access to it, especially if you are performing IO on a server. You can keep a single HttpClient instance as long as you want it, then dispose it later, giving you control over the Http layer, instead of some internal Http Connection factories like WebRequest uses. You can inherit it, which means, custom objects with a specific API. And finally, HttpClient allows you to create predefine things such as default headers, authentication, as well as full request/response pipelining.


BTW, ElasticLinq uses HttpClient, which is another reason I wish it didn’t die off so fast.


Given that we have a modern Http Client, isn’t it time to have a modern Elasticsearch Client?



Next week I will be attending TechEd, locked up in a Hotel Room at night for 5 days. This will give me an excuse to start building this, and getting it up on GitHub. Additionally I plan on doing a couple of Blog posts based on why some of the decisions in the code are being made, and showing some sample usages of it.


I have created the initial GitHub Repository with a really awesome name of AsyncElastic… naming things is hard. Sometime today I plan on going through and adding Issues for the features I want to support in the 1.0 beta, which will most likely include Document Storage and Raw Http requests for single, multiple, and “sniffed” servers.


Why not just contribute to ElasticLinq? Well, I don’t want to even think about touching that Linq expression code, and I would prefer not to take that approach Connection has a Client vs. Connection is a Client.


The ASP.Net MVC AngularJS series will also be moving forward with AsyncElastic as the data store for the next few blog posts.

9 comments:

Martijn Laarman said...

Hi Tom,

Author of Elasticsearch.NET/NEST here.

The idea of the lambda expressions is that they allow you to abstract to methods exactly to avoid nesting 300 lambda expression/object creations. I am working on allowing both lambda and object initializers in the future though as I understand this is the most common pain point for folks.

While the latter point I have to concede that this might be a hindrance to folks who'd rather do manual object creation the claim NEST does a poor http implementation I have to rebute:

The default connection in Elasticsearch.Net uses WebRequest although its been written to use TAP completely internally and still be .NET 4.0 compatible.

In fact the common connection abstraction interface is TAP.

The msdn docs are quite adamant about KeepAlive for HTTP1/1

When using HTTP/1.1, Keep-Alive is on by default. Setting KeepAlive to false may result in sending a Connection: Close header to the server.

The reason NEST does not use HttpClient by default in .NET 4.5 is because the default HttpClientMessageHandler still uses HttpWebRequest under the hood. Don't be fooled by the new interfaces it exposes. The same claims about not being able to guarantee WebRequest not sharing connections still apply to HttpClient.

I also started on implementing an HttpClient version of IConnection but did not see any performance or resource gains so I opted for the lowest common denominator which works well across most platforms/.NET versions

https://github.com/elasticsearch/elasticsearch-net/blob/f201a6bd13ecb084fe35a57449068e6be9677c73/src/Connections/Elasticsearch.Net.Connection.HttpClient/ElasticsearchHttpClient.cs

All the async methods in Elasticsearch.Net/NEST are async to the core and rely solely on IOCP.

Thats not to say HttpClient's interface is not fastly superior to HttpWebRequest but because of the fact the default HttpClientMessageHandler still relies on HttpWebRequest it also still shares much of the same warts.

This is why projects like these https://github.com/abock/CurlSharp are being followed by me very closely! None of the alternative HttpClientMessageHanders are production ready (not counting Paul Betts native IOS/android bindings)

I welcome the competition but I would also love a contributor who totally owns the HttpClient IConnection implementation! Combining efforts > fragmentation :)

Tom Anderson said...

Hi Martijn,

Thanks so much for the well thought out reply.

I did take a closer look into the implementations of the HttpClientHandler that is by default used in the deep trenches of the HttpClient, and... feel a bit derpy when I realized that under the hood, it is just a much better API wrapped around the same core internals.

I actually had totally skimmed over the IConnection implementation that was started in the Elasticsearch.Net code base for HttpClient. This looks like a better way to go about this, rather than creating a side-tracked project that goes a totally different direction, I agree that combining efforts is definitely much better.

On the regards to the libcurl wrapper (thanks for showing me that) with the HttpClient IConnection implementation, it could be very easy to swap out the HttpMessageHandler layer with the curl one once completed.

I spent a few hours so far this week hamming out the first draft of the core connection logic of the inner HttpClient implementations in AsyncElastic, but instead, I think I want to focus my energy towards the IConnection implementation on NEST instead, mostly because it does free up a lot of work on my side, and it also helps get me to have someone poke through my code and abuse it on a larger scale.

This goes back to one of my core philosophies that I totally bypassed on this blog post/project.

"Did you talk to someone else, did they think it was a good idea, and did they get excited about it?"

Again, thanks for the response, and I look forward to hopefully talking with you more in the future.

Thanks,
~Tom

Rich Ward said...

Hi -

What's the latest on AsyncElastic? Just went to GitHub and couldn't find it.

We're new to ElasticSearch - and .NET developers - so we'd love something that is clear, well-documented and works!

:-)

Any wisdom greatly appreciated.

Best,

- Rich

Rich Ward said...

Hi Tom -

I read your new post and saw that you are now working on Elasticsearch.Net.Connection.HttpClient.

I did try to install that and received the following:
Install-Package : Unable to find package 'Elasticsearch.Net.Connection.HttpClient'.
At line:1 char:1
+ Install-Package Elasticsearch.Net.Connection.HttpClient
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [Install-Package], InvalidOperationException
+ FullyQualifiedErrorId : NuGetCmdletUnhandledException,NuGet.PowerShell.Commands.InstallPackageCommand

Any ideas?

Many thanks,

- Rich

Tom Anderson said...

Hey @Rich, glad you stopped by :)

The HttpClient hasn't made it's presence on NuGet.org yet, but it is available from the CI MyGet repository here: https://www.myget.org/gallery/elasticsearch-net

You can go bug Martijn on his github to get it published to Nuget.org :)

As far as responding to your previous comment, I dropped support of the AsyncElastic only due to the fact that contributing was a better approach, especially with my availability to maintain an open source project (see my response to this blog post in http://www.tomanderson.me/2014/05/elasticsearchnet-and-contributing-to.html)

If you have any questions, or find any issues, please feel free to hit me up!

Oliver said...

Hi Tom,

thanks for your post and touching on the .NET ecosystem around ElasticSearch.

Fortunately, I read all the way down to then - not only the end of your post but also the end of the comments ;-) There was some very valuable information in there so I'd suggest you put a note at the top of your post with an update/hint as to how or why your post is kind of biased. Maybe you could even link to http://www.tomanderson.me/2014/05/elasticsearchnet-and-contributing-to.html for people to understand where things went after you published your post.

Anyway, I'm glad folks is teaming up on an awesome project like NEST and ElasticSearch.NET!

Tom Anderson said...

Thanks @Oliver I have modified the post to redirect to the updated information.

AJ said...

Hi Tom,

Recently we deployed our application using Nest 2.5.0 to production. Our application does 100s of read and write request per second. This in turn leading to Port Exhaustion on our servers side. I needed an opinion that should I use single static ElasticClient instance or create instance per request to address this issue?
A little guidance is really appreciated.

Regards,
Abhijeet

Tom Anderson said...

Under the hood, Nest (.Net) should still be doing connection pooling, so you should be safe to share a connection between everything.

If you are running into port exhaustion on the server side (I am assuming on the one calling Elasticsearch, not the Elasticsearch server), you might want to up the ServicePointManager settings.