Site Search with Middleman and lunr.js

Published: 2016-09-22 19:10 -0400

One of the tasks I have for this year is to review all the applications I’ve developed and consider how to lower their maintenance costs. Even for applications that aren’t being actively fed new content they need to be updated for security vulnerabilities in the framework and libraries. One easy way to do that is to consider shutting then down, and I wish more applications I have developed were candidates for sunsetting.

We have some older applications that are still useful and can’t be shut down. They’re are largely static but occasionally do get an update. We’ve thought about how to “pickle” certain applications by taking a snapshot of them and just making that static representation live on without the application code running behind it, but we’ve never pursued that approach as making changes that need to be applied across the site can be annoying.

For a couple of these applications I’m considering migrating them to a static site generator. That would allow us to make changes, not worry about updating dependencies, and remove concerns about security. One feature though that seemed difficult to replace without a server-side component is search. So I’m newly interested in the problem of site search for static sites. Here’s how I added site search to this blog as a way to test out site search without a server-side component.

Before making this change I was just pointing out to a Google site search, which isn’t the kind of thing I could do for one of our sites at work. What I’m doing now is certainly more complex than a simple search box like that, but the middleman-search gem made it rather simple to implement. There were a few things that took me a little time to figure out, so I’m sharing snippets here to maybe save someone else some time.

First, if using this with Middleman 4 using the master version might help:

gem 'middleman-search', github: 'manastech/middleman-search'

Then the code to activate the plugin in config.rb was updated for the structure of my blog. The pages for tagging polluted the index so I added a very rudimentary way to skip over some paths from getting indexed. I also added a way to store the section of the site (as “group”) in order to be able to display that along with any search result.

activate :search do |search|
  search.resources = ['about/', 'blog/', 'bots/', 'bots-blog/', 'demos/',
    'experience/', 'presentations/', 'projects/', '/writing']
  search.fields = {
    title:   {boost: 100, store: true, required: true},
    content: {boost: 50},
    url:     {index: false, store: true}
  }

  search_skip = ['Articles Tagged', 'Posts by Tag']

  search.before_index = Proc.new do |to_index, to_store, resource|
    if search_skip.any?{|ss| ss == resource.data.title}
      throw(:skip)
    end
    to_store[:group] = resource.path.split('/').first
  end
end

When the site is built is creates a search.json file at the root (unless you tell it to put it somewhere else). In order to encourage the client to cache it, we’ll set our ajax request to cache it. As the site gets updated we’ll want to bust the cache, so we need to add “.json” to the list of extensions that Middleman will create a digest hash for and properly link to. The way of doing this that is in all of the documentation did not work for me. This did, but required spelling out each of the extensions to create a hash for rather than just trying to append “.json” to asset_hash.exts.

activate :asset_hash do |asset_hash|
  asset_hash.ignore = [/demos/]
  asset_hash.exts = %w[ .css .js .png .jpg .eot .svg .ttf .woff .json ]
end

Now I just created a simple erb file (with frontmatter) to make up the search page. I’ve added a form to fallback to a Duck Duck Go site search.

---
title: Search
---

<%= javascript_include_tag 'search' %>

<h1>Search</h1>

<p>
  <input type="text" id="search" placeholder="Search..." width="100%">
</p>

<div id="result-count"></div>

<div class="list-group searchresults">
</div>


<div id="duckduckgo-fallback-search">
  <p>If you can't find what you're looking for try searching this site via Duck Duck Go:</p>
  <form action="http://duckduckgo.com" method="get" role="search">
    <div class="form-group">
      <input class="search form-control" type="text" name="q" value="site:ronallo.com " autocomplete="off">
      </div>
  </form>
</div>

And here’s the JavaScript, the beginnings of it borrowed from the middleman-search readme and this blog post. Unfortunately the helper search_index_path provided by middleman-search did not work–the method was simply never found. One magic thing that took me a long time to figure out was that using this helper was completely unnecessary. It is totally fine to just include the URL as /search.json and Middleman will convert it to the asset hash name when it builds the site.

The other piece that I needed to open the console for was to find out why the search results only gave me back documents with a ref and score like this: { ref: 6, score: 0.5273936305006518 }. The data packaged into search.json includes both the index and the documents. Once we get the reference to the document, we can retrieve the document to give us the url, title, and section for the page.

Updated 2016-09-23 to use Duck Duck Go as the fallback search service.

var lunrIndex = null;
var lunrData  = null;
// Download index data
$.ajax({
  url: "/search-dcb4f2bc.json",
  cache: true,
  method: 'GET',
  success: function(data) {
    lunrData = data;
    lunrIndex = lunr.Index.load(lunrData.index);
  }
});

$(document).ready(function () {
  var duckduckgosearch = $('#duckduckgo-fallback-search');
  duckduckgosearch.hide();

  $('input#search').on('keyup', function () {
    // Get query
    var query = $(this).val();
    // Search for it
    var result = lunrIndex.search(query);
    // Output it
    var searchresults = $('.searchresults');
    var resultcount = $('#result-count');
    if (result.length === 0) {
      // Hide results
      searchresults.hide();
      resultcount.hide();
      if (query.length == 0) {
        duckduckgosearch.hide();
      } else {
        duckduckgosearch.show()
      }
    } else {
      // Show results
      resultcount.html('results: ' + result.length);
      searchresults.empty();
      for (var item in result) {
        // A result only gives us a reference to a document
        var ref = result[item].ref;
        // Using the reference get the document
        var doc = lunrData.docs[ref];
        // Get the section of the site
        var group = " <span class='badge'>" + doc.group + '</span>';
        var searchitem = '<a class="list-group-item" href="' + doc.url + '">' + doc.title + group + '</a>';
        searchresults.append(searchitem);
      }
      searchresults.show();
    }
  });
});

That’s it. Solr-like search for a completely static site. Try it.

Preliminary Inventory of Digital Collections

Incomplete thoughts on digital libraries.

Site Search with Middleman and lunr.js