Monday, June 16, 2014

Elasticsearch Notes

Elasticsearch short introduction

Elasticsearch use case

  • GitHub – search code base(Github uses Elasticsearch to search 20TB data,including 1.3 billion files  and 130 billion code lines)
  • Foursquare – location search(50 million location data search)
  • SoundCloud – music search(provide music search service for 180 million users)
  • Fog Creek – search code base(support 30 million search in 40 billion code lines every month)

Elasticsearch installation

Java install

cd ~
sudo apt-get update
sudo apt-get install openjdk-7-jre-headless -y

install elasticsearch 1.2.1

wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.2.1.deb
sudo dpkg -i elasticsearch-1.2.1.deb
sudo service elasticsearch start

Configuration

  • system config file: /etc/elasticsearch/elasticsearch.yml
    • Application-wide settings (zen discovery, available analyzers)
    • index default configurations (number of shards)
    • Cluster Name and Node Name
  • log config file:  /etc/elasticsearch/logging.yml, default logging files are located in /var/log/elasticsearch/
  • Another important part is tuning your operating system
    • Elasticsearch will created several files when indexing, so the system cannot limit the open file descriptors to less than 32000, can be edit in /etc/security/limits.conf
    • ES is written in Java and obviously runs inside a JVM. The most apparent JVM option is -Xmx. I set it to about 50% of the total physical memory, it happened to be a 64GB of RAM machine, so I set the ES_HEAP_SIZE size to 32GB. 
Cluster name is important because elasticsearch will auto discover new nodes and connect nodes in to cluster based on cluaster name. Node name is also configured to make management easier.

Elasticsearch distribution model

  • Node
    • A running ES instance, or process running on a machine
    • Run on same or different machine
    • Testing: same machine can have several nodes. Production:suggestion one machine single node
  • Cluster
    • Distributed es system made of several nodes
    • Dynamic master election, no single node fail(fail as a whole)
    • Communication between nodes and data distribution and balancing is automatically handled
    • View as a whole from outside
  • Index
    • Multiple index(like mysql database) support
    • Multiple types(like mysql tables) inside indexes
  • Shard
    • Building blocks of index, index is divided in to shards
    • Each Shard is a Lucence index
    • Shards will be placed on different machines
    • Moving of shards does not require reindex
  • Replica
    • Each shards can have 0 or more replicas
    • True copy of primary shard
    • Increase system fault tollerace, and search performance

Usage

Please refer to http://www.elasticsearch.org/resources/ for more information or please take a look at the ppt attachment in the first section.

Monitoring

We use ElastcHQ to monitor and manage our cluster, it's simply a local website. No configuration is needed.
Important coloring of cluster status:
  • Red – Index cannot be used, some primary shards are not allocated
  • Yellow – Index can still be used, but some replicas are not allocated
  • Green – Index run as normal

Manage outdated indexes

We use elasticsearch-curator to manage our outdated indexes

Installation

cd
sudo git clone https://github.com/elasticsearch/curator.git curator
cd curator
sudo python setup.py install

Configuration cron job

Manage indexes with test- as a prefix and - as date separator(example index name: test-2014-06-18), to delete indexes order than 4 days and close indexes order than 1 day:
20 12 * * * /usr/local/bin/curator --host 192.168.1.1 -s - --prefix test- -d 4 -c 1 -s -

Some problems

Address yellow or red cluster status

red status usually caused by cluster fail, will be fixed by restarting problem nodes(take a look at the logs located at /var/log/elasticsearch/{cluster name}.log)
 yellow status usually means some replica shards are not allocated. Can be solved by manually allocate the shard.
  1. Take a look at the cluster management tool (ElasticHQ) for index status: you can find out which shard fails, for example: shard 1 is not allocated
  2. find an available node which the shard is not allocated on, say Xemu.
  3. Invoke api to reroute:
    1. put the following json in a file reroute.json
      {
          "commands" : [
             {
                "allocate" : {
                    "index" "test""shard" 1"node" "Xemu"
                }
              }
          ]
      }
    2. invoke
      curl -XPOST 'localhost:9200/_cluster/reroute' -d @reroute

Add replicas on the fly

curl -XPUT 'localhost:9200/test/_settings' -d '
{
    "index" : {
        "number_of_replicas" 1
    }
}'