Elasticsearch short introduction

PPT: Elasticsearch - Distributed Search Solution

If interested, can read the book:http://books.google.com.sg/books?id=PEFK3MuwBsIC&printsec=frontcover&dq=elasticsearch&hl=en&sa=X&ei=RuhtUpv8J4KRrQfjpoDQAg&redir_esc=y#v=onepage&q&f=false

Elasticsearch use case

GitHub – search code base(Github uses Elasticsearch to search 20TB data，including 1.3 billion files and 130 billion code lines)
Foursquare – location search(50 million location data search)
SoundCloud – music search(provide music search service for 180 million users)
Fog Creek – search code base(support 30 million search in 40 billion code lines every month)

Elasticsearch installation

Java install

cd ~

sudo apt-get update

sudo apt-get install openjdk-7-jre-headless -y

install elasticsearch 1.2.1

wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.2.1.deb

sudo dpkg -i elasticsearch-1.2.1.deb

sudo service elasticsearch start

Configuration

system config file: /etc/elasticsearch/elasticsearch.yml
- Application-wide settings (zen discovery, available analyzers)
- index default configurations (number of shards)
- Cluster Name and Node Name
log config file: /etc/elasticsearch/logging.yml, default logging files are located in /var/log/elasticsearch/
Another important part is tuning your operating system
- Elasticsearch will created several files when indexing, so the system cannot limit the open file descriptors to less than 32000, can be edit in /etc/security/limits.conf
- ES is written in Java and obviously runs inside a JVM. The most apparent JVM option is -Xmx. I set it to about 50% of the total physical memory, it happened to be a 64GB of RAM machine, so I set the ES_HEAP_SIZE size to 32GB.

Cluster name is important because elasticsearch will auto discover new nodes and connect nodes in to cluster based on cluaster name. Node name is also configured to make management easier.

Elasticsearch distribution model

Node
- A running ES instance, or process running on a machine
- Run on same or different machine
- Testing: same machine can have several nodes. Production:suggestion one machine single node
Cluster
- Distributed es system made of several nodes
- Dynamic master election, no single node fail(fail as a whole)
- Communication between nodes and data distribution and balancing is automatically handled
- View as a whole from outside
Index
- Multiple index(like mysql database) support
- Multiple types(like mysql tables) inside indexes
Shard
- Building blocks of index, index is divided in to shards
- Each Shard is a Lucence index
- Shards will be placed on different machines
- Moving of shards does not require reindex
Replica
- Each shards can have 0 or more replicas
- True copy of primary shard
- Increase system fault tollerace, and search performance

Usage

Please refer to http://www.elasticsearch.org/resources/ for more information or please take a look at the ppt attachment in the first section.

Monitoring

We use ElastcHQ to monitor and manage our cluster, it's simply a local website. No configuration is needed.

Important coloring of cluster status:

Red – Index cannot be used, some primary shards are not allocated
Yellow – Index can still be used, but some replicas are not allocated
Green – Index run as normal

Manage outdated indexes

We use elasticsearch-curator to manage our outdated indexes

Installation

cd

sudo git clone https://github.com/elasticsearch/curator.git curator

cd curator

sudo python setup.py install

Configuration cron job

Manage indexes with test- as a prefix and - as date separator(example index name: test-2014-06-18), to delete indexes order than 4 days and close indexes order than 1 day:

20 12 * * * /usr/local/bin/curator --host 192.168.1.1 -s - --prefix test- -d 4 -c 1 -s -

Some problems

Address yellow or red cluster status

red status usually caused by cluster fail, will be fixed by restarting problem nodes(take a look at the logs located at /var/log/elasticsearch/{cluster name}.log)

yellow status usually means some replica shards are not allocated. Can be solved by manually allocate the shard.

Take a look at the cluster management tool (ElasticHQ) for index status: you can find out which shard fails, for example: shard 1 is not allocated
find an available node which the shard is not allocated on, say Xemu.

Invoke api to reroute:

put the following json in a file reroute.json

{

    "commands" : [

       {

          "allocate" : {

              "index" : "test", "shard" : 1, "node" : "Xemu"

          }

        }

    ]

}

invoke

curl -XPOST 'localhost:9200/_cluster/reroute' -d @reroute

Add replicas on the fly

curl -XPUT 'localhost:9200/test/_settings' -d '

{

    "index" : {

        "number_of_replicas" : 1

    }

}'

Categories: Elasticsearch

Learning to Program

Monday, June 16, 2014

Elasticsearch Notes

Elasticsearch short introduction

Elasticsearch use case

Elasticsearch installation

Java install

install elasticsearch 1.2.1

Configuration

Elasticsearch distribution model

Usage

Monitoring

Manage outdated indexes

Installation

Configuration cron job

Some problems

Address yellow or red cluster status

Add replicas on the fly

Categories

Popular Posts

About Author

Blogger news

Labels

About

Learning to Program

Monday, June 16, 2014

Elasticsearch Notes

Elasticsearch short introduction

Elasticsearch use case

Elasticsearch installation

Java install

install elasticsearch 1.2.1

Configuration

Elasticsearch distribution model

Usage

Monitoring

Manage outdated indexes

Installation

Configuration cron job

Some problems

Address yellow or red cluster status

Add replicas on the fly

Categories

Popular Posts

About Author

Blogger news

Subscribe To Blog

Labels

About