Elasticsearch short introduction
If interested, can read the book:http://books.google.com.sg/books?id=PEFK3MuwBsIC&printsec=frontcover&dq=elasticsearch&hl=en&sa=X&ei=RuhtUpv8J4KRrQfjpoDQAg&redir_esc=y#v=onepage&q&f=false
Elasticsearch use case
- GitHub – search code base(Github uses Elasticsearch to search 20TB data,including 1.3 billion files and 130 billion code lines)
- Foursquare – location search(50 million location data search)
- SoundCloud – music search(provide music search service for 180 million users)
- Fog Creek – search code base(support 30 million search in 40 billion code lines every month)
Elasticsearch installation
Java install
install elasticsearch 1.2.1
Configuration
- system config file: /etc/elasticsearch/elasticsearch.yml
- Application-wide settings (zen discovery, available analyzers)
- index default configurations (number of shards)
- Cluster Name and Node Name
- log config file: /etc/elasticsearch/logging.yml, default logging files are located in /var/log/elasticsearch/
- Another important part is tuning your operating system
- Elasticsearch will created several files when indexing, so the system cannot limit the open file descriptors to less than 32000, can be edit in /etc/security/limits.conf
- ES is written in Java and obviously runs inside a JVM. The most apparent JVM option is -Xmx. I set it to about 50% of the total physical memory, it happened to be a 64GB of RAM machine, so I set the
ES_HEAP_SIZE
size to 32GB.
Cluster name is important because elasticsearch will auto discover new nodes and connect nodes in to cluster based on cluaster name. Node name is also configured to make management easier.
Elasticsearch distribution model
- Node
- A running ES instance, or process running on a machine
- Run on same or different machine
- Testing: same machine can have several nodes. Production:suggestion one machine single node
- Cluster
- Distributed es system made of several nodes
- Dynamic master election, no single node fail(fail as a whole)
- Communication between nodes and data distribution and balancing is automatically handled
- View as a whole from outside
- Index
- Multiple index(like mysql database) support
- Multiple types(like mysql tables) inside indexes
- Shard
- Building blocks of index, index is divided in to shards
- Each Shard is a Lucence index
- Shards will be placed on different machines
- Moving of shards does not require reindex
- Replica
- Each shards can have 0 or more replicas
- True copy of primary shard
- Increase system fault tollerace, and search performance
Usage
Please refer to http://www.elasticsearch.org/resources/ for more information or please take a look at the ppt attachment in the first section.
Monitoring
We use ElastcHQ to monitor and manage our cluster, it's simply a local website. No configuration is needed.
Important coloring of cluster status:
- Red – Index cannot be used, some primary shards are not allocated
- Yellow – Index can still be used, but some replicas are not allocated
- Green – Index run as normal
Manage outdated indexes
We use elasticsearch-curator to manage our outdated indexes
Installation
Configuration cron job
Manage indexes with test- as a prefix and - as date separator(example index name: test-2014-06-18), to delete indexes order than 4 days and close indexes order than 1 day:
Some problems
Address yellow or red cluster status
red status usually caused by cluster fail, will be fixed by restarting problem nodes(take a look at the logs located at /var/log/elasticsearch/{cluster name}.log)
yellow status usually means some replica shards are not allocated. Can be solved by manually allocate the shard.
- Take a look at the cluster management tool (ElasticHQ) for index status: you can find out which shard fails, for example: shard 1 is not allocated
- find an available node which the shard is not allocated on, say Xemu.
- Invoke api to reroute:
- put the following json in a file reroute.json
- invoke