How to import repository.tar.gz?

Hi All,

On my portal, the “data repository” doesn’t show any data repositories. I realized I haven’t imported repository.tar.gz. My question is how I import the downloaded repository.tar.gz.

Thanks,

Brady

Hi Brady,

repository.tar.gz should be installed with the Knapsack plugin. The import tool does not support it yet.

First, you need to install the plugin and restart the Elasticsearch node where the plugin is installed.

/usr/share/elasticsearch/bin/plugin -url http://bit.ly/29A1hsz -install knapsack
sudo service elasticsearch restart

Then use the following command to install to start an archive import:

curl -XPOST "http://elasticsearchnode:9200/_import?path=/repository.tar.gz"

where

  • elasticsearchnode is address of the node where the plugin is installed
  • path query parameter is path to the repository.tar.gz

Check the logs of Elasticsearch node to see when the import finishes. But it usually takes around 2 minutes.

Hi Vitalii,

Thanks for the quick reply.

I have repository.tar.gz on current working directory, I ran the command but it failed:

curl -XPOST "http://lxv-icgc-elastic01:9200/_import?path=repository.tar.gz"
{"error":"InvalidIndexNameException[[_import] Invalid index name [_import], must not start with '_']","status":400}

I remove “_” from “_import” and reran the command:

indent preformatted text by 4 spacescurl -XPOST “http://lxv-icgc-elastic01:9200/_import?path=/repository.tar.gz

This time it didn’t fail. In elastic log, I see this message:

[2016-09-02 07:53:57,279][INFO ][cluster.metadata ] [Krystalin] [import] creating index, cause [api], shards [5]/[1], mappings []

There is no completion message. The following command shows the index:

curl lxv-icgc-elastic02:9200/_cat/indices
green open .marvel-2016.09.02  1 1    73111 0 289.6mb 144.1mb 
green open .marvel-2016.08.31  1 1   102301 0 435.7mb 217.8mb 
green open .marvel-2016.09.01  1 1   121452 0 488.4mb 244.2mb 
green open .marvel-2016.08.29  1 1    10277 0    46mb    23mb 
green open icgc22-13          15 1 55977215 0    28gb    14gb 
green open .marvel-2016.08.30  1 1    86640 0 373.7mb 186.8mb 
green open icgc21-0-0          1 1        0 0    230b    115b 
green open import              5 1        0 0    970b    575b 
green open terms-lookup        1 4        0 0    575b    115b 

Looks like the “import” index is empty after more than 10 minutes. I guess “import” is wrong name for the index. What name should I use?

Answer my own question. “_import” is a command, so changing it to “import” is wrong.

The issue is probably caused by non-functional knapsack plugin. I used the following commands to remove and add it back:

sudo /usr/share/elasticsearch/bin/plugin -remove knapsack
sudo /usr/share/elasticsearch/bin/plugin -url http://bit.ly/29A1hsz -install knapsack
sudo service elasticsearch restart

The used the command to import:

curl -XPOST “http://lxv-icgc-elastic01:9200/_import?path=/tmp/repository.tar.gz

The import started, but appears it hits an error:

[2016-09-02 17:25:50,872][INFO ][KnapsackImportAction ] resetting refresh rate for index icgc-repository-20160830
[2016-09-02 17:25:50,872][ERROR][KnapsackImportAction ] null
java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:542)
at java.lang.Integer.parseInt(Integer.java:615)
at org.xbib.elasticsearch.action.knapsack.imp.TransportKnapsackImportAction.performImport(TransportKnapsackImportAction.java:245)
at org.xbib.elasticsearch.action.knapsack.imp.TransportKnapsackImportAction$1.run(TransportKnapsackImportAction.java:115)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2016-09-02 17:26:01,598][INFO ][BulkNodeClient ] closing bulk processor…
[2016-09-02 17:26:01,599][INFO ][BulkNodeClient ] shutting down…
[2016-09-02 17:26:01,599][INFO ][BulkNodeClient ] shutting down completed
[2016-09-02 17:26:01,600][INFO ][KnapsackImportAction ] end of import: {“mode”:“import”,“started”:“2016-09-03T00:24:55.602Z”,“path”:“file:///tmp/repository.tar.gz”,“node_name”:“Typeface”}, count = 415529
[2016-09-02 17:26:01,622][INFO ][KnapsackService ] remove: plugin.knapsack.import.state -> [{“mode”:“import”,“started”:“2016-09-03T00:24:55.602Z”,“path”:“file:///tmp/repository.tar.gz”,“node_name”:“Typeface”}]
[2016-09-02 17:26:01,623][INFO ][KnapsackService ] update cluster settings: plugin.knapsack.import.state -> []

When I visit “data repository” page on browser, I will get this kind of error messages on a lot elasticsearch nodes:

[2016-09-03 06:53:15,959][DEBUG][action.search.type       ] [Tag] [icgc-repository][3], node[bzl_ovkhQuq3-67T0k87Ug], [R], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@4e8e6974]
org.elasticsearch.transport.RemoteTransportException: [Jolt][inet[/10.103.131.26:9300]][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.aggregations.AggregationExecutionException: [nested] nested path [file_copies] is not nested
        at org.elasticsearch.search.aggregations.bucket.nested.NestedAggregator.<init>(NestedAggregator.java:71)
        at org.elasticsearch.search.aggregations.bucket.nested.NestedAggregator$Factory.create(NestedAggregator.java:185)
        at org.elasticsearch.search.aggregations.AggregatorFactories.createAndRegisterContextAware(AggregatorFactories.java:53)
        at org.elasticsearch.search.aggregations.AggregatorFactories.createSubAggregators(AggregatorFactories.java:71)
        at org.elasticsearch.search.aggregations.Aggregator.<init>(Aggregator.java:191)
        at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.<init>(BucketsAggregator.java:39)
        at org.elasticsearch.search.aggregations.bucket.terms.TermsAggregator.<init>(TermsAggregator.java:135)
        at org.elasticsearch.search.aggregations.bucket.terms.AbstractStringTermsAggregator.<init>(AbstractStringTermsAggregator.java:37)
        at org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator.<init>(GlobalOrdinalsStringTermsAggregator.java:73)
        at org.elasticsearch.search.aggregations.bucket.terms.TermsAggregatorFactory$ExecutionMode$2.create(TermsAggregatorFactory.java:60)

It appears the index is not imported correctly. Below are the indexes in elasticsearch:

curl lxv-icgc-elastic02:9200/_cat/indices
green open .marvel-2016.09.02  1 1   125802 0 467.9mb 233.9mb 
green open .marvel-2016.08.31  1 1   102301 0 435.7mb 217.8mb 
green open .marvel-2016.09.01  1 1   121452 0 488.4mb 244.2mb 
green open .marvel-2016.08.29  1 1    10277 0    46mb    23mb 
green open icgc22-13          15 1 55977215 0    28gb    14gb 
green open .marvel-2016.08.30  1 1    86640 0 373.7mb 186.8mb 
green open .marvel-2016.09.03  1 1    91314 0 323.1mb 161.6mb 
green open icgc21-0-0          1 1        0 0    230b    115b 
green open icgc-repository     5 1   415524 0 984.8mb 492.4mb 
green open terms-lookup        1 4        0 0    467b    115b

Please note I rename icgc-repository-2016-* to icgc-repository which seems to be what code looks for.

I am wondering whether this is related to the error message at the end of importing.

Hi Brady,

You should import the repository index using the following steps:

  • Install Knapsack plugin and restart the ES node (which you already done.)
  • From the same node where the plugin is stalled start the import process. The index archive should be on the same node.
  • After the import is finished create an icgc-repository alias for the imported index. You can do this with the following command:
curl -XPOST 'http://localhost:9200/_aliases' -d'
{
  "actions": [
    {
      "add": {
        "index": "icgc-repository-20160908",
        "alias": "icgc-repository"
      }
    }
  ]
}'

P.S. We are going to add functionality to import this file with the import tool.

Hi Vitalii,

Thanks for the detailed steps. I ended up with using “repoIndexName” in elasticsearch application.yml before seeing your reply. Using alias is better than hard code the name in elasicsearch application.yml. I will use it next time.

Brady

We recently added support to import the repository.tar.gz Elasticsearch index archive with the dcc-download-import tool.

  1. Download the tool:
wget https://artifacts.oicr.on.ca/artifactory/dcc-release/org/icgc/dcc/dcc-download-import/[RELEASE]/dcc-download-import-[RELEASE].jar -O dcc-download-import.jar
  1. Import the archive:
java -jar dcc-download-import.jar -i repository.tar.gz -es es://localhost:9300