Wikidata Import 2024-01-20

From BITPlan Wiki
Jump to navigation Jump to search

Import

Import
edit
state  ✅
url  https://wiki.bitplan.com/index.php/Wikidata_Import_2024-01-20
target  QLever
start  2024-01-20
end  2024-01-21
days  0.5
os  Ubuntu 22.04.2 LTS
cpu  Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
ram  256
triples  19.1
comment  


Docker

docker pull adfreiburg/qlever
Using default tag: latest
latest: Pulling from adfreiburg/qlever
29202e855b20: Pull complete 
94ca9f61181f: Pull complete 
367bd497f93c: Pull complete 
5d9353f3c7b1: Pull complete 
fa1b81522802: Pull complete 
70be4539455c: Pull complete 
d3c042ca662a: Pull complete 
Digest: sha256:19106e3606851a1b0a3ca736aa6c3a5246ce48913dbc2a434c4a821c6d0af492
Status: Downloaded newer image for adfreiburg/qlever:latest
docker.io/adfreiburg/qlever:latest
docker pull adfreiburg/qlever-ui
Using default tag: latest
latest: Pulling from adfreiburg/qlever-ui
59bf1c3509f3: Already exists 
07a400e93df3: Already exists 
64052ee245ef: Already exists 
a44d093ad4a5: Already exists 
0381087ee065: Already exists 
91c88323734b: Pull complete 
fdcee6d0309d: Pull complete 
e6b2715c1d5d: Pull complete 
b9c9f00cb678: Pull complete 
3f12ea50b177: Pull complete 
Digest: sha256:7f4b358d6a127e512979074de0c6e84f250a37bca46c494d8e04a62844716e48
Status: Downloaded newer image for adfreiburg/qlever-ui:latest
docker.io/adfreiburg/qlever-ui:latest

QLever control

https://github.com/ad-freiburg/qlever-control

git clone https://github.com/ad-freiburg/qlever-control.git
Cloning into 'qlever-control'...
remote: Enumerating objects: 1107, done.
remote: Counting objects: 100% (865/865), done.
remote: Compressing objects: 100% (422/422), done.
remote: Total 1107 (delta 392), reused 781 (delta 374), pack-reused 242
Receiving objects: 100% (1107/1107), 242.20 KiB | 8.97 MiB/s, done.
Resolving deltas: 100% (506/506), done.
cd qlever-control/
git checkout python-qlever
Branch 'python-qlever' set up to track remote branch 'python-qlever' from 'origin'.
Switched to a new branch 'python-qlever'
pip install .
Defaulting to user installation because normal site-packages is not writeable
Processing /home/wf/source/python/qlever-control
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: UNKNOWN
  Building wheel for UNKNOWN (pyproject.toml) ... done
  Created wheel for UNKNOWN: filename=UNKNOWN-0.0.0-py3-none-any.whl size=5111 sha256=1f79a64c282f532143b3ea16551e4698077fd186ce36aa5c306bf6d08822ca26
  Stored in directory: /home/wf/.cache/pip/wheels/07/95/58/79d49197785a6e837569fd3f894d646428d2e272f53582c762
Successfully built UNKNOWN
Installing collected packages: UNKNOWN
Successfully installed UNKNOWN-0.0.0

dblp warmup test

wf@wikidata:/hd/mantax/qlever/dblp$ qlever setup-config dblp
# ...
wf@wikidata:/hd/mantax/qlever/dblp$ qlever get-data index restart test-query ui 
Action "get-data"

curl -LO -C - https://dblp.org/rdf/dblp.ttl.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1648M  100 1648M    0     0  13.0M      0  0:02:06  0:02:06 --:--:-- 13.1M
Total file size: 1.7 GB

Action "index"

Write value of config variable index.SETTINGS_JSON to file dblp.settings.json
docker run -it --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --entrypoint bash --name qlever.indexer.dblp adfreiburg/qlever -c 'zcat dblp.ttl.gz | IndexBuilderMain -F ttl -f - -i dblp -s dblp.settings.json --text-words-from-literals | tee dblp.index-log.txt'

2024-01-20 06:31:08.935 - INFO: QLever IndexBuilder, compiled on Fri Jan 19 21:22:46 UTC 2024 using git hash 95fc20
2024-01-20 06:31:08.936 - INFO: You specified the input format: TTL
2024-01-20 06:31:08.936 - INFO: Processing input triples from /dev/stdin ...
2024-01-20 06:31:08.936 - INFO: Locale was not specified in settings file, default is en_US
2024-01-20 06:31:08.936 - INFO: You specified "locale = en_US" and "ignore-punctuation = 0"
2024-01-20 06:31:08.936 - INFO: You specified "parallel-parsing = true", which enables faster parsing for TTL files that don't include multiline literals with unescaped newline characters and that have newline characters after the end of triples.
2024-01-20 06:31:08.936 - INFO: You specified "num-triples-per-batch = 1,000,000", choose a lower value if the index builder runs out of memory
2024-01-20 06:31:08.936 - INFO: Integers that cannot be represented by QLever will throw an exception (this is the default behavior)
2024-01-20 06:32:09.940 - INFO: Input triples processed: 100,000,000
2024-01-20 06:33:11.794 - INFO: Input triples processed: 200,000,000
2024-01-20 06:34:10.230 - INFO: Input triples processed: 300,000,000
2024-01-20 06:34:59.558 - INFO: Done, total number of triples read: 383,101,871 [may contain duplicates]
2024-01-20 06:34:59.558 - INFO: Number of QLever-internal triples created: 383,102,213 [may contain duplicates]
2024-01-20 06:34:59.558 - INFO: Merging partial vocabularies in byte order (internal only) ...
2024-01-20 06:36:12.093 - INFO: Words merged: 100,000,000
2024-01-20 06:36:12.646 - INFO: Number of words in internal vocabulary: 103,056,659
2024-01-20 06:36:12.646 - INFO: Building prefix tree from internal vocabulary ...
2024-01-20 06:36:41.754 - INFO: Words processed: 100,000,000
2024-01-20 06:36:42.106 - INFO: Computing maximally compressing prefixes (greedy algorithm) ...
2024-01-20 06:38:01.075 - INFO: Reduction of size of internal vocabulary: 29%
2024-01-20 06:38:03.758 - INFO: Merging partial vocabularies in Unicode order (internal and external) ...
2024-01-20 06:39:32.850 - INFO: Words merged: 100,000,000
2024-01-20 06:39:34.292 - INFO: Number of words in external vocabulary: 0
2024-01-20 06:39:34.292 - INFO: Removing temporary files ...
2024-01-20 06:39:36.775 - INFO: Converting external vocabulary to binary format ...
2024-01-20 06:39:36.775 - INFO: Converting triples from local IDs to global IDs ...
2024-01-20 06:39:44.002 - INFO: Triples converted: 100,000,000
2024-01-20 06:39:50.239 - INFO: Triples converted: 200,000,000
2024-01-20 06:39:56.800 - INFO: Triples converted: 300,000,000
2024-01-20 06:40:02.612 - INFO: Done, total number of triples converted: 383,102,213
2024-01-20 06:40:02.828 - INFO: Writing compressed vocabulary to disk ...
2024-01-20 06:40:35.135 - INFO: Creating a pair of index permutations ...
2024-01-20 06:40:49.488 - INFO: Triples processed: 100,000,000
2024-01-20 06:41:02.700 - INFO: Triples processed: 200,000,000
2024-01-20 06:41:18.308 - INFO: Triples processed: 300,000,000
2024-01-20 06:41:30.384 - INFO: Number of unique elements: 383,101,422
2024-01-20 06:41:31.919 - INFO: Statistics for SPO: #relations = 51,248,218, #blocks = 8,175, #triples = 383,101,422
2024-01-20 06:41:31.919 - INFO: Statistics for SOP: #relations = 51,248,218, #blocks = 8,175, #triples = 383,101,422
2024-01-20 06:41:31.920 - INFO: Writing meta data for SPO and SOP ...
2024-01-20 06:41:32.252 - INFO: Number of distinct patterns: 808
2024-01-20 06:41:32.252 - INFO: Number of subjects with pattern: 51,248,049 [all]
2024-01-20 06:41:32.252 - INFO: Total number of distinct subject-predicate pairs: 319,963,105
2024-01-20 06:41:32.252 - INFO: Average number of predicates per subject: 6.2
2024-01-20 06:41:32.252 - INFO: Average number of subjects per predicate: 4,705,340
2024-01-20 06:41:32.302 - INFO: Creating a pair of index permutations ...
2024-01-20 06:42:04.079 - INFO: Triples processed: 100,000,000
2024-01-20 06:42:24.519 - INFO: Triples processed: 200,000,000
2024-01-20 06:42:40.771 - INFO: Triples processed: 300,000,000
2024-01-20 06:43:05.492 - INFO: Statistics for OSP: #relations = 103,020,074, #blocks = 9,768, #triples = 383,101,422
2024-01-20 06:43:05.492 - INFO: Statistics for OPS: #relations = 103,020,074, #blocks = 9,768, #triples = 383,101,422
2024-01-20 06:43:05.492 - INFO: Writing meta data for OSP and OPS ...
2024-01-20 06:43:05.971 - INFO: Adding 51,248,049triples to the POS and PSO permutation for `ql:has-pattern` ...
2024-01-20 06:43:10.427 - INFO: Creating a pair of index permutations ...
2024-01-20 06:43:25.420 - INFO: Triples processed: 100,000,000
2024-01-20 06:43:38.544 - INFO: Triples processed: 200,000,000
2024-01-20 06:43:52.415 - INFO: Triples processed: 300,000,000
2024-01-20 06:44:08.522 - INFO: Triples processed: 400,000,000
2024-01-20 06:44:12.925 - INFO: Statistics for PSO: #relations = 73, #blocks = 13,933, #triples = 434,349,471
2024-01-20 06:44:12.925 - INFO: Statistics for POS: #relations = 73, #blocks = 13,933, #triples = 434,349,471
2024-01-20 06:44:12.925 - INFO: Writing meta data for PSO and POS ...
2024-01-20 06:44:13.158 - INFO: Index build completed
2024-01-20 06:44:13.329 - INFO: 
2024-01-20 06:44:13.329 - INFO: Adding text index ...
2024-01-20 06:44:13.329 - INFO: Considering each literal as a text record
2024-01-20 06:44:13.331 - INFO: The git hash used to build this index was 95fc20
2024-01-20 06:44:13.332 - INFO: Reading vocabulary from file dblp.vocabulary.internal ...
2024-01-20 06:44:15.709 - INFO: Done, number of words: 103,056,659
2024-01-20 06:44:15.709 - INFO: Number of words in external vocabulary: 0
2024-01-20 06:44:15.709 - INFO: Building text vocabulary ...
2024-01-20 06:45:01.479 - INFO: Writing vocabulary to file dblp.text.vocabulary ...
2024-01-20 06:45:01.567 - INFO: Done, number of words: 10,600,142
2024-01-20 06:45:01.642 - INFO: Building the half-inverted index lists ...
2024-01-20 06:54:07.233 - INFO: Statistics for text index: #words = 255,624,870, #blocks = 251,607
2024-01-20 06:54:07.256 - INFO: Text index build completed

Action "restart"

Stop running server if found, then start new server

Checking for process matching "ServerMain.* -i [^ ]*dblp" and for Docker container with name "qlever.server.dblp"

Found process 4232 from user wf with command line: ServerMain -i dblp -j 8 -p 7015 -m 20 -c 5 -e 1 -k 100 -a dblp_110931226 -t

Killed process 4232

docker run -d --restart=unless-stopped -u $(id -u):$(id -g) -it -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -p 7015:7015 -w /index --entrypoint bash --name qlever.server.dblp adfreiburg/qlever -c 'ServerMain -i dblp -j 8 -p 7015 -m 30G -c 5G -e 1G -k 100 -a dblp_7643543846 -t > dblp.server-log.txt 2>&1'

Follow dblp.server-log.txt until the server is ready (Ctrl-C stops following the log, but not the server)

2024-01-20 06:54:09.324 - INFO: QLever Server, compiled on Fri Jan 19 21:22:46 UTC 2024 using git hash 95fc20
2024-01-20 06:54:09.326 - INFO: Initializing server ...
2024-01-20 06:54:09.326 - INFO: The git hash used to build this index was 95fc20
2024-01-20 06:54:09.327 - INFO: Reading vocabulary from file dblp.vocabulary.internal ...
2024-01-20 06:54:11.783 - INFO: Done, number of words: 103,056,659
2024-01-20 06:54:11.783 - INFO: Number of words in external vocabulary: 0
2024-01-20 06:54:11.786 - INFO: Registered PSO permutation: #relations = 73, #blocks = 13,933, #triples = 434,349,471
2024-01-20 06:54:11.788 - INFO: Registered POS permutation: #relations = 73, #blocks = 13,933, #triples = 434,349,471
2024-01-20 06:54:11.790 - INFO: Registered OPS permutation: #relations = 103,020,074, #blocks = 9,768, #triples = 383,101,422
2024-01-20 06:54:11.791 - INFO: Registered OSP permutation: #relations = 103,020,074, #blocks = 9,768, #triples = 383,101,422
2024-01-20 06:54:11.792 - INFO: Registered SPO permutation: #relations = 51,248,218, #blocks = 8,175, #triples = 383,101,422
2024-01-20 06:54:11.793 - INFO: Registered SOP permutation: #relations = 51,248,218, #blocks = 8,175, #triples = 383,101,422
2024-01-20 06:54:11.793 - INFO: Reading patterns from file dblp.index.patterns ...
2024-01-20 06:54:11.793 - INFO: Reading vocabulary from file dblp.text.vocabulary ...
2024-01-20 06:54:11.905 - INFO: Done, number of words: 10,600,142
2024-01-20 06:54:11.905 - INFO: Reading metadata from file dblp.text.index ...
2024-01-20 06:54:11.919 - INFO: Registered text index: #words = 255,624,870, #blocks = 251,607
2024-01-20 06:54:11.920 - INFO: Sorting random result tables to estimate the sorting performance of this machine ...
2024-01-20 06:54:12.657 - INFO: Access token for restricted API calls is "dblp_7643543846"
2024-01-20 06:54:12.657 - INFO: The server is ready, listening for requests on port 7015 ...
2024-01-20 06:54:13.398 - INFO: 
2024-01-20 06:54:13.398 - INFO: Request received via GET, no content type specified
2024-01-20 06:54:13.398 - INFO: Alive check with message "from the qlever script"
2024-01-20 06:54:13.415 - INFO: 
2024-01-20 06:54:13.415 - INFO: Request received via GET, no content type specified
2024-01-20 06:54:13.415 - INFO: Setting index description to: "DBLP computer science bibliography, data from https://dblp.org/rdf/dblp.ttl.gz"
2024-01-20 06:54:13.431 - INFO: 
2024-01-20 06:54:13.431 - INFO: Request received via GET, no content type specified
2024-01-20 06:54:13.431 - INFO: Setting text description to: "All literals, search with FILTER KEYWORDS(?text, ...)"

Action "test-query"

curl -s http://localhost:7015 -H "Accept: text/tab-separated-values" -H "Content-Type: application/sparql-query" --data "SELECT * WHERE { ?s ?p ?o } LIMIT 10"

?s	?p	?o
<https://dblp.org/rec/books/acm/19/X19>	<https://dblp.org/rdf/schema#numberOfCreators>	0
<https://dblp.org/rec/books/acm/19/X19a>	<https://dblp.org/rdf/schema#numberOfCreators>	0
<https://dblp.org/rec/books/acm/19/X19b>	<https://dblp.org/rdf/schema#numberOfCreators>	0
<https://dblp.org/rec/books/acm/19/X19c>	<https://dblp.org/rdf/schema#numberOfCreators>	0
<https://dblp.org/rec/books/acm/19/X19d>	<https://dblp.org/rdf/schema#numberOfCreators>	0
<https://dblp.org/rec/books/acm/19/X19e>	<https://dblp.org/rdf/schema#numberOfCreators>	0
<https://dblp.org/rec/books/acm/19/X19f>	<https://dblp.org/rdf/schema#numberOfCreators>	0
<https://dblp.org/rec/books/acm/19/X19g>	<https://dblp.org/rdf/schema#numberOfCreators>	0
<https://dblp.org/rec/books/acm/19/X19h>	<https://dblp.org/rdf/schema#numberOfCreators>	0
<https://dblp.org/rec/books/acm/22/X22>	<https://dblp.org/rdf/schema#numberOfCreators>	0

Action "ui"

docker rm -f qlever-ui
docker pull adfreiburg/qlever-ui
docker run -d -p 7000:7000 --name qlever-ui adfreiburg/qlever-ui 
docker exec -it qlever-ui bash -c "python manage.py configure dblp http://wikidata:7015"

Error response from daemon: No such container: qlever-ui
docker: Error response from daemon: driver failed programming external connectivity on endpoint qlever-ui (1091f6b275a72fe1f875445a90cccba525578bb363fce560d3b2aae1bc380213): Bind for 0.0.0.0:7000 failed: port is already allocated.

wikidata

wf@wikidata:/hd/mantax/qlever/wikidata.2024$ qlever setup-config wikidata

Action "setup-config"

Created Qleverfile for config "wikidata" in current directory

You can now execute a sequence of actions, for example:

qlever get-data index restart test-query ui 

Available action names are: autocompletion-warmup, cache-stats-and-settings, clear-cache, clear-cache-complete, example-queries, get-data, index, index-stats, log, memory-profile, memory-profile-show, remove-index, restart, setup-config, show-config, start, status, stop, test-query, ui

To get autocompletion for these, run the following or add it to your `.bashrc`:

eval "$(qlever setup-autocompletion)"

wf@wikidata:/hd/mantax/qlever/wikidata.2024$nohup qlever get-data index&

index failure

"input is not a tty"

workaround - remove -it and run as daemon:

docker run -d --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --entrypoint bash --name qlever.indexer.wikidata adfreiburg/qlever -c 'ulimit -Sn 1048576; bzcat latest-lexemes.ttl.bz2 latest-all.ttl.bz2 | IndexBuilderMain -F ttl -f - -i wikidata -s wikidata.settings.json --stxxl-memory 10g | tee wikidata.index-log.txt'
2024-01-20 17:21:21.049 - INFO: QLever IndexBuilder, compiled on Fri Jan 19 21:22:46 UTC 2024 using git hash 95fc20
2024-01-20 17:21:21.049 - INFO: You specified the input format: TTL
2024-01-20 17:21:21.049 - INFO: Processing input triples from /dev/stdin ...
2024-01-20 17:21:21.049 - INFO: You specified "locale = en_US" and "ignore-punctuation = 1"
2024-01-20 17:21:21.050 - INFO: You specified "parallel-parsing = true", which enables faster parsing for TTL files that don't include multiline literals with unescaped newline characters and that have newline characters after the end of triples.
2024-01-20 17:21:21.050 - INFO: You specified "num-triples-per-batch = 5,000,000", choose a lower value if the index builder runs out of memory
2024-01-20 17:21:21.050 - INFO: Integers that cannot be represented by QLever will throw an exception (this is the default behavior)
2024-01-20 17:23:30.828 - INFO: Input triples processed: 100,000,000
2024-01-20 17:25:31.249 - INFO: Input triples processed: 200,000,000
2024-01-20 17:27:19.354 - INFO: Input triples processed: 300,000,000
...
2024-01-21 07:04:39.765 - INFO: Triples processed: 27,900,000,000
2024-01-21 07:04:56.640 - INFO: Triples processed: 28,000,000,000
2024-01-21 07:05:12.927 - INFO: Triples processed: 28,100,000,000
2024-01-21 07:05:17.949 - INFO: Statistics for PSO: #relations = 69,508, #blocks = 905,604, #triples = 28,133,360,586
2024-01-21 07:05:17.949 - INFO: Statistics for POS: #relations = 69,508, #blocks = 905,604, #triples = 28,133,360,586
2024-01-21 07:05:17.949 - INFO: Writing meta data for PSO and POS ...
2024-01-21 07:05:27.403 - INFO: Index build completed