FAQ

Also relevant is the Troubleshooting, which goes over some (blocking) difficulties while launching.

Note

Please include the version in any bug reports or feature requests. The version number should look something like v0.4.1. It can be found at http://[url]:8421/docs or in the downloaded experiment file (found at http://[url]:8421/download which has a filename like exp-2021-05-20T07:31-salmon-v0.4.1.rdb).

How do I cite Salmon?

See the Users section to for the specific BibTex citation for the relevant paper, “Efficiently Learning Relative Similarity Embeddings with Crowdsourcing”.

When should I use random/active sampling?

Rule of thumb:

Use random sampling for simple problems when not many responses are required (small number of targets and clean responses).
Use active sampling for anything more complicated (large number of targets or noisy responses) when the crowdsourcing budget and/or embedding quality are relevant.

Specifics are How many responses will be needed? and What active samplers are recommended?. Random sampling can produce good embeddings, but will require about 3× the number of responses that active sampling requires.

By default, Salmon will produce random embeddings. This is the simplest sampler, and doesn’t require any user configuration. Tips on how to use active samplers are in What active samplers are recommended?.

How many responses will be needed?

Depends on the targets used, and how humans respond. Let’s say there are \(n\) targets that are being embedded into \(d\) dimensions. At most, we can provide bounds on how many responses you’ll need:

Lower bound: at least \(nd\log_2(n)\) responses are needed for a perfect embedding with noiseless responses. Active triplet algorithms require \(\Omega(nd\log_2(n))\) responses (so a constant number of responses more/less). 1
Upper bound: if random responses are collected, a high quality embedding will likely be generated with \(O(nd\log_2(n))\) responses, likely \(20 nd \log_2(n)\) responses (or possibly \(10 nd \log_2(n)\)). 2

This suggests that the number of responses required when \(n\) and \(d\) are changed scaled like \(nd\log_2(n)\). i.e, if an embedding below requires 5,000 responses for \(n=30\), scaling to \(n=40\) with \(d=1\) would likely require about \(3600 \approx 5000\frac{40 \cdot 1 \cdot \log_2(40)}{30\cdot 2 \cdot \log_2(30)}\) responses for the same dataset.

See the benchmarks on active sampling for some benchmarks/landmarks on the specific number of responses required for a particular dataset. If you think your dataset will require too many responses, see our recommendations on active samplers. Active samplers might be able to generate better embeddings with a fixed number of responses.

What active samplers are recommended?

Use of ARR is most recommended. See the benchmarks on active sampling for an example configuration and the number of responses required for that usage. The defaults of ARR have been explored pretty throughly. In the benchmarks on active sampling, we used the default parameters for ARR after exploring the possible values.

Monitoring performance is difficult with active/adaptive algorithm; random sampling is a lot better. Typically, between 10% and 20% of the sampling is used to monitor and report performance. That means I’d recommend this partial configuration:

samplers:
  ARR: {}
  Random: {}
sampling:
  probs: {"ARR": 85, "Random": 15}

Can I choose a different machine?

All of our experiments are run with t3.xlarge instances. If you want to choose a different machine, ensure that is has the following:

At least 4GB of RAM
At least 3 CPU cores.

These are required because Salmon requires 3.2GB of memory and Dask has three tasks per adaptive algorithm: posting queries, searching queries, model updating. Generally, the number of cores should be 3 * n_algs. This isn’t a strict guideline; only 2 out of the 3 tasks take significant amounts of time. Using 2 * n_algs will work at a small performance hit; we recommend at least 4 cores for two algorithms.

How do I ask specific questions?

It’s possible to configure what queries get shown to the crowdsourcing user with two methods:

By using Validation. You can specify a list of queries, or generate n_queries random queries. See Sampler configuration and the Validation sampler section.
By specifying the query key in Sampling’s detail field. This allows showing users specific queries at specific times. e.g., “for the first query the user sees, show them a query with [these objects].”
- Example: Sampling detail
- Example: the documentation for Sampling.

How do I specify when samplers are used?

Controlling when or how often samplers get used is possible with two methods:

By setting the probs key in Sampling. If the YAML specifies sampling.probs: {"ARR": 80, "Random": 20}, the Random sampler will be used 20% of the time.
By setting the sampler field in Sampling’s detail field. This will ensure that the query is generated by a specific sampler (or which sampler will receive the answer if the query is pre-specified in your init.yaml per “How do I ask specific questions?”).

When sampling.probs is specified in your init.yaml, Salmon serves a query to a user with a certain probability. This can pose difficulties if you want to ask exactly \(N\) questions to each crowdsourcing participant. Specifying sampling.details in your init.yaml will work around this and allow configuring the details of particular queries.

How do I ask every crowdsourcing user exactly the same questions?

By specifying both the sampler key and the query key in Sampling’s detail field. The answers to the questions “How do I specify when samplers are used?” and “How do I ask specific questions?” are relevant.

An example is in “Sampling detail”, delegated to the Sampler configuration page because the target indexing in “Validation sampler” is relevant.

How do I see the Dask dashboard?

Look at port 8787 if you want more information on how jobs are scheduled. If on EC2, this will require some port forwarding to your own machine:

ssh -i key.pem -L 7787:localhost:8787 ubuntu@[EC2 public DNS or IP]
# visit http://localhost:7787 in the browser to see Salmon's Dask dashboard

If desired, it is possible to open port 8787 on the Amazon EC2 machine. If that action is taken, it is recommended to only allow a specific IP to view that port.

How do I customize the participant unique identifier aka “puid”?

Visiting http://[url]:8421/?puid=foobar will set that the participant UID to be foobar.

How do I use HTTPS with Salmon?

HTTP is how web servers communicate; HTTPS protects that communication from third parties.

Some crowdsourcing services require HTTPS. There are to ways to provide these crowdsourcing services an HTTPS URL:

Redirect to Salmon from an HTTPS page.
Set up a TLS termination proxy.

Option (1) is a lot easier because various hosting services support HTTPS (e.g., GitHub Pages and GitLab Pages support HTTPS for custom domains). Hosting a redirect HTML page at one of these services with HTTPS will likely satisfy any requirements you may have.

Option (2) is more complex. A good overview is at FastAPI’s page “About HTTPS,” available at https://fastapi.tiangolo.com/deployment/https/. This process is beyond scope for this project. 3

1: “Low-dimensional embedding using adaptively selected ordinal data.” Jamieson and Nowak. 2011. Allerton Conference on Communication, Control, and Computing. https://homes.cs.washington.edu/~jamieson/resources/activeMDS.pdf
2: “Finite sample prediction and recovery bounds for ordinal embedding.” Jain, Jamieson and Nowak. 2016. NeurIPS. https://nowak.ece.wisc.edu/ordinal_embedding.pdf
3: though the package mkcert might help.

The Docker machines aren’t launching

Are you using the command docker-compose up to launch Salmon? The command docker build . doesn’t work.

Salmon requires a Redis docker machine and certain directories/ports being available. Technically, it’s possible to build all the Docker machines yourself (but it’s not feasible).