Automatic assignment of columns to bindings
Each binding in the data
section of template.yml can be configured for automatic assignment of columns to bindings.
data_type
Simple column assignment by A user can ask the UI to automatically assign columns of their dataset to bindings. By default, the UI will try to assign an unused column (or columns for multiple column type: columns
bindings) to the binding based on the types allowed by its data_type
field. For example:
- name: Color by
dataset: data
key: color
type: column
column: Data::A
data_type:
- string
- number
allows the UI to assign a string or number column to this binding. Columns are assigned to bindings in order, so the first binding (closest to the top of the YAML file) is assigned a column or columns first, then the binding below it second, and so on. When no data_type
is specified it defaults to string
.
Customizing assignment with hints
You may want to allow bindings for only some of the data types. To do that, an assignment
field is needed on your binding:
- name: Color by
dataset: data
key: color
type: column
column: Data::A
data_type:
- string
- number
assignment:
hints:
string: true # This default can be omitted.
number: false # This stops number columns being picked.
Behind the scenes, columns are scored and the highest scoring columns are assigned to a binding. In the basic assignment configurations above every column not already assigned is scored 1 if it is a type we're allowing, and 0 otherwise. There may be more than one column scoring 1, so to tie-break them the column on the left wins.
One thing you may want to do is favor one type over another for a particular binding:
- name: ID
description: Unique ID.
dataset: data
key: id
type: column
data_type:
- string
- number
assignment:
hints:
number:
weight: 1 # The default value.
string:
weight: 0.9
The weight property effectively scores number columns more highly than string columns. The above configuration means that this binding will receive an available number column when there is one, but it can fall back to a string column.
This basic picking can be useful, but you may want to refine the columns picked for a given binding. For example, you may want to pick bindings which contain unique values, or contain uniformly distributed entries. The assignment field allows you to do this with hints. Hints allow you to do enhanced filtering and scoring of columns to increase the chances of an optimal set of assignments for a given dataset.
Let's enhance the ID binding above:
- name: ID
description: Unique ID.
dataset: data
key: id
type: column
data_type:
- number
- string
assignment:
hints:
number:
weight: 1
unique: true
string:
weight: 0.9
unique: true
The configuration above defines an ID binding which can be automatically assigned a number and string column populated with unique values, favoring number columns. The unique hint will score a column 1
when every populated cell is unique, and 0
if there are any repetitions. Hints are declared per type.
Most hints ignore empty cells of a column. This is important for hints like unique
so that they don't see two empty cells as a repeated value. There are two special hints called sparse and dense which do not ignore empty cells, and we can use one to further refine automatic assignment to the ID binding:
data_type:
- string
- number
assignment:
hints:
number:
weight: 1
unique: true
dense: true
string:
weight: 0.9
unique: true
dense: true
Now number columns which are dense (not sparse) can be automatically assigned to the ID binding.
Unlike the unique
hint, the dense
hint scores a column based on the proportion of cells in a column which are populated. For example, when half the cells are empty the dense
score will be 0.5. By default the dense
hint floors anything less than 0.8 (80% of cells populated) to 0. The dense
hint takes optional configuration to move the threshold. To set it to 90%:
data_type:
- string
- number
assignment:
hints:
number:
weight: 1
unique: true
dense:
threshold: 0.9
string:
weight: 0.9
unique: true
dense:
threshold: 0.9
Multiple hints are at work in the above configuration. The overall score of a column is the geometric mean of the score of its hints (not counting the weight
) multiplied by the weight
. One consequence of this is that when any hint results in a score of 0, the column is scored 0 overall (a behaviour which is useful for filtering columns).
Many hints have defaults, which are used when you activate them with true
(as we did with the unique
hint above). Other hints may require configuration.
Bindings are assigned columns in order, and by default each column is assigned once. A consequence of this is that you should place your bindings in descending order of importance to give your most important binding first choice of columns. If you have a binding you'd like to allow to reuse columns which have already been automatically assigned, you can use the can_rebind
option:
assignment:
can_rebind: true
By default this option is false and bindings should omit it.
Multiple-columns bindings take the same assignment configuration, and have additional optional fields:
- name: Color by
dataset: data
key: color
type: columns
columns: Data::B-D,F
data_type:
number
assignment:
max_columns: 10 # defaults to Infinity
min_columns: 5 # defaults to 1
contiguous: false # defaults to false
sort: true # defaults to false
Multiple-column bindings take the same hints as single column bindings. Assignment to multiple-column bindings is like a repeated version of single column assignment. Available columns are scored, and up to max_columns
of the highest scoring are picked. Zero scoring columns are never picked, and then if less than min_columns
would be assigned, none are assigned. Multiple-column bindings can be configured to look for a contiguous series of columns.
More information on the rules which single, multiple, and multiple-contiguous binding assignments follow can be found in the assignment strategies section below.
Assignment strategies
Column assignment follows a set of rules, described in this section. These rules mean that, for a given dataset and template configuration, the automatic assignment of columns is deterministic and predictable.
Single column
Single column bindings are bindings with a column
property.
For each binding, working from top to bottom in the data
section of your template YAML configuration:
- Score available columns based on types and hints (available columns are those unused by previous bindings or all columns when
can_rebind
istrue
). - Rule out columns with a score of 0.
- Pick the highest scoring column:
- When there is one column with a high score, it is assigned.
- When there are two or more with the high score, the left-most column is assigned.
- When there are no columns to pick, assign no column.
Multiple column
Multiple column bindings are bindings with a columns
property.
Per-column
The default multiple-column strategy is like the single column strategy. At least min_columns
and at most max_columns
are picked in the same manner as single column assignment. When there are not at least min_columns
with a non-zero score available, no columns are assigned to the binding.
Contiguous blocks
A second assignment strategy is available to multiple-column bindings to select contiguous blocks of columns (no gaps between the columns). This is useful for series data. In this mode (contiguous: true
) groups of neighboring columns at least min_columns
and at most max_columns
wide are scored. The score of a group is the geometric mean of the scores of the columns within it. A consequence of this is that a zero scoring column leads to the score of a group it is in being zero too. The strategy does the following things to select a group to assign. For each multiple column contiguous binding:
- When there are no groups scoring more than 0 at least
min_columns
wide, assign no columns. - When there is one high scoring group of columns at least
min_columns
wide, assign the columns. - When there are two or more groups sharing a high score, pick the widest group and assign the columns.
- When there are two or more groups sharing a high score with the same width, pick the group furthest to the left and assign the columns.
Sort selected columns
It's possible to set the sort order of selected columns using the sort
property (boolean). When set to false
(default), columns will be selected in the order in which they appear in the data table. Useful when it's important to retain the order of the data in a sheet (eg. time series). When set to true
, it will order the columns based on the score they receive from the hints. This is useful when you want to sort columns on how well they fit the hint (eg. finding the most categorical column in the Hierarchy template, selecting the most useful column first in a multi-value dropdown)
Hints library
The hints in this section are available to configure for your bindings. Hints are type sensitive, so some hints are available only for particular types.
sparse
and dense
Types: string
, number
, datetime
The sparse
and dense
hints allow you to score and filter columns according to the proportion of their elements which are empty. For example, a column with nine cells populated and one empty will get a dense
score of 0.9
and a sparse score of 0.1
. These hints have a threshold option to allow a score below a threshold to be floored to 0
.
The default configuration of the dense
hint looks like:
string:
dense: true
which is equivalent to:
string:
dense:
threshold: 0.8
Similarly the default configuration of the sparse
hint looks like:
string:
sparse: true
which is equivalent to:
string:
sparse:
threshold: 0.2
unique
The unique hint asserts that all elements of a column which are not empty are not repeated. When this is true the column is scored as 1
, otherwise 0
. The unique
hint takes no configuration:
string:
unique: true
distinct
Types: string
, number
, datetime
The distinctness hint allows you to make assertions about how many distinct values a column has. By default the hint will score any non-empty column 1, so it must be tuned. It can make assertions based on both the absolute number of distinct values and the proportion of values which are distinct.
An example of the distinct hint configured for an absolute number of values:
string:
distinct:
min: 5 # The lower absolute bound of distinct values. Default 0.
max: 9 # The upper (inclusive) bound of distinct values. Default Infinity.
An example of the distinct hint configured for a bound on the proportion of distinct values (the number of distinct values divided by the total number of values):
string:
distinct:
proportional_min: 0.1 # The lower absolute bound of distinct values. Default 0.
proportional_max: 0.9 # The upper (inclusive) bound of distinct values. Default 1.
Any mixture of these parameters may be used.
cardinality
Types: string
, number
, datetime
The cardinality
hint returns the proportion of distinct values in a column (the number of distinct values divided by the total number of values) as its score. It takes an optional threshold
parameter to place a lower bound (inclusive) on the proportion.
string:
cardinality:
threshold: 0.5 # n_distinct / n_populated >= 0.5
color
Types: string
The color
hint returns 1 when at least 2 and at most 8 distinct values are found in a column, otherwise it returns 0. It is given this name because this number of distinct values can be useful for mapping to a color palette. It takes no parameters.
string:
color: true
boolean
Types: string
, number
, datetime
The boolean
hint returns 1 when there are exactly two distinct values found in a column, otherwise it returns 0. It is given this name because this number of distinct values means the column may represent boolean data.
uniform
Types: string
, number
, datetime
The uniform
hint measures how uniform the data in a column is. A result of 1 represents a perfectly uniform distribution (each distinct value is represented the same number of times). This hint takes no parameters.
string:
uniform: true
categorical
and diverse
Types: string
, number
, datetime
The categorical
and diverse
hints are a mixture of cardinality
, distinct
, and dense
hints. They're provided as a convenience for the commonly needed constraints on columns.
For categorical, columns with too many empty cells, or with too many distinct values are scored 0. Columns which are sufficiently dense with up to a maximum number of distinct values are scored as 1 minus their cardinality (the proportion of distinct values). When that score is below a threshold, it gets floored to 0.
string:
categorical: true
# Equivalent to:
string:
categorical:
minimum_distinct: 2
maximum_distinct: 20
density: 0.8
threshold: 0.75
In the case of diverse
, columns with too many empty cells, or with too few distinct values are scored 0. Columns which are sufficiently dense with at least a minimum number of distinct values use their cardinality (the proportion of distinct values). When that score is below a threshold, it gets floored to 0.
string:
diverse: true
# Equivalent to:
string:
diverse:
minimum_distinct: 10
maximum_distinct: Infinity
density: 0.8
threshold: 0.75
allow_ids: false
Do not use categorical
or diverse
hints along with other hints!
If categorical
and diverse
don't quite fit your use case, it is better to compose what you need from the other hints.
url
Types: string
The url
string hint returns 1 when all populated cells of a column can be parsed as fully qualified URLs, otherwise it returns 0. It takes no parameters.
string:
url: true
geojson
Types: string
The geojson
string hint returns the proportion of populated cells which look like GeoJSON. It takes no parameters.
string:
geojson: true
text
Types: string
The text
string hint returns 1 when all populated cells of a column are not urls, geojson or booleans. The hint takes two optional parameters, max_length
and min_length
which set upper and lower limits on the length of each string.
string:
text:
max_length: 50 # The upper bound of string length. Default Infinity.
min_length: 1 # The lower bound of string length. Default 0.
year
Types: number
The year
hint looks for columns with numbers within a valid range.
25% of the score is given based on the header string. If the header is the string "year"
, "yr"
or "y"
it gets the 25%. The remaining 75% is the proportion of populated cells of the column which are in the range 1500 to 2100 inclusive. This hint takes no parameters:
number:
year: true
latitude
and longitude
Types: number
The latitude
and longitude
hints look for columns with numbers within a valid range (in degrees).
In the case of the latitude
hint, 25% of the score is given based on the header string. If the header is the string "lat"
or "latitude"
it gets the 25%. The remaining 75% is the proportion of populated cells of the column which are in the range -90 to +90 inclusive. This hint takes no parameters:
number:
latitude: true
The longitude
hint is similar. The first 25% of the score is gained by the header being "longitude
",
"lon",
"long", or
"lng"`. The remaining 75% is the proportion of populated cells in the range -180 to +180.
number:
longitude: true
Even with the header bonuses, a binding for longitude data should come before a binding for latitude data to increase the chances of a good assignment for both.
id
Types: number
, string
, datetime
The id
hint looks for columns that likely contain hints.
70% of the score is given base on the header string. If the header contains "id" or similar, it will score the column 0.7. If the column data is also unique and fully dense, 0.15 will be added to the score respectively. If the column header string doesn't contain "id", the score will always be 0.
This hint is also used in the diverse
hint to penalise id columns.
number:
latitude: true
The longitude
hint is similar. The first 25% of the score is gained by the header being "longitude
",
"lon",
"long", or
"lng"`. The remaining 75% is the proportion of populated cells in the range -180 to +180.
number:
longitude: true
Even with the header bonuses, a binding for longitude data should come before a binding for latitude data to increase the chances of a good assignment for both.
integer
Types: number
The integer
number hint returns the proportion of populated cells which look like integers. It takes no parameters:
number:
integer: true
comparison
Types: number
The comparison number hint returns the proportion of populated cells of which satisfy a comparison. This hint has two required parameters, operator
(the comparison operator), and value
(the number with which the cell is compared).
# cell > value
number:
comparison:
operator: ">"
value: 10
The value
may be any number. The comparison may be one of:
">"
: greater than">="
: greater than or equal"=="
: equal"!="
: not equal"<="
: less than or equal"<"
: less than