Automatic assignment of columns to bindings

Each binding in the data section of template.yml can be configured for automatic assignment of columns to bindings.

Simple column assignment by `data_type`

A user can ask the UI to automatically assign columns of their dataset to bindings. By default, the UI will try to assign an unused column (or columns for multiple column type: columns bindings) to the binding based on the types allowed by its data_type field. For example:

- name: Color by
  dataset: data
  key: color
  type: column
  column: Data::A
  data_type:
    - string
    - number

allows the UI to assign a string or number column to this binding. Columns are assigned to bindings in order, so the first binding (closest to the top of the YAML file) is assigned a column or columns first, then the binding below it second, and so on. When no data_type is specified it defaults to string.

Customizing assignment with hints

You may want to allow bindings for only some of the data types. To do that, an assignment field is needed on your binding:

- name: Color by
  dataset: data
  key: color
  type: column
  column: Data::A
  data_type:
    - string
    - number
  assignment:
    hints:
      string: true  # This default can be omitted.
      number: false # This stops number columns being picked.

Behind the scenes, columns are scored and the highest scoring columns are assigned to a binding. In the basic assignment configurations above every column not already assigned is scored 1 if it is a type we're allowing, and 0 otherwise. There may be more than one column scoring 1, so to tie-break them the column on the left wins.

One thing you may want to do is favor one type over another for a particular binding:

- name: ID
  description: Unique ID.
  dataset: data
  key: id
  type: column
  data_type:
    - string
    - number
  assignment:
    hints:
      number:
        weight: 1 # The default value.
      string:
        weight: 0.9

The weight property effectively scores number columns more highly than string columns. The above configuration means that this binding will receive an available number column when there is one, but it can fall back to a string column.

This basic picking can be useful, but you may want to refine the columns picked for a given binding. For example, you may want to pick bindings which contain unique values, or contain uniformly distributed entries. The assignment field allows you to do this with hints. Hints allow you to do enhanced filtering and scoring of columns to increase the chances of an optimal set of assignments for a given dataset.

Let's enhance the ID binding above:

- name: ID
  description: Unique ID.
  dataset: data
  key: id
  type: column
  data_type:
    - number
    - string
  assignment:
    hints:
      number:
        weight: 1
        unique: true
      string:
        weight: 0.9
        unique: true

The configuration above defines an ID binding which can be automatically assigned a number and string column populated with unique values, favoring number columns. The unique hint will score a column 1 when every populated cell is unique, and 0 if there are any repetitions. Hints are declared per type.

Most hints ignore empty cells of a column. This is important for hints like unique so that they don't see two empty cells as a repeated value. There are two special hints called sparse and dense which do not ignore empty cells, and we can use one to further refine automatic assignment to the ID binding:

  data_type:
    - string
    - number
  assignment:
    hints:
      number:
        weight: 1
        unique: true
        dense: true
      string:
        weight: 0.9
        unique: true
        dense: true

Now number columns which are dense (not sparse) can be automatically assigned to the ID binding.

Unlike the unique hint, the dense hint scores a column based on the proportion of cells in a column which are populated. For example, when half the cells are empty the dense score will be 0.5. By default the dense hint floors anything less than 0.8 (80% of cells populated) to 0. The dense hint takes optional configuration to move the threshold. To set it to 90%:

  data_type:
    - string
    - number
  assignment:
    hints:
      number:
        weight: 1
        unique: true
        dense:
          threshold: 0.9
      string:
        weight: 0.9
        unique: true
        dense:
          threshold: 0.9

Multiple hints are at work in the above configuration. The overall score of a column is the geometric mean of the score of its hints (not counting the weight) multiplied by the weight. One consequence of this is that when any hint results in a score of 0, the column is scored 0 overall (a behaviour which is useful for filtering columns).

Many hints have defaults, which are used when you activate them with true (as we did with the unique hint above). Other hints may require configuration.

Bindings are assigned columns in order, and by default each column is assigned once. A consequence of this is that you should place your bindings in descending order of importance to give your most important binding first choice of columns. If you have a binding you'd like to allow to reuse columns which have already been automatically assigned, you can use the can_rebind option:

  assignment:
    can_rebind: true

By default this option is false and bindings should omit it.

Multiple-columns bindings take the same assignment configuration, and have additional optional fields:

- name: Color by
  dataset: data
  key: color
  type: columns
  columns: Data::B-D,F
  data_type:
    number
  assignment:
    max_columns: 10   # defaults to Infinity
    min_columns: 5    # defaults to 1
    contiguous: false # defaults to false
    sort: true        # defaults to false

Multiple-column bindings take the same hints as single column bindings. Assignment to multiple-column bindings is like a repeated version of single column assignment. Available columns are scored, and up to max_columns of the highest scoring are picked. Zero scoring columns are never picked, and then if less than min_columns would be assigned, none are assigned. Multiple-column bindings can be configured to look for a contiguous series of columns.

More information on the rules which single, multiple, and multiple-contiguous binding assignments follow can be found in the assignment strategies section below.

Assignment strategies

Column assignment follows a set of rules, described in this section. These rules mean that, for a given dataset and template configuration, the automatic assignment of columns is deterministic and predictable.

Single column

Single column bindings are bindings with a column property.

For each binding, working from top to bottom in the data section of your template YAML configuration:

Score available columns based on types and hints (available columns are those unused by previous bindings or all columns when can_rebind is true).
Rule out columns with a score of 0.
Pick the highest scoring column:
- When there is one column with a high score, it is assigned.
- When there are two or more with the high score, the left-most column is assigned.
- When there are no columns to pick, assign no column.

Multiple column

Multiple column bindings are bindings with a columns property.

Per-column

The default multiple-column strategy is like the single column strategy. At least min_columns and at most max_columns are picked in the same manner as single column assignment. When there are not at least min_columns with a non-zero score available, no columns are assigned to the binding.

Contiguous blocks

A second assignment strategy is available to multiple-column bindings to select contiguous blocks of columns (no gaps between the columns). This is useful for series data. In this mode (contiguous: true) groups of neighboring columns at least min_columns and at most max_columns wide are scored. The score of a group is the geometric mean of the scores of the columns within it. A consequence of this is that a zero scoring column leads to the score of a group it is in being zero too. The strategy does the following things to select a group to assign. For each multiple column contiguous binding:

When there are no groups scoring more than 0 at least min_columns wide, assign no columns.
When there is one high scoring group of columns at least min_columns wide, assign the columns.
When there are two or more groups sharing a high score, pick the widest group and assign the columns.
When there are two or more groups sharing a high score with the same width, pick the group furthest to the left and assign the columns.

Sort selected columns

It's possible to set the sort order of selected columns using the sort property (boolean). When set to false (default), columns will be selected in the order in which they appear in the data table. Useful when it's important to retain the order of the data in a sheet (eg. time series). When set to true, it will order the columns based on the score they receive from the hints. This is useful when you want to sort columns on how well they fit the hint (eg. finding the most categorical column in the Hierarchy template, selecting the most useful column first in a multi-value dropdown)

Hints library

The hints in this section are available to configure for your bindings. Hints are type sensitive, so some hints are available only for particular types.

`sparse` and `dense`

Types: string, number, datetime

The sparse and dense hints allow you to score and filter columns according to the proportion of their elements which are empty. For example, a column with nine cells populated and one empty will get a dense score of 0.9 and a sparse score of 0.1. These hints have a threshold option to allow a score below a threshold to be floored to 0.

The default configuration of the dense hint looks like:

string:
  dense: true

which is equivalent to:

string:
  dense:
    threshold: 0.8

Similarly the default configuration of the sparse hint looks like:

string:
  sparse: true

which is equivalent to:

string:
  sparse:
    threshold: 0.2

`unique`

The unique hint asserts that all elements of a column which are not empty are not repeated. When this is true the column is scored as 1, otherwise 0. The unique hint takes no configuration:

string:
  unique: true

`distinct`

Types: string, number, datetime

The distinctness hint allows you to make assertions about how many distinct values a column has. By default the hint will score any non-empty column 1, so it must be tuned. It can make assertions based on both the absolute number of distinct values and the proportion of values which are distinct.

An example of the distinct hint configured for an absolute number of values:

string:
  distinct:
    min: 5 # The lower absolute bound of distinct values. Default 0.
    max: 9 # The upper (inclusive) bound of distinct values. Default Infinity.

An example of the distinct hint configured for a bound on the proportion of distinct values (the number of distinct values divided by the total number of values):

string:
  distinct:
    proportional_min: 0.1 # The lower absolute bound of distinct values. Default 0.
    proportional_max: 0.9 # The upper (inclusive) bound of distinct values. Default 1.

Any mixture of these parameters may be used.

`cardinality`

Types: string, number, datetime

The cardinality hint returns the proportion of distinct values in a column (the number of distinct values divided by the total number of values) as its score. It takes an optional threshold parameter to place a lower bound (inclusive) on the proportion.

string:
  cardinality:
    threshold: 0.5 # n_distinct / n_populated >= 0.5

`color`

Types: string

The color hint returns 1 when at least 2 and at most 8 distinct values are found in a column, otherwise it returns 0. It is given this name because this number of distinct values can be useful for mapping to a color palette. It takes no parameters.

string:
  color: true

`boolean`

Types: string, number, datetime

The boolean hint returns 1 when there are exactly two distinct values found in a column, otherwise it returns 0. It is given this name because this number of distinct values means the column may represent boolean data.

`uniform`

Types: string, number, datetime

The uniform hint measures how uniform the data in a column is. A result of 1 represents a perfectly uniform distribution (each distinct value is represented the same number of times). This hint takes no parameters.

string:
  uniform: true

`categorical` and `diverse`

Types: string, number, datetime

The categorical and diverse hints are a mixture of cardinality, distinct, and dense hints. They're provided as a convenience for the commonly needed constraints on columns.

For categorical, columns with too many empty cells, or with too many distinct values are scored 0. Columns which are sufficiently dense with up to a maximum number of distinct values are scored as 1 minus their cardinality (the proportion of distinct values). When that score is below a threshold, it gets floored to 0.

string:
  categorical: true

# Equivalent to:

string:
  categorical:
    minimum_distinct: 2
    maximum_distinct: 20
    density: 0.8
    threshold: 0.75

In the case of diverse, columns with too many empty cells, or with too few distinct values are scored 0. Columns which are sufficiently dense with at least a minimum number of distinct values use their cardinality (the proportion of distinct values). When that score is below a threshold, it gets floored to 0.

string:
  diverse: true

# Equivalent to:

string:
  diverse:
    minimum_distinct: 10
    maximum_distinct: Infinity
    density: 0.8
    threshold: 0.75
    allow_ids: false

Do not use categorical or diverse hints along with other hints!

If categorical and diverse don't quite fit your use case, it is better to compose what you need from the other hints.

`url`

Types: string

The url string hint returns 1 when all populated cells of a column can be parsed as fully qualified URLs, otherwise it returns 0. It takes no parameters.

string:
  url: true

`geojson`

Types: string

The geojson string hint returns the proportion of populated cells which look like GeoJSON. It takes no parameters.

string:
  geojson: true

`text`

Types: string

The text string hint returns 1 when all populated cells of a column are not urls, geojson or booleans. The hint takes two optional parameters, max_length and min_length which set upper and lower limits on the length of each string.

string:
  text:
    max_length: 50 # The upper bound of string length. Default Infinity.
    min_length: 1 # The lower bound of string length. Default 0.

`year`

Types: number

The year hint looks for columns with numbers within a valid range.

25% of the score is given based on the header string. If the header is the string "year", "yr" or "y" it gets the 25%. The remaining 75% is the proportion of populated cells of the column which are in the range 1500 to 2100 inclusive. This hint takes no parameters:

number:
  year: true

`latitude` and `longitude`

Types: number

The latitude and longitude hints look for columns with numbers within a valid range (in degrees).

In the case of the latitude hint, 25% of the score is given based on the header string. If the header is the string "lat" or "latitude" it gets the 25%. The remaining 75% is the proportion of populated cells of the column which are in the range -90 to +90 inclusive. This hint takes no parameters:

number:
  latitude: true

The longitude hint is similar. The first 25% of the score is gained by the header being "longitude","lon","long", or"lng"`. The remaining 75% is the proportion of populated cells in the range -180 to +180.

number:
  longitude: true

Even with the header bonuses, a binding for longitude data should come before a binding for latitude data to increase the chances of a good assignment for both.

`id`

Types: number, string, datetime

The id hint looks for columns that likely contain hints.

70% of the score is given base on the header string. If the header contains "id" or similar, it will score the column 0.7. If the column data is also unique and fully dense, 0.15 will be added to the score respectively. If the column header string doesn't contain "id", the score will always be 0.

This hint is also used in the diverse hint to penalise id columns.

number:
  latitude: true

number:
  longitude: true

Even with the header bonuses, a binding for longitude data should come before a binding for latitude data to increase the chances of a good assignment for both.

`integer`

Types: number

The integer number hint returns the proportion of populated cells which look like integers. It takes no parameters:

number:
  integer: true

`comparison`

Types: number

The comparison number hint returns the proportion of populated cells of which satisfy a comparison. This hint has two required parameters, operator (the comparison operator), and value (the number with which the cell is compared).

# cell > value
number:
  comparison:
    operator: ">"
    value: 10

The value may be any number. The comparison may be one of:

">": greater than
">=": greater than or equal
"==": equal
"!=": not equal
"<=": less than or equal
"<": less than

Simple column assignment by data_type

Customizing assignment with hints

Assignment strategies

Single column

Multiple column

Per-column

Contiguous blocks

Sort selected columns

Hints library

sparse and dense

unique

distinct

cardinality

color

boolean

uniform

categorical and diverse

url

geojson

text

year

latitude and longitude

id

integer

comparison

Simple column assignment by `data_type`

`sparse` and `dense`

`unique`

`distinct`

`cardinality`

`color`

`boolean`

`uniform`

`categorical` and `diverse`

`url`

`geojson`

`text`

`year`

`latitude` and `longitude`

`id`

`integer`

`comparison`