Module 2: Selecting the Right Data

Pick data that answers your question, not just data that exists.

1
Quick Intro
2
Guided Practice
3
Applied Task
4
Quick Refinement
1

Quick Intro

~2 min
WHAT

Data selection means picking the small number of well-chosen variables that directly support your core question — and rejecting the rest, even when it's available.

WHY

Decision-makers have 5 minutes. Every extra variable steals attention from the answer. Bad data choice can't be rescued by good visualization downstream.

HOW IT FITS

This is pillar 1 of four (data + visualization + narration + interaction). It determines what your chart in Module 3 can actually show.

EXAMPLE

For "Should we expand cycling lanes?": pick cycling trips per district + collision counts. Reject GDP, even though it correlates with everything.

1

Essential vs. noise

30 sec
Essential data = directly answers your question. Noise = interesting but irrelevant.
Essential
PT ridership 2019-2023 (answers "did ridership recover?")
Noise
Average ticket price by region (interesting, but not the question)
Every dataset must pass the test: "Does this help answer my question?"
2

The 2-layer filter

30 sec
Layer 1: Is it accessible and traceable? Layer 2: Does it answer your question?
Layer 1: Quick Filter
✓ Can I download it?
✓ Is the source named (Destatis, KBA)?
✓ Is it clean/structured?
Layer 2: Story Fit
✓ Does it answer my question?
✓ Does it include crisis data?
✓ Does it have a turning point?
Only data that passes BOTH layers goes into your story.
3

Spot cherry-picking

30 sec
Cherry-picking = showing only data that supports your conclusion. Always include uncomfortable data.
Cherry-picked
"Ridership grew 15% from 2021 to 2023" (ignores 2020 crash)
Complete
"Ridership crashed 56% in 2020, then recovered to 96% of 2019 by 2023"
MobiDaS R2 (Integrity): Include the 2020 pandemic drop. Always.
Continue to Module 3 →
2

Guided Practice

~3 min
1

Choose the right dataset

1 min
Practice applying the 2-layer filter.
Question: "Should the city invest in suburban rail to recover lost ridership?"
Which dataset is essential?
A: Ticket revenue 2019-2023
Source: Transport authority annual reports
B: Suburban rail ridership 2019-2023
Source: Destatis/VDV
C: Car ownership rates by district
Source: KBA vehicle registry
2

Check for integrity

1 min
Good data includes uncomfortable truths. Check for gaps.
📊
Dataset: PT ridership 2015-2023
Years available: 2015, 2016, 2017, 2018, 2019, 2021, 2022, 2023
What year is missing?
Missing year:
Why it matters: Missing 2020 = cherry-picking the pandemic crash.
3

Write a source citation

30 sec
MobiDaS R3 (Transparency): Every data point needs a traceable source.
Citation format
[Metric]: [Value] ([Source], [Year])
Example
Suburban rail ridership: 1.2B journeys (Destatis, 2023)
Write a citation for this fact: "EV registrations reached 500,000 in 2023"
3

Applied Task

~3 min
1

Identify your data needs

1 min
Based on your M1 question, what data would answer it?
Your question
(Your M1 question will appear here)
What metric would directly answer this question?
2

Name your source

1 min
German mobility data sources: Destatis, KBA, UBA, VDV, Mobilithek.
Select the most likely source for your data:
Destatis
Official statistics (ridership, modal split)
KBA
Vehicle registrations (cars, EVs)
UBA
Environmental data (emissions, air quality)
VDV
Public transport statistics
3

Write your data selection

1 min
Combine: metric + time range + source + why it answers your question.
Data selection template
I will use [metric] from [source] ([years]) because it shows [how it answers question].
Write your data selection:
0 characters
4

Quick Refinement

~2 min
1

Check your selection

30 sec
Your data selection must pass the 2-layer filter.
Layer 1: Accessibility
Layer 2: Story Fit
2

Add integrity note

30 sec
Note any limitations or uncomfortable data points you'll include.
What uncomfortable truth will you include? (e.g., pandemic drop)
3

Commit your data selection

30 sec
Your data selection will guide your visualization in Module 3.
Final data selection: